• Type: Bug
    • Status: Resolved
    • Priority: Medium
    • Resolution: Done
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: minuteman
    • Labels:


      In one of our running DC/OS clusters (v1.9.0) we've had a bit of trouble with the l4lb. One of our services has /waldo/web:3000 as it's VIP_0 label - this hasn't caused any issues in the past but recently we've found about 50% of the requests to it fail with "connection refused".

      I've dug into how the l4lb works - I'm pretty limited in my ability to read Erlang but my understanding is that either Navstar or Minuteman manipulate the IPVS Linux Module to add entries in the subnet which map out to actual container ip:port combinations in the DC/OS cluster. Whichever service administers that somehow fetches updates from Mesos.

      In our case the IP address of the domain works out to hex 0B2A45DD). I can inspect the state of IPVS by running "cat /proc/net/ip_vs" on CoreOS. Doing so with a bit of grepping shows this:

      TCP 0B2A45DD:0BB8 wlc
      {{ -> 0A0001A0:3396 Masq 1 0 0}}
      {{ -> 0A000099:2DD8 Masq 1 0 0}}

      0A0001A0:3396 ( is a valid container IP in the cluster

      0A000099:2DD8 ( however failed about 4 days ago


      So it seems that Minuteman/Navstar has gotten out of sync.  I tried restarting and scaling up / scaling down the service but nothing seems to clear out the old container remote address, any ideas how to fix it?  Do you know how the system may have gotten into this state so we can avoid doing it again?




            • Assignee:
              dgoel Deepak Goel
              whoward whoward
              ( DO NOT USE ) Networking Team
              Deepak Goel, Marian Zange, whoward
            • Watchers:
              3 Start watching this issue


              • Created: