Details

    • Type: Bug
    • Status: Resolved
    • Priority: Medium
    • Resolution: Won't Do
    • Affects Version/s: DC/OS 1.8.7, DC/OS 1.9.2
    • Fix Version/s: DC/OS 1.9.5, DC/OS 1.10.1
    • Component/s: networking
    • Labels:
      None

      Description

      Hi,
      We are using DCOS 1.8.7 on CoreOS Stable 1235.5.0 and we have been encountering stability issues with VIPs.

      The VIPs seem to work but we get a lot of connections errors in our applications logs. Sometimes it stop working completely until we reboot the involved nodes.

      In our python web app we get a lot of those errors about the connection to the postgresql server :

      Exception : server closed the connection unexpectedly
              This probably means the server terminated abnormally
              before or while processing the request.
      

      In our gitlab we have the same kind of error about the connection to the external postgresql :

      ActiveRecord::StatementInvalid (PG::ConnectionBad: PQconsumeInput() server closed the connection unexpectedly
              This probably means the server terminated abnormally
              before or while processing the request.
      

      In our nginx frontend which try to connect to our python app we get those errors :

      2017/01/10 11:42:49 [error] 5#5: *169 recv() failed (104: Connection reset by peer) while reading response
       header from upstream, client: OUR_TEST_IP, server: SERVER_URL, request: "GET / HTTP/1.1", upstream:
       "http://11.154.108.73:8069/", host: "SERVER_URL"
      

      We are using the curl command to try and check if the VIPS are working.

      core@dcos-vm-sbg1-sp03 ~ $ curl labcrmodoo9.marathon.l4lb.thisdcos.directory:8069
      <html><head><script>window.location = '/web' + location.hash;</script></head></html>
      

      So here It works when i try to access a python web app from a public node.

      But if i try the same command from inside our nginx container :

      / # curl labcrmodoo9.marathon.l4lb.thisdcos.directory:8069
      curl: (56) Recv failure: Connection reset by peer
      

      The curl fail 9 out of 10 times from inside the nginx container.

      The connection usually improve when we reboot the involved nodes but it start failing again after a while.

      We also found this topic : https://groups.google.com/a/dcos.io/forum/#!searchin/users/vips/users/bKv9mucQBi0/QxgwmczmAAAJ which looks a little like the problem we have so we added the file /etc/sysctl.d/netfilter.conf with the following content on every node :

      net.netfilter.nf_conntrack_tcp_be_liberal=1
      net.netfilter.ip_conntrack_tcp_be_liberal=1
      net.ipv4.netfilter.ip_conntrack_tcp_be_liberal=1
      

      but it doesn't solve our network issue.

      Do you have any idea where the problem could be ?
      We can provide more information about our configuration and environment if necessary.

        Attachments

          Activity

            People

            • Assignee:
              dgoel Deepak Goel
              Reporter:
              gberna Guilhem
              Watchers:
              Albert Strasheim (Inactive), anatoly yakovenko (Inactive), Anatoly Yakovenko (Inactive), Bekir Dogan, Bertrand RETIF, Cathy Daw, Deepak Goel, Guilhem, LinkMJB, Marian Zange, Nicholas Sun (Inactive), Senthil Kumaran (Inactive), Senthil Kumaran (Inactive)
            • Watchers:
              13 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Zendesk Support

                  NextupJiraPlusStatus

                  Error rendering 'slack.nextup.jira:nextup-jira-plus-status'. Please contact your JIRA administrators.