We are using DCOS 1.8.7 on CoreOS Stable 1235.5.0 and we have been encountering stability issues with VIPs.
The VIPs seem to work but we get a lot of connections errors in our applications logs. Sometimes it stop working completely until we reboot the involved nodes.
In our python web app we get a lot of those errors about the connection to the postgresql server :
In our gitlab we have the same kind of error about the connection to the external postgresql :
In our nginx frontend which try to connect to our python app we get those errors :
We are using the curl command to try and check if the VIPS are working.
So here It works when i try to access a python web app from a public node.
But if i try the same command from inside our nginx container :
The curl fail 9 out of 10 times from inside the nginx container.
The connection usually improve when we reboot the involved nodes but it start failing again after a while.
We also found this topic : https://groups.google.com/a/dcos.io/forum/#!searchin/users/vips/users/bKv9mucQBi0/QxgwmczmAAAJ which looks a little like the problem we have so we added the file /etc/sysctl.d/netfilter.conf with the following content on every node :
but it doesn't solve our network issue.
Do you have any idea where the problem could be ?
We can provide more information about our configuration and environment if necessary.