[DCOS_OSS-3625] arp cache neighbor table overflow Created: 13/Jun/18 Updated: 09/Nov/18 Resolved: 13/Aug/18 |
|
Status: | Resolved |
Project: | DC/OS |
Component/s: | dcos-net-spartan, navstar, networking |
Affects Version/s: | DC/OS 1.11.0, DC/OS 1.11.1, DC/OS 1.11.2 |
Fix Version/s: | None |
Type: | Bug | Priority: | Medium |
Reporter: | tahaalibra (Inactive) | Assignee: | Deepak Goel |
Resolution: | Cannot Reproduce | ||
Labels: | None | ||
Remaining Estimate: | Not Specified | ||
Time Spent: | Not Specified | ||
Original Estimate: | Not Specified |
Team: |
Description |
Hey, I have DCOS Cluster, it has around 20 Agents and around 50 Tasks Running. The Problem that i am facing is that components on agents and master go into unhealthy states. after some digging, i found the following error in dmesg `arp cache neighbor table overflow`. i increased the following net.ipv4.neigh.default.gc_thresh2 net.ipv4.neigh.default.gc_thresh2 net.ipv4.neigh.default.gc_thresh3
My issue is how is the limit is reached, with only 50 Tasks Running. Also can DCOS optimize gc_thresh for its use case |
Comments |
Comment by Deepak Goel [ 21/Jun/18 ] |
tahaalibra thanks for reporting this issue. I agree, this shouldn't cause arp cache to blow. Could you please check how many entries are present in your arp cache? |
Comment by tahaalibra (Inactive) [ 21/Jun/18 ] |
its little over 1024 (1024 is the default value for net.ipv4.neigh.default.gc_thresh3) |
Comment by Deepak Goel [ 21/Jun/18 ] |
Something is not right there because even if you add 20 agents and 50 tasks its only 70. |
Comment by tahaalibra (Inactive) [ 21/Jun/18 ] |
Some more Information, we remove and add new mesos agents frequently (this is done for autoscaling our agents fleet and for some other automation) |
Comment by Deepak Goel [ 21/Jun/18 ] |
still it wouldn't be faster than arp timeout (default 4 hours) |
Comment by tahaalibra (Inactive) [ 21/Jun/18 ] |
i totally agree with you, can you test it out..we have 3 cluster all have same problem. BTW for an overlay network net.ipv4.neigh.default.gc_thresh3=1024 seems to be be low and it should be tweaked by the application during installation |
Comment by Deepak Goel [ 25/Jun/18 ] |
tahaalibra In case of overlay, it depends on the number of agents that you have in your cluster and not on the number of containers because nexthop for all the containers on an agent is agent itself. Are you sure you arp table has legit entries? |
Comment by Deepak Goel [ 28/Jun/18 ] |
tahaalibra where are you running your cluster? is it on aws, azure, gce or on-prem? I tried reproducing it on aws with 1 master, 20 agents and 50 tasks but didn't see my arp table bloating |
Comment by Deepak Goel [ 13/Aug/18 ] |
Closing it on no-response. Please feel free to open it if you see it again |