[DCOS_OSS-3625] arp cache neighbor table overflow Created: 13/Jun/18  Updated: 09/Nov/18  Resolved: 13/Aug/18

Status: Resolved
Project: DC/OS
Component/s: dcos-net-spartan, navstar, networking
Affects Version/s: DC/OS 1.11.0, DC/OS 1.11.1, DC/OS 1.11.2
Fix Version/s: None

Type: Bug Priority: Medium
Reporter: tahaalibra (Inactive) Assignee: Deepak Goel
Resolution: Cannot Reproduce  
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Team: DELETE Networking Team

 Description   

Hey,

I have DCOS Cluster, it has around 20 Agents and around 50 Tasks Running.

The Problem that i am facing is that components on agents and master go into unhealthy states.

after some digging, i found the following error in dmesg `arp cache neighbor table overflow`. i increased the following

net.ipv4.neigh.default.gc_thresh2

net.ipv4.neigh.default.gc_thresh2

net.ipv4.neigh.default.gc_thresh3

 

My issue is how is the limit is reached, with only 50 Tasks Running. Also can DCOS optimize gc_thresh for its use case



 Comments   
Comment by Deepak Goel [ 21/Jun/18 ]

tahaalibra thanks for reporting this issue. I agree, this shouldn't cause arp cache to blow. Could you please check how many entries are present in your arp cache?

Comment by tahaalibra (Inactive) [ 21/Jun/18 ]

its little over 1024 (1024 is the default value for net.ipv4.neigh.default.gc_thresh3)

Comment by Deepak Goel [ 21/Jun/18 ]

Something is not right there because even if you add 20 agents and 50 tasks its only 70.

Comment by tahaalibra (Inactive) [ 21/Jun/18 ]

Some more Information, we remove and add new mesos agents frequently (this is done for autoscaling our agents fleet and for some other automation)

Comment by Deepak Goel [ 21/Jun/18 ]

still it wouldn't be faster than arp timeout (default 4 hours)

Comment by tahaalibra (Inactive) [ 21/Jun/18 ]

i totally agree with you, can you test it out..we have 3 cluster all have same problem. BTW for an overlay network

net.ipv4.neigh.default.gc_thresh3=1024 seems to be be low and it should be tweaked by the application during installation

Comment by Deepak Goel [ 25/Jun/18 ]

tahaalibra In case of overlay, it depends on the number of agents that you have in your cluster and not on the number of containers because nexthop for all the containers on an agent is agent itself. Are you sure you arp table has legit entries?

Comment by Deepak Goel [ 28/Jun/18 ]

tahaalibra where are you running your cluster? is it on aws, azure, gce or on-prem? I tried reproducing it on aws with 1 master, 20 agents and 50 tasks but didn't see my arp table bloating

Comment by Deepak Goel [ 13/Aug/18 ]

Closing it on no-response. Please feel free to open it if you see it again

Generated at Tue May 24 04:46:51 CDT 2022 using JIRA 7.8.4#78004-sha1:5704c55c9196a87d91490cbb295eb482fa3e65cf.