Details
-
Type:
Bug
-
Status: Resolved
-
Priority:
Medium
-
Resolution: Won't Do
-
Affects Version/s: DC/OS 1.10.0
-
Fix Version/s: DC/OS 1.11.0
-
Labels:
-
Sprint:Networking Team 1.11 Sprint 12
-
Story Points:1
Description
We have found several instances of spartan using all available CPU with very low cluster load. On further investigation, we discovered a crazy amount of localhost UDP traffic with both source and destination ports set to 62053.
Wireshark showed this to be weird DNS traffic with a whole lot of duplicate transaction ids:
root@prd-ge-controller03:~# tshark -i lo -c 100 -f 'udp src port 62053 and dst port 62053' -d 'udp.port==62053,dns' -O dns 2> /dev/null | grep Transaction | sort | uniq -c | head -n 5 5 Transaction ID: 0x07f7 4 Transaction ID: 0x1b3d 6 Transaction ID: 0x3172 5 Transaction ID: 0x32f7 4 Transaction ID: 0x3ebe
When I dug into the spartan and erl-dns code, I found that erl-dns doesn't check the QR flag in the DNS message header and thus happily treats a response message as if it were a query. This means that when it replies to a query from the UDP port it's listening on it will treat the reply it's just sent to itself as a new query, replying to its own responses as fast as the network and CPU will allow.
I don't have any insight into how the query loops started in the first place, but the server I was investigating had 22 distinct transaction ids bouncing at a rate of about 8k queries per second.
Attachments
Issue Links
- is resolved by
-
DCOS_OSS-1946 Get rid of erldns udp and tcp servers
-
- Resolved
-