[DCOS_OSS-4184] Mesos offers ports that are already in use Created: 25/Sep/18  Updated: 02/Dec/20  Resolved: 02/Dec/20

Status: Resolved
Project: DC/OS
Component/s: mesos, networking
Affects Version/s: DC/OS 1.9.10, DC/OS 1.10.8, DC/OS 1.11.5, DC/OS 1.12.0, DC/OS 1.13.0
Fix Version/s: None

Type: Bug Priority: Medium
Reporter: Ivan Chernetsky (Inactive) Assignee: Andrei Sekretenko
Resolution: Won't Do  
Labels: foundations, mesos, networking
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Team: DELETE DKP Workloads Team
Sprint: Core Sprint 2018-29, Core RI-6 Sprint 2018-30
Story Points: 5


The port ranges that allocated for Mesos to offer to frameworks, do not take into account all the ports that are used by the DC/OS components. Please refer to https://github.com/dcos/dcos/blob/ec599b00cdf4b7df90a364d2a6712476cfd34f8b/gen/dcos-config.yaml#L604

  - path: /etc/mesos-slave
    content: |
      MESOS_RESOURCES=[{"name":"ports","type":"RANGES","ranges": {"range": [{"begin": 1025, "end": 2180},{"begin": 2182, "end": 3887},{"begin": 3889, "end": 5049},{"begin": 5052, "end": 8079},{"begin": 8082, "end": 8180},{"begin": 8182, "end": 32000}]}}]
  - path: /etc/mesos-slave-public
    content: |
      MESOS_RESOURCES=[{"name":"ports","type":"RANGES","ranges": {"range": [{"begin": 1, "end": 21},{"begin": 23, "end": 5050},{"begin": 5052, "end": 32000}]}}]

dcos-net uses port 53, and on `slave_public` agent nodes Mesos is allowed to offer this port to frameworks, and if a framework decides to use it, a task will fail upon launch.

core@ip-10-0-7-47 ~ $ sudo netstat -ntulp | grep :53
tcp        0      0*               LISTEN      4948/dcos-net       
tcp        0      0*               LISTEN      4948/dcos-net       
tcp        0      0*               LISTEN      4948/dcos-net       
tcp6       0      0 fd01:d::c633:6401:53    :::*                    LISTEN      4948/dcos-net       
udp        0      0*                           4948/dcos-net       
udp        0      0*                           4948/dcos-net       
udp        0      0*                           4948/dcos-net       
udp6       0      0 fd01:d::c633:6401:53    :::*                                4948/dcos-net

We need to make sure that all the ports that are used by the DC/OS components are excluded from the port ranges that Mesos offers to frameworks.

Comment by Sergey Urbanovich (Inactive) [ 25/Sep/18 ]




Comment by Ivan Chernetsky (Inactive) [ 25/Sep/18 ]

Sergey Urbanovich, thanks for posting the links. I don't think the docs are up-to-date. For instance, Marathon HTTPS port is missing. I guess, the work on this ticket should include updating the docs as well.

Comment by Benno Evers (Inactive) [ 27/Sep/18 ]

As far as I can tell from the linked docs, port 53 is the only problematic port right now, so I've opened https://github.com/dcos/dcos/pull/3517 with the straightforward fix.

If it turns out that there are additional ports that were not documented, we might want to change to a slightly more general solution, like declaring the ports used by DC/OS in some configuration variable and introducing a static reservation to some internal role for them on all hosts.

Also, I'm not sure how important it is, but "normal" applications won't start services listening on privileged ports, so the only way how this bug could have surfaced is if someone was trying to run a DNS server on DC/OS. But then, the linked PR won't solve the issue for them - in fact, it will get worse because instead of getting a "port already used" error they now have to wonder why they're not getting any fitting offers.

So if this is something people are actually trying to do in practice, we might want to plan for some solution where ports are reported per network interface, and then only marking the internal port 53 as being used by dcos-net.

Comment by Benno Evers (Inactive) [ 01/Oct/18 ]

After thinking a bit more about this, it seems that all reasonably simple fixes will fail for this issue:

1) Excluding ports from agent resources: This works fine for new DC/OS installations, but (as pointed out by Jie Yu), will cause upgrades to fail because right now we can not remove resources from agents without fully wiping them. (and https://reviews.apache.org/r/64384/ to fix that will not be merged in time for 1.12)

It would still be possible to add some logic to the installer to exclude the resources only in case of fresh installations or when the user is wiping the agents anyways, but fully testing that in time for the release does not look realistic, and will the install script more complicated and make future maintenance harder.

2) Adding a static reservation. The problem is that these are parsed additively, so a specification like "ports:[0-1024];ports(__dcos_internal_reserved):[53-53]" is interpreted as saying that frameworks are allowed to use all of the ports 0-1024, and in addition the role __dcos_internal_reserved is allowed to use one port 53.

I've opened https://issues.apache.org/jira/browse/MESOS-9280 to track the necessary Mesos work to enable this.

3) Adding a dynamic reservation. These require getting an offer with the specified port first, so they're not really suitable for this purpose.

Given that this issue has been around for a long time without complaints, I guess its fairly safe to assume that there are currently no customers trying to their own DNS server inside DC/OS, so for now I'll downgrade this to "Medium" and remove the attached fix versions.

Maybe as a follow-up, we could ask the docs team to specifically point out that we don't support running DNS servers inside DC/OS, and revisit this once we're trying to reserve another port for internal use.

Comment by Ivan Chernetsky (Inactive) [ 01/Oct/18 ]

Benno Evers, I do get why it is not easy to make Mesos not offer a particular port specifically on existing DC/OS installations, but I don't get why you closed this ticket, because any task can get offered a port 53 on a slave_public node, and the task will fail, and will get restarted with another port, because ports are chosen randomly out of available ones, unless a user specifies a specific port in an app definition, therefore such a failure can be considered transient, and in case it happens in a customer's clusters, it doesn't get much attention, I believe. So, it is not about running DNS servers.

Comment by Benno Evers (Inactive) [ 02/Oct/18 ]

Ivan Chernetsky, the idea behind closing was to capture the fact that there's no immediate solution in sight, and so there's no work left to be done until either the linked MESOS-9280 ticket is resolved or we can find some better workaround. I've changed it to 'Blocked' now, maybe that fits better.

Comment by Benno Evers (Inactive) [ 04/Oct/18 ]

After further discussions with the Marathon team, it seems like this issue isn't easily solvable on their side as well:

  • First, they're bound by the same backwards compatibility considerations as Mesos is, so just not offering privileged ports by default would be a breaking change that would need to be hidden behind a feature flag disable by default
  • Second, "ports" does not have any special internal meaning to either Marathon or Mesos, it's just a range-type variable, so adding any application-level special handling seems cumbersome.
  • Third, Marathon does not support multiple roles, so reserving priviliged ports to e.g. a __dcos_privileged_networking role would make them inaccessible to Marathon clients.

So, in summary, I think the best way forward would be to
1) Merge `--reconfiguration_policy=any` patches in Mesos, switch to that in DC/OS and just remove port 53 from public agents. Since this is a problem purely arising from the way Mesos is configured on public agents, it seems appropriate to solve it by changing configuration.
2) In the meantime, add an application-level check for port < 1024 to the test_vip test, which seems to currently be the only application seriously affected by this.
3) Once it is possible, reserve all privileged ports to a special role so that frameworks can not get them offered by default.

Comment by Dominik Dary (Inactive) [ 02/Dec/20 ]

This tickets get resolved as "Won't Do" as we are not planing to fix any non blocker bugs. Feel free to comment or re-open if you feel different.

Generated at Sun May 22 08:12:14 CDT 2022 using JIRA 7.8.4#78004-sha1:5704c55c9196a87d91490cbb295eb482fa3e65cf.