Affects Version/s: None
Fix Version/s: DC/OS 1.10.0
Epic Name:Admin Router: Improvement of the DNS resolution in AR upstream configuration
Total Story Points:10
0 SP 0 SP 10 SP
This tracks the effort and Jiras related to deprecating periodic NGINX reloads approach in favour of dynamic DNS resolution.
Originally, AR reloads were introduced due to limitations of the Open/Core version of NGINX we are using. Some of the upstream definitions need to be defined using well-known DC/OS DNS aliases like e.g.:
The contents of these aliases can change in time, for example, due to Mesos leader re-election. Unfortunately, in the Open/Core version of NGINX these entries are resolved only during the startup/reload of the NGINX. Hence a new systemd service was introduced that periodically performs gratefull restart of dcos-adminrouter service:
This approach has two major disadvantages:
- it creates lots of confusing log messages, some customers feel anxious when they notice service being restarted every 30s
- graceful reload can affect service handling, as depicted in DCOS-15783. Even though it looks like AR bug, pinpointing it make result in significant engineering effort and some meddling with NGINX core
So, basing on i.e. https://forum.nginx.org/read.php?2,215830,215832#msg-215832 and https://www.jethrocarr.com/2013/11/02/nginx-reverse-proxies-and-dns-resolution/ we can try to emulate NGINX plus behaviour. It is also vital to test all the changes thoroughly, so some extra effort will have to be made to write decent unittests that make sure that:
- LUA code re-resolves the changes e.g. in cache update subroutines
- all affected endpoints re-resolve DNS-based upstreams in a timely manner.