Uploaded image for project: 'DC/OS'
  1. DC/OS
  2. DCOS_OSS-1446

Admin Router: Dynamic DNS resolution of upstreams

    Details

    • Epic Name:
      Admin Router: Improvement of the DNS resolution in AR upstream configuration
    • Epic Status:
      Done
    • Total Story Points:
      10
    • Progress Meter:
      0 SP 0 SP 10 SP

      Description

      This tracks the effort and Jiras related to deprecating periodic NGINX reloads approach in favour of dynamic DNS resolution.

      Originally, AR reloads were introduced due to limitations of the Open/Core version of NGINX we are using. Some of the upstream definitions need to be defined using well-known DC/OS DNS aliases like e.g.:

      • master.mesos
      • leader.mesos
      • marathon.mesos
      • etc...

      The contents of these aliases can change in time, for example, due to Mesos leader re-election. Unfortunately, in the Open/Core version of NGINX these entries are resolved only during the startup/reload of the NGINX. Hence a new systemd service was introduced that periodically performs gratefull restart of dcos-adminrouter service:

      $ cat ./dcos-adminrouter-reload.service
      [Unit]
      Description=Admin Router Reloader: reloads Admin Router to pick up domain resolution changes
      
      [Service]
      Type=oneshot
      EnvironmentFile=/opt/mesosphere/environment
      ExecStart=-$PKG_PATH/nginx/sbin/adminrouter.sh -c $PKG_PATH/nginx/conf/nginx.master.conf -s reload
      
      # vespian @ budrys in ~/work/git_repos/dcos/packages/adminrouter/extra/systemd on git:master o [22:23:32]
      $ cat dcos-adminrouter-reload.timer
      [Unit]
      Description=Admin Router Reloader Timer: periodically reloads Admin Router to pick up domain resolution changes
      [Timer]
      OnBootSec=5sec
      OnUnitActiveSec=30s
      

      This approach has two major disadvantages:

      • it creates lots of confusing log messages, some customers feel anxious when they notice service being restarted every 30s
      • graceful reload can affect service handling, as depicted in DCOS-15783. Even though it looks like AR bug, pinpointing it make result in significant engineering effort and some meddling with NGINX core

      So, basing on i.e. https://forum.nginx.org/read.php?2,215830,215832#msg-215832 and https://www.jethrocarr.com/2013/11/02/nginx-reverse-proxies-and-dns-resolution/ we can try to emulate NGINX plus behaviour. It is also vital to test all the changes thoroughly, so some extra effort will have to be made to write decent unittests that make sure that:

      • LUA code re-resolves the changes e.g. in cache update subroutines
      • all affected endpoints re-resolve DNS-based upstreams in a timely manner.

        Attachments

          Activity

            People

            • Assignee:
              prozlach Pawel Rozlach
              Reporter:
              prozlach Pawel Rozlach
              Team:
              DELETE Security Team
              Watchers:
              Jan-Philip Gehrcke (Inactive), Pawel Rozlach
            • Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Zendesk Support

                  NextupJiraPlusStatus

                  Error rendering 'slack.nextup.jira:nextup-jira-plus-status'. Please contact your JIRA administrators.