Uploaded image for project: 'DC/OS'
  1. DC/OS
  2. DCOS_OSS-1523

DC/OS Vagrant: Improve how memory resources are defined for agents


    • Type: Task
    • Status: Resolved
    • Priority: High
    • Resolution: Done
    • Affects Version/s: DC/OS 1.10.0
    • Fix Version/s: DC/OS 1.10.0
    • Component/s: dcos-vagrant
    • Labels:


      DCOS_OSS-1467 revealed that there is a race in how dcos-vagrant defines resources for agent nodes resulting in tests flakiness. Quoting the issue:

      * `Tests pass` scenario:
        * port definitions are initially defined in /opt/mesosphere/etc/mesos-slave-public
        * the make_disk_resources.py script honours MESOS_RESOURCES variable settings from previous files/does not override them (https://github.com/dcos/dcos/blob/master/packages/mesos/extra/make_disk_resources.py#L151) so the ports definitions are left intact
        * install-mesos-memory.sh, when run AFTER make_disk_resources.py, also honours MESOS_RESOURCES settings and does not override them (https://github.com/dcos/dcos-vagrant/blob/master/provision/bin/install-mesos-memory.sh#L33) - memory resources definition is stored in /var/lib/dcos/mesos-resources
        * port definitions are preserved!
      * `Tests fail` scenario:
        * port definitions are initially defined in /opt/mesosphere/etc/mesos-slave-public
        * install-mesos-memory.sh, when run BEFORE make_disk_resources.py, override settings from /opt/mesosphere/etc/mesos-slave-public and creates /var/lib/dcos/mesos-slave-common file
        * /var/lib/dcos/mesos-slave-common has the highest priority when make_disk_resources.py script and it does not contain port definitions from /opt/mesosphere/etc/mesos-slave-public
        * make_disk_resources.py just appends disk resources to the data from /var/lib/dcos/mesos-slave-common file and creates /var/lib/dcos/mesos-resources file which now has the highest priority
        * the resulting file does not contain proper port definitions

      We need to make sure that no matter when the `install-mesos-memory.sh` runs (before or after make_disk_resources.py), it does not overwrite port definitions from the /opt/mesosphere/etc/mesos-slave-public. So far this was being done by waiting for `dcos-diagnostics --check` to finish, unfortunatelly this can no longer be the case. During the debug of the DCOS_OSS-1467 it was reveald that `dcos-diagnostic` can give false positives (return with non-zero exit code), even though the installation process has not finished. This seems like a bug and will be dealt with in a separate issue.

      Please let me know if I managed to proovide sufficient information in order to solve this issue. If not - do not hestitate to drop me a line or check DCOS_OSS-1467 for more details.



          Issue Links



              • Assignee:
                karl Karl Isenberg (Inactive)
                prozlach Pawel Rozlach
                Jan-Philip Gehrcke (Inactive), Karl Isenberg (Inactive), Pawel Rozlach
              • Watchers:
                3 Start watching this issue


                • Created:

                  Zendesk Support


                    Error rendering 'slack.nextup.jira:nextup-jira-plus-status'. Please contact your JIRA administrators.