Uploaded image for project: 'DC/OS'
  1. DC/OS
  2. DCOS_OSS-340

test_systemd_units_health for dcos-spartan-watchdog.service failing regularly on Azure

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Medium
    • Resolution: Cannot Reproduce
    • Affects Version/s: DC/OS 1.8.0
    • Fix Version/s: DC/OS 1.9.0
    • Component/s: networking
    • Labels:

      Description

      The dcos-spartan-watchdog.service is failing regularly with our integration tests on Azure. It appears that the test is sensitive to system startup time, and we may need to adjust the watchdog service so that it only starts once the spartan service is fully initialized.

      Failure snippet:

              if unhealthy_output:
      >           raise AssertionError('\n'.join(unhealthy_output))
      E           AssertionError: Unhealthy unit dcos-spartan-watchdog.service has been found on node 10.32.0.6, health status 1. journalctl output dcos-spartan-watchdog.service state is not one of the possible states [active inactive activating]. Current state is [ failed ]. Please check `systemctl show all dcos-spartan-watchdog.service` to check current unit state. 
      E           -- Logs begin at Fri 2016-08-26 17:47:58 UTC, end at Fri 2016-08-26 17:54:10 UTC. --
      E           Aug 26 17:50:39 dcos-agent-private-01234567000002 systemd[1]: Starting DNS Dispatcher Watchdog: Make sure spartan is running...
      E           Aug 26 17:51:44 dcos-agent-private-01234567000002 toybox[17107]: host: Host not found.
      E           Aug 26 17:51:44 dcos-agent-private-01234567000002 toybox[17107]: Using domain server 198.51.100.1:
      E           Aug 26 17:51:44 dcos-agent-private-01234567000002 systemd[1]: dcos-spartan-watchdog.service: Main process exited, code=killed, status=9/KILL
      E           Aug 26 17:51:44 dcos-agent-private-01234567000002 systemd[1]: Failed to start DNS Dispatcher Watchdog: Make sure spartan is running.
      E           Aug 26 17:51:44 dcos-agent-private-01234567000002 systemd[1]: dcos-spartan-watchdog.service: Unit entered failed state.
      E           Aug 26 17:51:44 dcos-agent-private-01234567000002 systemd[1]: dcos-spartan-watchdog.service: Failed with result 'signal'.
      

      Example of the integration test failure:
      https://teamcity.mesosphere.io/viewLog.html?buildId=382623&buildTypeId=ClosedSource_Dcos_IntegrationTests_CloudIntegrationTests_DcosOssAzureIntegration&tab=buildLog

      This was discovered as part our new Azure integration tests here: https://github.com/dcos/dcos/pull/591

        Attachments

          Activity

            People

            • Assignee:
              albert Albert Strasheim (Inactive)
              Reporter:
              jeremy Jeremy Lingmann (Inactive)
              Watchers:
              Adam Bordelon (Inactive), Albert Strasheim (Inactive), Cody Maloney (Inactive), Jeremy Lingmann (Inactive), Sargun Dhillon (Inactive)
            • Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Zendesk Support

                  NextupJiraPlusStatus

                  Error rendering 'slack.nextup.jira:nextup-jira-plus-status'. Please contact your JIRA administrators.