[DCOS_OSS-340] test_systemd_units_health for dcos-spartan-watchdog.service failing regularly on Azure Created: 26/Aug/16  Updated: 09/Nov/18  Resolved: 17/Mar/17

Status: Resolved
Project: DC/OS
Component/s: networking
Affects Version/s: DC/OS 1.8.0
Fix Version/s: DC/OS 1.9.0

Type: Bug Priority: Medium
Reporter: Jeremy Lingmann (Inactive) Assignee: Albert Strasheim (Inactive)
Resolution: Cannot Reproduce  
Labels: azure
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   

The dcos-spartan-watchdog.service is failing regularly with our integration tests on Azure. It appears that the test is sensitive to system startup time, and we may need to adjust the watchdog service so that it only starts once the spartan service is fully initialized.

Failure snippet:

        if unhealthy_output:
>           raise AssertionError('\n'.join(unhealthy_output))
E           AssertionError: Unhealthy unit dcos-spartan-watchdog.service has been found on node 10.32.0.6, health status 1. journalctl output dcos-spartan-watchdog.service state is not one of the possible states [active inactive activating]. Current state is [ failed ]. Please check `systemctl show all dcos-spartan-watchdog.service` to check current unit state. 
E           -- Logs begin at Fri 2016-08-26 17:47:58 UTC, end at Fri 2016-08-26 17:54:10 UTC. --
E           Aug 26 17:50:39 dcos-agent-private-01234567000002 systemd[1]: Starting DNS Dispatcher Watchdog: Make sure spartan is running...
E           Aug 26 17:51:44 dcos-agent-private-01234567000002 toybox[17107]: host: Host not found.
E           Aug 26 17:51:44 dcos-agent-private-01234567000002 toybox[17107]: Using domain server 198.51.100.1:
E           Aug 26 17:51:44 dcos-agent-private-01234567000002 systemd[1]: dcos-spartan-watchdog.service: Main process exited, code=killed, status=9/KILL
E           Aug 26 17:51:44 dcos-agent-private-01234567000002 systemd[1]: Failed to start DNS Dispatcher Watchdog: Make sure spartan is running.
E           Aug 26 17:51:44 dcos-agent-private-01234567000002 systemd[1]: dcos-spartan-watchdog.service: Unit entered failed state.
E           Aug 26 17:51:44 dcos-agent-private-01234567000002 systemd[1]: dcos-spartan-watchdog.service: Failed with result 'signal'.

Example of the integration test failure:
https://teamcity.mesosphere.io/viewLog.html?buildId=382623&buildTypeId=ClosedSource_Dcos_IntegrationTests_CloudIntegrationTests_DcosOssAzureIntegration&tab=buildLog

This was discovered as part our new Azure integration tests here: https://github.com/dcos/dcos/pull/591



 Comments   
Comment by Jeremy Lingmann (Inactive) [ 26/Aug/16 ]

We've muted this particular failure in our CI job until 9/2/2016.

Comment by Jeremy Lingmann (Inactive) [ 30/Aug/16 ]

Any updates Sargun Dhillon?

Comment by Cody Maloney (Inactive) [ 05/Sep/16 ]

Jeremy Lingmann what are we using to signal to azure that a host is up? If we could make that wait until 3dt reports all units on the host are healthy, would solve this.

Comment by Sargun Dhillon (Inactive) [ 12/Sep/16 ]

See discussion here: https://github.com/dcos/dcos/pull/657

Comment by Albert Strasheim (Inactive) [ 27/Feb/17 ]

Azure tests seem to be having a bad time for unrelated reasons right now.

https://teamcity.mesosphere.io/viewType.html?buildTypeId=ClosedSource_Dcos_IntegrationTests_CloudIntegrationTests_DcosOssAzureIntegration&tab=buildTypeHistoryList&branch_ClosedSource_Dcos_IntegrationTests_CloudIntegrationTests=__all_branches__

Comment by Adam Bordelon (Inactive) [ 08/Mar/17 ]

Albert Strasheim How are these tests looking now? When can we close this out?

Comment by Adam Bordelon (Inactive) [ 15/Mar/17 ]

What's the latest on the Azure CI?

Comment by Albert Strasheim (Inactive) [ 17/Mar/17 ]

Not seeing this failure on Azure anymore. Only issue there seems to be test_vips.

Generated at Wed May 18 09:10:50 CDT 2022 using JIRA 7.8.4#78004-sha1:5704c55c9196a87d91490cbb295eb482fa3e65cf.