[DCOS_OSS-340] test_systemd_units_health for dcos-spartan-watchdog.service failing regularly on Azure Created: 26/Aug/16 Updated: 09/Nov/18 Resolved: 17/Mar/17 |
|
Status: | Resolved |
Project: | DC/OS |
Component/s: | networking |
Affects Version/s: | DC/OS 1.8.0 |
Fix Version/s: | DC/OS 1.9.0 |
Type: | Bug | Priority: | Medium |
Reporter: | Jeremy Lingmann (Inactive) | Assignee: | Albert Strasheim (Inactive) |
Resolution: | Cannot Reproduce | ||
Labels: | azure | ||
Remaining Estimate: | Not Specified | ||
Time Spent: | Not Specified | ||
Original Estimate: | Not Specified |
Description |
The dcos-spartan-watchdog.service is failing regularly with our integration tests on Azure. It appears that the test is sensitive to system startup time, and we may need to adjust the watchdog service so that it only starts once the spartan service is fully initialized. Failure snippet: if unhealthy_output: > raise AssertionError('\n'.join(unhealthy_output)) E AssertionError: Unhealthy unit dcos-spartan-watchdog.service has been found on node 10.32.0.6, health status 1. journalctl output dcos-spartan-watchdog.service state is not one of the possible states [active inactive activating]. Current state is [ failed ]. Please check `systemctl show all dcos-spartan-watchdog.service` to check current unit state. E -- Logs begin at Fri 2016-08-26 17:47:58 UTC, end at Fri 2016-08-26 17:54:10 UTC. -- E Aug 26 17:50:39 dcos-agent-private-01234567000002 systemd[1]: Starting DNS Dispatcher Watchdog: Make sure spartan is running... E Aug 26 17:51:44 dcos-agent-private-01234567000002 toybox[17107]: host: Host not found. E Aug 26 17:51:44 dcos-agent-private-01234567000002 toybox[17107]: Using domain server 198.51.100.1: E Aug 26 17:51:44 dcos-agent-private-01234567000002 systemd[1]: dcos-spartan-watchdog.service: Main process exited, code=killed, status=9/KILL E Aug 26 17:51:44 dcos-agent-private-01234567000002 systemd[1]: Failed to start DNS Dispatcher Watchdog: Make sure spartan is running. E Aug 26 17:51:44 dcos-agent-private-01234567000002 systemd[1]: dcos-spartan-watchdog.service: Unit entered failed state. E Aug 26 17:51:44 dcos-agent-private-01234567000002 systemd[1]: dcos-spartan-watchdog.service: Failed with result 'signal'. Example of the integration test failure: This was discovered as part our new Azure integration tests here: https://github.com/dcos/dcos/pull/591 |
Comments |
Comment by Jeremy Lingmann (Inactive) [ 26/Aug/16 ] |
We've muted this particular failure in our CI job until 9/2/2016. |
Comment by Jeremy Lingmann (Inactive) [ 30/Aug/16 ] |
Any updates Sargun Dhillon? |
Comment by Cody Maloney (Inactive) [ 05/Sep/16 ] |
Jeremy Lingmann what are we using to signal to azure that a host is up? If we could make that wait until 3dt reports all units on the host are healthy, would solve this. |
Comment by Sargun Dhillon (Inactive) [ 12/Sep/16 ] |
See discussion here: https://github.com/dcos/dcos/pull/657 |
Comment by Albert Strasheim (Inactive) [ 27/Feb/17 ] |
Azure tests seem to be having a bad time for unrelated reasons right now. |
Comment by Adam Bordelon (Inactive) [ 08/Mar/17 ] |
Albert Strasheim How are these tests looking now? When can we close this out? |
Comment by Adam Bordelon (Inactive) [ 15/Mar/17 ] |
What's the latest on the Azure CI? |
Comment by Albert Strasheim (Inactive) [ 17/Mar/17 ] |
Not seeing this failure on Azure anymore. Only issue there seems to be test_vips. |