[DCOS_OSS-1467] integration-test/vagrant-bash: 5 tests are reproducibly failing in train 195 Created: 26/Jul/17  Updated: 09/Nov/18  Resolved: 09/Aug/17

Status: Resolved
Project: DC/OS
Component/s: adminrouter, testing
Affects Version/s: None
Fix Version/s: DC/OS 1.10.0

Type: Bug Priority: High
Reporter: Jan-Philip Gehrcke (Inactive) Assignee: Pawel Rozlach
Resolution: Won't Do  
Labels: security, testing
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File full_log_public_slave.log     File mesos_agent_state.json     PNG File no_port_offers.png     File patch.run_all    
Issue Links:
Problem/Incident
causes DCOS_OSS-1523 DC/OS Vagrant: Improve how memory res... Resolved
Relates
relates to DCOS_OSS-1524 `dcos-diagnostics --diag` returns fal... Resolved
relates to DCOS_OSS-1167 dcos-docker test failure: test_appli... Resolved
Epic Link: Admin Router: Improvement of the DNS resolution in AR upstream configuration
Team: DELETE Security Team
Sprint: Security Team Sprint 9, Security Team Sprint 10
Story Points: 6

 Description   

5 tests are reproducibly failing in train 195's integration-test/vagrant-bash CI run:

https://github.com/dcos/dcos/pull/1764#issuecomment-318034394

https://teamcity.mesosphere.io/viewLog.html?buildId=717001
https://teamcity.mesosphere.io/viewLog.html?buildId=716919
https://teamcity.mesosphere.io/viewLog.html?buildId=717001
https://teamcity.mesosphere.io/viewLog.html?buildId=717100

test_applications.test_octarine | 
test_networking.test_ip_per_container |  
test_service_discovery.test_service_discovery_docker_bridge |  
test_service_discovery.test_service_discovery_docker_bridged_network |  
test_service_discovery.test_service_discovery_docker_overlay_port_mapping | 

This needs to be debugged. Have these previously been muted and did they now unmute themselves? Are those known to be flaky and is a certain race condition hitting in reprodcibly as of infrastructure slowness?



 Comments   
Comment by Jan-Philip Gehrcke (Inactive) [ 26/Jul/17 ]

Also see https://mesosphere.slack.com/archives/C0L6U3ECQ/p1500673559855729 (Pawel Rozlach's input)

Comment by Jeremy Lingmann (Inactive) [ 26/Jul/17 ]

I believe the only thing potentially new / problematic here is the `test_applications.test_octarine` failure. Per TC stats this test succeeds 98% of the time on master. So if it is indeed failing all the time, we a new failure mode with this train or issue with the test that needs to be investigated. The rest of those look like our normal flaky network tests... Michael Ellenburg would you agree?

Comment by Michael Ellenburg (Inactive) [ 26/Jul/17 ]

Jeremy Lingmann I dont think this is a new failure mode. Test octarine is especially flaky on vagrant so this was mentally cataloged under "all vagrant tests are flaky"

This might be worth simply skipping on vagrant runs...

Comment by Michael Ellenburg (Inactive) [ 26/Jul/17 ]

Oops, I see that this is filed against the PR, and not master. The PR is looking rough... trying again one last time in case yesterday was a dockerhub freak occurrence

Comment by Pawel Rozlach [ 27/Jul/17 ]

Just my 2cents.
I wonder if we can re-run all the tests on master every 3 h, and have tests-reliability data gathered in some DB. Then, we can have a tool that extracts this data and draws a graph of the moving average of last 8(1day), 24(3day), 56(7days) failed tests. Basing on that we can try to inspect all the tests which are above x% and quickly pinpoint when given test began to be flakey. I do not know if this is already possible with TC, the graphs I have seen so far aren't very useful as they cover last 1k-2k executions.

This way we pull into the picture the time horizon, and better determine if/when given test become flaky and how significant "24 failures" really are.

Comment by Martin Hrabovcin [ 28/Jul/17 ]

Maybe it would be possible to build a TeamCity trigger that would run such a test every 3 hours and use the REST API (https://confluence.jetbrains.com/display/TCD10/REST+API) to get moving average stats we want.

Comment by Pawel Rozlach [ 31/Jul/17 ]

Quick dump of my findings so far:

  • Tests are reproducibly failing after setting up dcos-vagrant using dcos/dcos:test_util/run-all script with parts responsible for cleanup on failed test patch.run_all
  • quick look at the DC/OS GUI during test_octarine test run shows that there is problem with offers
  • quick look at the tests reveals that all of them specify a port for their test application:
    • test_octarine requires port 80 on the agent
    • test_service_discovery_docker_bridge requires port 9080
    • test_service_discovery_docker_overlay requires port 9080
    • test_service_discovery_docker_overlay_port_mapping requires port 9080
    • test_ip_per_container requires port 9080
  • Quick look at the init script of the Mesos Agent:
    https://github.com/dcos/dcos/blob/master/packages/mesos/extra/dcos-mesos-slave-public.service
    (...)
    EnvironmentFile=/opt/mesosphere/environment
    EnvironmentFile=/opt/mesosphere/etc/mesos-slave-common
    EnvironmentFile=/opt/mesosphere/etc/mesos-slave-public
    EnvironmentFile=/opt/mesosphere/etc/proxy.env
    EnvironmentFile=-/opt/mesosphere/etc/mesos-slave-common-extras
    EnvironmentFile=-/var/lib/dcos/mesos-slave-common
    EnvironmentFile=-/var/lib/dcos/mesos-resources
    EnvironmentFile=-/run/dcos/etc/mesos-slave-public
    (...)
    ExecStartPre=/opt/mesosphere/bin/make_disk_resources.py /var/lib/dcos/mesos-resources
    (...)
    ExecStart=$PKG_PATH/bin/mesos-agent
    
  • the MESOS_RESOURCES variable is set in following files
     [root@p1 ~]# grep --color MESOS_RESOURCES /opt/mesosphere/environment /opt/mesosphere/etc/mesos-slave-common /opt/mesosphere/etc/mesos-slave-public /opt/mesosphere/etc/proxy.env /opt/mesosphere/etc/mesos-slave-common-extras /var/lib/dcos/mesos-slave-common /var/lib/dcos/mesos-resources /run/dcos/etc/mesos-slave-public
     /opt/mesosphere/etc/mesos-slave-public:MESOS_RESOURCES=[\{"name":"ports","type":"RANGES","ranges": \{"range": [ \{"begin": 1, "end": 21}
    ,
    \{"begin": 23, "end": 5050}
    ,
    \{"begin": 5052, "end": 32000}
    ]}}]
     grep: /opt/mesosphere/etc/mesos-slave-common-extras: Nie ma takiego pliku ani katalogu
     /var/lib/dcos/mesos-slave-common:MESOS_RESOURCES='[\{"role":"slave_public","type":"SCALAR","name":"mem","scalar":\{"value":1024}}]'
     /var/lib/dcos/mesos-resources:MESOS_RESOURCES='[
    Unknown macro: \{"scalar"}
    ,
    Unknown macro: \{"role"}
    ]'
     grep: /run/dcos/etc/mesos-slave-public: Nie ma takiego pliku ani katalogu
     [root@p1 ~]#
     

    So, basing on https://www.freedesktop.org/software/systemd/man/systemd.exec.html#EnvironmentFile= the order of EnvironmentFile is taken into consideration:
    1) MESOS_RESOURCES is first defined in /opt/mesosphere/etc/mesos-slave-public
    2) then it is overrriden in /var/lib/dcos/mesos-slave-common with a valu that only contains memory resources
    3) then it is overriden by /var/lib/dcos/mesos-resources with the same memory resources AND disk resources

1st file is generated by gen, more about the second file a bit later on, and the third one: https://github.com/dcos/dcos/blob/master/packages/mesos/extra/make_disk_resources.py

  • did a test:
    • removed file /var/lib/dcos/mesos-slave-common
    • also removed /var/lib/dcos/mesos-resources so that it is re-generated (https://github.com/dcos/dcos/blob/master/packages/mesos/extra/make_disk_resources.py#L127-L129)
    • /var/lib/dcos/mesos-resources (the highest-priority file) contains now valid port definitions
    • did on all agents:
      sudo rm -fv /var/lib/dcos/mesos-slave-common  /var/lib/dcos/mesos-resources
      sudo systemctl stop  dcos-mesos-slave.service
      sudo rm -f /var/lib/mesos/slave/meta/slaves/latest
      sudo systemctl start  dcos-mesos-slave.service
      

 

I still have no idea why the tests were flaky and not always failing, and why AR patch make it hard-fail :| I will try to take a look at it tomorrow.

Comment by Jeremy Lingmann (Inactive) [ 31/Jul/17 ]

Pawel Rozlach just spitballing here, but is this something you could try and repro with the new dcos-docker environment & scripts that Jan-Philip Gehrcke mentioned? I mention this because there is a possibility these tests are not running on the current dcos-docker test environment (but we should still confirm). If so, maybe that would help ease the debugging and dev/test cycle time?

Comment by Pawel Rozlach [ 01/Aug/17 ]

Jeremy Lingmann
> is this something you could try and repro with the new dcos-docker environment & scripts that Jan-Philip Gehrcke mentioned?
You mean e2e tests?

> I mention this because there is a possibility these tests are not running on the current dcos-docker test environment
Which tests exactly?

Comment by Pawel Rozlach [ 02/Aug/17 ]

Another quick dump of my findings:

  • let's diff logs of AR PR vagrant tests, are from successfull one, are from failed one (test_octarine and friends):
      :      [Step 2/3] ==> a1: [sudo]$ dcos-postflight^M
      :      [Step 2/3] ==> a1: Setting Mesos Memory: 5632 (role=*)^M
      :      [Step 2/3] ==> a1: [sudo]$ mesos-memory 5632^M
    - :      [Step 2/3]     a1:       Updating /var/lib/dcos/mesos-resources^M
    + :      [Step 2/3]     a1:       Updating /var/lib/dcos/mesos-slave-common^M
    

    Full log for a public agent install from a failed test can be found in full_log_public_slave.log file.

  • in dcos-mesos-slave systemd unit the /run/dcos/etc/mesos-slave-public ENV file has priority over /var/lib/dcos/mesos-resources:
    https://github.com/dcos/dcos/blob/master/packages/mesos/extra/dcos-mesos-slave-public.service
    (...)
    EnvironmentFile=/opt/mesosphere/etc/mesos-slave-public
    EnvironmentFile=-/var/lib/dcos/mesos-slave-common
    EnvironmentFile=-/var/lib/dcos/mesos-resources
    EnvironmentFile=-/run/dcos/etc/mesos-slave-public
    (...)
    ExecStartPre=/opt/mesosphere/bin/make_disk_resources.py /var/lib/dcos/mesos-resources
    (...)
    ExecStart=$PKG_PATH/bin/mesos-agent
    
  • /var/lib/dcos/mesos-resources is created by:
    packages/mesos/extra/dcos-mesos-slave.service:ExecStartPre=/opt/mesosphere/bin/make_disk_resources.py /var/lib/dcos/mesos-resources
  • so we may have a race:
    • `Tests pass` scenario:
    • `Tests fail` scenario:
      • port definitions are initially defined in /opt/mesosphere/etc/mesos-slave-public
      • install-mesos-memory.sh, when run BEFORE make_disk_resources.py, override settings from /opt/mesosphere/etc/mesos-slave-public and creates /var/lib/dcos/mesos-slave-common file
      • /var/lib/dcos/mesos-slave-common has the highest priority when make_disk_resources.py script and it does not contain port definitions from /opt/mesosphere/etc/mesos-slave-public
      • make_disk_resources.py just appends disk resources to the data from /var/lib/dcos/mesos-slave-common file and creates /var/lib/dcos/mesos-resources file which now has the highest priority
      • the resulting file does not contain proper port definitions

Especially interesting is getting the list of units - what if dcos-diagnostics produces false positives by being run during the install process when not all systemd units are in the directory?

Stay tuned for another episode!

Comment by Jan-Philip Gehrcke (Inactive) [ 08/Aug/17 ]

Pawel Rozlach thanks for all the debugging work. Can you please summarize what you have found out so far in like 1-3 sentences so that we can understand if this really still blocks DCOS-5809? Did you maybe in the meantime collect enough data to prove that these test failures are not actually as of your changes for the patch for DCOS-5809?

Comment by Pawel Rozlach [ 08/Aug/17 ]

Jan-Philip Gehrcke There are two bugs working together here that result in a race condition and failures we observed. One I have been able to sufficiently identify and understand (dcos-vagrant issue), for the other - dcos-diagnostics interacting in a racy way with pkgpanda - I need ~1 more day of uninterrupted work to complete the debugging.

Fixing vagrant issue will unblock the AR PR/will enable us to get the tests green. Fully understanding the pkgpanda<->dcos-diagnostics issue will give us understanding why exactly tests went red only for this particular PR and quite possibly - fix a bug in DC/OS / prevent flakiness in other tests.

Comment by Jan-Philip Gehrcke (Inactive) [ 09/Aug/17 ]

Thanks for the response, Pawel. I wanted to understand "if this really still blocks DCOS-5809". You did not explicitly answer that, but my interpretation is "no": it seems like you already have enough data that shows that your AR change is not the cause of the integration test failures observed here. Please, mark DCOS-5809 as unblocked and prioritize it again, accordingly (with the data collected here we can ignore the CI failures).

One I have been able to sufficiently identify and understand (dcos-vagrant issue),

Can you please create a JIRA ticket for this one and leave it unassigned, pinging Cathy Daw?

for the other - dcos-diagnostics interacting in a racy way with pkgpanda - I need ~1 more day of uninterrupted work to complete the debugging.

Can you please create a JIRA ticket for tracking this debugging work and leave it unassigned so far?

Comment by Pawel Rozlach [ 09/Aug/17 ]

Just a last, quick dump of my remaining findings (done yesterday), so that this Jira contains everything I have so far:

The next step is to understand how exactly Pkgpanda does the install and how AR changes made the PR fail much more often than usual. My guess is that the order the packages are decompressed/installed creates a small window where part of the units are green and `dcos-diagnostics` returns `0` even though not all units have finished installing. This still remains to be proven/validated though.

Comment by Pawel Rozlach [ 09/Aug/17 ]

Vagrant issue: https://jira.mesosphere.com/browse/DCOS_OSS-1523
dcos-diagnostics issue: https://jira.mesosphere.com/browse/DCOS_OSS-1524

I am closing this issue now.

Generated at Fri May 20 11:35:35 CDT 2022 using JIRA 7.8.4#78004-sha1:5704c55c9196a87d91490cbb295eb482fa3e65cf.