[DCOS_OSS-2115] test_vip failed with RetryError on MarathonApp.wait Created: 14/Nov/17  Updated: 15/Dec/18  Resolved: 26/Nov/18

Status: Resolved
Project: DC/OS
Component/s: marathon, mesos, networking
Affects Version/s: DC/OS 1.9.7, DC/OS 1.10.5, DC/OS 1.11.0, DC/OS 1.12.0
Fix Version/s: None

Type: Bug Priority: High
Reporter: Jan-Philip Gehrcke (Inactive) Assignee: Unassigned
Resolution: Done  
Labels: flaky-bug, mergebot-override, mesos, networking, type:ci-failure
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File 181126_test_vip_dcososs2115_override_rate_since_dec17_without_title.png     PNG File 2018-09-11_override-command-rate-for-top-jira-tickets.png     Zip Archive bundle-2018-07-03-1530620037.zip     Zip Archive diagnostics.zip     File integration-test-0749fbd35959418787aa0e015a5f0bc9.json     File integration-test-vip-user-host-proxy-8639690e02544ddf91c5258a9ffce698.tar.gz     Text File marathon.log     Text File mesos-agent.log     Text File mesos-master.log     File python_test_server.logs.md     HTML File result1     HTML File result2     File sandbox_10_0_0_221.tar.gz     File sandbox_10_0_2_170.tar.gz     File sandbox_10_0_2_30.tar.gz     File sandbox_10_0_2_84.tar.gz     File sandbox_10_0_3_199.tar.gz     File sandbox_10_0_3_77.tar.gz     Text File test_vip.log     File test_vip.tar.gz    
Issue Links:
Blocks
Duplicate
is duplicated by DCOS_OSS-2264 test_vip proxy app fails to deploy in... Resolved
is duplicated by DCOS_OSS-3747 test_vip[Container.DOCKER-Network.HOS... Resolved
Relates
relates to DCOS_OSS-1463 test_networking.test_vip test is flaky Resolved
relates to MARATHON-8235 Reconcile overdue tasks instead of ex... Resolved
relates to DCOS_OSS-3736 Update marathon app configuration to ... Resolved
relates to DCOS_OSS-3747 test_vip[Container.DOCKER-Network.HOS... Resolved
Epic Link: DC/OS Test Flakiness
Sprint: Core Sprint 2018-28, Core Sprint 2018-29, Core RI-6 Sprint 2018-30
Story Points: 8
Product (inherited): DC/OS
Transition Due Date:

 Description   

[open_source_tests.test_networking.test_vip[Container_MESOS-Network_USER-Network_USER]] failed in
https://teamcity.mesosphere.io/viewLog.html?buildId=858535
https://github.com/mesosphere/dcos-enterprise/pull/1712

With a RetryError upon the attempt to launch the corresponding Marathon application:

self = <retrying.Retrying object at 0x7f533113d278>
fn = <function MarathonApp.wait at 0x7f53320a2048>
args = (<test_networking.MarathonApp object at 0x7f53310ff4a8>, <dcos_test_utils.enterprise.EnterpriseApiSession object at 0x7f5332196da0>)
kwargs = {}, start_time = 1510670454766, attempt_number = 240
attempt = Attempts: 240, Value: False, delay_since_first_attempt_ms = 1200912
sleep = 5000

    def call(self, fn, *args, **kwargs):
        start_time = int(round(time.time() * 1000))
        attempt_number = 1
        while True:
            try:
                attempt = Attempt(fn(*args, **kwargs), attempt_number, False)
            except:
                tb = sys.exc_info()
                attempt = Attempt(tb, attempt_number, True)
    
            if not self.should_reject(attempt):
                return attempt.get(self._wrap_exception)
    
            delay_since_first_attempt_ms = int(round(time.time() * 1000)) - start_time
            if self.stop(attempt_number, delay_since_first_attempt_ms):
                if not self._wrap_exception and attempt.has_exception:
                    # get() on an attempt with an exception should cause it to be raised, but raise just in case
                    raise attempt.get()
                else:
>                   raise RetryError(attempt)
E                   retrying.RetryError: RetryError[Attempts: 240, Value: False]


 Comments   
Comment by Senthil Kumaran (Inactive) [ 15/Nov/17 ]

Observed this failure again today: in dcos-docker suite:

https://teamcity.mesosphere.io/viewLog.html?buildId=860118&buildTypeId=DcosIo_Dcos_DockerIntegrationTests_IntegrationTestDcosDockerPr&tab=buildResultsDiv 

This is fixable in test code.

Comment by Adam Dangoor (Inactive) [ 21/Nov/17 ]

Example at https://teamcity.mesosphere.io/viewLog.html?buildId=868620&buildTypeId=DcOs_Enterprise_ManualTriggers_IntegrationTest_AwsOnpremWStaticBackendAndSecurit

open_source_tests/test_networking.py:214 (test_vip[Container.MESOS-Network.HOST-Network.USER])
dcos_api_session = <dcos_test_utils.enterprise.EnterpriseApiSession object at 0x7f420dab7fd0>
container = <Container.MESOS: 'MESOS'>, vip_net = <Network.HOST: 'HOST'>
proxy_net = <Network.USER: 'USER'>

    @pytest.mark.slow
    @pytest.mark.skipif(
        not lb_enabled(),
        reason='Load Balancer disabled')
    @pytest.mark.parametrize(
        'container,vip_net,proxy_net',
        generate_vip_app_permutations())
    def test_vip(dcos_api_session,
                 container: marathon.Container,
                 vip_net: marathon.Network,
                 proxy_net: marathon.Network):
        '''Test VIPs between the following source and destination configurations:
            * containers: DOCKER, UCR and NONE
            * networks: USER, BRIDGE (docker only), HOST
            * agents: source and destnations on same agent or different agents
            * vips: named and unnamed vip
    
        Origin app will be deployed to the cluster with a VIP. Proxy app will be
        deployed either to the same host or elsewhere. Finally, a thread will be
        started on localhost (which should be a master) to submit a command to the
        proxy container that will ping the origin container VIP and then assert
        that the expected origin app UUID was returned
        '''
        errors = 0
>       tests = setup_vip_workload_tests(dcos_api_session, container, vip_net, proxy_net)

open_source_tests/test_networking.py:239: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
open_source_tests/test_networking.py:272: in setup_vip_workload_tests
    origin_app.wait(dcos_api_session)
../../lib/python3.5/site-packages/retrying.py:49: in wrapped_f
    return Retrying(*dargs, **dkw).call(f, *args, **kw)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <retrying.Retrying object at 0x7f420c9975c0>
fn = <function MarathonApp.wait at 0x7f420da56400>
args = (<test_networking.MarathonApp object at 0x7f420cb61a58>, <dcos_test_utils.enterprise.EnterpriseApiSession object at 0x7f420dab7fd0>)
kwargs = {}, start_time = 1511276625442, attempt_number = 240
attempt = Attempts: 240, Value: False, delay_since_first_attempt_ms = 1201505
sleep = 5000

    def call(self, fn, *args, **kwargs):
        start_time = int(round(time.time() * 1000))
        attempt_number = 1
        while True:
            try:
                attempt = Attempt(fn(*args, **kwargs), attempt_number, False)
            except:
                tb = sys.exc_info()
                attempt = Attempt(tb, attempt_number, True)
    
            if not self.should_reject(attempt):
                return attempt.get(self._wrap_exception)
    
            delay_since_first_attempt_ms = int(round(time.time() * 1000)) - start_time
            if self.stop(attempt_number, delay_since_first_attempt_ms):
                if not self._wrap_exception and attempt.has_exception:
                    # get() on an attempt with an exception should cause it to be raised, but raise just in case
                    raise attempt.get()
                else:
>                   raise RetryError(attempt)
E                   retrying.RetryError: RetryError[Attempts: 240, Value: False]
Comment by Orlando Hohmeier (Inactive) [ 04/Dec/17 ]

Observed the same failure on https://github.com/mesosphere/dcos-enterprise/pull/1783 ( https://teamcity.mesosphere.io/viewLog.html?buildId=878155&buildTypeId=DcOs_Enterprise_Test_Inte_AwsOnpremWStaticBackendAndSecurityStrict )

Comment by Mergebot [ 04/Dec/17 ]

Github PR: https://github.com/mesosphere/dcos-enterprise/pull/1783 status teamcity/dcos/test/aws/onprem/static/strict was overridden with a failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 05/Dec/17 ]

Github PR: https://github.com/mesosphere/dcos-enterprise/pull/1755 status teamcity/dcos/test/aws/cloudformation/simple was overridden with a failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 06/Dec/17 ]

Github PR: https://github.com/dcos/dcos/pull/2165 status teamcity/dcos/test/docker was overridden with a failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 06/Dec/17 ]

Github PR: https://github.com/dcos/dcos/pull/2165 status teamcity/dcos/test/aws/onprem/static-redhat was overridden with a failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 08/Dec/17 ]

Github PR: https://github.com/dcos/dcos/pull/2159 status teamcity/dcos/test/azure/arm was overridden with a failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 12/Dec/17 ]

@skumaran overrode teamcity/dcos/test/docker status of dcos/dcos/pull/2196 with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 13/Dec/17 ]

@jp overrode teamcity/dcos/test/aws/onprem/static status of dcos/dcos/pull/2159 with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 13/Dec/17 ]

@skumaran overrode teamcity/dcos/test/docker status of dcos/dcos/pull/2200 with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Adam Bordelon (Inactive) [ 17/Dec/17 ]

I got the same thing in https://teamcity.mesosphere.io/viewLog.html?buildId=900731 but with MarathonPod.wait. Pods and Apps both flake

Comment by Mergebot [ 17/Dec/17 ]

@skumaran overrode teamcity/dcos/test/docker status of dcos/dcos/pull/2046 (Title: Bump dcos-mesos to latest master a4b1134., Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 19/Dec/17 ]

@prozlach overrode teamcity/dcos/test/aws/onprem/static-redhat status of dcos/dcos/pull/2220 (Title: Bump pkgpanda pkgs kazoo and gunicorn, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 19/Dec/17 ]

@jeremy overrode mergebot/enterprise/build-status/aggregate status of dcos/dcos/pull/2107 (Title: Use the more recent marathon endpoint., Branch: master) with the failure noted in this JIRA.

Comment by Mergebot [ 19/Dec/17 ]

@jeremy overrode teamcity/dcos/test/aws/onprem/static-redhat status of dcos/dcos/pull/2107 (Title: Use the more recent marathon endpoint., Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 19/Dec/17 ]

@jeremy overrode teamcity/dcos/test/docker status of dcos/dcos/pull/2107 (Title: Use the more recent marathon endpoint., Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Sergey Urbanovich (Inactive) [ 19/Dec/17 ]
Dec 15 03:42:54 dcos-docker-master1 java[969]: [myid:] INFO  [ProcessThread(sid:0 cport:2181)::PrepRequestProcessor@648] - Got user-level KeeperException when processing sessionid:0x160582962c9002e type:setData cxid:0x2656 zxid:0x1b41 txntype:-1 reqpath:n/a Error Path:/marathon/state/group/2/root/2017-12-15T03:42:54.914Z Error:KeeperErrorCode = NoNode for /marathon/state/group/2/root/2017-12-15T03:42:54.914Z
Comment by Mergebot [ 19/Dec/17 ]

@jeremy overrode teamcity/dcos/test/aws/cloudformation/simple status of dcos/dcos/pull/2231 (Title: Train 293, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Senthil Kumaran (Inactive) [ 20/Dec/17 ]

Hi Sergey Urbanovich - What is the significance of that exception that you pointed out?

Comment by Mergebot [ 20/Dec/17 ]

@skumaran overrode teamcity/dcos/test/aws/onprem/static status of dcos/dcos/pull/2210 (Title: Bump dcos-test-utils, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Sergey Urbanovich (Inactive) [ 20/Dec/17 ]

Senthil Kumaran it's a proof that most likely test_vip flakiness is caused by marathon, see dcos-docker-master1.log in artifacts

Comment by Mergebot [ 21/Dec/17 ]

@adam overrode teamcity/dcos/test/aws/onprem/static status of dcos/dcos/pull/2244 (Title: Train 295, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 21/Dec/17 ]

@adam overrode teamcity/dcos/test/azure/arm status of dcos/dcos/pull/2244 (Title: Train 295, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 09/Jan/18 ]

@michael.ellenburg overrode teamcity/dcos/test/azure/arm status of dcos/dcos/pull/2267 (Title: fix cloud_images CI yum error, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 10/Jan/18 ]

@aekbote overrode teamcity/dcos/test/aws/onprem/static status of dcos/dcos/pull/2281 (Title: Admin Router: Minimizing software version information reported by the AR [1.10], Branch: 1.10) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 11/Jan/18 ]

@michael.ellenburg overrode teamcity/dcos/test/azure/arm status of dcos/dcos/pull/2297 (Title: Train 304, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 19/Jan/18 ]

@skumaran overrode teamcity/dcos/test/aws/onprem/static/disabled status of mesosphere/dcos-enterprise/pull/2088 (Title: [master] Mergebot Automated Train PR - 2018-Jan-19-15-33, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 23/Jan/18 ]

@jp overrode teamcity/dcos/test/azure/arm status of dcos/dcos/pull/2346 (Title: [master] Mergebot Automated Train PR - 2018-Jan-22-16-43, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Adam Dangoor (Inactive) [ 29/Jan/18 ]

I have moved this issue to DCOS-OSS as it is an OSS test and affects OSS builds.

Comment by Mergebot [ 29/Jan/18 ]

@skumaran overrode teamcity/dcos/test/aws/onprem/static-redhat status of dcos/dcos/pull/2371 (Title: [master] Mergebot Automated Train PR - 2018-Jan-26-17-41, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 30/Jan/18 ]

@michael.ellenburg overrode teamcity/dcos/test/aws/cloudformation/simple status of dcos/dcos/pull/2356 (Title: Bump dcos-test-utils, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Marco Monaco [ 07/Feb/18 ]

Senthil Kumaran Is this something we need to resolve before 1.11 GA? Who is or will working on that? Thanks

Comment by Senthil Kumaran (Inactive) [ 07/Feb/18 ]

Let's collect more information on the failure using this PR: https://github.com/dcos/dcos/pull/2421

RetryError Exception is really a bad exception without more information. I was thinking of solving it in the test libraries so that we could provide targetted exception information. For now, I am going add logging to the test methods in our code.

Ken Sipe , Karsten Jeschkies and Matthias Eichstedt - We really think this is Marathon issue, as GET request to

"/v2/pods/{id}::status"

is not returning a `STABLE` when we deploy the workload app before we test the networking. Running multiple times in the https://github.com/dcos/dcos/pull/2421 could reveal more information on this flaky issue. Once you have the required information, I'd like to assign this bug to one of you.

Thank you!

Comment by Karsten Jeschkies (Inactive) [ 08/Feb/18 ]

Senthil Kumaran, I just started digging into the test a little. So take my comments with a grain of salt. The failing test has assert errors == 0. What do you think about

error = list()
.
.
except Exception as e:
    errors.append(e)
.
.
assert len(errors) == 0

this should print the errors in the JUnit stack trace and simplifies debugging quite a bit.

Comment by Karsten Jeschkies (Inactive) [ 08/Feb/18 ]

I took the liberty to change things a bit https://github.com/dcos/dcos/pull/2426.

Comment by Karsten Jeschkies (Inactive) [ 09/Feb/18 ]

Alright, the job failed. This does not seem to be a flake to me

>       assert self._info['status'] == 'STABLE'
E       assert 'DEGRADED' == 'STABLE'
E         - DEGRADED
E         + STABLE
Comment by Mergebot [ 12/Feb/18 ]

@sergeyurbanovich overrode teamcity/dcos/test/azure/arm status of dcos/dcos/pull/2450 (Title: [master] Bump Mesos to nightly master d4b000f, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Karsten Jeschkies (Inactive) [ 13/Feb/18 ]

Hm, doe we have some stats on this? The test seems to fail every time for me https://github.com/dcos/dcos/pull/2426. This seems to be a bug and not a flake.

Comment by Senthil Kumaran (Inactive) [ 13/Feb/18 ]

Hi Karsten Jeschkies - If you click on the TeamCity link - https://teamcity.mesosphere.io/viewLog.html?buildId=966772&tab=buildResultsDiv&buildTypeId=DcOs_Open_Test_IntegrationTest_AzureArm  and follow the test results, you will information statistical information on those tests. For e.g: https://teamcity.mesosphere.io/project.html?projectId=DcOs_Open_Test_IntegrationTest&testNameId=-2566244230057288490&tab=testDetails 

I am +1 to the assertion change that you made in PR. That will give us more information than a simple generic RetryError

However, it is interesting to note that the test is succeeding on other platforms and failing only on Azure ARM install.

I have re-triggered to collect more stats, and have queued just the test_networking::test_vip test on Azure ARM to get specific stats: https://teamcity.mesosphere.io/viewLog.html?buildId=971555  Let's wait for the results of these runs. 

Comment by Senthil Kumaran (Inactive) [ 14/Feb/18 ]

Karsten Jeschkies - We had a success for the flaky test during re-trigger.  This is why we have categorized it as flaky.

 It is a bug that -  assert self._info['status'] == 'STABLE', error_msg  will *never* be true under certain conditions. But, the conditions under which it is not going to succeed is unknown.

Comment by Mergebot [ 22/Feb/18 ]

@sergeyurbanovich overrode teamcity/dcos/test/aws/onprem/static status of dcos/dcos/pull/2496 (Title: bump dcos-cni, Branch: 1.11) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 22/Feb/18 ]

@skumaran overrode teamcity/dcos/test/aws/onprem/static status of dcos/dcos/pull/2426 (Title: Assert HTTP responses and prettify errors., Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 22/Feb/18 ]

@skumaran overrode teamcity/dcos/test/azure/arm status of dcos/dcos/pull/2493 (Title: chore(dcos-ui): update package, Branch: 1.11) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 26/Feb/18 ]

@skumaran overrode teamcity/dcos/test/azure/arm status of dcos/dcos/pull/2510 (Title: [1.11] Mergebot Automated Train PR - 2018-Feb-26-11-00, Branch: 1.11) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 27/Feb/18 ]

@skumaran overrode teamcity/dcos/test/azure/arm status of dcos/dcos/pull/2519 (Title: [1.11] Mergebot Automated Train PR - 2018-Feb-27-21-13, Branch: 1.11) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 01/Mar/18 ]

@gpaul overrode teamcity/dcos/test/docker status of dcos/dcos/pull/2511 (Title: Admin Router: Prevent reusing tcp sockets by AR's cache code [master], Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 01/Mar/18 ]

@michael.ellenburg overrode teamcity/dcos/test/azure/arm status of dcos/dcos/pull/2538 (Title: Admin Router: Prevent reusing tcp sockets by AR's cache code [1.11], Branch: 1.11) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 05/Mar/18 ]

@prozlach overrode teamcity/dcos/test/aws/onprem/static/disabled status of mesosphere/dcos-enterprise/pull/2164 (Title: [DCOS-19243] Test the permissions required to access the /v2/leader endpoint, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 07/Mar/18 ]

@skumaran overrode teamcity/dcos/test/aws/onprem/static/disabled status of mesosphere/dcos-enterprise/pull/2419 (Title: 1.11.0 Integration Train for UI Changes and Version Update., Branch: 1.11.0-GA) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 07/Mar/18 ]

@skumaran overrode teamcity/dcos/test/azure/arm status of dcos/dcos/pull/2589 (Title: Fixed broken Azure & AWS documentation links., Branch: 1.11.0-GA) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 07/Mar/18 ]

@skumaran overrode teamcity/dcos/test/azure/arm status of dcos/dcos/pull/2589 (Title: Fixed broken Azure & AWS documentation links., Branch: 1.11.0-GA) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 08/Mar/18 ]

@gpaul overrode teamcity/dcos/test/aws/onprem/static status of dcos/dcos/pull/2568 (Title: Admin Router: Support for custom 'Host' header and response status for generic tests, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 08/Mar/18 ]

@gpaul overrode teamcity/dcos/test/docker status of dcos/dcos/pull/2568 (Title: Admin Router: Support for custom 'Host' header and response status for generic tests, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 14/Mar/18 ]

@skumaran overrode teamcity/dcos/test/aws/onprem/static status of dcos/dcos/pull/2620 (Title: [1.11] Avoid python dependency break for python-dateutil, Branch: 1.11) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 15/Mar/18 ]

@prozlach overrode teamcity/dcos/test/azure/arm status of dcos/dcos/pull/2615 (Title: [1.10] Mergebot Automated Train PR - 2018-Mar-14-10-00, Branch: 1.10) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Sergey Urbanovich (Inactive) [ 19/Mar/18 ]

Matthias Eichstedt Could you add some details why you closed this JIRA?

Comment by Senthil Kumaran (Inactive) [ 19/Mar/18 ]

Matthias Eichstedt - this was observed again in master today - https://teamcity.mesosphere.io/viewLog.html?buildId=1009460&buildTypeId=DcOs_Enterprise_ManualTriggers_IntegrationTest_AwsOnpremWStaticBackendAndSecurit

If we close this, we will need at-least another JIRA that indicates the activity towards fixing of this flakiness in Marathon.

 

Comment by Senthil Kumaran (Inactive) [ 19/Mar/18 ]

Re-opening to use this for override. 

Comment by Mergebot [ 19/Mar/18 ]

@skumaran overrode teamcity/dcos/test/aws/onprem/static/disabled status of mesosphere/dcos-enterprise/pull/2470 (Title: [master] Mergebot Automated Train PR - 2018-Mar-19-12-00, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 19/Mar/18 ]

@skumaran overrode teamcity/dcos/test/aws/cloudformation/simple status of dcos/dcos/pull/2637 (Title: Use pytest-dcos plugin, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 19/Mar/18 ]

@skumaran overrode teamcity/dcos/test/aws/onprem/static/disabled status of mesosphere/dcos-enterprise/pull/2476 (Title: [master] Mergebot Automated Train PR - 2018-Mar-20-00-04, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Karsten Jeschkies (Inactive) [ 20/Mar/18 ]

When I investigated the issue I've found that the test code should be improved first since the errors do not give much information. However, I don't know who the owner is and it was so cumbersome to make any changes by myself that I just gave up. Overall I'm not sure this is a Marathon issue.

Comment by Sergey Urbanovich (Inactive) [ 20/Mar/18 ]

Hi Karsten Jeschkies,

It’s an integration test and it checks how dcos-l4lb, marathon, mesos, dcos-overlay and others work together. It’s hard to find the owner and it’s one of the reasons why dcos has been suffering from the issue for so long. However, let me be the owner. I can help you with test_vip code and networking stuff, I hope Michael Ellenburg can help us with the testing infrastructure (test_helper, dcos_test_utils, etc).

My speculation is that we see a lot of failures in test_vip just because this test has tens of sub-tests and they start hundreds of tasks. Maybe we have exactly the same issues in other tests, but we aren’t experiencing them so frequently.

Let’s improve the test code to prove or refute completely that it’s a marathon issue. What should we change in the code?

Please feel free to reach me out on slack directly or we can discuss it on #test-vip-flakiness channel (yes, this issue has its own channel!).

I hope Avinash Sridharan, Matthias Eichstedt, and Artem Harutyunyan can help us to prioritize the issue. It’s the most flaky test in DC/OS.

Thank you.

Comment by Mergebot [ 20/Mar/18 ]

@skumaran overrode teamcity/dcos/test/aws/cloudformation/simple status of dcos/dcos/pull/2653 (Title: [master] Mergebot Automated Train PR - 2018-Mar-20-23-22, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 09/Apr/18 ]

@skumaran overrode teamcity/dcos/test/docker status of dcos/dcos/pull/2711 (Title: [master] Mergebot Automated Train PR - 2018-Apr-06-12-00, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 16/Apr/18 ]

@skumaran overrode teamcity/dcos/test/aws/onprem/static status of dcos/dcos/pull/2748 (Title: DCOS_OSS-2372[1.11] Use `pip download` to prepare TeamCity rather than the removed `pip i…, Branch: 1.11) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 17/Apr/18 ]

@sergeyurbanovich overrode teamcity/dcos/test/aws/cloudformation/simple status of dcos/dcos/pull/2758 (Title: bump from latest dcos-net master, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 24/Apr/18 ]

@sergeyurbanovich overrode teamcity/dcos/test/azure/arm status of dcos/dcos/pull/2783 (Title: [master] Mergebot Automated Train PR - 2018-Apr-20-12-00, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 25/Apr/18 ]

@sergeyurbanovich overrode teamcity/dcos/test/aws/onprem/static/disabled status of mesosphere/dcos-enterprise/pull/2636 (Title: [1.11] Mergebot Automated Train PR - 2018-Apr-20-11-00, Branch: 1.11) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 03/May/18 ]

@skumaran overrode teamcity/dcos/test/aws/cloudformation/simple status of dcos/dcos/pull/2816 (Title: Add owners from the Mesos pool of committers., Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 08/May/18 ]

@branden overrode teamcity/dcos/test/dcos-e2e/docker/static/strict status of mesosphere/dcos-enterprise/pull/2729 (Title: improve dcos-diagnostics integration test and fix dcos-diagnostics system account permissions 1.11, Branch: 1.11) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 17/May/18 ]

@branden overrode teamcity/dcos/test/dcos-e2e/docker/static/strict status of mesosphere/dcos-enterprise/pull/2441 (Title: Remove web installer, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Karsten Jeschkies (Inactive) [ 31/May/18 ]

Sergey Urbanovich, I'm sorry. I've missed your comment.

Let’s improve the test code to prove or refute completely that it’s a marathon issue. What should we change in the code?

I tried to add some logs and change the code but gave up because a PR took more than a week to merge. I don't know how the security team manages this.

Anyhow, I proposed to remove the test. If there are reasonable use cases I'm happy to spec the tests and implement them in the Marathon system test suite. The Marathon team would be a clear owner then. What do you think, Matthias Eichstedt?

So, I'm happy to take ownership but then it moves into our repo.

Comment by Mergebot [ 31/May/18 ]

@sergeyurbanovich overrode teamcity/dcos/test/aws/cloudformation/simple status of dcos/dcos/pull/2890 (Title: Fix upgrade issues with ipv6 and flaky service discovery integration test, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Matthias Eichstedt (Inactive) [ 01/Jun/18 ]

Karsten Jeschkies I'm OK with moving it into our repo and effectively owing it – Jan-Philip Gehrcke since you created the ticket originally do you veto?

Comment by Sergey Urbanovich (Inactive) [ 01/Jun/18 ]

Hi Karsten Jeschkies and Matthias Eichstedt,

Thanks for replying!

As I mentioned before, test_vip checks how mesos, marathon, dcos-l4lb, dcos-overlay, linux kernel, etc work together on different clouds with different security configurations. We must be sure that the feature is not broken when we update any of them. It is the main integration test for dcos networking stack. Most of these goals could not be reached if we move the test to marathon repo. Unfortunately, we don't have the option that you suggested.

I do totally understand your pain with dcos workflow, it's unreasonably slow and usually it takes several days to merge any PR to master branch (sometimes it takes months, no kidding). However, one of the biggest issues with the whole experience is that we have tons of flaky integration tests. I can promise you that I will be shepherding all test_vip PRs and I believe they will be merged faster.

Comment by Senthil Kumaran (Inactive) [ 01/Jun/18 ]

Hello Matthias Eichstedt - I am with Sergey Urbanovich here. If anything, we should work on fixing this in dcos/doc repo instead moving it out of the repo.
The moving it out will solve the problem in a 'shallow' manner, it will unblock folks, but we will miss on coverage that this test brings.

Sergey Urbanovich - On PRs not moving forward for days/ months, is it still the case? The idea with @docs-owners and ability to overide flakes is a attempt to solve that. Has this not been helping? Once the CI is sufficiently stable, we have plans to do away with trains and land the PRs immediately. Let us keep involving the dcos-owners, and reduce the flakiness in the system to improve this. Thank you!

Comment by Karsten Jeschkies (Inactive) [ 04/Jun/18 ]

It is the main integration test for dcos networking stack

So the owner should be the networking team then, right? If the wiki is still up to date that would be Sergey Urbanovich and Deepak Goel. I'm happy to help if you can provide the app definitions being deployed, the Marathon logs during the test runs and the logs of the test itself.

Comment by Sergey Urbanovich (Inactive) [ 04/Jun/18 ]

Karsten Jeschkies Well, it's easy! Please check any mergebot comment above. Let's look at the last one (link).

Test waited for a pod, id: /integration-test-51752de892914eb58c16530e1c842b4c, in the log you can find the pod definition, it's python term (see below). You also can find all marathon logs in artifacts -> master_journald.log.

[2018-05-31 01:48:02,306|test_networking|INFO]: Origin app: {'id': '/integration-test-51752de892914eb58c16530e1c842b4c', 'scheduling': {'placement': {'acceptedResourceRoles': ['*', 'slave_public'], 'constraints': [{'fieldName': 'hostname', 'operator': 'CLUSTER', 'value': '10.0.1.44'}]}}, 'containers': [{'name': 'app-51752de892914eb58c16530e1c842b4c', 'resources': {'cpus': 0.01, 'mem': 32}, 'image': {'kind': 'DOCKER', 'id': 'debian:jessie'}, 'exec': {'command': {'shell': '/opt/mesosphere/bin/dcos-shell python /opt/mesosphere/active/dcos-integration-test/util/python_test_server.py $ENDPOINT_TEST'}}, 'volumeMounts': [{'name': 'opt', 'mountPath': '/opt/mesosphere'}], 'endpoints': [{'name': 'test', 'protocol': ['tcp'], 'hostPort': 0, 'labels': {'VIP_0': '1.1.1.7:10176'}}], 'environment': {'DCOS_TEST_UUID': '51752de892914eb58c16530e1c842b4c', 'HOME': '/'}}], 'networks': [{'mode': 'host'}], 'volumes': [{'name': 'opt', 'host': '/opt/mesosphere'}]}
Comment by Karsten Jeschkies (Inactive) [ 05/Jun/18 ]

Thanks for the pointers. So the logs say the following:

  28076 2018-05-31 01:48:05: [2018-05-31 01:48:05,167] INFO  Processing LaunchEphemeral(Instance(instance [integration-test-51752de892914eb58c16530e1c842b4c.instance-a91e4f80-6474-11e8-874e-e639d421a063],AgentInfo(10.0.1.44,Some(72859f6f-babb-4975-912d-c2885c6417ef-S0),None,None,Vector()),InstanceState(Created,2018-05-31T01:48:05.112Z,None,None),Map(task [integration-test-51752de892914eb58c16530e1c842b4c.instance-a91e4f80-6474-11e8-874e-e639d421a063.app-51752de892914eb58c16530e1c842b4c
  28076 ] -> Task(task [integration-test-51752de892914eb58c16530e1c842b4c.instance-a91e4f80-6474-11e8-874e-e639d421a063.app-51752de892914eb58c16530e1c842b4c],2018-05-31T01:48:02.319Z,Status(2018-05-31T01:48:05.112Z,None,None,Created,NetworkInfo(10.0.1.44,Vector(20704),List())))),2018-05-31T01:48:02.319Z,UnreachableEnabled(0 seconds,0 seconds),None)) for instance [integration-test-51752de892914eb58c16530e1c842b4c.instance-a91e4f80-6474-11e8-874e-e639d421a063] (mesosphere.marathon.core.l
  28076 auncher.impl.OfferProcessorImpl:scala-execution-context-global-1808)
...
28085 2018-05-31 01:48:06: [2018-05-31 01:48:06,714] INFO  Received status update for task integration-test-51752de892914eb58c16530e1c842b4c.instance-a91e4f80-6474-11e8-874e-e639d421a063.app-51752de892914eb58c16530e1c842b4c: TASK_STARTING () (mesosphere.marathon.MarathonScheduler:Thread-1476)
  28086 2018-05-31 01:48:06: [2018-05-31 01:48:06,717] INFO  Acknowledge status update for task integration-test-51752de892914eb58c16530e1c842b4c.instance-a91e4f80-6474-11e8-874e-e639d421a063.app-51752de892914eb58c16530e1c842b4c: TASK_STARTING () (mesosphere.marathon.core.task.update.impl.TaskStatusUpdateProcessorImpl:scala-execution-context-global-1949)
...
  28169 2018-05-31 01:48:22: [2018-05-31 01:48:22,529] INFO  10.0.5.50 - - [31/May/2018:01:48:22 +0000] "GET //10.0.5.50/v2/pods/integration-test-51752de892914eb58c16530e1c842b4c::status HTTP/1.1" 200 2509 "-" "python-requests/2.18.4"  (mesosphere.chaos.http.ChaosRequestLog:qtp2077738191-47)
  28170 2018-05-31 01:48:27: [2018-05-31 01:48:27,542] INFO  10.0.5.50 - - [31/May/2018:01:48:27 +0000] "GET //10.0.5.50/v2/pods/integration-test-51752de892914eb58c16530e1c842b4c::status HTTP/1.1" 200 2509 "-" "python-requests/2.18.4"  (mesosphere.chaos.http.ChaosRequestLog:qtp2077738191-50)
...
  28258 2018-05-31 01:53:09: [2018-05-31 01:53:09,883] WARN  Should kill: task [integration-test-51752de892914eb58c16530e1c842b4c.instance-a91e4f80-6474-11e8-874e-e639d421a063.app-51752de892914eb58c16530e1c842b4c] was launched 304s ago and was not confirmed yet (mesosphere.marathon.core.task.jobs.impl.OverdueTasksActor$Support:scala-execution-context-global-1962)
  28259 2018-05-31 01:53:09: [2018-05-31 01:53:09,883] INFO  Killing overdue instance [integration-test-51752de892914eb58c16530e1c842b4c.instance-a91e4f80-6474-11e8-874e-e639d421a063] (mesosphere.marathon.core.task.jobs.impl.OverdueTasksActor$Support:scala-execution-context-global-1962)
...

So the task does not start in time. And then there is

› rg "integration-test-51752de892914eb58c16530e1c842b4c.*TASK_" dcos-marathon.service
28085:2018-05-31 01:48:06: [2018-05-31 01:48:06,714] INFO  Received status update for task integration-test-51752de892914eb58c16530e1c842b4c.instance-a91e4f80-6474-11e8-874e-e639d421a063.app-51752de892914eb58c16530e1c842b4c: TASK_STARTING () (mesosphere.marathon.MarathonScheduler:Thread-1476)
28086:2018-05-31 01:48:06: [2018-05-31 01:48:06,717] INFO  Acknowledge status update for task integration-test-51752de892914eb58c16530e1c842b4c.instance-a91e4f80-6474-11e8-874e-e639d421a063.app-51752de892914eb58c16530e1c842b4c: TASK_STARTING () (mesosphere.marathon.core.task.update.impl.TaskStatusUpdateProcessorImpl:scala-execution-context-global-1949)
28538:2018-05-31 01:56:34: [2018-05-31 01:56:34,968] INFO  Received status update for task integration-test-51752de892914eb58c16530e1c842b4c.instance-a91e4f80-6474-11e8-874e-e639d421a063.app-51752de892914eb58c16530e1c842b4c: TASK_STARTING (Reconciliation: Latest task state) (mesosphere.marathon.MarathonScheduler:Thread-1525)
28546:2018-05-31 01:56:34: [2018-05-31 01:56:34,971] INFO  Received status update for task integration-test-51752de892914eb58c16530e1c842b4c.instance-a91e4f80-6474-11e8-874e-e639d421a063.app-51752de892914eb58c16530e1c842b4c: TASK_STARTING (Reconciliation: Latest task state) (mesosphere.marathon.MarathonScheduler:Thread-1533)
28561:2018-05-31 01:56:35: [2018-05-31 01:56:34,978] INFO  Acknowledge status update for task integration-test-51752de892914eb58c16530e1c842b4c.instance-a91e4f80-6474-11e8-874e-e639d421a063.app-51752de892914eb58c16530e1c842b4c: TASK_STARTING (Reconciliation: Latest task state) (mesosphere.marathon.core.task.update.impl.TaskStatusUpdateProcessorImpl:scala-execution-context-global-1962)
28569:2018-05-31 01:56:35: [2018-05-31 01:56:34,984] INFO  Acknowledge status update for task integration-test-51752de892914eb58c16530e1c842b4c.instance-a91e4f80-6474-11e8-874e-e639d421a063.app-51752de892914eb58c16530e1c842b4c: TASK_STARTING (Reconciliation: Latest task state) (mesosphere.marathon.core.task.update.impl.TaskStatusUpdateProcessorImpl:scala-execution-context-global-1962)
29362:2018-05-31 02:06:34: [2018-05-31 02:06:34,973] INFO  Received status update for task integration-test-51752de892914eb58c16530e1c842b4c.instance-a91e4f80-6474-11e8-874e-e639d421a063.app-51752de892914eb58c16530e1c842b4c: TASK_STARTING (Reconciliation: Latest task state) (mesosphere.marathon.MarathonScheduler:Thread-1581)
29369:2018-05-31 02:06:34: [2018-05-31 02:06:34,981] INFO  Acknowledge status update for task integration-test-51752de892914eb58c16530e1c842b4c.instance-a91e4f80-6474-11e8-874e-e639d421a063.app-51752de892914eb58c16530e1c842b4c: TASK_STARTING (Reconciliation: Latest task state) (mesosphere.marathon.core.task.update.impl.TaskStatusUpdateProcessorImpl:scala-execution-context-global-1901)
29375:2018-05-31 02:06:35: [2018-05-31 02:06:34,982] INFO  Received status update for task integration-test-51752de892914eb58c16530e1c842b4c.instance-a91e4f80-6474-11e8-874e-e639d421a063.app-51752de892914eb58c16530e1c842b4c: TASK_STARTING (Reconciliation: Latest task state) (mesosphere.marathon.MarathonScheduler:Thread-1589)
29383:2018-05-31 02:06:35: [2018-05-31 02:06:34,989] INFO  Acknowledge status update for task integration-test-51752de892914eb58c16530e1c842b4c.instance-a91e4f80-6474-11e8-874e-e639d421a063.app-51752de892914eb58c16530e1c842b4c: TASK_STARTING (Reconciliation: Latest task state) (mesosphere.marathon.core.task.update.impl.TaskStatusUpdateProcessorImpl:scala-execution-context-global-1808)
29810:2018-05-31 02:08:21: [2018-05-31 02:08:20,992] INFO  Received status update for task integration-test-51752de892914eb58c16530e1c842b4c.instance-a91e4f80-6474-11e8-874e-e639d421a063.app-51752de892914eb58c16530e1c842b4c: TASK_RUNNING () (mesosphere.marathon.MarathonScheduler:Thread-1616)
29831:2018-05-31 02:08:21: [2018-05-31 02:08:21,060] INFO  Acknowledge status update for task integration-test-51752de892914eb58c16530e1c842b4c.instance-a91e4f80-6474-11e8-874e-e639d421a063.app-51752de892914eb58c16530e1c842b4c: TASK_RUNNING () (mesosphere.marathon.core.task.update.impl.TaskStatusUpdateProcessorImpl:scala-execution-context-global-1962)
29858:2018-05-31 02:08:39: [2018-05-31 02:08:39,982] INFO  Received status update for task integration-test-51752de892914eb58c16530e1c842b4c.instance-a91e4f80-6474-11e8-874e-e639d421a063.app-51752de892914eb58c16530e1c842b4c: TASK_KILLING () (mesosphere.marathon.MarathonScheduler:Thread-1630)
29859:2018-05-31 02:08:39: [2018-05-31 02:08:39,995] INFO  Acknowledge status update for task integration-test-51752de892914eb58c16530e1c842b4c.instance-a91e4f80-6474-11e8-874e-e639d421a063.app-51752de892914eb58c16530e1c842b4c: TASK_KILLING () (mesosphere.marathon.core.task.update.impl.TaskStatusUpdateProcessorImpl:scala-execution-context-global-2234)
29860:2018-05-31 02:08:40: [2018-05-31 02:08:40,057] INFO  Received status update for task integration-test-51752de892914eb58c16530e1c842b4c.instance-a91e4f80-6474-11e8-874e-e639d421a063.app-51752de892914eb58c16530e1c842b4c: TASK_KILLED (Command terminated with signal Terminated) (mesosphere.marathon.MarathonScheduler:Thread-1631)
29870:2018-05-31 02:08:40: [2018-05-31 02:08:40,062] INFO  Acknowledge status update for task integration-test-51752de892914eb58c16530e1c842b4c.instance-a91e4f80-6474-11e8-874e-e639d421a063.app-51752de892914eb58c16530e1c842b4c: TASK_KILLED (Command terminated with signal Terminated) (mesosphere.marathon.core.task.update.impl.TaskStatusUpdateProcessorImpl:scala-execution-context-global-2206)

It takes 20 seconds for the task to launch!

Comment by Karsten Jeschkies (Inactive) [ 05/Jun/18 ]

Sergey Urbanovich, what is dcos-integration-test/util/python_test_server.py doing? Where can I find the sandboxes from Mesos with the logs of the executor and python_test_server.py?

Comment by Mergebot [ 05/Jun/18 ]

@skumaran overrode teamcity/dcos/test/aws/cloudformation/simple status of dcos/dcos/pull/2949 (Title: pin urllib3 to 1.22 for compatibility with requests, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Sergey Urbanovich (Inactive) [ 05/Jun/18 ]

Karsten Jeschkies python_test_server.py is a simple http server, you can find it in dcos repo here. Please check the diagnostic bundle in artifacts, may be mesos-slave logs have some valuable information. IIRC our test infrastructure doesn't collect logs for mesos sandboxes. Senthil Kumaran please correct me if I'm wrong.

Comment by Senthil Kumaran (Inactive) [ 05/Jun/18 ]

Sergey Urbanovich  - you are right, we don't collect logs for mesos sandboxes. We only bundle the journald logs that are on master and the agents.

Comment by Mergebot [ 05/Jun/18 ]

@skumaran overrode teamcity/dcos/test/dcos-docker/static status of dcos/dcos/pull/2940 (Title: Modifies test relying on mesos logging in the stdout of a task, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Karsten Jeschkies (Inactive) [ 06/Jun/18 ]

we don't collect logs for mesos sandboxes. We only bundle the journald logs that are on master and the agents.

This makes it almost impossible to debug. We had some issues in the Marathon integration tests that we only found with logs from the executors and apps. If we are lucky we find some things by digging into the Mesos logs. However, if a task is in TASK_STARTING and is not becoming TASK_RUNNING Marathon cannot do anything about this.

Comment by Ioannis Charalampidis (Inactive) [ 06/Jun/18 ]

It might be unrelated, or I might be missing something, but I see that the python job on TeamCity is waiting for the tasks to be "healthy":

self._info = r.json()
> assert self._info['app']['tasksHealthy'] == self.app['instances']
E assert 0 == 1

test_networking.py:72: AssertionError

But I did not see any health checks defined in the pod definition that Sergey posted above.

Comment by Mergebot [ 06/Jun/18 ]

@gpaul overrode teamcity/dcos/test/dcos-e2e/docker/static/strict status of mesosphere/dcos-enterprise/pull/2853 (Title: rexray: upgrade to v0.11.1 [Backport to 1.11], Branch: 1.11) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Gustav Paul (Inactive) [ 06/Jun/18 ]

I'm not 100% this last failure should be tracked by this issue, but it looks similar enough that I'll leave it to Sergey Urbanovich to decide whether this merits a different ticket.

Comment by Matthias Eichstedt (Inactive) [ 06/Jun/18 ]

Karsten Jeschkies is right – the log snippets he provided earlier clearly show that a task is reported STARTING and then does not turn RUNNING within 5 minutes. The default behavior of Marathon is to kill such a task (aka expunge all information about it) after 5 minutes.

A mitigation could be to increase the task_launch_timeout to e.g. 1200000L (20 minutes), but we should find out why it takes so long for the task to turn running. Linking MARATHON-8235 since this is slightly related.

Senthil Kumaran we are kind of blocked triaging this. We could (1) increase the above timeout, but we (2) should have sandboxes available. I don't think we're investigating a Marathon problem here – tasks are not reported Running in time, so either the docker daemon, the agent, or other things are are severely slow to respond. (Increasing the Marathon timeout is not a substitute for an RCA.)

Comment by Sergey Urbanovich (Inactive) [ 06/Jun/18 ]

Senthil Kumaran It seems like we have to add mesos sandboxes dirs to artifacts or do we have any other options? May I kindly ask you to create a blocker JIRA for that?

Comment by Mergebot [ 18/Jun/18 ]

@skumaran overrode teamcity/dcos/test/aws/onprem/static status of dcos/dcos/pull/2968 (Title: [master] Mergebot Automated Train PR - 2018-Jun-11-12-00, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 25/Jun/18 ]

@skumaran overrode teamcity/dcos/test/azure/arm status of dcos/dcos/pull/3000 (Title: [master] Mergebot Automated Train PR - 2018-Jun-25-06-56, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 25/Jun/18 ]

@kapil overrode teamcity/dcos/test/aws/cloudformation/simple status of dcos/dcos/pull/3003 (Title: [master] Bump Mesos to nightly master d22a3d7, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 26/Jun/18 ]

@jp overrode teamcity/dcos/test/aws/onprem/static status of dcos/dcos/pull/2990 (Title: Don't skip checks that are limited to a specific role, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 27/Jun/18 ]

@skumaran overrode teamcity/dcos/test/aws/cloudformation/simple status of dcos/dcos/pull/2992 (Title: Bump CoreOS AMI to v1745.7.0, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 29/Jun/18 ]

@gpaul overrode teamcity/dcos/test/azure/arm status of dcos/dcos/pull/3017 (Title: Added second EBS drive to agents and public agents (1.10 backport)., Branch: 1.10) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 29/Jun/18 ]

@jp overrode teamcity/dcos/test/aws/cloudformation/simple status of dcos/dcos/pull/2992 (Title: Bump CoreOS AMI to v1745.7.0, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 29/Jun/18 ]

@kapil overrode teamcity/dcos/test/aws/onprem/static status of dcos/dcos/pull/3021 (Title: [master] Bump Mesos to nightly master 22471b8, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 29/Jun/18 ]

@sergeyurbanovich overrode teamcity/dcos/test/azure/arm status of dcos/dcos/pull/3006 (Title: Adds network information to be collected as part of diagnostic bundle, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Karsten Jeschkies (Inactive) [ 03/Jul/18 ]

We've ran test_networking.py::test_vip on a cluster an it failed for two instances integration-test-2f50d63b9f5f44d29489e89e383d30d5 and integration-test-0749fbd35959418787aa0e015a5f0bc9. See the test_vip.log.

Marathon does start the Python server. See python_test_server.logs.md and the attached Mesos sandbox logs sandbox_10_0_3_199.tar.gz, sandbox_10_0_3_77.tar.gz and sandbox_10_0_0_221.tar.gz. Mesos can even ping the app. So it is healthy. We think this issue is related to test_docker_port_mapping in Marathon.

I assume this is a networking issue, Sergey Urbanovich. Are there other logs we could log at? The full bundle is bundle-2018-07-03-1530620037.zip.

Comment by Aleksey Dukhovniy (Inactive) [ 03/Jul/18 ]

 A few things:

  • at least the VIP definition is partially deprecated. ipAddresses is removed, container.docker.network is removed in favor of container.networks, container.docker.portMappings is moved to container.portMappings. See this networking migration guide from 1.4 to 1.5 for more details and update the definitions accordingly.

That's how it should look like:

{
    "id": "integration-test-0749fbd35959418787aa0e015a5f0bc9",
    "cpus": 0.1,
    "mem": 32,
    "instances": 1,
    "cmd": "/opt/mesosphere/bin/dcos-shell python /opt/mesosphere/active/dcos-integration-test/util/python_test_server.py 10043",
    "env": {
        "DCOS_TEST_UUID": "0749fbd35959418787aa0e015a5f0bc9",
        "HOME": "/"
    },
    "healthChecks": [
        {
            "protocol": "MESOS_HTTP",
            "path": "/ping",
            "gracePeriodSeconds": 5,
            "intervalSeconds": 10,
            "timeoutSeconds": 10,
            "maxConsecutiveFailures": 120,
            "port": 10043
        }
    ],
    "networks": [
        {
            "mode": "container",
            "name": "dcos"
        }
    ],
    "container": {
        "type": "DOCKER",
        "docker": {
            "image": "debian:jessie"
        },
        "portMappings": [
                {
                    "containerPort": 10043,
                    "protocol": "tcp",
                    "name": "test",
                    "labels": {
                        "VIP_0": "/namedvip:10042"
                    }
                }
            ],
        "volumes": [
            {
                "containerPath": "/opt/mesosphere",
                "hostPath": "/opt/mesosphere",
                "mode": "RO"
            }
        ]
    },
    "constraints": [
        [
            "hostname",
            "CLUSTER",
            "10.0.0.221"
        ]
    ],
    "acceptedResourceRoles": [
        "*",
        "slave_public"
    ]
}

Nevertheless: I can run deprecated and proper app definitions manually and they both are successful in isolation.

Comment by Sergey Urbanovich (Inactive) [ 03/Jul/18 ]

Hi Karsten Jeschkies! You've caught another bug with test_vip, it's not related to the case which we have been tracking here. Your logs show that all applications were ready, the test failed on assert [1]. It definitely looks like a networking issue on CoreOS v1745.7.0, DCOS_OSS-3707. It is not flakiness, there were 9 test failures in a row [2].

The summary of this JIRA is "test_vip failed with RetryError on MarathonApp.wait". In that case, you would see a stack trace that starts with setup_vip_workload_tests function [3].

Comment by Karsten Jeschkies (Inactive) [ 04/Jul/18 ]

Hi, Sergey Urbanovich, without the sandboxes we cannot do much here.

Comment by Mergebot [ 04/Jul/18 ]

@skumaran overrode teamcity/dcos/test/azure/arm status of dcos/dcos/pull/3010 (Title: [1.11] Mergebot Automated Train PR - 2018-Jun-27-11-00, Branch: 1.11) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Adam Dangoor (Inactive) [ 05/Jul/18 ]

Senthil Kumaran I don't think that the previous override is relevant to this issue.

Comment by Pawel Rozlach [ 05/Jul/18 ]

The problem described by Karsten Jeschkies in [1] has been narrowed down in [2]: basically, mesos-modules need some patching before we can bump to the newer Docker version (and by proxy - to the newer CoreOs version). As Sergey Urbanovich already pointed out, this is not a flakiness issue but a genuine failure detected by test_vip integration test.

[1] https://jira.mesosphere.com/browse/DCOS_OSS-2115?focusedCommentId=161189&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-161189
[2] https://jira.mesosphere.com/browse/DCOS_OSS-3707?focusedCommentId=161750&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-161750

Comment by Pawel Rozlach [ 05/Jul/18 ]

at least the VIP definition is partially deprecated. ipAddresses is removed, container.docker.network is removed in favor of container.networks, container.docker.portMappings is moved to container.portMappings. See this networking migration guide from 1.4 to 1.5 for more details and update the definitions accordingly.

I have created DCOS_OSS-3736 to address that.

Comment by Sergey Urbanovich (Inactive) [ 08/Jul/18 ]

> It seems like we have to add mesos sandboxes dirs to artifacts or do we have any other options? May I kindly ask you to create a blocker JIRA for that?

Senthil Kumaran would you please provide any updates on this matter?

Comment by Jan-Philip Gehrcke (Inactive) [ 09/Jul/18 ]

DCOS_OSS-3747 focuses on the case where a test_vip fails with `assert 0 == 1` in `wait_for_tasks_healthy`.

Comment by Karsten Jeschkies (Inactive) [ 09/Jul/18 ]

I tried to debug test_vip. However, the tests deploys 288 apps if I'm not mistaken. These do not finish due to unfulfilled roles on my tests cluster. Sergey Urbanovich, what cluster do we require?

Comment by Adam Dangoor (Inactive) [ 09/Jul/18 ]

Karsten Jeschkies In case it helps, the integration tests are run on a cluster with one master, two private agents, one public agent.

Comment by Pawel Rozlach [ 09/Jul/18 ]

Discussed things briefly with Karsten Jeschkies and Aleksey Dukhovniy:

  • the test needs to be split up, i.e. extract the "inner" parametrization (same host vs. different host, named VIP vs. plain VIP) into a proper pytest.parametrize as this will allow narrowing flakiness down. In total there should be 36*4=144 test cases for test vip instead of 36 that we have now.
  • the error messages need to be refactored to something more meaningful, instead of just a plain assert statement
  • the test needs to run in the loop via TeamCity, for a week or so, so that we can gather the flakiness data for the same DC/OS version and the same components version, and multiple tests runs.

This should allow us to work around the 288 apps issue that Karsten Jeschkies mentioned a few comments earlier.

CC: Sergey Urbanovich

Comment by Gustav Paul (Inactive) [ 11/Jul/18 ]

Every permutation appears to have some small percentage chance of failing. This is not as simple as finding the one permutation that fails.

If we got the logs from the task sandboxes and the journal I think Sergey Urbanovich would be happy to comb through the 36 test cases (as opposed to 144). I believe Tools Infra are going to work on that soon.

Comment by Karsten Jeschkies (Inactive) [ 12/Jul/18 ]

Here are the logs from 100 runs test_vip.tar.gz.

Five runs failed

› rg "====.*failed" tar/test_vip*.log
tar/test_vip_34.log
7676:============ 1 failed, 30 passed, 1042 warnings in 2292.29 seconds =============

tar/test_vip_28.log
7881:============ 1 failed, 31 passed, 1068 warnings in 2315.99 seconds =============

tar/test_vip_41.log
8614:============ 1 failed, 34 passed, 1159 warnings in 2416.24 seconds =============

tar/test_vip_77.log
6411:============= 1 failed, 21 passed, 887 warnings in 2149.93 seconds =============

tar/test_vip_94.log
4648:============= 1 failed, 13 passed, 659 warnings in 1877.69 seconds =============

Three container pod tests fail with

 63         error_msg = &apos;Status was {}: {}&apos;.format(self._info[&apos;status&apos;], self._info.get(&apos;message&apos;, &apos;no message&apos;))
 64 &gt;       assert self._info[&apos;status&apos;] == &apos;STABLE&apos;, error_msg
 65 E       AssertionError: Status was DEGRADED: no message
 66 E       assert &apos;DEGRADED&apos; == &apos;STABLE&apos;
 67 E         - DEGRADED
 68 E         + STABLE

two others with

 58         self._info = r.json()
 59 &gt;       assert self._info[&apos;app&apos;][&apos;tasksHealthy&apos;] == self.app[&apos;instances&apos;]
 60 E       assert 0 == 1
Comment by Karsten Jeschkies (Inactive) [ 12/Jul/18 ]

See the sandboxes

And the diagnostics.zip bundle.

One failed app was integration-test-vip-user-host-proxy-8639690e02544ddf91c5258a9ffce698.tar.gz if I'm not mistaken. It's sandbox is on 10.0.2.30.

Comment by Jan-Philip Gehrcke (Inactive) [ 12/Jul/18 ]

I love the development that I see here. Thank you everyone.

Comment by Sergey Urbanovich (Inactive) [ 12/Jul/18 ]

Logs start from 2018-07-11 12:29:01 on leader nodes in diagnostic bundle, it's test_vip_60. I've checked a failure from test_vip_94.log. integration-test-vip-user-user-proxy-9a6a11d810294a42b9008a408fc63ffd failed to start on 10.0.2.170, UCR container on dcos overlay network.

2018-07-12 00:15:00: I0712 00:15:00.342406  2267 containerizer.cpp:2006] Checkpointing container's forked pid 25922 to '/var/lib/mesos/slave/meta/slaves/7b81ef39-41f7-4906-b5fb-b5f11c0d4c5b-S3/frameworks/7b81ef39-41f7-4906-b5fb-b5f11c0d4c5b-0001/executors/integration-test-vip-user-user-proxy-9a6a11d810294a42b9008a408fc63ffd.9d5a6549-8568-11e8-b41a-aae668466825/runs/251be515-f13e-4ae0-b333-2cae5baa1bd0/pids/forked.pid'
2018-07-12 00:24:59: I0712 00:24:59.834357  2262 slave.cpp:6792] Terminating executor 'integration-test-vip-user-user-proxy-9a6a11d810294a42b9008a408fc63ffd.9d5a6549-8568-11e8-b41a-aae668466825' of framework 7b81ef39-41f7-4906-b5fb-b5f11c0d4c5b-0001 because it did not register within 10mins
2018-07-12 00:35:22: I0712 00:35:22.863657  2266 slave.cpp:3633] Asked to kill task integration-test-vip-user-user-proxy-9a6a11d810294a42b9008a408fc63ffd.9d5a6549-8568-11e8-b41a-aae668466825 of framework 7b81ef39-41f7-4906-b5fb-b5f11c0d4c5b-0001
Comment by Mergebot [ 13/Jul/18 ]

@gpaul overrode teamcity/dcos/test/aws/onprem/static status of dcos/dcos/pull/3071 (Title: Change Adminrouter access_log logging facility to daemon [Backport 1.10], Branch: 1.10) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 13/Jul/18 ]

@gpaul overrode teamcity/dcos/test/aws/cloudformation/simple status of dcos/dcos/pull/2866 (Title: Increase the limit on worker_connections to 10K, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Jie Yu (Inactive) [ 13/Jul/18 ]

Vinod Kone can you have someone from the Mesos team to take a look? Looks like the executor cannot register within 10min.

Comment by Karsten Jeschkies (Inactive) [ 16/Jul/18 ]

Here are the stats we use for our loops.

Failing test cases:

for f in tar/*.xml; do echo $(xmlstarlet sel -t -v "/testsuite/testcase[failure]/@name" "$f"); done | sort | uniq -c
   1 test_vip[Container.MESOS-Network.USER-Network.USER]
   1 test_vip[Container.NONE-Network.USER-Network.HOST]
   1 test_vip[Container.POD-Network.BRIDGE-Network.USER]
   1 test_vip[Container.POD-Network.USER-Network.HOST]
   1 test_vip[Container.POD-Network.USER-Network.USER]

Sergey Urbanovich, do you see any pattern in the network types?

Unique error causes:

for f in tar/*.xml; do echo $(xmlstarlet sel -t -v "/testsuite/testcase/failure/@message" "$f"); done | sort | uniq -c
   3 AssertionError: Status was DEGRADED: no message assert 'DEGRADED' == 'STABLE' - DEGRADED + STABLE
   2 assert 0 == 1
Comment by Mergebot [ 16/Jul/18 ]

@alexr overrode teamcity/dcos/test/dcos-e2e/docker/static/strict status of mesosphere/dcos-enterprise/pull/2997 (Title: Split test_ee_signal and improve debug logging on failure., Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 16/Jul/18 ]

@gpaul overrode teamcity/dcos/test/aws/cloudformation/simple status of dcos/dcos/pull/3064 (Title: [1.12/master] Use ngx.timer.every() for the AR cache update, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 16/Jul/18 ]

@gpaul overrode teamcity/dcos/test/dcos-e2e/docker/static/strict status of mesosphere/dcos-enterprise/pull/3032 (Title: Add an integration test for auto load cgroups subsystems and container-specific cgroups mounts., Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Sergey Urbanovich (Inactive) [ 16/Jul/18 ]

Karsten Jeschkies I'd say that 2 out of 5 could be related to network issue. I've recently re-wrote the whole mesos polling in dcos-net and added some logs. The patch will be merged with the next master train. I would like to wait for some time and collect new test failures with those logs and sandbox data. At the moment I don't see any marathon related issues.

Comment by Karsten Jeschkies (Inactive) [ 17/Jul/18 ]

Sergey Urbanovich, thanks for the feedback. Senthil Kumaran, would it be possible to setup a loop for DC/OS master to gather the data constantly? The comments by the merge bot are hard to analyze and come from pull requests which distort the results.

Comment by Pawel Rozlach [ 17/Jul/18 ]

Karsten Jeschkies I already created a Jira (DCOS-17519) for that and tried to make lots of different people to notice it, but so far it was ignored I hope that at given enough time enough people will get the same idea and that it will finally get prioritized

Comment by Senthil Kumaran (Inactive) [ 17/Jul/18 ]

>  Senthil Kumaran, would it be possible to setup a loop for DC/OS master to gather the data constantly?

Karsten Jeschkies - Yes, I have it today. It is good that we are focussing on this problem, let us not loose momentum on this.

Comment by Mergebot [ 18/Jul/18 ]

@alexr overrode teamcity/dcos/test/dcos-e2e/docker/static/strict status of mesosphere/dcos-enterprise/pull/2702 (Title: dcos-checks: bump for cockroachdb ranges check and enable config, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 19/Jul/18 ]

@gpaul overrode teamcity/dcos/test/dcos-e2e/docker/static/strict status of mesosphere/dcos-enterprise/pull/2975 (Title: [master] Provide Adminrouter URL for IAM access, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 19/Jul/18 ]

@sergeyurbanovich overrode teamcity/dcos/test/dcos-e2e/docker/static/strict status of mesosphere/dcos-enterprise/pull/3062 (Title: Adds dataDir to ucr bridge cni configuration, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Senthil Kumaran (Inactive) [ 19/Jul/18 ]

> would it be possible to setup a loop for DC/OS master to gather the data constantly?

Karsten Jeschkies / Sergey Urbanovich

Let's keep an eye on this - https://teamcity.mesosphere.io/viewType.html?buildTypeId=DcOs_Enterprise_Test_Inte_TestVipExclusive&branch_DcOs_Enterprise_Test_Inte=%3Cdefault%3E&tab=buildTypeStatusDiv 

 

This is test_vip exclusive, it is going to exercise only `pytest -k test_vip` for every 3 hours, and if test step fails, the cluster wont be deleted. Let's monitor this one.

 

Comment by Gustav Paul (Inactive) [ 19/Jul/18 ]

First failures:
https://teamcity.mesosphere.io/viewLog.html?buildId=1134310&tab=buildResultsDiv&buildTypeId=DcOs_Enterprise_Test_Inte_TestVipExclusive

Comment by Sergey Urbanovich (Inactive) [ 20/Jul/18 ]

Senthil Kumaran In that job all tests on ucr are failing consistently. It doesn't sound like the test_vip flakiness.

Comment by Senthil Kumaran (Inactive) [ 23/Jul/18 ]

Hey Sergey Urbanovich - You are right, the UCR failure is unrelated to this. I am investigating it further here. If this is broken in master then it is being observed only in AWS Onprem w/ Static Backend and Security Strict test suite. Further investigation is in progress https://jira.mesosphere.com/browse/DCOS-39700 as we don't want merge those failures with flaky behavior of test_vip. 

Comment by Mergebot [ 24/Jul/18 ]

@alexr overrode teamcity/dcos/test/dcos-docker/static status of dcos/dcos/pull/3133 (Title: gen/calc: normalize check timeouts, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 30/Jul/18 ]

@alexr overrode teamcity/dcos/test/dcos-e2e/docker/static/strict status of mesosphere/dcos-enterprise/pull/3102 (Title: Enable Mesos jemalloc and memory profiling support in DC/OS, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 31/Jul/18 ]

@alexr overrode teamcity/dcos/test/dcos-docker/static status of dcos/dcos/pull/3123 (Title: Enable Mesos jemalloc and memory profiling support in DC/OS, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 31/Jul/18 ]

@branden overrode teamcity/dcos/test/aws/onprem/static status of dcos/dcos/pull/3175 (Title: [1.10] Mergebot Automated Train PR - 2018-Jul-31-10-00, Branch: 1.10) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 01/Aug/18 ]

@skumaran overrode teamcity/dcos/test/aws/cloudformation/simple status of dcos/dcos/pull/3120 (Title: Add Telegraf as a DC/OS component, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 02/Aug/18 ]

@skumaran overrode teamcity/dcos/test/dcos-e2e/docker/static/strict status of mesosphere/dcos-enterprise/pull/3164 (Title: [1.11] Mergebot Automated Train PR - 2018-Aug-02-11-00, Branch: 1.11) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 03/Aug/18 ]

@skumaran overrode teamcity/dcos/test/dcos-e2e/docker/static/strict status of mesosphere/dcos-enterprise/pull/3174 (Title: [master] Mergebot Automated Train PR - 2018-Aug-02-23-24, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Senthil Kumaran (Inactive) [ 07/Aug/18 ]

Hi Karsten Jeschkies / Sergey Urbanovich - I hope you noticed that the Sandbox logs are being collected on these jobs  (Done as explained here - https://jira.mesosphere.com/browse/DCOS-39211

Moreover. we have a periodic execution of just test_vip test case that has showing a consistent pattern of failing intermittently here - https://teamcity.mesosphere.io/viewType.html?buildTypeId=FooBar_DcOs_Enterprise_Test_Inte_TestVipExclusive&tab=buildTypeHistoryList&branch_DcOs_Enterprise_Test_Inte=1.12.DCOS-39700.t1  I hope that is useful to debug further. 

 

Comment by Mergebot [ 07/Aug/18 ]

@timweidner overrode teamcity/dcos/test/dcos-docker/static status of dcos/dcos/pull/3226 (Title: [1.10] packages/java: Update Java to 8u181 version, Branch: 1.10) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 07/Aug/18 ]

@skumaran overrode teamcity/dcos/test/aws/onprem/static status of dcos/dcos/pull/3227 (Title: [master] Mergebot Automated Train PR - 2018-Aug-07-12-00, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Senthil Kumaran (Inactive) [ 09/Aug/18 ]

I am setting the priority to High. It has been a "Blocker" status bug for a long time and we have not blocked any releases due to this bug.

Comment by Mergebot [ 09/Aug/18 ]

@skumaran overrode teamcity/dcos/test/azure/arm status of dcos/dcos/pull/3233 (Title: [Backport][1.11] locks the glide version to fix build issue *, Branch: *1.11) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Senthil Kumaran (Inactive) [ 10/Aug/18 ]

Hello, We have been tracking this issue as flaky bug/task. Please make sure that the metadata such as Priority, Issue Type reflect the status accurately.

If this frequently observed flake, please set the status to Blocker or High.
If this flake was observed once or twice, please set the issue state to medium or low and please feel free to close this issue too.

Comment by Mergebot [ 10/Aug/18 ]

@jp overrode teamcity/dcos/test/dcos-e2e/docker/static/strict status of mesosphere/dcos-enterprise/pull/3185 (Title: packages/bouncer: replace `dig` for leader detection with python only implementation, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Adam Dangoor (Inactive) [ 10/Aug/18 ]

Sergey Urbanovich Karsten Jeschkies - it looks like recent reported failures have a different failure to the one in the description. Is this because of work done to make the error clear?

In particular:

self = <test_networking.MarathonPod object at 0x7efdc4194b00>
dcos_api_session = <dcos_test_utils.enterprise.EnterpriseApiSession object at 0x7efdfb957d30>

    @retrying.retry(
        wait_fixed=5000,
        stop_max_delay=20 * 60 * 1000,
        retry_on_result=lambda res: res is False)
    def wait(self, dcos_api_session):
        r = dcos_api_session.marathon.get('/v2/pods/{}::status'.format(self.id))
        assert_response_ok(r)
    
        self._info = r.json()
        error_msg = 'Status was {}: {}'.format(self._info['status'], self._info.get('message', 'no message'))
>       assert self._info['status'] == 'STABLE', error_msg
E       AssertionError: Status was DEGRADED: no message
E       assert 'DEGRADED' == 'STABLE'
E         - DEGRADED
E         + STABLE

I will ask for an override against this issue, and I'd ask that, if possible, you change the description of this issue.

Comment by Mergebot [ 10/Aug/18 ]

@skumaran overrode teamcity/dcos/test/dcos-e2e/docker/static/strict status of mesosphere/dcos-enterprise/pull/3228 (Title: [master] Mergebot Automated Train PR - 2018-Aug-09-23-45, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 10/Aug/18 ]

@skumaran overrode teamcity/dcos/test/azure/arm status of dcos/dcos/pull/3259 (Title: [1.10] Mergebot Automated Train PR - 2018-Aug-09-23-45, Branch: 1.10) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 10/Aug/18 ]

@skumaran overrode teamcity/dcos/test/dcos-docker/static status of dcos/dcos/pull/3259 (Title: [1.10] Mergebot Automated Train PR - 2018-Aug-09-23-45, Branch: 1.10) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Karsten Jeschkies (Inactive) [ 13/Aug/18 ]

Adam Dangoor,

this test has multiple flakes. See my comment. Since test_vip is not split up and is covering a lot of DC/OS this JIRA became a pool of all sorts of flake reports.

Our, ie Aleksey Dukhovniy and my, suggestion was

  1. Split up the test into individual test cases or make clear that we have over 36 test cases run.
  2. Stop spamming the JIRAs with overrides.
  3. Run a loop on DC/OS master and not just test_vip as we do now.

This way we would now when an override is appropriate or not. AFAIK 1. is not going to happen.

Comment by Mergebot [ 13/Aug/18 ]

@jp overrode teamcity/dcos/test/aws/cloudformation/simple status of dcos/dcos/pull/3248 (Title: [1.11] changelog: Add Java update note, Branch: 1.11) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 13/Aug/18 ]

@drozhkov overrode teamcity/dcos/test/aws/cloudformation/simple status of dcos/dcos/pull/3210 (Title: Bump ui to 1.11+v1.17.0, Branch: 1.11) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 13/Aug/18 ]

@sergeyurbanovich overrode teamcity/dcos/test/dcos-docker/static status of dcos/dcos/pull/3268 (Title: [master] Mergebot Automated Train PR - 2018-Aug-13-12-00, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Adam Dangoor (Inactive) [ 15/Aug/18 ]

Karsten Jeschkies Can you suggest an alternative for someone with a PR that hits this flake?

Comment by Mergebot [ 15/Aug/18 ]

@jp overrode teamcity/dcos/test/dcos-e2e/docker/static/strict status of mesosphere/dcos-enterprise/pull/2998 (Title: Test that a configurable permissions cache is used by various authorizers, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 15/Aug/18 ]

@jp overrode teamcity/dcos/test/dcos-e2e/docker/static/strict status of mesosphere/dcos-enterprise/pull/3221 (Title: Implement bootstrap methods for telegraf, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 15/Aug/18 ]

@cprovencher overrode teamcity/dcos/test/azure/arm status of dcos/dcos/pull/3274 (Title: [master] Mergebot Automated Train PR - 2018-Aug-14-12-01, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 16/Aug/18 ]

@jp overrode teamcity/dcos/test/aws/onprem/static status of dcos/dcos/pull/3250 (Title: Implement bootstrap methods for telegraf, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 20/Aug/18 ]

@gpaul overrode teamcity/dcos/test/dcos-e2e/docker/static/strict status of mesosphere/dcos-enterprise/pull/3206 (Title: setup.py: specify all files in ./pkgpanda/docker/dcos-builder, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 21/Aug/18 ]

@cprovencher overrode teamcity/dcos/test/dcos-e2e/docker/static/strict status of mesosphere/dcos-enterprise/pull/3273 (Title: [master] Mergebot Automated Train PR - 2018-Aug-20-12-00, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 21/Aug/18 ]

@cprovencher overrode teamcity/dcos/test/azure/arm status of dcos/dcos/pull/3313 (Title: [1.10] Mergebot Automated Train PR - 2018-Aug-21-10-00, Branch: 1.10) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 23/Aug/18 ]

@skumaran overrode teamcity/dcos/test/dcos-e2e/docker/static/strict status of mesosphere/dcos-enterprise/pull/3295 (Title: [WIP] Bump cosmos testing version, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 24/Aug/18 ]

@skumaran overrode teamcity/dcos/test/dcos-e2e/docker/static/strict status of mesosphere/dcos-enterprise/pull/3265 (Title: Bumping dcos-test-utils version, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 24/Aug/18 ]

@charlesprovencher overrode teamcity/dcos/test/aws/onprem/static status of dcos/dcos/pull/3329 (Title: [master] Mergebot Automated Train PR - 2018-Aug-23-12-00, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 24/Aug/18 ]

@charlesprovencher overrode teamcity/dcos/test/dcos-docker/static status of dcos/dcos/pull/3329 (Title: [master] Mergebot Automated Train PR - 2018-Aug-23-12-00, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 27/Aug/18 ]

@jp overrode teamcity/dcos/test/dcos-e2e/docker/static/strict status of mesosphere/dcos-enterprise/pull/3310 (Title: [DCOS-39776] Remove disabled security mode, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 27/Aug/18 ]

@charlesprovencher overrode teamcity/dcos/test/dcos-docker/static status of dcos/dcos/pull/3346 (Title: [master] Mergebot Automated Train PR - 2018-Aug-27-23-27, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 27/Aug/18 ]

@charlesprovencher overrode teamcity/dcos/test/aws/onprem/static status of dcos/dcos/pull/3344 (Title: [1.11] Mergebot Automated Train PR - 2018-Aug-27-19-19, Branch: 1.11) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 28/Aug/18 ]

@charlesprovencher overrode teamcity/dcos/test/dcos-e2e/docker/static/strict status of mesosphere/dcos-enterprise/pull/3321 (Title: [1.11] Mergebot Automated Train PR - 2018-Aug-27-19-20, Branch: 1.11) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 28/Aug/18 ]

@skumaran overrode teamcity/dcos/test/dcos-e2e/docker/static/strict status of mesosphere/dcos-enterprise/pull/3321 (Title: [1.11] Mergebot Automated Train PR - 2018-Aug-27-19-20, Branch: 1.11) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 28/Aug/18 ]

@gpaul overrode teamcity/dcos/test/aws/onprem/static status of dcos/dcos/pull/3148 (Title: Add timestamp for dmesg, distro version, timedatectl and systemd unit status to diag bundle, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Sergey Urbanovich (Inactive) [ 28/Aug/18 ]

https://teamcity.mesosphere.io/viewLog.html?buildId=1183572&buildTypeId=DcOs_Open_Test_IntegrationTest_AwsOnpremWStaticBackend&tab=buildResultsDiv

Here is a really good example for the mesos team. Mesos couldn't start UCR container on host network. Vinod Kone could someone help us with this? Please check Artifacts tab for logs and mesos sandboxes.

2018-08-28 18:14:39: I0828 18:14:39.182575 13694 containerizer.cpp:2006] Checkpointing container's forked pid 28601 to '/var/lib/mesos/slave/meta/slaves/55852c35-cc19-415d-a747-9dcfb7472e9d-S1/frameworks/55852c35-cc19-415d-a747-9dcfb7472e9d-0001/executors/integration-test-628cbd1f301b4c07bd6946cf4eb35168.39e5ccf3-aaee-11e8-a0d8-fe211abb3180/runs/a6ebb482-8bcf-4524-9b6b-0b91b3150efb/pids/forked.pid'
2018-08-28 18:24:38: I0828 18:24:38.591409 13695 slave.cpp:6790] Terminating executor 'integration-test-628cbd1f301b4c07bd6946cf4eb35168.39e5ccf3-aaee-11e8-a0d8-fe211abb3180' of framework 55852c35-cc19-415d-a747-9dcfb7472e9d-0001 because it did not register within 10mins
Comment by Mergebot [ 29/Aug/18 ]

@skumaran overrode teamcity/dcos/test/dcos-docker/static status of dcos/dcos/pull/3346 (Title: [master] Mergebot Automated Train PR - 2018-Aug-27-23-27, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 30/Aug/18 ]

@jp overrode teamcity/dcos/test/aws/onprem/static status of dcos/dcos/pull/3362 (Title: Disable a watchdog for stuck processes, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 30/Aug/18 ]

@jp overrode teamcity/dcos/test/aws/onprem/static status of dcos/dcos/pull/3359 (Title: [1.10] Disable a watchdog for stuck processes, Branch: 1.10) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 31/Aug/18 ]

@jp overrode teamcity/dcos/test/dcos-docker/static status of dcos/dcos/pull/3370 (Title: Fix the release create stage., Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 31/Aug/18 ]

@skumaran overrode teamcity/dcos/test/aws/cloudformation/simple status of dcos/dcos/pull/3372 (Title: [master] Mergebot Automated Train PR - 2018-Aug-31-12-01, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Jan-Philip Gehrcke (Inactive) [ 03/Sep/18 ]

Terminating executor 'integration-test-628cbd1f301b4c07bd6946cf4eb35168.39e5ccf3-aaee-11e8-a0d8-fe211abb3180' of framework 55852c35-cc19-415d-a747-9dcfb7472e9d-0001 because it did not register within 10mins

Highly relevant discussion: https://github.com/dcos/dcos/pull/1801

Also see DCOS_OSS-1463 where we concluded before: "Docker pulls are causing test cases to fail by timeout and could be greatly improved with a dedicated proxy and minimalist docker image".

Comment by Mergebot [ 04/Sep/18 ]

@drozhkov overrode teamcity/dcos/test/aws/onprem/static status of dcos/dcos/pull/3382 (Title: chore(dcos-ui): bump DC/OS UI dcos-ui/master+v2.19.4, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 04/Sep/18 ]

@skumaran overrode teamcity/dcos/test/dcos-e2e/docker/static/strict status of mesosphere/dcos-enterprise/pull/3359 (Title: Bumping marathon to 1.7.111, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 04/Sep/18 ]

@branden overrode teamcity/dcos/test/dcos-e2e/docker/static/strict status of mesosphere/dcos-enterprise/pull/3347 (Title: [WIP] Add plugins to Telegraf, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 04/Sep/18 ]

@skumaran overrode teamcity/dcos/test/dcos-e2e/docker/static/strict status of mesosphere/dcos-enterprise/pull/3352 (Title: [master] Mergebot Automated Train PR - 2018-Sep-03-12-00, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 05/Sep/18 ]

@jp overrode teamcity/dcos/test/dcos-e2e/docker/static/strict status of mesosphere/dcos-enterprise/pull/3365 (Title: adminrouter: authentication architecture adjustments (WIP), Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 05/Sep/18 ]

@jp overrode teamcity/dcos/test/dcos-e2e/docker/static/strict status of mesosphere/dcos-enterprise/pull/3351 (Title: [1.11] Bump Mesos to nightly 1.5.x 19d17ce, Branch: 1.11) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 05/Sep/18 ]

@jp overrode teamcity/dcos/test/aws/onprem/static status of dcos/dcos/pull/3366 (Title: [WIP] Add plugins to Telegraf, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 05/Sep/18 ]

@skumaran overrode teamcity/dcos/test/dcos-docker/static status of dcos/dcos/pull/3366 (Title: [WIP] Add plugins to Telegraf, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 05/Sep/18 ]

@sergeyurbanovich overrode teamcity/dcos/test/dcos-e2e/docker/static/strict status of mesosphere/dcos-enterprise/pull/3366 (Title: [master] Mergebot Automated Train PR - 2018-Sep-05-12-01, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Vinod Kone (Inactive) [ 06/Sep/18 ]

Gilbert Song and Qian Zhang will triage this.

Comment by Mergebot [ 07/Sep/18 ]

@jp overrode teamcity/dcos/test/aws/onprem/static status of dcos/dcos/pull/3254 (Title: Prevent dcos-history leaking auth tokens, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Qian Zhang (Inactive) [ 07/Sep/18 ]

For the Docker containerizer's case (i.e., the test `test_networking.test_vip[Container_DOCKER-Network_BRIDGE-Network_HOST]`), I checked the stderr of the Docker executor and found an error:

...
E0830 00:09:37.303499 2428 executor.cpp:385] Failed to inspect container 'mesos-eaa4f455-0a2c-47ff-bf98-8bd0ad243740': Unable to create container: Unable to find Id in container
[2018-08-30 00:09:37,745] INFO: HTTP server is starting, port: 3511, test-UUID: '0d4176ad55894360907e0e4ea6ce0f81'
...

So Docker executor has already launched the Docker container but the output of the `docker inspect` does not include the container's ID, this is weird, I never see this issue before.

Comment by Qian Zhang (Inactive) [ 07/Sep/18 ]

For Mesos containerizer's cases, after checking the logs, I found there are actually two different cases.

Case 1:

In the stderr of the executor, I see only one message:

Failed to synchronize with agent (it's probably exited)

This is an error which could happen when reading a pipe.

Case 2:

The stderr of the executor is empty, and in agent log I see:

2018-09-03 20:30:06: I0903 20:30:06.007843 13710 cni.cpp:952] Bind mounted '/proc/13189/ns/net' to '/run/mesos/isolators/network/cni/d955d3cb-099e-496a-87a9-fc89ef3567ef/ns' for container d955d3cb-099e-496a-87a9-fc89ef3567ef
2018-09-03 20:50:13: I0903 20:50:13.872481 13713 cni.cpp:1383] Got assigned IPv4 address '172.31.254.22/24' from CNI network 'mesos-bridge' for container d955d3cb-099e-496a-87a9-fc89ef3567ef

So it took 20 mins for CNI isolator to get IP for the container which is weird.

Comment by Karsten Jeschkies (Inactive) [ 10/Sep/18 ]

Qian Zhang, do you think it would make sense to track the issue separately? This flake still shows up as one in Carter Gawron's summaries even though there are multiple independent issues.

Comment by Carter Gawron [ 10/Sep/18 ]

Anything we can do to split this up and resolve would be great. We have overriden this issue 119 times. That's ~40hrs worth of work just doing that.

 

 

Comment by Senthil Kumaran (Inactive) [ 10/Sep/18 ]

Karsten - Those multiple independent issues, can be tracked separately and this issue be made a dependency those. We have already done that for this ticket, please notice this in the issue links. Once the core issues are fixed, and upon the resolution of the flakiness we should close this issue.  I hope, you do not mean that we should close this issue, and track those independent issues instead. That won't help much IMO. 

We are close to resolution on this problem, and it will be great to get across the finish line with this. 

 

 

Comment by Qian Zhang (Inactive) [ 10/Sep/18 ]

I suspect this issue (at least the Mesos containerizer's case) may be caused by the FD leak bug that Gilbert recently fixed in Mesos, and that fix landed in DC/OS master branch 5 days ago.

If this issue happens again, this ticket will be automatically updated by Mergebot by adding a new comment, right? I will keep monitoring this ticket and see if there are anything different if this issue happens again.

Comment by Mergebot [ 10/Sep/18 ]

@charlesprovencher overrode teamcity/dcos/test/azure/arm status of dcos/dcos/pull/3416 (Title: [1.11] Mergebot Automated Train PR - 2018-Sep-10-16-49, Branch: 1.11) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 10/Sep/18 ]

@charlesprovencher overrode teamcity/dcos/test/aws/onprem/static status of dcos/dcos/pull/3416 (Title: [1.11] Mergebot Automated Train PR - 2018-Sep-10-16-49, Branch: 1.11) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 11/Sep/18 ]

@skumaran overrode teamcity/dcos/test/aws/onprem/static status of dcos/dcos/pull/3414 (Title: [1.10] Backported detailed resource logging for some allocator errors in Mesos., Branch: 1.10) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 11/Sep/18 ]

@skumaran overrode teamcity/dcos/test/aws/onprem/static status of dcos/dcos/pull/3371 (Title: Add wait command to dcos-docker instructions, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Qian Zhang (Inactive) [ 11/Sep/18 ]

This test failed again for the Mesos containerizer case (see the last comment added by MergeBot) in another place. This time in the executor's stderr, I see the task has been started successfully and the health check returned 200 which is also good.

[2018-09-07 23:18:36,926] INFO: HTTP server is starting, port: 12830, test-UUID: '0979a2280fc3431e9904885603a0c810'
[2018-09-07 23:18:50,273] INFO: REQ: 127.0.0.1 "GET /ping HTTP/1.1" 200 -
I0907 23:18:50.315387 9 checker_process.cpp:1140] HTTP health check for task 'integration-test-0979a2280fc3431e9904885603a0c810.578be3f4-b2f4-11e8-afd5-92e466048727' returned: 200
I0907 23:18:50.315495 9 executor.cpp:350] Received task health update, healthy: true

But the weird thing is, agent did not receive any status updates for this task from the executor.

$ grep integration-test-0979a2280fc3431e9904885603a0c810 ~/Downloads/dcos-mesos-slave.service
2018-09-07 23:18:34: I0907 23:18:34.559394 16711 slave.cpp:2035] Got assigned task 'integration-test-0979a2280fc3431e9904885603a0c810.578be3f4-b2f4-11e8-afd5-92e466048727' for framework e4956377-6a5b-4a83-9277-7f35da39387e-0000
2018-09-07 23:18:34: I0907 23:18:34.559952 16711 slave.cpp:2409] Authorizing task 'integration-test-0979a2280fc3431e9904885603a0c810.578be3f4-b2f4-11e8-afd5-92e466048727' for framework e4956377-6a5b-4a83-9277-7f35da39387e-0000
2018-09-07 23:18:34: I0907 23:18:34.560616 16711 slave.cpp:2852] Launching task 'integration-test-0979a2280fc3431e9904885603a0c810.578be3f4-b2f4-11e8-afd5-92e466048727' for framework e4956377-6a5b-4a83-9277-7f35da39387e-0000
2018-09-07 23:18:34: I0907 23:18:34.560703 16711 paths.cpp:745] Creating sandbox '/var/lib/mesos/slave/slaves/e4956377-6a5b-4a83-9277-7f35da39387e-S1/frameworks/e4956377-6a5b-4a83-9277-7f35da39387e-0000/executors/integration-test-0979a2280fc3431e9904885603a0c810.578be3f4-b2f4-11e8-afd5-92e466048727/runs/e56b8602-0de7-4b57-bc61-c839a28e554f' for user 'root'
2018-09-07 23:18:34: I0907 23:18:34.561393 16711 paths.cpp:748] Creating sandbox '/var/lib/mesos/slave/meta/slaves/e4956377-6a5b-4a83-9277-7f35da39387e-S1/frameworks/e4956377-6a5b-4a83-9277-7f35da39387e-0000/executors/integration-test-0979a2280fc3431e9904885603a0c810.578be3f4-b2f4-11e8-afd5-92e466048727/runs/e56b8602-0de7-4b57-bc61-c839a28e554f'
2018-09-07 23:18:34: I0907 23:18:34.561604 16711 slave.cpp:9015] Launching executor 'integration-test-0979a2280fc3431e9904885603a0c810.578be3f4-b2f4-11e8-afd5-92e466048727' of framework e4956377-6a5b-4a83-9277-7f35da39387e-0000 with resources [{"allocation_info":{"role":"slave_public"},"name":"cpus","scalar":{"value":0.1},"type":"SCALAR"},{"allocation_info":{"role":"slave_public"},"name":"mem","scalar":{"value":32.0},"type":"SCALAR"}] in work directory '/var/lib/mesos/slave/slaves/e4956377-6a5b-4a83-9277-7f35da39387e-S1/frameworks/e4956377-6a5b-4a83-9277-7f35da39387e-0000/executors/integration-test-0979a2280fc3431e9904885603a0c810.578be3f4-b2f4-11e8-afd5-92e466048727/runs/e56b8602-0de7-4b57-bc61-c839a28e554f'
2018-09-07 23:18:34: I0907 23:18:34.561944 16711 slave.cpp:3530] Launching container e56b8602-0de7-4b57-bc61-c839a28e554f for executor 'integration-test-0979a2280fc3431e9904885603a0c810.578be3f4-b2f4-11e8-afd5-92e466048727' of framework e4956377-6a5b-4a83-9277-7f35da39387e-0000
2018-09-07 23:18:34: I0907 23:18:34.562489 16711 slave.cpp:3049] Queued task 'integration-test-0979a2280fc3431e9904885603a0c810.578be3f4-b2f4-11e8-afd5-92e466048727' for executor 'integration-test-0979a2280fc3431e9904885603a0c810.578be3f4-b2f4-11e8-afd5-92e466048727' of framework e4956377-6a5b-4a83-9277-7f35da39387e-0000
2018-09-07 23:18:34: I0907 23:18:34.603900 16711 containerizer.cpp:2022] Checkpointing container's forked pid 28352 to '/var/lib/mesos/slave/meta/slaves/e4956377-6a5b-4a83-9277-7f35da39387e-S1/frameworks/e4956377-6a5b-4a83-9277-7f35da39387e-0000/executors/integration-test-0979a2280fc3431e9904885603a0c810.578be3f4-b2f4-11e8-afd5-92e466048727/runs/e56b8602-0de7-4b57-bc61-c839a28e554f/pids/forked.pid'
2018-09-07 23:18:35: I0907 23:18:35.235574 16710 slave.cpp:4824] Got registration for executor 'integration-test-0979a2280fc3431e9904885603a0c810.578be3f4-b2f4-11e8-afd5-92e466048727' of framework e4956377-6a5b-4a83-9277-7f35da39387e-0000 from executor(1)@10.10.0.145:33052
2018-09-07 23:18:35: I0907 23:18:35.263344 16708 slave.cpp:3262] Sending queued task 'integration-test-0979a2280fc3431e9904885603a0c810.578be3f4-b2f4-11e8-afd5-92e466048727' to executor 'integration-test-0979a2280fc3431e9904885603a0c810.578be3f4-b2f4-11e8-afd5-92e466048727' of framework e4956377-6a5b-4a83-9277-7f35da39387e-0000 at executor(1)@10.10.0.145:33052
2018-09-07 23:38:38: I0907 23:38:38.775454 16713 slave.cpp:3636] Asked to kill task integration-test-0979a2280fc3431e9904885603a0c810.578be3f4-b2f4-11e8-afd5-92e466048727 of framework e4956377-6a5b-4a83-9277-7f35da39387e-0000
2018-09-07 23:38:39: I0907 23:38:39.933745 16710 slave.cpp:6310] Executor 'integration-test-0979a2280fc3431e9904885603a0c810.578be3f4-b2f4-11e8-afd5-92e466048727' of framework e4956377-6a5b-4a83-9277-7f35da39387e-0000 exited with status 0
2018-09-07 23:38:39: I0907 23:38:39.933826 16710 slave.cpp:5290] Handling status update TASK_FAILED (Status UUID: 4fe40aba-2210-464d-ae76-dc1c61614ac8) for task integration-test-0979a2280fc3431e9904885603a0c810.578be3f4-b2f4-11e8-afd5-92e466048727 of framework e4956377-6a5b-4a83-9277-7f35da39387e-0000 from @0.0.0.0:0
2018-09-07 23:38:39: E0907 23:38:39.933982 16710 slave.cpp:5621] Failed to update resources for container e56b8602-0de7-4b57-bc61-c839a28e554f of executor 'integration-test-0979a2280fc3431e9904885603a0c810.578be3f4-b2f4-11e8-afd5-92e466048727' running task integration-test-0979a2280fc3431e9904885603a0c810.578be3f4-b2f4-11e8-afd5-92e466048727 on status update for terminal task, destroying container: Container not found
2018-09-07 23:38:39: I0907 23:38:39.934042 16710 task_status_update_manager.cpp:328] Received task status update TASK_FAILED (Status UUID: 4fe40aba-2210-464d-ae76-dc1c61614ac8) for task integration-test-0979a2280fc3431e9904885603a0c810.578be3f4-b2f4-11e8-afd5-92e466048727 of framework e4956377-6a5b-4a83-9277-7f35da39387e-0000
2018-09-07 23:38:39: I0907 23:38:39.934334 16710 task_status_update_manager.cpp:842] Checkpointing UPDATE for task status update TASK_FAILED (Status UUID: 4fe40aba-2210-464d-ae76-dc1c61614ac8) for task integration-test-0979a2280fc3431e9904885603a0c810.578be3f4-b2f4-11e8-afd5-92e466048727 of framework e4956377-6a5b-4a83-9277-7f35da39387e-0000
2018-09-07 23:38:39: I0907 23:38:39.934468 16712 slave.cpp:5782] Forwarding the update TASK_FAILED (Status UUID: 4fe40aba-2210-464d-ae76-dc1c61614ac8) for task integration-test-0979a2280fc3431e9904885603a0c810.578be3f4-b2f4-11e8-afd5-92e466048727 of framework e4956377-6a5b-4a83-9277-7f35da39387e-0000 to master@10.10.0.104:5050
2018-09-07 23:38:39: I0907 23:38:39.950619 16712 task_status_update_manager.cpp:401] Received task status update acknowledgement (UUID: 4fe40aba-2210-464d-ae76-dc1c61614ac8) for task integration-test-0979a2280fc3431e9904885603a0c810.578be3f4-b2f4-11e8-afd5-92e466048727 of framework e4956377-6a5b-4a83-9277-7f35da39387e-0000
2018-09-07 23:38:39: I0907 23:38:39.950681 16712 task_status_update_manager.cpp:842] Checkpointing ACK for task status update TASK_FAILED (Status UUID: 4fe40aba-2210-464d-ae76-dc1c61614ac8) for task integration-test-0979a2280fc3431e9904885603a0c810.578be3f4-b2f4-11e8-afd5-92e466048727 of framework e4956377-6a5b-4a83-9277-7f35da39387e-0000
2018-09-07 23:38:39: I0907 23:38:39.950841 16712 slave.cpp:6408] Cleaning up executor 'integration-test-0979a2280fc3431e9904885603a0c810.578be3f4-b2f4-11e8-afd5-92e466048727' of framework e4956377-6a5b-4a83-9277-7f35da39387e-0000 at executor(1)@10.10.0.145:33052
2018-09-07 23:38:39: I0907 23:38:39.951190 16712 gc.cpp:95] Scheduling '/var/lib/mesos/slave/slaves/e4956377-6a5b-4a83-9277-7f35da39387e-S1/frameworks/e4956377-6a5b-4a83-9277-7f35da39387e-0000/executors/integration-test-0979a2280fc3431e9904885603a0c810.578be3f4-b2f4-11e8-afd5-92e466048727/runs/e56b8602-0de7-4b57-bc61-c839a28e554f' for gc 1.9999889926163days in the future
2018-09-07 23:38:39: I0907 23:38:39.951225 16712 gc.cpp:95] Scheduling '/var/lib/mesos/slave/slaves/e4956377-6a5b-4a83-9277-7f35da39387e-S1/frameworks/e4956377-6a5b-4a83-9277-7f35da39387e-0000/executors/integration-test-0979a2280fc3431e9904885603a0c810.578be3f4-b2f4-11e8-afd5-92e466048727' for gc 1.99998899182815days in the future
2018-09-07 23:38:39: I0907 23:38:39.951248 16712 gc.cpp:95] Scheduling '/var/lib/mesos/slave/meta/slaves/e4956377-6a5b-4a83-9277-7f35da39387e-S1/frameworks/e4956377-6a5b-4a83-9277-7f35da39387e-0000/executors/integration-test-0979a2280fc3431e9904885603a0c810.578be3f4-b2f4-11e8-afd5-92e466048727/runs/e56b8602-0de7-4b57-bc61-c839a28e554f' for gc 1.99998899146963days in the future
2018-09-07 23:38:39: I0907 23:38:39.951269 16712 gc.cpp:95] Scheduling '/var/lib/mesos/slave/meta/slaves/e4956377-6a5b-4a83-9277-7f35da39387e-S1/frameworks/e4956377-6a5b-4a83-9277-7f35da39387e-0000/executors/integration-test-0979a2280fc3431e9904885603a0c810.578be3f4-b2f4-11e8-afd5-92e466048727' for gc 1.99998899119111days in the future

Only one task status (`TASK_FAILED`) was handled by agent for this task, but I suspect that status update was generated by the agent itself rather than sent from executor. It looks like executor cannot send any status updates to the agent.

Comment by Karsten Jeschkies (Inactive) [ 12/Sep/18 ]

Karsten - Those multiple independent issues, can be tracked separately and this issue be made a dependency those. We have already done that for this ticket, please notice this in the issue links. Once the core issues are fixed, and upon the resolution of the flakiness we should close this issue. I hope, you do not mean that we should close this issue, and track those independent issues instead. That won't help much IMO.

We are close to resolution on this problem, and it will be great to get across the finish line with this.

Well, as I said Carter's weekly summary did not surface these details. Maybe it should not. However, as an engineer this tickets becomes very hard to follow. Mergebot is generating a lot of noise and the core issues investigated don't seem obvious by looking at this ticket. Also, the only two open related issues are MARATHON-8235 and DCOS_OSS-3736 which both don't seem to address the symptoms Qian Zhang describes.

As Aleksey Dukhovniy and I mentioned before to Pawel Rozlach and Fabricio de Sousa Nascimento, the test should be split up as well as this ticket. There is no way we can help and get ahead of this if the rest of the company thinks that test_vip is one flaky test. It is not.

Anyways, kudos to Qian for diving into this.

Comment by Jan-Philip Gehrcke (Inactive) [ 12/Sep/18 ]

The override command data shows that the test_vip instability has hurt us really bad in the past two months (more than any other instability) and justifies assembling a "tiger team" of individual domain experts which focuses on finding and addressing the individual causes leading to the test_vip instability.

From the looks of it we almost have such a team (comprised of Aleksey Dukhovniy, Sergey Urbanovich, Karsten Jeschkies, Qian Zhang, ...) but I think we should first-class this and make sure that they can focus. CC Artem Harutyunyan Chandler Hoisington.

As Aleksey Dukhovniy and I mentioned before to Pawel Rozlach and Fabricio de Sousa Nascimento, the test should be split up as well as this ticket. There is no way we can help and get ahead of this if the rest of the company thinks that test_vip is one flaky test. It is not.

While some might indeed think that you can be sure that others (like me) know how diverse and mean the test_vip instability is. We know that it runs much more Marathon apps than other tests which is why, statistically, it suffers from even minor instabilities around app and task launches. And getting to the bottom of the individual, independent causes is indeed what we must focus on.

I propose:

  • Let's keep this ticket open and let's consider it to be right ticket to issue override commands against (we should not expect all developers issuing override commands to be able to distinguish different sub types of test test_vip failure).
  • Let's have most of the casual discussion on Slack in #test-vip-flakiness.
  • I would really love to see a concise summary of our current state of knowledge about individual, independent causes (or theories about them). Let's try to assemble this in a shared Google Doc maybe? Every theory / cause should get its own JIRA ticket and its own section in the Google Doc.
Comment by Mergebot [ 12/Sep/18 ]

@charlesprovencher overrode teamcity/dcos/test/aws/onprem/static status of dcos/dcos/pull/3426 (Title: [master] Mergebot Automated Train PR - 2018-Sep-12-12-00, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Qian Zhang (Inactive) [ 12/Sep/18 ]

This test failed again, but it seems there is no logs?

Comment by Mergebot [ 12/Sep/18 ]

@charlesprovencher overrode teamcity/dcos/test/dcos-docker/static status of dcos/dcos/pull/3357 (Title: [Backport] [1.11] bump mesos-module to include the fix for coreos 1800.7.0, Branch: 1.11) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 13/Sep/18 ]

@jp overrode teamcity/dcos/test/dcos-e2e/docker/static/strict status of mesosphere/dcos-enterprise/pull/3406 (Title: exhibitor package: bump ZooKeeper to 3.4.13 release, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 13/Sep/18 ]

@jp overrode teamcity/dcos/test/aws/onprem/static status of dcos/dcos/pull/3431 (Title: chore: bump dcos-ui v1.10+v1.10.9-rc3, Branch: 1.10) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 13/Sep/18 ]

@jonathangiddy overrode teamcity/dcos/test/dcos-docker/static status of dcos/dcos/pull/3428 (Title: exhibitor package: bump ZooKeeper to 3.4.13 release, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 13/Sep/18 ]

@gauripowale overrode teamcity/dcos/test/dcos-docker/static status of dcos/dcos/pull/3437 (Title: [1.11] Mergebot Automated Train PR - 2018-Sep-13-11-00, Branch: 1.11) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Qian Zhang (Inactive) [ 13/Sep/18 ]

I created an OSS ticket (https://issues.apache.org/jira/browse/MESOS-9231) to trace the issue of Docker containerizer, and I will try to manually reproduce the issue of UCR with `dcos-launch`.

Comment by Mergebot [ 13/Sep/18 ]

@skumaran overrode teamcity/dcos/test/dcos-e2e/docker/static/strict status of mesosphere/dcos-enterprise/pull/3404 (Title: [master] Mergebot Automated Train PR - 2018-Sep-12-12-00, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 13/Sep/18 ]

@kapil overrode teamcity/dcos/test/dcos-e2e/docker/static/strict status of mesosphere/dcos-enterprise/pull/3350 (Title: Bumped Mesos SHA for dc/os 1.11 container cleanup EBUSY fix., Branch: 1.11) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 17/Sep/18 ]

@jp overrode teamcity/dcos/test/aws/cloudformation/simple status of dcos/dcos/pull/3447 (Title: [1.11] Bump Mesos to nightly 1.5.x 5a7ad47, Branch: 1.11) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 17/Sep/18 ]

@jp overrode teamcity/dcos/test/dcos-e2e/docker/static/strict status of mesosphere/dcos-enterprise/pull/3434 (Title: openssl: bump to 1.0.2p, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 17/Sep/18 ]

@gpaul overrode teamcity/dcos/test/dcos-docker/static status of dcos/dcos/pull/3443 (Title: openssl: bump to 1.0.2p, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 17/Sep/18 ]

@gpaul overrode teamcity/dcos/test/aws/onprem/static status of dcos/dcos/pull/3443 (Title: openssl: bump to 1.0.2p, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Gustav Paul (Inactive) [ 17/Sep/18 ]

Any idea why we no longer artifact the sandbox logs?

We were tracking collection of sandbox logs in https://jira.mesosphere.com/browse/DCOS-39211 which is resolved, yet I don't see any sandbox logs in any of today's overrides.

This test is unbelievably flaky and we're about to GA a release while this test (which exercises out stack end-to-end) is failing several times per day. As far as I understand the current status there is no way to make progress without the sandbox logs.

Senthil Kumaran Charles Provencher Patrick Crews Carter Gawron Please help!

Comment by Senthil Kumaran (Inactive) [ 17/Sep/18 ]

Hi Gustav Paul - The sandbox logs are collected. Also, a general trend for us to "improve" and collect more logs that do any regression.

The latest failure

1) - *teamcity/dcos/test/aws/onprem/static * - DC/OS Installation itself had failed. It can seen from the Build Logs in the console. `+ ./dcos-launch wait` kept waiting for the cluster to come up and it never came up. Transient network issues? (pehaps). Will jounald logs on bootstrap node help in addition to console logs? (That will be an addition, but seem to gather information from the console logs here on what happened. Since DC/OS didn't come up, we don't have master or sandbox logs here.

2) teamcity/dcos/test/dcos-docker/static - Collection of logs for this has not been added yet. Only the enterprise side was added by Charles recently and the steps needs to copied over to Open status too. This is in progress as we are trying to make sure all status checks have consistent set of logs. (https://jira.mesosphere.com/browse/DCOS-41749)

In this specific scenario, a re-triggering - teamcity/dcos/test/aws/onprem/static must have helped and Qian is assisting us with the mesos bug here and has gotten access to valuable logs so far. HTH.

Comment by Senthil Kumaran (Inactive) [ 17/Sep/18 ]

Gustav Paul - If you look at any other failure, that is actual test_vip failure, and not the cluster creation failure, which shouldn't be linked to this ticket, You will find the sanbox logs available - For e.g. - https://teamcity.mesosphere.io/viewLog.html?buildId=1210700&buildTypeId=DcOs_Open_Test_IntegrationTest_AwsOnpremWStaticBackend&tab=artifacts

Comment by Gustav Paul (Inactive) [ 17/Sep/18 ]

Thanks Senthil Kumaran! I'm still confused though, for example this build from yesterday:
https://teamcity.mesosphere.io/viewLog.html?buildId=1210127&buildTypeId=DcOs_Enterprise_Test_DockerBased_DockerWStaticBackendAndSecurityStrict&tab=artifacts#!bxqmlynw4w0,-tz1l3ge8t9dp,12ba5z8v1dgud,-rugbygsxq5yj,-rugbygsxqa5x,1fommawh58u7w

That is an Enterprise strict mode build, test_vip failed, this was yesterday, and I don't see the sandbox logs, while I do see them for the build you linked (awesome, btw.)

Do you perhaps mean that the Enterprise builds don't collect sandbox logs yet but the OSS builds do?

Comment by Senthil Kumaran (Inactive) [ 17/Sep/18 ]

Hi Gustav Paul - That was a miss on our side (Tools Infra). The Job that you linked should have sandbox logs collected. Looks like we failed to add it to the Docker Job on Enterprise. I have added the comment on that task https://jira.mesosphere.com/browse/DCOS_OSS-3738 to be reopened and to be completed.

This EPIC - DCOS-41749 tracks for making sure all relevant logs are made available consistently for the TC jobs.

Comment by Gustav Paul (Inactive) [ 17/Sep/18 ]

Thanks Senthil Kumaran the log collection effort is fiddly but I believe it's going to pay for itself a thousand times over.

Comment by Mergebot [ 17/Sep/18 ]

@charlesprovencher overrode teamcity/dcos/test/aws/onprem/static status of dcos/dcos/pull/3452 (Title: 1.12.0 beta2 train, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 18/Sep/18 ]

@skumaran overrode teamcity/dcos/test/aws/onprem/static status of dcos/dcos/pull/3436 (Title: Fix 500 responses from v0 metrics API, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 19/Sep/18 ]

@jp overrode teamcity/dcos/test/dcos-e2e/docker/static/strict status of mesosphere/dcos-enterprise/pull/3455 (Title: [1.11] packages/bootstrap: Do not remove permissions from dcos_marathon and dcos_metronome service accounts, Branch: 1.11) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 19/Sep/18 ]

@drozhkov overrode teamcity/dcos/test/aws/onprem/static status of dcos/dcos/pull/3462 (Title: Bump ui to v1.22.0, Branch: 1.11) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 19/Sep/18 ]

@jp overrode teamcity/dcos/test/dcos-e2e/docker/static/strict status of mesosphere/dcos-enterprise/pull/3242 (Title: Add LDAP_GROUP_IMPORT_LIMIT_SECONDS Bouncer configuration variable, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 19/Sep/18 ]

@jp overrode teamcity/dcos/test/dcos-e2e/docker/static/strict status of mesosphere/dcos-enterprise/pull/3411 (Title: Enabled GC of nested container sandboxes by the Mesos agent., Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 19/Sep/18 ]

@skumaran overrode teamcity/dcos/test/dcos-e2e/docker/static/strict status of mesosphere/dcos-enterprise/pull/3448 (Title: Fix expected sha1 value for the rewrite_amd64_en-US.msi installer, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 19/Sep/18 ]

@skumaran overrode teamcity/dcos/test/dcos-e2e/docker/static/strict status of mesosphere/dcos-enterprise/pull/3448 (Title: Fix expected sha1 value for the rewrite_amd64_en-US.msi installer, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 19/Sep/18 ]

@skumaran overrode teamcity/dcos/test/dcos-e2e/docker/static/strict status of mesosphere/dcos-enterprise/pull/3457 (Title: [1.10] Mergebot Automated Train PR - 2018-Sep-19-10-00, Branch: 1.10) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 19/Sep/18 ]

@jonathangiddy overrode teamcity/dcos/test/aws/onprem/static status of dcos/dcos/pull/3464 (Title: (1.12) Fix 500 responses from v0 metrics API, Branch: 1.12) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 19/Sep/18 ]

@gpaul overrode teamcity/dcos/test/dcos-e2e/docker/static/strict status of mesosphere/dcos-enterprise/pull/3472 (Title: Add more context to vip test app names, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 20/Sep/18 ]

@kapil overrode teamcity/dcos/test/aws/onprem/static status of dcos/dcos/pull/3477 (Title: [1.12] Bump Mesos to nightly 1.7.x 06eb5ba, Branch: 1.12) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Qian Zhang (Inactive) [ 21/Sep/18 ]

I reproduced a failed case of Mesos containerizer in a DC/OS cluster launched with `dcos-docker`, it was caused by a container which was stuck as `ISOLATING` state, here is the the agent logs for that container:

 

Sep 21 07:21:39 dcos-e2e-4ab303d7-436b-4b07-8b78-43dd9abacd34-agent-0 mesos-agent[1270]: I0921 07:21:39.343425 1275 paths.cpp:745] Creating sandbox '/var/lib/mesos/slave/slaves/453c7b65-353f-4944-9b49-d5dcbba2e6f5-S2/frameworks/453c7b65-353f-4944-9b49-d5dcbba2e6f5-0001/executors/integration-test-6de4f1fbc9644f80993b6170e9e432f0.fb2f1952-bd6e-11e8-9fcb-70b3d5800003/runs/85809953-a904-4823-9279-a46b023be09a' for user 'root'
Sep 21 07:21:39 dcos-e2e-4ab303d7-436b-4b07-8b78-43dd9abacd34-agent-0 mesos-agent[1270]: I0921 07:21:39.345252 1275 paths.cpp:748] Creating sandbox '/var/lib/mesos/slave/meta/slaves/453c7b65-353f-4944-9b49-d5dcbba2e6f5-S2/frameworks/453c7b65-353f-4944-9b49-d5dcbba2e6f5-0001/executors/integration-test-6de4f1fbc9644f80993b6170e9e432f0.fb2f1952-bd6e-11e8-9fcb-70b3d5800003/runs/85809953-a904-4823-9279-a46b023be09a'
Sep 21 07:21:39 dcos-e2e-4ab303d7-436b-4b07-8b78-43dd9abacd34-agent-0 mesos-agent[1270]: I0921 07:21:39.347165 1275 slave.cpp:8997] Launching executor 'integration-test-6de4f1fbc9644f80993b6170e9e432f0.fb2f1952-bd6e-11e8-9fcb-70b3d5800003' of framework 453c7b65-353f-4944-9b49-d5dcbba2e6f5-0001 with resources [{"allocation_info":{"role":"slave_public"},"name":"cpus","scalar":{"value":0.1},"type":"SCALAR"},{"allocation_info":{"role":"slave_public"},"name":"mem","scalar":{"value":32.0},"type":"SCALAR"}] in work directory '/var/lib/mesos/slave/slaves/453c7b65-353f-4944-9b49-d5dcbba2e6f5-S2/frameworks/453c7b65-353f-4944-9b49-d5dcbba2e6f5-0001/executors/integration-test-6de4f1fbc9644f80993b6170e9e432f0.fb2f1952-bd6e-11e8-9fcb-70b3d5800003/runs/85809953-a904-4823-9279-a46b023be09a'
Sep 21 07:21:39 dcos-e2e-4ab303d7-436b-4b07-8b78-43dd9abacd34-agent-0 mesos-agent[1270]: I0921 07:21:39.348870 1275 slave.cpp:3530] Launching container 85809953-a904-4823-9279-a46b023be09a for executor 'integration-test-6de4f1fbc9644f80993b6170e9e432f0.fb2f1952-bd6e-11e8-9fcb-70b3d5800003' of framework 453c7b65-353f-4944-9b49-d5dcbba2e6f5-0001
Sep 21 07:21:39 dcos-e2e-4ab303d7-436b-4b07-8b78-43dd9abacd34-agent-0 mesos-agent[1270]: I0921 07:21:39.352116 1275 containerizer.cpp:1282] Starting container 85809953-a904-4823-9279-a46b023be09a
Sep 21 07:21:39 dcos-e2e-4ab303d7-436b-4b07-8b78-43dd9abacd34-agent-0 mesos-agent[1270]: I0921 07:21:39.354837 1275 provisioner.cpp:545] Provisioning image rootfs '/var/lib/mesos/slave/provisioner/containers/85809953-a904-4823-9279-a46b023be09a/backends/overlay/rootfses/02148442-f072-46e2-8809-b43f982e784d' for container 85809953-a904-4823-9279-a46b023be09a using overlay backend
Sep 21 07:21:39 dcos-e2e-4ab303d7-436b-4b07-8b78-43dd9abacd34-agent-0 mesos-agent[1270]: I0921 07:21:39.360311 1275 containerizer.cpp:3120] Transitioning the state of container 85809953-a904-4823-9279-a46b023be09a from PROVISIONING to PREPARING
Sep 21 07:21:39 dcos-e2e-4ab303d7-436b-4b07-8b78-43dd9abacd34-agent-0 mesos-agent[1270]: I0921 07:21:39.375891 1279 memory.cpp:478] Started listening for OOM events for container 85809953-a904-4823-9279-a46b023be09a
Sep 21 07:21:39 dcos-e2e-4ab303d7-436b-4b07-8b78-43dd9abacd34-agent-0 mesos-agent[1270]: I0921 07:21:39.376188 1279 memory.cpp:590] Started listening on 'low' memory pressure events for container 85809953-a904-4823-9279-a46b023be09a
Sep 21 07:21:39 dcos-e2e-4ab303d7-436b-4b07-8b78-43dd9abacd34-agent-0 mesos-agent[1270]: I0921 07:21:39.376269 1279 memory.cpp:590] Started listening on 'medium' memory pressure events for container 85809953-a904-4823-9279-a46b023be09a
Sep 21 07:21:39 dcos-e2e-4ab303d7-436b-4b07-8b78-43dd9abacd34-agent-0 mesos-agent[1270]: I0921 07:21:39.376343 1279 memory.cpp:590] Started listening on 'critical' memory pressure events for container 85809953-a904-4823-9279-a46b023be09a
Sep 21 07:21:39 dcos-e2e-4ab303d7-436b-4b07-8b78-43dd9abacd34-agent-0 mesos-agent[1270]: I0921 07:21:39.381487 1275 cpu.cpp:92] Updated 'cpu.shares' to 204 (cpus 0.2) for container 85809953-a904-4823-9279-a46b023be09a
Sep 21 07:21:39 dcos-e2e-4ab303d7-436b-4b07-8b78-43dd9abacd34-agent-0 mesos-agent[1270]: I0921 07:21:39.381837 1275 cpu.cpp:112] Updated 'cpu.cfs_period_us' to 100ms and 'cpu.cfs_quota_us' to 20ms (cpus 0.2) for container 85809953-a904-4823-9279-a46b023be09a
Sep 21 07:21:39 dcos-e2e-4ab303d7-436b-4b07-8b78-43dd9abacd34-agent-0 mesos-agent[1270]: I0921 07:21:39.382069 1275 memory.cpp:198] Updated 'memory.soft_limit_in_bytes' to 64MB for container 85809953-a904-4823-9279-a46b023be09a
Sep 21 07:21:39 dcos-e2e-4ab303d7-436b-4b07-8b78-43dd9abacd34-agent-0 mesos-agent[1270]: I0921 07:21:39.393718 1275 memory.cpp:227] Updated 'memory.limit_in_bytes' to 64MB for container 85809953-a904-4823-9279-a46b023be09a
Sep 21 07:21:39 dcos-e2e-4ab303d7-436b-4b07-8b78-43dd9abacd34-agent-0 mesos-agent[1270]: I0921 07:21:39.422675 1279 secret.cpp:309] 0 secrets have been resolved for container 85809953-a904-4823-9279-a46b023be09a
Sep 21 07:21:39 dcos-e2e-4ab303d7-436b-4b07-8b78-43dd9abacd34-agent-0 mesos-agent[1270]: I0921 07:21:39.521315 1277 switchboard.cpp:316] Container logger module finished preparing container 85809953-a904-4823-9279-a46b023be09a; IOSwitchboard server is not required
Sep 21 07:21:39 dcos-e2e-4ab303d7-436b-4b07-8b78-43dd9abacd34-agent-0 mesos-agent[1270]: I0921 07:21:39.534044 1275 linux_launcher.cpp:492] Launching container 85809953-a904-4823-9279-a46b023be09a and cloning with namespaces CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWPID | CLONE_NEWNET
Sep 21 07:21:39 dcos-e2e-4ab303d7-436b-4b07-8b78-43dd9abacd34-agent-0 mesos-agent[1270]: I0921 07:21:39.800432 1277 containerizer.cpp:2046] Checkpointing container's forked pid 20039 to '/var/lib/mesos/slave/meta/slaves/453c7b65-353f-4944-9b49-d5dcbba2e6f5-S2/frameworks/453c7b65-353f-4944-9b49-d5dcbba2e6f5-0001/executors/integration-test-6de4f1fbc9644f80993b6170e9e432f0.fb2f1952-bd6e-11e8-9fcb-70b3d5800003/runs/85809953-a904-4823-9279-a46b023be09a/pids/forked.pid'
Sep 21 07:21:39 dcos-e2e-4ab303d7-436b-4b07-8b78-43dd9abacd34-agent-0 mesos-agent[1270]: I0921 07:21:39.805660 1277 containerizer.cpp:3120] Transitioning the state of container 85809953-a904-4823-9279-a46b023be09a from PREPARING to ISOLATING
Sep 21 07:21:39 dcos-e2e-4ab303d7-436b-4b07-8b78-43dd9abacd34-agent-0 mesos-agent[1270]: I0921 07:21:39.814954 1279 cni.cpp:962] Bind mounted '/proc/20039/ns/net' to '/run/mesos/isolators/network/cni/85809953-a904-4823-9279-a46b023be09a/ns' for container 85809953-a904-4823-9279-a46b023be09a
Sep 21 07:21:41 dcos-e2e-4ab303d7-436b-4b07-8b78-43dd9abacd34-agent-0 mesos-agent[1270]: I0921 07:21:41.238293 1281 cni.cpp:1394] Got assigned IPv4 address '172.31.254.185/24' from CNI network 'mesos-bridge' for container 85809953-a904-4823-9279-a46b023be09a
Sep 21 07:21:41 dcos-e2e-4ab303d7-436b-4b07-8b78-43dd9abacd34-agent-0 mesos-agent[1270]: I0921 07:21:41.239543 1281 cni.cpp:1102] Unable to find DNS nameservers for container 85809953-a904-4823-9279-a46b023be09a, using host '/etc/resolv.conf'
Sep 21 07:31:39 dcos-e2e-4ab303d7-436b-4b07-8b78-43dd9abacd34-agent-0 mesos-agent[1270]: I0921 07:31:39.350037 1276 containerizer.cpp:2457] Destroying container 85809953-a904-4823-9279-a46b023be09a in ISOLATING state
Sep 21 07:31:39 dcos-e2e-4ab303d7-436b-4b07-8b78-43dd9abacd34-agent-0 mesos-agent[1270]: I0921 07:31:39.350167 1276 containerizer.cpp:3120] Transitioning the state of container 85809953-a904-4823-9279-a46b023be09a from ISOLATING to DESTROYING

So the container was stuck at `ISOLATING` state for 10 mins and then containerizer tried to destroy it, but the destroy can never finish since we need to wait for the isolators to finish isolating. So there must be an isolator's `isolate()` method never returned. I will add more logs in isolators and try to figure out which isolator caused this issue.

Comment by Mergebot [ 24/Sep/18 ]

@skumaran overrode teamcity/dcos/test/dcos-e2e/docker/static/strict status of mesosphere/dcos-enterprise/pull/3461 (Title: [master] Mergebot Automated Train PR - 2018-Sep-19-12-00, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 24/Sep/18 ]

Pull Request, https://github.com/dcos/dcos/pull/3475, associated with the JIRA ticket was merged into DC/OS 1.12.0

Comment by Mergebot [ 25/Sep/18 ]

@jp overrode teamcity/dcos/test/dcos-e2e/docker/static/strict status of mesosphere/dcos-enterprise/pull/3330 (Title: Mesos modules: increase network timeout, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 25/Sep/18 ]

@gpaul overrode teamcity/dcos/test/dcos-e2e/docker/static/strict status of mesosphere/dcos-enterprise/pull/3505 (Title: [1.11] Add more data to diagnostics bundle, Branch: 1.11) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 25/Sep/18 ]

@jp overrode teamcity/dcos/test/dcos-e2e/docker/static/strict status of mesosphere/dcos-enterprise/pull/3507 (Title: [BACKPORT] Mesos modules Increased IAM timeout, Branch: 1.11) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 27/Sep/18 ]

@philip overrode teamcity/dcos/test/dcos-e2e/docker/static/strict status of mesosphere/dcos-enterprise/pull/3519 (Title: [1.12] Telegraf fixes, Branch: 1.12) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 27/Sep/18 ]

@skumaran overrode teamcity/dcos/test/dcos-e2e/docker/static/strict status of mesosphere/dcos-enterprise/pull/3515 (Title: [master] Mergebot Automated Train PR - 2018-Sep-26-12-00, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 27/Sep/18 ]

@skumaran overrode teamcity/dcos/test/aws/onprem/static status of dcos/dcos/pull/3512 (Title: [master] Mergebot Automated Train PR - 2018-Sep-26-12-00, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 28/Sep/18 ]

@philip overrode teamcity/dcos/test/dcos-e2e/docker/static/strict status of mesosphere/dcos-enterprise/pull/3533 (Title: Marathon remove precheck on single node 1.11, Branch: 1.11) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 28/Sep/18 ]

@philip overrode teamcity/dcos/test/aws/cloudformation/simple status of dcos/dcos/pull/3531 (Title: Marathon remove precheck on single node 1.11, Branch: 1.11) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 01/Oct/18 ]

@charlesprovencher overrode teamcity/dcos/test/aws/onprem/static status of dcos/dcos/pull/3536 (Title: New config.yaml to support Windows Build Artifacts in Separate S3 Bucket, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Matthias Eichstedt (Inactive) [ 02/Oct/18 ]

DCOS-19619 is also suffering from TaskGroup containers stuck in STARTING – I've linked it as a duplicate, but there are no sandboxes to verify the root cause is the same.

Comment by Mergebot [ 02/Oct/18 ]

@drozhkov overrode teamcity/dcos/test/dcos-e2e/docker/static/strict status of mesosphere/dcos-enterprise/pull/3523 (Title: chore(dcos-ui): bump DC/OS UI v2.24.4, Branch: 1.12) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 02/Oct/18 ]

@skumaran overrode teamcity/dcos/test/dcos-e2e/docker/static/strict status of mesosphere/dcos-enterprise/pull/3539 (Title: [1.12][DCOS-42419] Add UCR Support for package registry by adding v2 schema 1 manifests, Branch: 1.12) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 02/Oct/18 ]

@skumaran overrode teamcity/dcos/test/dcos-e2e/docker/static/strict status of mesosphere/dcos-enterprise/pull/3553 (Title: 1.12: Pass ssl_keystore_password via MARATHON_ environment variables, Branch: 1.12) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 03/Oct/18 ]

@jonathangiddy overrode teamcity/dcos/test/dcos-e2e/docker/static/strict status of mesosphere/dcos-enterprise/pull/3567 (Title: [master] packages/bouncer: bump bouncer, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 04/Oct/18 ]

@sergeyurbanovich overrode teamcity/dcos/test/dcos-e2e/docker/static/strict status of mesosphere/dcos-enterprise/pull/3503 (Title: bump mesos-dns to bring in changes for mesos state endpoint, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 04/Oct/18 ]

@sergeyurbanovich overrode teamcity/dcos/test/aws/onprem/static status of dcos/dcos/pull/3504 (Title: bump mesos-dns to bring in changes for mesos state endpoint, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 05/Oct/18 ]

@skumaran overrode teamcity/dcos/test/aws/onprem/static status of dcos/dcos/pull/3530 (Title: Marathon remove precheck on single node Master, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 08/Oct/18 ]

@branden overrode teamcity/dcos/test/aws/onprem/static status of dcos/dcos/pull/3571 (Title: [1.12] Grant containers dir ownership to dcos_telegraf, Branch: 1.12) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 09/Oct/18 ]

@skumaran overrode teamcity/dcos/test/aws/onprem/static status of dcos/dcos/pull/3578 (Title: [1.12] Add root capabilities to dcos-diagnostics, Branch: 1.12) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 11/Oct/18 ]

@charlesprovencher overrode teamcity/dcos/test/dcos-e2e/docker/static/strict status of mesosphere/dcos-enterprise/pull/3626 (Title: Skip test_packaging_api, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 11/Oct/18 ]

@charlesprovencher overrode teamcity/dcos/test/dcos-e2e/docker/static/strict status of mesosphere/dcos-enterprise/pull/3627 (Title: Skip test_packaging_api, Branch: 1.12) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 11/Oct/18 ]

@sergeyurbanovich overrode teamcity/dcos/test/azure/arm status of dcos/dcos/pull/3607 (Title: [1.11] Bump dcos-net, Branch: 1.11) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 11/Oct/18 ]

@sergeyurbanovich overrode teamcity/dcos/test/dcos-docker/static status of dcos/dcos/pull/3608 (Title: [1.10] Bump navstar, Branch: 1.10) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 12/Oct/18 ]

@timweidner overrode teamcity/dcos/test/dcos-e2e/docker/static/strict status of mesosphere/dcos-enterprise/pull/3629 (Title: [1.12] Backport tweidner/adangoor/fix-mesos-api-test-flake, Branch: 1.12) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 12/Oct/18 ]

@branden overrode teamcity/dcos/test/dcos-e2e/docker/static/strict status of mesosphere/dcos-enterprise/pull/3610 (Title: (1.12) Assert system clock is synced before starting dcos-exhibitor, Branch: 1.12) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 15/Oct/18 ]

@timweidner overrode teamcity/dcos/test/dcos-e2e/docker/static/strict status of mesosphere/dcos-enterprise/pull/3575 (Title: Handle exceptions during Metronome startup, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 16/Oct/18 ]

@jp overrode teamcity/dcos/test/aws/onprem/static status of dcos/dcos/pull/3625 (Title: Bump to the newest Metronome, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 16/Oct/18 ]

@sergeyurbanovich overrode teamcity/dcos/test/dcos-e2e/docker/static/strict status of mesosphere/dcos-enterprise/pull/3653 (Title: [1.12] Bump dcos-net, Branch: 1.12) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 17/Oct/18 ]

@charlesprovencher overrode teamcity/dcos/test/aws/onprem/static/strict status of mesosphere/dcos-enterprise/pull/3667 (Title: 1.11 train 10/17, Branch: 1.11) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 17/Oct/18 ]

@skumaran overrode teamcity/dcos/test/dcos-e2e/docker/static/strict status of mesosphere/dcos-enterprise/pull/3662 (Title: [master] Mergebot Automated Train PR - 2018-Oct-17-12-01, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 17/Oct/18 ]

@skumaran overrode teamcity/dcos/test/aws/onprem/static/strict status of mesosphere/dcos-enterprise/pull/3662 (Title: [master] Mergebot Automated Train PR - 2018-Oct-17-12-01, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 17/Oct/18 ]

@skumaran overrode teamcity/dcos/test/dcos-e2e/docker/static/strict status of mesosphere/dcos-enterprise/pull/3661 (Title: [1.12] Mergebot Automated Train PR - 2018-Oct-17-12-01, Branch: 1.12) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 18/Oct/18 ]

@jp overrode teamcity/dcos/test/aws/onprem/static/strict status of mesosphere/dcos-enterprise/pull/3645 (Title: [1.12] Update dcos-diagnostics, Branch: 1.12) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 18/Oct/18 ]

@jp overrode teamcity/dcos/test/aws/onprem/static/strict status of mesosphere/dcos-enterprise/pull/3642 (Title: Update dcos-diagnostics, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 18/Oct/18 ]

@jp overrode teamcity/dcos/test/dcos-e2e/docker/static/strict status of mesosphere/dcos-enterprise/pull/3642 (Title: Update dcos-diagnostics, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 18/Oct/18 ]

@gpaul overrode teamcity/dcos/test/aws/onprem/static status of dcos/dcos/pull/3648 (Title: maintenance_mode is enabled by default in 1.8 (1.13), Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 18/Oct/18 ]

@sergeyurbanovich overrode teamcity/dcos/test/azure/arm status of dcos/dcos/pull/3649 (Title: [1.10] Bump Mesos to nightly 1.4.x 82df2a4, Branch: 1.10) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 18/Oct/18 ]

@skumaran overrode teamcity/dcos/test/aws/onprem/static/strict status of mesosphere/dcos-enterprise/pull/3654 (Title: Bump dcos-net, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 18/Oct/18 ]

@skumaran overrode teamcity/dcos/test/aws/onprem/static/strict status of mesosphere/dcos-enterprise/pull/3670 (Title: [1.12] Always prefer to serve schema 2 over schema 1 docker manifest, Branch: 1.12) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 18/Oct/18 ]

@charlesprovencher overrode teamcity/dcos/test/dcos-e2e/docker/static/strict status of mesosphere/dcos-enterprise/pull/3673 (Title: [master] Mergebot Automated Train PR - 2018-Oct-18-12-00, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 18/Oct/18 ]

@charlesprovencher overrode teamcity/dcos/test/aws/onprem/static status of dcos/dcos/pull/3652 (Title: [master] Mergebot Automated Train PR - 2018-Oct-18-12-00, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 18/Oct/18 ]

@sergeyurbanovich overrode teamcity/dcos/test/dcos-e2e/docker/static/strict status of mesosphere/dcos-enterprise/pull/3665 (Title: packages/dcos-integration-test/test_tls: Enable dcos-net TLS tests, Branch: 1.12) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 19/Oct/18 ]

@jp overrode teamcity/dcos/test/aws/cloudformation/simple status of dcos/dcos/pull/3638 (Title: chore(dcos-ui): bump package to 1.11+v1.24.0, Branch: 1.11) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 19/Oct/18 ]

@skumaran overrode teamcity/dcos/test/aws/onprem/static/strict status of mesosphere/dcos-enterprise/pull/3582 (Title: Do not pull overlay data when overlay is disable, Branch: 1.11) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 19/Oct/18 ]

@skumaran overrode teamcity/dcos/test/aws/cloudformation/simple status of dcos/dcos/pull/3568 (Title: [1.11] Do not pull overlay data when overlay is disable, Branch: 1.11) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 19/Oct/18 ]

@skumaran overrode teamcity/dcos/test/azure/arm status of dcos/dcos/pull/3654 (Title: bump marathon 1.6.654, Branch: 1.11) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 20/Oct/18 ]

@klueska overrode teamcity/dcos/test/aws/cloudformation/simple status of dcos/dcos/pull/3645 (Title: Bump dcos-log, Branch: 1.11) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 20/Oct/18 ]

@klueska overrode teamcity/dcos/test/dcos-e2e/docker/static/strict status of mesosphere/dcos-enterprise/pull/3603 (Title: Add external Mesos master/agent logs in the bundle, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 22/Oct/18 ]

@klueska overrode teamcity/dcos/test/aws/onprem/static status of dcos/dcos/pull/3645 (Title: Bump dcos-log, Branch: 1.11) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 23/Oct/18 ]

@skumaran overrode teamcity/dcos/test/dcos-e2e/docker/static/strict status of mesosphere/dcos-enterprise/pull/3682 (Title: [1.11] Bump Mesos to nightly 1.5.x 2ead30d, Branch: 1.11) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Adam Dangoor (Inactive) [ 23/Oct/18 ]

What does it mean that this is "In Progress" but not assigned?

Comment by Mergebot [ 23/Oct/18 ]

@timweidner overrode teamcity/dcos/test/dcos-e2e/docker/static/strict status of mesosphere/dcos-enterprise/pull/3692 (Title: [DCOS-43342] Retry reserving disk in Mesos v0 scheduler test., Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 23/Oct/18 ]

@greg overrode teamcity/dcos/test/aws/cloudformation/simple status of dcos/dcos/pull/3590 (Title: [1.11] Add Mesos patches to ensure TEARDOWN is sent in v1 Java shim., Branch: 1.11) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 24/Oct/18 ]

@greg overrode teamcity/dcos/test/azure/arm status of dcos/dcos/pull/3591 (Title: [1.10] Add Mesos patches to ensure TEARDOWN is sent in v1 Java shim., Branch: 1.10) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Greg Mann (Inactive) [ 25/Oct/18 ]

Seems like we may have multiple failure modes leading to this test failure. Here are the Mesos agent logs from a repro I just attained of test_vip[Container.POD-Network.USER-Network.BRIDGE], filtered for the task and container ID of the test_vip task:

Oct 25 18:13:46 dcos-e2e-fa3ba6a6-56ea-4369-a0dc-a9e45e03aaf8-agent-0 mesos-agent[2076]: I1025 18:13:46.486881  2084 slave.cpp:2035] Got assigned task group containing tasks [ integration-test-bf8517ec5711454691f6f1c28184fa07.instance-b6bf6f89-d881-11e8-9bf3-70b3d5800001.app-bf8517ec5711454691f6f1c28184fa07 ] for framework 00dc552a-2133-45b6-b2f1-15651df01139-0001
Oct 25 18:13:46 dcos-e2e-fa3ba6a6-56ea-4369-a0dc-a9e45e03aaf8-agent-0 mesos-agent[2076]: I1025 18:13:46.488046  2084 slave.cpp:2409] Authorizing task group containing tasks [ integration-test-bf8517ec5711454691f6f1c28184fa07.instance-b6bf6f89-d881-11e8-9bf3-70b3d5800001.app-bf8517ec5711454691f6f1c28184fa07 ] for framework 00dc552a-2133-45b6-b2f1-15651df01139-0001
Oct 25 18:13:46 dcos-e2e-fa3ba6a6-56ea-4369-a0dc-a9e45e03aaf8-agent-0 mesos-agent[2076]: I1025 18:13:46.489056  2084 slave.cpp:8469] Authorizing framework principal 'dcos_marathon' to launch task integration-test-bf8517ec5711454691f6f1c28184fa07.instance-b6bf6f89-d881-11e8-9bf3-70b3d5800001.app-bf8517ec5711454691f6f1c28184fa07
Oct 25 18:13:46 dcos-e2e-fa3ba6a6-56ea-4369-a0dc-a9e45e03aaf8-agent-0 mesos-agent[2076]: I1025 18:13:46.500901  2084 slave.cpp:2852] Launching task group containing tasks [ integration-test-bf8517ec5711454691f6f1c28184fa07.instance-b6bf6f89-d881-11e8-9bf3-70b3d5800001.app-bf8517ec5711454691f6f1c28184fa07 ] for framework 00dc552a-2133-45b6-b2f1-15651df01139-0001
Oct 25 18:13:46 dcos-e2e-fa3ba6a6-56ea-4369-a0dc-a9e45e03aaf8-agent-0 mesos-agent[2076]: I1025 18:13:46.500946  2084 paths.cpp:745] Creating sandbox '/var/lib/mesos/slave/slaves/00dc552a-2133-45b6-b2f1-15651df01139-S1/frameworks/00dc552a-2133-45b6-b2f1-15651df01139-0001/executors/instance-integration-test-bf8517ec5711454691f6f1c28184fa07.b6bf6f89-d881-11e8-9bf3-70b3d5800001/runs/a06b3776-7b56-4ebc-9926-144ae795877a' for user 'nobody'
Oct 25 18:13:46 dcos-e2e-fa3ba6a6-56ea-4369-a0dc-a9e45e03aaf8-agent-0 mesos-agent[2076]: I1025 18:13:46.501479  2084 paths.cpp:748] Creating sandbox '/var/lib/mesos/slave/meta/slaves/00dc552a-2133-45b6-b2f1-15651df01139-S1/frameworks/00dc552a-2133-45b6-b2f1-15651df01139-0001/executors/instance-integration-test-bf8517ec5711454691f6f1c28184fa07.b6bf6f89-d881-11e8-9bf3-70b3d5800001/runs/a06b3776-7b56-4ebc-9926-144ae795877a'
Oct 25 18:13:46 dcos-e2e-fa3ba6a6-56ea-4369-a0dc-a9e45e03aaf8-agent-0 mesos-agent[2076]: I1025 18:13:46.501626  2084 slave.cpp:8997] Launching executor 'instance-integration-test-bf8517ec5711454691f6f1c28184fa07.b6bf6f89-d881-11e8-9bf3-70b3d5800001' of framework 00dc552a-2133-45b6-b2f1-15651df01139-0001 with resources [{"allocation_info":{"role":"slave_public"},"name":"cpus","scalar":{"value":0.1},"type":"SCALAR"},{"allocation_info":{"role":"slave_public"},"name":"mem","scalar":{"value":32.0},"type":"SCALAR"},{"allocation_info":{"role":"slave_public"},"name":"disk","scalar":{"value":10.0},"type":"SCALAR"},{"allocation_info":{"role":"slave_public"},"name":"ports","ranges":{"range":[{"begin":13463,"end":13463}]},"type":"RANGES"}] in work directory '/var/lib/mesos/slave/slaves/00dc552a-2133-45b6-b2f1-15651df01139-S1/frameworks/00dc552a-2133-45b6-b2f1-15651df01139-0001/executors/instance-integration-test-bf8517ec5711454691f6f1c28184fa07.b6bf6f89-d881-11e8-9bf3-70b3d5800001/runs/a06b3776-7b56-4ebc-9926-144ae795877a'
Oct 25 18:13:46 dcos-e2e-fa3ba6a6-56ea-4369-a0dc-a9e45e03aaf8-agent-0 mesos-agent[2076]: I1025 18:13:46.501796  2084 jwt_secret_generator.cpp:71] Generated token 'eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJjaWQiOiJhMDZiMzc3Ni03YjU2LTRlYmMtOTkyNi0xNDRhZTc5NTg3N2EiLCJlaWQiOiJpbnN0YW5jZS1pbnRlZ3JhdGlvbi10ZXN0LWJmODUxN2VjNTcxMTQ1NDY5MWY2ZjFjMjgxODRmYTA3LmI2YmY2Zjg5LWQ4ODEtMTFlOC05YmYzLTcwYjNkNTgwMDAwMSIsImZpZCI6IjAwZGM1NTJhLTIxMzMtNDViNi1iMmYxLTE1NjUxZGYwMTEzOS0wMDAxIn0.im3hKnkvU-ztJIBU8-BLfRjHLzxP0-7BRg_egQNphO8' for principal '{"claims":{"cid":"a06b3776-7b56-4ebc-9926-144ae795877a","fid":"00dc552a-2133-45b6-b2f1-15651df01139-0001","eid":"instance-integration-test-bf8517ec5711454691f6f1c28184fa07.b6bf6f89-d881-11e8-9bf3-70b3d5800001"}}' using secret (base64) 'e0s2LVp0ZDNDdklXYDQkekx0Km0tTiQ/dzNvNzYoOVE0RXJEIVQ5Ul9UOHlHNlY7S09zS019SkVkanA0KTBOM2ZtXnQ9WVBHckR2anhFbTs3fH1zTkA9Y1pzcUwpYWs+YFlgV25QcUU7JGZISDBELSNePT1HIWNuPT8/QjMhfCRiSGZnST9qVT9jZGRDfT56QmxCaFlKMTcrV2Z8S2g/N3dYOSpXaV5fYXRNc3NRc3h8WUR3aWVaUUJuY0FNQG9gdlpvZlNVSDVBLVdqX3dOWmQ4dE8tVmFfbXItY3lgY0UwYGs0XmNuVzJmJGgpaVp1cW8kIXV8OTdPcGd3bUBQe0F+fXR2fEZGVSlUflcqZ0tWUHdvQnxPWWxqemFWPlJDNGhpaVJOMStFdStDbzhSWVhSWFRXfFFFUnBKME0/KHo='
Oct 25 18:13:46 dcos-e2e-fa3ba6a6-56ea-4369-a0dc-a9e45e03aaf8-agent-0 mesos-agent[2076]: I1025 18:13:46.502132  2084 slave.cpp:3049] Queued task group containing tasks [ integration-test-bf8517ec5711454691f6f1c28184fa07.instance-b6bf6f89-d881-11e8-9bf3-70b3d5800001.app-bf8517ec5711454691f6f1c28184fa07 ] for executor 'instance-integration-test-bf8517ec5711454691f6f1c28184fa07.b6bf6f89-d881-11e8-9bf3-70b3d5800001' of framework 00dc552a-2133-45b6-b2f1-15651df01139-0001
Oct 25 18:13:46 dcos-e2e-fa3ba6a6-56ea-4369-a0dc-a9e45e03aaf8-agent-0 mesos-agent[2076]: I1025 18:13:46.503170  2084 slave.cpp:3530] Launching container a06b3776-7b56-4ebc-9926-144ae795877a for executor 'instance-integration-test-bf8517ec5711454691f6f1c28184fa07.b6bf6f89-d881-11e8-9bf3-70b3d5800001' of framework 00dc552a-2133-45b6-b2f1-15651df01139-0001
Oct 25 18:13:46 dcos-e2e-fa3ba6a6-56ea-4369-a0dc-a9e45e03aaf8-agent-0 mesos-agent[2076]: I1025 18:13:46.504715  2085 containerizer.cpp:1282] Starting container a06b3776-7b56-4ebc-9926-144ae795877a
Oct 25 18:13:46 dcos-e2e-fa3ba6a6-56ea-4369-a0dc-a9e45e03aaf8-agent-0 mesos-agent[2076]: I1025 18:13:46.505230  2085 containerizer.cpp:3124] Transitioning the state of container a06b3776-7b56-4ebc-9926-144ae795877a from PROVISIONING to PREPARING
Oct 25 18:13:46 dcos-e2e-fa3ba6a6-56ea-4369-a0dc-a9e45e03aaf8-agent-0 mesos-agent[2076]: I1025 18:13:46.519476  2081 memory.cpp:478] Started listening for OOM events for container a06b3776-7b56-4ebc-9926-144ae795877a
Oct 25 18:13:46 dcos-e2e-fa3ba6a6-56ea-4369-a0dc-a9e45e03aaf8-agent-0 mesos-agent[2076]: I1025 18:13:46.519562  2081 memory.cpp:590] Started listening on 'low' memory pressure events for container a06b3776-7b56-4ebc-9926-144ae795877a
Oct 25 18:13:46 dcos-e2e-fa3ba6a6-56ea-4369-a0dc-a9e45e03aaf8-agent-0 mesos-agent[2076]: I1025 18:13:46.519753  2081 memory.cpp:590] Started listening on 'medium' memory pressure events for container a06b3776-7b56-4ebc-9926-144ae795877a
Oct 25 18:13:46 dcos-e2e-fa3ba6a6-56ea-4369-a0dc-a9e45e03aaf8-agent-0 mesos-agent[2076]: I1025 18:13:46.519860  2081 memory.cpp:590] Started listening on 'critical' memory pressure events for container a06b3776-7b56-4ebc-9926-144ae795877a
Oct 25 18:13:46 dcos-e2e-fa3ba6a6-56ea-4369-a0dc-a9e45e03aaf8-agent-0 mesos-agent[2076]: I1025 18:13:46.532474  2082 cpu.cpp:92] Updated 'cpu.shares' to 102 (cpus 0.1) for container a06b3776-7b56-4ebc-9926-144ae795877a
Oct 25 18:13:46 dcos-e2e-fa3ba6a6-56ea-4369-a0dc-a9e45e03aaf8-agent-0 mesos-agent[2076]: I1025 18:13:46.532487  2086 memory.cpp:198] Updated 'memory.soft_limit_in_bytes' to 32MB for container a06b3776-7b56-4ebc-9926-144ae795877a
Oct 25 18:13:46 dcos-e2e-fa3ba6a6-56ea-4369-a0dc-a9e45e03aaf8-agent-0 mesos-agent[2076]: I1025 18:13:46.532569  2082 cpu.cpp:112] Updated 'cpu.cfs_period_us' to 100ms and 'cpu.cfs_quota_us' to 10ms (cpus 0.1) for container a06b3776-7b56-4ebc-9926-144ae795877a
Oct 25 18:13:46 dcos-e2e-fa3ba6a6-56ea-4369-a0dc-a9e45e03aaf8-agent-0 mesos-agent[2076]: I1025 18:13:46.532605  2086 memory.cpp:227] Updated 'memory.limit_in_bytes' to 32MB for container a06b3776-7b56-4ebc-9926-144ae795877a
Oct 25 18:13:46 dcos-e2e-fa3ba6a6-56ea-4369-a0dc-a9e45e03aaf8-agent-0 mesos-agent[2076]: I1025 18:13:46.535212  2082 secret.cpp:309] 0 secrets have been resolved for container a06b3776-7b56-4ebc-9926-144ae795877a
Oct 25 18:13:46 dcos-e2e-fa3ba6a6-56ea-4369-a0dc-a9e45e03aaf8-agent-0 mesos-agent[2076]: I1025 18:13:46.634297  2080 switchboard.cpp:316] Container logger module finished preparing container a06b3776-7b56-4ebc-9926-144ae795877a; IOSwitchboard server is not required
Oct 25 18:13:46 dcos-e2e-fa3ba6a6-56ea-4369-a0dc-a9e45e03aaf8-agent-0 mesos-agent[2076]: I1025 18:13:46.637706  2087 linux_launcher.cpp:492] Launching container a06b3776-7b56-4ebc-9926-144ae795877a and cloning with namespaces CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWPID | CLONE_NEWNET
Oct 25 18:13:46 dcos-e2e-fa3ba6a6-56ea-4369-a0dc-a9e45e03aaf8-agent-0 mesos-agent[2076]: I1025 18:13:46.648129  2086 containerizer.cpp:2046] Checkpointing container's forked pid 25844 to '/var/lib/mesos/slave/meta/slaves/00dc552a-2133-45b6-b2f1-15651df01139-S1/frameworks/00dc552a-2133-45b6-b2f1-15651df01139-0001/executors/instance-integration-test-bf8517ec5711454691f6f1c28184fa07.b6bf6f89-d881-11e8-9bf3-70b3d5800001/runs/a06b3776-7b56-4ebc-9926-144ae795877a/pids/forked.pid'
Oct 25 18:13:46 dcos-e2e-fa3ba6a6-56ea-4369-a0dc-a9e45e03aaf8-agent-0 mesos-agent[2076]: I1025 18:13:46.649010  2086 containerizer.cpp:3124] Transitioning the state of container a06b3776-7b56-4ebc-9926-144ae795877a from PREPARING to ISOLATING
Oct 25 18:13:46 dcos-e2e-fa3ba6a6-56ea-4369-a0dc-a9e45e03aaf8-agent-0 mesos-agent[2076]: I1025 18:13:46.690057  2080 cni.cpp:960] Bind mounted '/proc/25844/ns/net' to '/run/mesos/isolators/network/cni/a06b3776-7b56-4ebc-9926-144ae795877a/ns' for container a06b3776-7b56-4ebc-9926-144ae795877a
Oct 25 18:13:47 dcos-e2e-fa3ba6a6-56ea-4369-a0dc-a9e45e03aaf8-agent-0 mesos-agent[2076]: I1025 18:13:47.922582  2080 cni.cpp:1394] Got assigned IPv4 address '172.31.254.22/24' from CNI network 'mesos-bridge' for container a06b3776-7b56-4ebc-9926-144ae795877a
Oct 25 18:13:47 dcos-e2e-fa3ba6a6-56ea-4369-a0dc-a9e45e03aaf8-agent-0 mesos-agent[2076]: I1025 18:13:47.924751  2083 cni.cpp:1100] Unable to find DNS nameservers for container a06b3776-7b56-4ebc-9926-144ae795877a, using host '/etc/resolv.conf'
Oct 25 18:23:46 dcos-e2e-fa3ba6a6-56ea-4369-a0dc-a9e45e03aaf8-agent-0 mesos-agent[2076]: I1025 18:23:46.504267  2085 slave.cpp:6793] Terminating executor 'instance-integration-test-bf8517ec5711454691f6f1c28184fa07.b6bf6f89-d881-11e8-9bf3-70b3d5800001' of framework 00dc552a-2133-45b6-b2f1-15651df01139-0001 because it did not register within 10mins
Oct 25 18:23:46 dcos-e2e-fa3ba6a6-56ea-4369-a0dc-a9e45e03aaf8-agent-0 mesos-agent[2076]: I1025 18:23:46.504505  2085 containerizer.cpp:2457] Destroying container a06b3776-7b56-4ebc-9926-144ae795877a in ISOLATING state
Oct 25 18:23:46 dcos-e2e-fa3ba6a6-56ea-4369-a0dc-a9e45e03aaf8-agent-0 mesos-agent[2076]: I1025 18:23:46.504531  2085 containerizer.cpp:3124] Transitioning the state of container a06b3776-7b56-4ebc-9926-144ae795877a from ISOLATING to DESTROYING

Looks like in this particular case, the container was stuck in ISOLATING state.

Comment by Mergebot [ 25/Oct/18 ]

@gaston overrode teamcity/dcos/test/azure/arm status of dcos/dcos/pull/3679 (Title: bump marathon to 1.5.12, Branch: 1.10) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Qian Zhang (Inactive) [ 25/Oct/18 ]

Thanks Greg Mann, I think that is the issue MESOS-9334, we will fix it soon.

Comment by Mergebot [ 30/Oct/18 ]

@philip overrode teamcity/dcos/test/aws/onprem/static/strict status of mesosphere/dcos-enterprise/pull/3696 (Title: [1.11] Merge dependent tests into one big scenario, Branch: 1.11) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 30/Oct/18 ]

@philip overrode teamcity/dcos/test/aws/onprem/static status of dcos/dcos/pull/3589 (Title: Add external Mesos master/agent logs in the bundle, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 30/Oct/18 ]

@klueska overrode teamcity/dcos/test/dcos-e2e/docker/static/strict status of mesosphere/dcos-enterprise/pull/3663 (Title: bump dcos-log, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 30/Oct/18 ]

@klueska overrode teamcity/dcos/test/aws/onprem/static status of dcos/dcos/pull/3644 (Title: Bump dcos-log, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 31/Oct/18 ]

@charlesprovencher overrode teamcity/dcos/test/aws/onprem/static status of dcos/dcos/pull/3673 (Title: [master] Mergebot Automated Train PR - 2018-Oct-23-12-00, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 31/Oct/18 ]

@charlesprovencher overrode teamcity/dcos/test/dcos-e2e/docker/static/strict status of mesosphere/dcos-enterprise/pull/3680 (Title: [1.12] Mergebot Automated Train PR - 2018-Oct-19-12-00, Branch: 1.12) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 01/Nov/18 ]

@gauripowale overrode teamcity/dcos/test/dcos-docker/static status of dcos/dcos/pull/3693 (Title: Add fetch_cluster_logs.bash, Branch: 1.10) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 01/Nov/18 ]

@charlesprovencher overrode teamcity/dcos/test/aws/onprem/static/strict status of mesosphere/dcos-enterprise/pull/3695 (Title: [master] Mergebot Automated Train PR - 2018-Oct-23-12-00, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 01/Nov/18 ]

@philip overrode teamcity/dcos/test/aws/onprem/static/strict status of mesosphere/dcos-enterprise/pull/3735 (Title: (1.12) Fix Telegraf dcos_statsd plugin race condition, Branch: 1.12) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 01/Nov/18 ]

@charlesprovencher overrode teamcity/dcos/test/dcos-e2e/docker/static/strict status of mesosphere/dcos-enterprise/pull/3727 (Title: Add fetch_cluster_logs.bash, Branch: 1.10) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 01/Nov/18 ]

@alex overrode teamcity/dcos/test/dcos-e2e/docker/static/strict status of mesosphere/dcos-enterprise/pull/3731 (Title: [1.12] Bump Mesos to nightly 1.7.x cb07b69, Branch: 1.12) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 06/Nov/18 ]

@jonathangiddy overrode teamcity/dcos/test/aws/onprem/static/strict status of mesosphere/dcos-enterprise/pull/3712 (Title: Mh/java 8u192 1.12, Branch: 1.12) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 06/Nov/18 ]

@jonathangiddy overrode teamcity/dcos/test/dcos-e2e/docker/static/strict status of mesosphere/dcos-enterprise/pull/3712 (Title: Mh/java 8u192 1.12, Branch: 1.12) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 06/Nov/18 ]

@jonathangiddy overrode teamcity/dcos/test/aws/onprem/static status of dcos/dcos/pull/3681 (Title: packages/java: Update to 8u192, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 06/Nov/18 ]

@charlesprovencher overrode teamcity/dcos/test/aws/onprem/static/strict status of mesosphere/dcos-enterprise/pull/3760 (Title: Add missing error check, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 06/Nov/18 ]

@charlesprovencher overrode teamcity/dcos/test/dcos-e2e/docker/static/strict status of mesosphere/dcos-enterprise/pull/3760 (Title: Add missing error check, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 07/Nov/18 ]

@sergeyurbanovich overrode teamcity/dcos/test/dcos-e2e/docker/static/strict status of mesosphere/dcos-enterprise/pull/3750 (Title: Fix TLS handshake, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 07/Nov/18 ]

@sergeyurbanovich overrode teamcity/dcos/test/dcos-e2e/docker/static/strict status of mesosphere/dcos-enterprise/pull/3743 (Title: [1.11] Bump dcos-net, Branch: 1.11) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 07/Nov/18 ]

@sergeyurbanovich overrode teamcity/dcos/test/dcos-e2e/docker/static/strict status of mesosphere/dcos-enterprise/pull/3743 (Title: [1.11] Bump dcos-net, Branch: 1.11) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 07/Nov/18 ]

@sergeyurbanovich overrode teamcity/dcos/test/aws/onprem/static/strict status of mesosphere/dcos-enterprise/pull/3743 (Title: [1.11] Bump dcos-net, Branch: 1.11) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 07/Nov/18 ]

@charlesprovencher overrode teamcity/dcos/test/dcos-e2e/docker/static/strict status of mesosphere/dcos-enterprise/pull/3771 (Title: 1.12 train 11/06/2018, Branch: 1.12) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 08/Nov/18 ]

@klueska overrode teamcity/dcos/test/dcos-e2e/docker/static/strict status of mesosphere/dcos-enterprise/pull/3603 (Title: Add external Mesos master/agent logs in the bundle, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 08/Nov/18 ]

@gauripowale overrode teamcity/dcos/test/aws/onprem/static/strict status of mesosphere/dcos-enterprise/pull/3776 (Title: Bump cosmos-enterprise and package registry and add a new integration test with spark fwk, Branch: 1.12) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 08/Nov/18 ]

@sergeyurbanovich overrode teamcity/dcos/test/dcos-e2e/docker/static/strict status of mesosphere/dcos-enterprise/pull/3734 (Title: [1.12] Bump dcos-net, Branch: 1.12) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 08/Nov/18 ]

@sergeyurbanovich overrode teamcity/dcos/test/aws/onprem/static/strict status of mesosphere/dcos-enterprise/pull/3734 (Title: [1.12] Bump dcos-net, Branch: 1.12) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 09/Nov/18 ]

@jonathangiddy overrode teamcity/dcos/test/aws/onprem/static status of dcos/dcos/pull/3744 (Title: Collect ZooKeeper Metrics using DC/OS Telegraf, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 13/Nov/18 ]

@alex overrode teamcity/dcos/test/dcos-e2e/docker/static/strict status of mesosphere/dcos-enterprise/pull/3603 (Title: Add external Mesos master/agent logs in the bundle, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 13/Nov/18 ]

@alex overrode teamcity/dcos/test/dcos-e2e/docker/static/strict status of mesosphere/dcos-enterprise/pull/3603 (Title: Add external Mesos master/agent logs in the bundle, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 13/Nov/18 ]

@alex overrode teamcity/dcos/test/aws/onprem/static/strict status of mesosphere/dcos-enterprise/pull/3603 (Title: Add external Mesos master/agent logs in the bundle, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 13/Nov/18 ]

@branden overrode teamcity/dcos/test/aws/onprem/static/strict status of mesosphere/dcos-enterprise/pull/3430 (Title: Add SELinux details to diagnostics bundle, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 13/Nov/18 ]

@charlesprovencher overrode teamcity/dcos/test/dcos-e2e/docker/static/strict status of mesosphere/dcos-enterprise/pull/3777 (Title: Adding required ending forwardslash to download_url, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 14/Nov/18 ]

@jp overrode teamcity/dcos/test/dcos-e2e/docker/static/strict status of mesosphere/dcos-enterprise/pull/3812 (Title: [1.12] Mergebot Automated Train PR - 2018-Nov-14-02-38, Branch: 1.12) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Mergebot [ 16/Nov/18 ]

@sergeyurbanovich overrode teamcity/dcos/test/aws/onprem/static/strict status of mesosphere/dcos-enterprise/pull/3821 (Title: [1.12] Bump dcos-net, Branch: 1.12) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference.

Comment by Jan-Philip Gehrcke (Inactive) [ 26/Nov/18 ]

We did not have an override for 10 days. This is the longest period of silence since many months. The DC/OS pull request throughput remained roughly constant. That is, we can conclude that the rate at which the underlying instabilities create problems has been significantly reduced (by more than one order of magnitude, probably – hard to precisely quantify when only observing a short timeframe).

This is a major success. We have effectively addressed all instabilities resulting in this symptom.

I think it's a good time to close this ticket (after about a year!). If we ever observe a test_vip instability again we should track the symptom(s) in separate JIRA ticket(s).

Comment by Jan-Philip Gehrcke (Inactive) [ 26/Nov/18 ]

For posterity, the following graph shows the evolution of the override command rate for DCOS_OSS-2115 (and DCOS-19542 which was at some point renamed to DCOS_OSS-2115) over the course of 1 year:

Comment by Jan-Philip Gehrcke (Inactive) [ 26/Nov/18 ]

Closing, thanks to everyone who helped fixing the underlying instabilities (most of which were in DC/OS; and not in the test method itself!)

Comment by Jan-Philip Gehrcke (Inactive) [ 15/Dec/18 ]

The story went on with DCOS-45799 and DCOS-46220, but we seem to have it under control!

Generated at Mon May 23 09:55:56 CDT 2022 using JIRA 7.8.4#78004-sha1:5704c55c9196a87d91490cbb295eb482fa3e65cf.