[DCOS_OSS-2115] test_vip failed with RetryError on MarathonApp.wait Created: 14/Nov/17 Updated: 15/Dec/18 Resolved: 26/Nov/18 |
|
Status: | Resolved |
Project: | DC/OS |
Component/s: | marathon, mesos, networking |
Affects Version/s: | DC/OS 1.9.7, DC/OS 1.10.5, DC/OS 1.11.0, DC/OS 1.12.0 |
Fix Version/s: | None |
Type: | Bug | Priority: | High |
Reporter: | Jan-Philip Gehrcke (Inactive) | Assignee: | Unassigned |
Resolution: | Done | ||
Labels: | flaky-bug, mergebot-override, mesos, networking, type:ci-failure | ||
Remaining Estimate: | Not Specified | ||
Time Spent: | Not Specified | ||
Original Estimate: | Not Specified |
Attachments: |
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
||||||||||||||||||||||||||||||||||||
Issue Links: |
|
||||||||||||||||||||||||||||||||||||
Epic Link: | DC/OS Test Flakiness | ||||||||||||||||||||||||||||||||||||
Sprint: | Core Sprint 2018-28, Core Sprint 2018-29, Core RI-6 Sprint 2018-30 | ||||||||||||||||||||||||||||||||||||
Story Points: | 8 | ||||||||||||||||||||||||||||||||||||
Product (inherited): | DC/OS | ||||||||||||||||||||||||||||||||||||
Transition Due Date: |
Description |
[open_source_tests.test_networking.test_vip[Container_MESOS-Network_USER-Network_USER]] failed in With a RetryError upon the attempt to launch the corresponding Marathon application: self = <retrying.Retrying object at 0x7f533113d278> fn = <function MarathonApp.wait at 0x7f53320a2048> args = (<test_networking.MarathonApp object at 0x7f53310ff4a8>, <dcos_test_utils.enterprise.EnterpriseApiSession object at 0x7f5332196da0>) kwargs = {}, start_time = 1510670454766, attempt_number = 240 attempt = Attempts: 240, Value: False, delay_since_first_attempt_ms = 1200912 sleep = 5000 def call(self, fn, *args, **kwargs): start_time = int(round(time.time() * 1000)) attempt_number = 1 while True: try: attempt = Attempt(fn(*args, **kwargs), attempt_number, False) except: tb = sys.exc_info() attempt = Attempt(tb, attempt_number, True) if not self.should_reject(attempt): return attempt.get(self._wrap_exception) delay_since_first_attempt_ms = int(round(time.time() * 1000)) - start_time if self.stop(attempt_number, delay_since_first_attempt_ms): if not self._wrap_exception and attempt.has_exception: # get() on an attempt with an exception should cause it to be raised, but raise just in case raise attempt.get() else: > raise RetryError(attempt) E retrying.RetryError: RetryError[Attempts: 240, Value: False] |
Comments |
Comment by Senthil Kumaran (Inactive) [ 15/Nov/17 ] |
Observed this failure again today: in dcos-docker suite: This is fixable in test code. |
Comment by Adam Dangoor (Inactive) [ 21/Nov/17 ] |
open_source_tests/test_networking.py:214 (test_vip[Container.MESOS-Network.HOST-Network.USER]) dcos_api_session = <dcos_test_utils.enterprise.EnterpriseApiSession object at 0x7f420dab7fd0> container = <Container.MESOS: 'MESOS'>, vip_net = <Network.HOST: 'HOST'> proxy_net = <Network.USER: 'USER'> @pytest.mark.slow @pytest.mark.skipif( not lb_enabled(), reason='Load Balancer disabled') @pytest.mark.parametrize( 'container,vip_net,proxy_net', generate_vip_app_permutations()) def test_vip(dcos_api_session, container: marathon.Container, vip_net: marathon.Network, proxy_net: marathon.Network): '''Test VIPs between the following source and destination configurations: * containers: DOCKER, UCR and NONE * networks: USER, BRIDGE (docker only), HOST * agents: source and destnations on same agent or different agents * vips: named and unnamed vip Origin app will be deployed to the cluster with a VIP. Proxy app will be deployed either to the same host or elsewhere. Finally, a thread will be started on localhost (which should be a master) to submit a command to the proxy container that will ping the origin container VIP and then assert that the expected origin app UUID was returned ''' errors = 0 > tests = setup_vip_workload_tests(dcos_api_session, container, vip_net, proxy_net) open_source_tests/test_networking.py:239: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ open_source_tests/test_networking.py:272: in setup_vip_workload_tests origin_app.wait(dcos_api_session) ../../lib/python3.5/site-packages/retrying.py:49: in wrapped_f return Retrying(*dargs, **dkw).call(f, *args, **kw) _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ self = <retrying.Retrying object at 0x7f420c9975c0> fn = <function MarathonApp.wait at 0x7f420da56400> args = (<test_networking.MarathonApp object at 0x7f420cb61a58>, <dcos_test_utils.enterprise.EnterpriseApiSession object at 0x7f420dab7fd0>) kwargs = {}, start_time = 1511276625442, attempt_number = 240 attempt = Attempts: 240, Value: False, delay_since_first_attempt_ms = 1201505 sleep = 5000 def call(self, fn, *args, **kwargs): start_time = int(round(time.time() * 1000)) attempt_number = 1 while True: try: attempt = Attempt(fn(*args, **kwargs), attempt_number, False) except: tb = sys.exc_info() attempt = Attempt(tb, attempt_number, True) if not self.should_reject(attempt): return attempt.get(self._wrap_exception) delay_since_first_attempt_ms = int(round(time.time() * 1000)) - start_time if self.stop(attempt_number, delay_since_first_attempt_ms): if not self._wrap_exception and attempt.has_exception: # get() on an attempt with an exception should cause it to be raised, but raise just in case raise attempt.get() else: > raise RetryError(attempt) E retrying.RetryError: RetryError[Attempts: 240, Value: False] |
Comment by Orlando Hohmeier (Inactive) [ 04/Dec/17 ] |
Observed the same failure on https://github.com/mesosphere/dcos-enterprise/pull/1783 ( https://teamcity.mesosphere.io/viewLog.html?buildId=878155&buildTypeId=DcOs_Enterprise_Test_Inte_AwsOnpremWStaticBackendAndSecurityStrict ) |
Comment by Mergebot [ 04/Dec/17 ] |
Github PR: https://github.com/mesosphere/dcos-enterprise/pull/1783 status teamcity/dcos/test/aws/onprem/static/strict was overridden with a failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 05/Dec/17 ] |
Github PR: https://github.com/mesosphere/dcos-enterprise/pull/1755 status teamcity/dcos/test/aws/cloudformation/simple was overridden with a failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 06/Dec/17 ] |
Github PR: https://github.com/dcos/dcos/pull/2165 status teamcity/dcos/test/docker was overridden with a failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 06/Dec/17 ] |
Github PR: https://github.com/dcos/dcos/pull/2165 status teamcity/dcos/test/aws/onprem/static-redhat was overridden with a failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 08/Dec/17 ] |
Github PR: https://github.com/dcos/dcos/pull/2159 status teamcity/dcos/test/azure/arm was overridden with a failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 12/Dec/17 ] |
@skumaran overrode teamcity/dcos/test/docker status of dcos/dcos/pull/2196 with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 13/Dec/17 ] |
@jp overrode teamcity/dcos/test/aws/onprem/static status of dcos/dcos/pull/2159 with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 13/Dec/17 ] |
@skumaran overrode teamcity/dcos/test/docker status of dcos/dcos/pull/2200 with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Adam Bordelon (Inactive) [ 17/Dec/17 ] |
I got the same thing in https://teamcity.mesosphere.io/viewLog.html?buildId=900731 but with MarathonPod.wait. Pods and Apps both flake |
Comment by Mergebot [ 17/Dec/17 ] |
@skumaran overrode teamcity/dcos/test/docker status of dcos/dcos/pull/2046 (Title: Bump dcos-mesos to latest master a4b1134., Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 19/Dec/17 ] |
@prozlach overrode teamcity/dcos/test/aws/onprem/static-redhat status of dcos/dcos/pull/2220 (Title: Bump pkgpanda pkgs kazoo and gunicorn, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 19/Dec/17 ] |
@jeremy overrode mergebot/enterprise/build-status/aggregate status of dcos/dcos/pull/2107 (Title: Use the more recent marathon endpoint., Branch: master) with the failure noted in this JIRA. |
Comment by Mergebot [ 19/Dec/17 ] |
@jeremy overrode teamcity/dcos/test/aws/onprem/static-redhat status of dcos/dcos/pull/2107 (Title: Use the more recent marathon endpoint., Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 19/Dec/17 ] |
@jeremy overrode teamcity/dcos/test/docker status of dcos/dcos/pull/2107 (Title: Use the more recent marathon endpoint., Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Sergey Urbanovich (Inactive) [ 19/Dec/17 ] |
Dec 15 03:42:54 dcos-docker-master1 java[969]: [myid:] INFO [ProcessThread(sid:0 cport:2181)::PrepRequestProcessor@648] - Got user-level KeeperException when processing sessionid:0x160582962c9002e type:setData cxid:0x2656 zxid:0x1b41 txntype:-1 reqpath:n/a Error Path:/marathon/state/group/2/root/2017-12-15T03:42:54.914Z Error:KeeperErrorCode = NoNode for /marathon/state/group/2/root/2017-12-15T03:42:54.914Z
|
Comment by Mergebot [ 19/Dec/17 ] |
@jeremy overrode teamcity/dcos/test/aws/cloudformation/simple status of dcos/dcos/pull/2231 (Title: Train 293, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Senthil Kumaran (Inactive) [ 20/Dec/17 ] |
Hi Sergey Urbanovich - What is the significance of that exception that you pointed out? |
Comment by Mergebot [ 20/Dec/17 ] |
@skumaran overrode teamcity/dcos/test/aws/onprem/static status of dcos/dcos/pull/2210 (Title: Bump dcos-test-utils, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Sergey Urbanovich (Inactive) [ 20/Dec/17 ] |
Senthil Kumaran it's a proof that most likely test_vip flakiness is caused by marathon, see dcos-docker-master1.log in artifacts |
Comment by Mergebot [ 21/Dec/17 ] |
@adam overrode teamcity/dcos/test/aws/onprem/static status of dcos/dcos/pull/2244 (Title: Train 295, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 21/Dec/17 ] |
@adam overrode teamcity/dcos/test/azure/arm status of dcos/dcos/pull/2244 (Title: Train 295, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 09/Jan/18 ] |
@michael.ellenburg overrode teamcity/dcos/test/azure/arm status of dcos/dcos/pull/2267 (Title: fix cloud_images CI yum error, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 10/Jan/18 ] |
@aekbote overrode teamcity/dcos/test/aws/onprem/static status of dcos/dcos/pull/2281 (Title: Admin Router: Minimizing software version information reported by the AR [1.10], Branch: 1.10) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 11/Jan/18 ] |
@michael.ellenburg overrode teamcity/dcos/test/azure/arm status of dcos/dcos/pull/2297 (Title: Train 304, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 19/Jan/18 ] |
@skumaran overrode teamcity/dcos/test/aws/onprem/static/disabled status of mesosphere/dcos-enterprise/pull/2088 (Title: [master] Mergebot Automated Train PR - 2018-Jan-19-15-33, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 23/Jan/18 ] |
@jp overrode teamcity/dcos/test/azure/arm status of dcos/dcos/pull/2346 (Title: [master] Mergebot Automated Train PR - 2018-Jan-22-16-43, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Adam Dangoor (Inactive) [ 29/Jan/18 ] |
I have moved this issue to DCOS-OSS as it is an OSS test and affects OSS builds. |
Comment by Mergebot [ 29/Jan/18 ] |
@skumaran overrode teamcity/dcos/test/aws/onprem/static-redhat status of dcos/dcos/pull/2371 (Title: [master] Mergebot Automated Train PR - 2018-Jan-26-17-41, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 30/Jan/18 ] |
@michael.ellenburg overrode teamcity/dcos/test/aws/cloudformation/simple status of dcos/dcos/pull/2356 (Title: Bump dcos-test-utils, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Marco Monaco [ 07/Feb/18 ] |
Senthil Kumaran Is this something we need to resolve before 1.11 GA? Who is or will working on that? Thanks |
Comment by Senthil Kumaran (Inactive) [ 07/Feb/18 ] |
Let's collect more information on the failure using this PR: https://github.com/dcos/dcos/pull/2421 RetryError Exception is really a bad exception without more information. I was thinking of solving it in the test libraries so that we could provide targetted exception information. For now, I am going add logging to the test methods in our code. Ken Sipe , Karsten Jeschkies and Matthias Eichstedt - We really think this is Marathon issue, as GET request to "/v2/pods/{id}::status"
is not returning a `STABLE` when we deploy the workload app before we test the networking. Running multiple times in the https://github.com/dcos/dcos/pull/2421 could reveal more information on this flaky issue. Once you have the required information, I'd like to assign this bug to one of you. Thank you! |
Comment by Karsten Jeschkies (Inactive) [ 08/Feb/18 ] |
Senthil Kumaran, I just started digging into the test a little. So take my comments with a grain of salt. The failing test has assert errors == 0. What do you think about error = list()
.
.
except Exception as e:
errors.append(e)
.
.
assert len(errors) == 0
this should print the errors in the JUnit stack trace and simplifies debugging quite a bit. |
Comment by Karsten Jeschkies (Inactive) [ 08/Feb/18 ] |
I took the liberty to change things a bit https://github.com/dcos/dcos/pull/2426. |
Comment by Karsten Jeschkies (Inactive) [ 09/Feb/18 ] |
Alright, the job failed. This does not seem to be a flake to me > assert self._info['status'] == 'STABLE' E assert 'DEGRADED' == 'STABLE' E - DEGRADED E + STABLE |
Comment by Mergebot [ 12/Feb/18 ] |
@sergeyurbanovich overrode teamcity/dcos/test/azure/arm status of dcos/dcos/pull/2450 (Title: [master] Bump Mesos to nightly master d4b000f, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Karsten Jeschkies (Inactive) [ 13/Feb/18 ] |
Hm, doe we have some stats on this? The test seems to fail every time for me https://github.com/dcos/dcos/pull/2426. This seems to be a bug and not a flake. |
Comment by Senthil Kumaran (Inactive) [ 13/Feb/18 ] |
Hi Karsten Jeschkies - If you click on the TeamCity link - https://teamcity.mesosphere.io/viewLog.html?buildId=966772&tab=buildResultsDiv&buildTypeId=DcOs_Open_Test_IntegrationTest_AzureArm and follow the test results, you will information statistical information on those tests. For e.g: https://teamcity.mesosphere.io/project.html?projectId=DcOs_Open_Test_IntegrationTest&testNameId=-2566244230057288490&tab=testDetails I am +1 to the assertion change that you made in PR. That will give us more information than a simple generic RetryError However, it is interesting to note that the test is succeeding on other platforms and failing only on Azure ARM install. I have re-triggered to collect more stats, and have queued just the test_networking::test_vip test on Azure ARM to get specific stats: https://teamcity.mesosphere.io/viewLog.html?buildId=971555 Let's wait for the results of these runs. |
Comment by Senthil Kumaran (Inactive) [ 14/Feb/18 ] |
Karsten Jeschkies - We had a success for the flaky test during re-trigger. This is why we have categorized it as flaky. It is a bug that - assert self._info['status'] == 'STABLE', error_msg will *never* be true under certain conditions. But, the conditions under which it is not going to succeed is unknown. |
Comment by Mergebot [ 22/Feb/18 ] |
@sergeyurbanovich overrode teamcity/dcos/test/aws/onprem/static status of dcos/dcos/pull/2496 (Title: bump dcos-cni, Branch: 1.11) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 22/Feb/18 ] |
@skumaran overrode teamcity/dcos/test/aws/onprem/static status of dcos/dcos/pull/2426 (Title: Assert HTTP responses and prettify errors., Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 22/Feb/18 ] |
@skumaran overrode teamcity/dcos/test/azure/arm status of dcos/dcos/pull/2493 (Title: chore(dcos-ui): update package, Branch: 1.11) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 26/Feb/18 ] |
@skumaran overrode teamcity/dcos/test/azure/arm status of dcos/dcos/pull/2510 (Title: [1.11] Mergebot Automated Train PR - 2018-Feb-26-11-00, Branch: 1.11) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 27/Feb/18 ] |
@skumaran overrode teamcity/dcos/test/azure/arm status of dcos/dcos/pull/2519 (Title: [1.11] Mergebot Automated Train PR - 2018-Feb-27-21-13, Branch: 1.11) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 01/Mar/18 ] |
@gpaul overrode teamcity/dcos/test/docker status of dcos/dcos/pull/2511 (Title: Admin Router: Prevent reusing tcp sockets by AR's cache code [master], Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 01/Mar/18 ] |
@michael.ellenburg overrode teamcity/dcos/test/azure/arm status of dcos/dcos/pull/2538 (Title: Admin Router: Prevent reusing tcp sockets by AR's cache code [1.11], Branch: 1.11) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 05/Mar/18 ] |
@prozlach overrode teamcity/dcos/test/aws/onprem/static/disabled status of mesosphere/dcos-enterprise/pull/2164 (Title: [DCOS-19243] Test the permissions required to access the /v2/leader endpoint, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 07/Mar/18 ] |
@skumaran overrode teamcity/dcos/test/aws/onprem/static/disabled status of mesosphere/dcos-enterprise/pull/2419 (Title: 1.11.0 Integration Train for UI Changes and Version Update., Branch: 1.11.0-GA) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 07/Mar/18 ] |
@skumaran overrode teamcity/dcos/test/azure/arm status of dcos/dcos/pull/2589 (Title: Fixed broken Azure & AWS documentation links., Branch: 1.11.0-GA) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 07/Mar/18 ] |
@skumaran overrode teamcity/dcos/test/azure/arm status of dcos/dcos/pull/2589 (Title: Fixed broken Azure & AWS documentation links., Branch: 1.11.0-GA) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 08/Mar/18 ] |
@gpaul overrode teamcity/dcos/test/aws/onprem/static status of dcos/dcos/pull/2568 (Title: Admin Router: Support for custom 'Host' header and response status for generic tests, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 08/Mar/18 ] |
@gpaul overrode teamcity/dcos/test/docker status of dcos/dcos/pull/2568 (Title: Admin Router: Support for custom 'Host' header and response status for generic tests, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 14/Mar/18 ] |
@skumaran overrode teamcity/dcos/test/aws/onprem/static status of dcos/dcos/pull/2620 (Title: [1.11] Avoid python dependency break for python-dateutil, Branch: 1.11) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 15/Mar/18 ] |
@prozlach overrode teamcity/dcos/test/azure/arm status of dcos/dcos/pull/2615 (Title: [1.10] Mergebot Automated Train PR - 2018-Mar-14-10-00, Branch: 1.10) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Sergey Urbanovich (Inactive) [ 19/Mar/18 ] |
Matthias Eichstedt Could you add some details why you closed this JIRA? |
Comment by Senthil Kumaran (Inactive) [ 19/Mar/18 ] |
Matthias Eichstedt - this was observed again in master today - https://teamcity.mesosphere.io/viewLog.html?buildId=1009460&buildTypeId=DcOs_Enterprise_ManualTriggers_IntegrationTest_AwsOnpremWStaticBackendAndSecurit If we close this, we will need at-least another JIRA that indicates the activity towards fixing of this flakiness in Marathon.
|
Comment by Senthil Kumaran (Inactive) [ 19/Mar/18 ] |
Re-opening to use this for override. |
Comment by Mergebot [ 19/Mar/18 ] |
@skumaran overrode teamcity/dcos/test/aws/onprem/static/disabled status of mesosphere/dcos-enterprise/pull/2470 (Title: [master] Mergebot Automated Train PR - 2018-Mar-19-12-00, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 19/Mar/18 ] |
@skumaran overrode teamcity/dcos/test/aws/cloudformation/simple status of dcos/dcos/pull/2637 (Title: Use pytest-dcos plugin, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 19/Mar/18 ] |
@skumaran overrode teamcity/dcos/test/aws/onprem/static/disabled status of mesosphere/dcos-enterprise/pull/2476 (Title: [master] Mergebot Automated Train PR - 2018-Mar-20-00-04, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Karsten Jeschkies (Inactive) [ 20/Mar/18 ] |
When I investigated the issue I've found that the test code should be improved first since the errors do not give much information. However, I don't know who the owner is and it was so cumbersome to make any changes by myself that I just gave up. Overall I'm not sure this is a Marathon issue. |
Comment by Sergey Urbanovich (Inactive) [ 20/Mar/18 ] |
It’s an integration test and it checks how dcos-l4lb, marathon, mesos, dcos-overlay and others work together. It’s hard to find the owner and it’s one of the reasons why dcos has been suffering from the issue for so long. However, let me be the owner. I can help you with test_vip code and networking stuff, I hope Michael Ellenburg can help us with the testing infrastructure (test_helper, dcos_test_utils, etc). My speculation is that we see a lot of failures in test_vip just because this test has tens of sub-tests and they start hundreds of tasks. Maybe we have exactly the same issues in other tests, but we aren’t experiencing them so frequently. Let’s improve the test code to prove or refute completely that it’s a marathon issue. What should we change in the code? Please feel free to reach me out on slack directly or we can discuss it on #test-vip-flakiness channel (yes, this issue has its own channel!). I hope Avinash Sridharan, Matthias Eichstedt, and Artem Harutyunyan can help us to prioritize the issue. It’s the most flaky test in DC/OS. Thank you. |
Comment by Mergebot [ 20/Mar/18 ] |
@skumaran overrode teamcity/dcos/test/aws/cloudformation/simple status of dcos/dcos/pull/2653 (Title: [master] Mergebot Automated Train PR - 2018-Mar-20-23-22, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 09/Apr/18 ] |
@skumaran overrode teamcity/dcos/test/docker status of dcos/dcos/pull/2711 (Title: [master] Mergebot Automated Train PR - 2018-Apr-06-12-00, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 16/Apr/18 ] |
@skumaran overrode teamcity/dcos/test/aws/onprem/static status of dcos/dcos/pull/2748 (Title: |
Comment by Mergebot [ 17/Apr/18 ] |
@sergeyurbanovich overrode teamcity/dcos/test/aws/cloudformation/simple status of dcos/dcos/pull/2758 (Title: bump from latest dcos-net master, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 24/Apr/18 ] |
@sergeyurbanovich overrode teamcity/dcos/test/azure/arm status of dcos/dcos/pull/2783 (Title: [master] Mergebot Automated Train PR - 2018-Apr-20-12-00, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 25/Apr/18 ] |
@sergeyurbanovich overrode teamcity/dcos/test/aws/onprem/static/disabled status of mesosphere/dcos-enterprise/pull/2636 (Title: [1.11] Mergebot Automated Train PR - 2018-Apr-20-11-00, Branch: 1.11) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 03/May/18 ] |
@skumaran overrode teamcity/dcos/test/aws/cloudformation/simple status of dcos/dcos/pull/2816 (Title: Add owners from the Mesos pool of committers., Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 08/May/18 ] |
@branden overrode teamcity/dcos/test/dcos-e2e/docker/static/strict status of mesosphere/dcos-enterprise/pull/2729 (Title: improve dcos-diagnostics integration test and fix dcos-diagnostics system account permissions 1.11, Branch: 1.11) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 17/May/18 ] |
@branden overrode teamcity/dcos/test/dcos-e2e/docker/static/strict status of mesosphere/dcos-enterprise/pull/2441 (Title: Remove web installer, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Karsten Jeschkies (Inactive) [ 31/May/18 ] |
Sergey Urbanovich, I'm sorry. I've missed your comment.
I tried to add some logs and change the code but gave up because a PR took more than a week to merge. I don't know how the security team manages this. Anyhow, I proposed to remove the test. If there are reasonable use cases I'm happy to spec the tests and implement them in the Marathon system test suite. The Marathon team would be a clear owner then. What do you think, Matthias Eichstedt? So, I'm happy to take ownership but then it moves into our repo. |
Comment by Mergebot [ 31/May/18 ] |
@sergeyurbanovich overrode teamcity/dcos/test/aws/cloudformation/simple status of dcos/dcos/pull/2890 (Title: Fix upgrade issues with ipv6 and flaky service discovery integration test, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Matthias Eichstedt (Inactive) [ 01/Jun/18 ] |
Karsten Jeschkies I'm OK with moving it into our repo and effectively owing it – Jan-Philip Gehrcke since you created the ticket originally do you veto? |
Comment by Sergey Urbanovich (Inactive) [ 01/Jun/18 ] |
Hi Karsten Jeschkies and Matthias Eichstedt, Thanks for replying! As I mentioned before, test_vip checks how mesos, marathon, dcos-l4lb, dcos-overlay, linux kernel, etc work together on different clouds with different security configurations. We must be sure that the feature is not broken when we update any of them. It is the main integration test for dcos networking stack. Most of these goals could not be reached if we move the test to marathon repo. Unfortunately, we don't have the option that you suggested. I do totally understand your pain with dcos workflow, it's unreasonably slow and usually it takes several days to merge any PR to master branch (sometimes it takes months, no kidding). However, one of the biggest issues with the whole experience is that we have tons of flaky integration tests. I can promise you that I will be shepherding all test_vip PRs and I believe they will be merged faster. |
Comment by Senthil Kumaran (Inactive) [ 01/Jun/18 ] |
Hello Matthias Eichstedt - I am with Sergey Urbanovich here. If anything, we should work on fixing this in dcos/doc repo instead moving it out of the repo. Sergey Urbanovich - On PRs not moving forward for days/ months, is it still the case? The idea with @docs-owners and ability to overide flakes is a attempt to solve that. Has this not been helping? Once the CI is sufficiently stable, we have plans to do away with trains and land the PRs immediately. Let us keep involving the dcos-owners, and reduce the flakiness in the system to improve this. Thank you! |
Comment by Karsten Jeschkies (Inactive) [ 04/Jun/18 ] |
So the owner should be the networking team then, right? If the wiki is still up to date that would be Sergey Urbanovich and Deepak Goel. I'm happy to help if you can provide the app definitions being deployed, the Marathon logs during the test runs and the logs of the test itself. |
Comment by Sergey Urbanovich (Inactive) [ 04/Jun/18 ] |
Karsten Jeschkies Well, it's easy! Please check any mergebot comment above. Let's look at the last one (link). Test waited for a pod, id: /integration-test-51752de892914eb58c16530e1c842b4c, in the log you can find the pod definition, it's python term (see below). You also can find all marathon logs in artifacts -> master_journald.log. [2018-05-31 01:48:02,306|test_networking|INFO]: Origin app: {'id': '/integration-test-51752de892914eb58c16530e1c842b4c', 'scheduling': {'placement': {'acceptedResourceRoles': ['*', 'slave_public'], 'constraints': [{'fieldName': 'hostname', 'operator': 'CLUSTER', 'value': '10.0.1.44'}]}}, 'containers': [{'name': 'app-51752de892914eb58c16530e1c842b4c', 'resources': {'cpus': 0.01, 'mem': 32}, 'image': {'kind': 'DOCKER', 'id': 'debian:jessie'}, 'exec': {'command': {'shell': '/opt/mesosphere/bin/dcos-shell python /opt/mesosphere/active/dcos-integration-test/util/python_test_server.py $ENDPOINT_TEST'}}, 'volumeMounts': [{'name': 'opt', 'mountPath': '/opt/mesosphere'}], 'endpoints': [{'name': 'test', 'protocol': ['tcp'], 'hostPort': 0, 'labels': {'VIP_0': '1.1.1.7:10176'}}], 'environment': {'DCOS_TEST_UUID': '51752de892914eb58c16530e1c842b4c', 'HOME': '/'}}], 'networks': [{'mode': 'host'}], 'volumes': [{'name': 'opt', 'host': '/opt/mesosphere'}]} |
Comment by Karsten Jeschkies (Inactive) [ 05/Jun/18 ] |
Thanks for the pointers. So the logs say the following: 28076 2018-05-31 01:48:05: [2018-05-31 01:48:05,167] INFO Processing LaunchEphemeral(Instance(instance [integration-test-51752de892914eb58c16530e1c842b4c.instance-a91e4f80-6474-11e8-874e-e639d421a063],AgentInfo(10.0.1.44,Some(72859f6f-babb-4975-912d-c2885c6417ef-S0),None,None,Vector()),InstanceState(Created,2018-05-31T01:48:05.112Z,None,None),Map(task [integration-test-51752de892914eb58c16530e1c842b4c.instance-a91e4f80-6474-11e8-874e-e639d421a063.app-51752de892914eb58c16530e1c842b4c 28076 ] -> Task(task [integration-test-51752de892914eb58c16530e1c842b4c.instance-a91e4f80-6474-11e8-874e-e639d421a063.app-51752de892914eb58c16530e1c842b4c],2018-05-31T01:48:02.319Z,Status(2018-05-31T01:48:05.112Z,None,None,Created,NetworkInfo(10.0.1.44,Vector(20704),List())))),2018-05-31T01:48:02.319Z,UnreachableEnabled(0 seconds,0 seconds),None)) for instance [integration-test-51752de892914eb58c16530e1c842b4c.instance-a91e4f80-6474-11e8-874e-e639d421a063] (mesosphere.marathon.core.l 28076 auncher.impl.OfferProcessorImpl:scala-execution-context-global-1808) ... 28085 2018-05-31 01:48:06: [2018-05-31 01:48:06,714] INFO Received status update for task integration-test-51752de892914eb58c16530e1c842b4c.instance-a91e4f80-6474-11e8-874e-e639d421a063.app-51752de892914eb58c16530e1c842b4c: TASK_STARTING () (mesosphere.marathon.MarathonScheduler:Thread-1476) 28086 2018-05-31 01:48:06: [2018-05-31 01:48:06,717] INFO Acknowledge status update for task integration-test-51752de892914eb58c16530e1c842b4c.instance-a91e4f80-6474-11e8-874e-e639d421a063.app-51752de892914eb58c16530e1c842b4c: TASK_STARTING () (mesosphere.marathon.core.task.update.impl.TaskStatusUpdateProcessorImpl:scala-execution-context-global-1949) ... 28169 2018-05-31 01:48:22: [2018-05-31 01:48:22,529] INFO 10.0.5.50 - - [31/May/2018:01:48:22 +0000] "GET //10.0.5.50/v2/pods/integration-test-51752de892914eb58c16530e1c842b4c::status HTTP/1.1" 200 2509 "-" "python-requests/2.18.4" (mesosphere.chaos.http.ChaosRequestLog:qtp2077738191-47) 28170 2018-05-31 01:48:27: [2018-05-31 01:48:27,542] INFO 10.0.5.50 - - [31/May/2018:01:48:27 +0000] "GET //10.0.5.50/v2/pods/integration-test-51752de892914eb58c16530e1c842b4c::status HTTP/1.1" 200 2509 "-" "python-requests/2.18.4" (mesosphere.chaos.http.ChaosRequestLog:qtp2077738191-50) ... 28258 2018-05-31 01:53:09: [2018-05-31 01:53:09,883] WARN Should kill: task [integration-test-51752de892914eb58c16530e1c842b4c.instance-a91e4f80-6474-11e8-874e-e639d421a063.app-51752de892914eb58c16530e1c842b4c] was launched 304s ago and was not confirmed yet (mesosphere.marathon.core.task.jobs.impl.OverdueTasksActor$Support:scala-execution-context-global-1962) 28259 2018-05-31 01:53:09: [2018-05-31 01:53:09,883] INFO Killing overdue instance [integration-test-51752de892914eb58c16530e1c842b4c.instance-a91e4f80-6474-11e8-874e-e639d421a063] (mesosphere.marathon.core.task.jobs.impl.OverdueTasksActor$Support:scala-execution-context-global-1962) ... So the task does not start in time. And then there is › rg "integration-test-51752de892914eb58c16530e1c842b4c.*TASK_" dcos-marathon.service 28085:2018-05-31 01:48:06: [2018-05-31 01:48:06,714] INFO Received status update for task integration-test-51752de892914eb58c16530e1c842b4c.instance-a91e4f80-6474-11e8-874e-e639d421a063.app-51752de892914eb58c16530e1c842b4c: TASK_STARTING () (mesosphere.marathon.MarathonScheduler:Thread-1476) 28086:2018-05-31 01:48:06: [2018-05-31 01:48:06,717] INFO Acknowledge status update for task integration-test-51752de892914eb58c16530e1c842b4c.instance-a91e4f80-6474-11e8-874e-e639d421a063.app-51752de892914eb58c16530e1c842b4c: TASK_STARTING () (mesosphere.marathon.core.task.update.impl.TaskStatusUpdateProcessorImpl:scala-execution-context-global-1949) 28538:2018-05-31 01:56:34: [2018-05-31 01:56:34,968] INFO Received status update for task integration-test-51752de892914eb58c16530e1c842b4c.instance-a91e4f80-6474-11e8-874e-e639d421a063.app-51752de892914eb58c16530e1c842b4c: TASK_STARTING (Reconciliation: Latest task state) (mesosphere.marathon.MarathonScheduler:Thread-1525) 28546:2018-05-31 01:56:34: [2018-05-31 01:56:34,971] INFO Received status update for task integration-test-51752de892914eb58c16530e1c842b4c.instance-a91e4f80-6474-11e8-874e-e639d421a063.app-51752de892914eb58c16530e1c842b4c: TASK_STARTING (Reconciliation: Latest task state) (mesosphere.marathon.MarathonScheduler:Thread-1533) 28561:2018-05-31 01:56:35: [2018-05-31 01:56:34,978] INFO Acknowledge status update for task integration-test-51752de892914eb58c16530e1c842b4c.instance-a91e4f80-6474-11e8-874e-e639d421a063.app-51752de892914eb58c16530e1c842b4c: TASK_STARTING (Reconciliation: Latest task state) (mesosphere.marathon.core.task.update.impl.TaskStatusUpdateProcessorImpl:scala-execution-context-global-1962) 28569:2018-05-31 01:56:35: [2018-05-31 01:56:34,984] INFO Acknowledge status update for task integration-test-51752de892914eb58c16530e1c842b4c.instance-a91e4f80-6474-11e8-874e-e639d421a063.app-51752de892914eb58c16530e1c842b4c: TASK_STARTING (Reconciliation: Latest task state) (mesosphere.marathon.core.task.update.impl.TaskStatusUpdateProcessorImpl:scala-execution-context-global-1962) 29362:2018-05-31 02:06:34: [2018-05-31 02:06:34,973] INFO Received status update for task integration-test-51752de892914eb58c16530e1c842b4c.instance-a91e4f80-6474-11e8-874e-e639d421a063.app-51752de892914eb58c16530e1c842b4c: TASK_STARTING (Reconciliation: Latest task state) (mesosphere.marathon.MarathonScheduler:Thread-1581) 29369:2018-05-31 02:06:34: [2018-05-31 02:06:34,981] INFO Acknowledge status update for task integration-test-51752de892914eb58c16530e1c842b4c.instance-a91e4f80-6474-11e8-874e-e639d421a063.app-51752de892914eb58c16530e1c842b4c: TASK_STARTING (Reconciliation: Latest task state) (mesosphere.marathon.core.task.update.impl.TaskStatusUpdateProcessorImpl:scala-execution-context-global-1901) 29375:2018-05-31 02:06:35: [2018-05-31 02:06:34,982] INFO Received status update for task integration-test-51752de892914eb58c16530e1c842b4c.instance-a91e4f80-6474-11e8-874e-e639d421a063.app-51752de892914eb58c16530e1c842b4c: TASK_STARTING (Reconciliation: Latest task state) (mesosphere.marathon.MarathonScheduler:Thread-1589) 29383:2018-05-31 02:06:35: [2018-05-31 02:06:34,989] INFO Acknowledge status update for task integration-test-51752de892914eb58c16530e1c842b4c.instance-a91e4f80-6474-11e8-874e-e639d421a063.app-51752de892914eb58c16530e1c842b4c: TASK_STARTING (Reconciliation: Latest task state) (mesosphere.marathon.core.task.update.impl.TaskStatusUpdateProcessorImpl:scala-execution-context-global-1808) 29810:2018-05-31 02:08:21: [2018-05-31 02:08:20,992] INFO Received status update for task integration-test-51752de892914eb58c16530e1c842b4c.instance-a91e4f80-6474-11e8-874e-e639d421a063.app-51752de892914eb58c16530e1c842b4c: TASK_RUNNING () (mesosphere.marathon.MarathonScheduler:Thread-1616) 29831:2018-05-31 02:08:21: [2018-05-31 02:08:21,060] INFO Acknowledge status update for task integration-test-51752de892914eb58c16530e1c842b4c.instance-a91e4f80-6474-11e8-874e-e639d421a063.app-51752de892914eb58c16530e1c842b4c: TASK_RUNNING () (mesosphere.marathon.core.task.update.impl.TaskStatusUpdateProcessorImpl:scala-execution-context-global-1962) 29858:2018-05-31 02:08:39: [2018-05-31 02:08:39,982] INFO Received status update for task integration-test-51752de892914eb58c16530e1c842b4c.instance-a91e4f80-6474-11e8-874e-e639d421a063.app-51752de892914eb58c16530e1c842b4c: TASK_KILLING () (mesosphere.marathon.MarathonScheduler:Thread-1630) 29859:2018-05-31 02:08:39: [2018-05-31 02:08:39,995] INFO Acknowledge status update for task integration-test-51752de892914eb58c16530e1c842b4c.instance-a91e4f80-6474-11e8-874e-e639d421a063.app-51752de892914eb58c16530e1c842b4c: TASK_KILLING () (mesosphere.marathon.core.task.update.impl.TaskStatusUpdateProcessorImpl:scala-execution-context-global-2234) 29860:2018-05-31 02:08:40: [2018-05-31 02:08:40,057] INFO Received status update for task integration-test-51752de892914eb58c16530e1c842b4c.instance-a91e4f80-6474-11e8-874e-e639d421a063.app-51752de892914eb58c16530e1c842b4c: TASK_KILLED (Command terminated with signal Terminated) (mesosphere.marathon.MarathonScheduler:Thread-1631) 29870:2018-05-31 02:08:40: [2018-05-31 02:08:40,062] INFO Acknowledge status update for task integration-test-51752de892914eb58c16530e1c842b4c.instance-a91e4f80-6474-11e8-874e-e639d421a063.app-51752de892914eb58c16530e1c842b4c: TASK_KILLED (Command terminated with signal Terminated) (mesosphere.marathon.core.task.update.impl.TaskStatusUpdateProcessorImpl:scala-execution-context-global-2206) It takes 20 seconds for the task to launch! |
Comment by Karsten Jeschkies (Inactive) [ 05/Jun/18 ] |
Sergey Urbanovich, what is dcos-integration-test/util/python_test_server.py doing? Where can I find the sandboxes from Mesos with the logs of the executor and python_test_server.py? |
Comment by Mergebot [ 05/Jun/18 ] |
@skumaran overrode teamcity/dcos/test/aws/cloudformation/simple status of dcos/dcos/pull/2949 (Title: pin urllib3 to 1.22 for compatibility with requests, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Sergey Urbanovich (Inactive) [ 05/Jun/18 ] |
Karsten Jeschkies python_test_server.py is a simple http server, you can find it in dcos repo here. Please check the diagnostic bundle in artifacts, may be mesos-slave logs have some valuable information. IIRC our test infrastructure doesn't collect logs for mesos sandboxes. Senthil Kumaran please correct me if I'm wrong. |
Comment by Senthil Kumaran (Inactive) [ 05/Jun/18 ] |
Sergey Urbanovich - you are right, we don't collect logs for mesos sandboxes. We only bundle the journald logs that are on master and the agents. |
Comment by Mergebot [ 05/Jun/18 ] |
@skumaran overrode teamcity/dcos/test/dcos-docker/static status of dcos/dcos/pull/2940 (Title: Modifies test relying on mesos logging in the stdout of a task, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Karsten Jeschkies (Inactive) [ 06/Jun/18 ] |
This makes it almost impossible to debug. We had some issues in the Marathon integration tests that we only found with logs from the executors and apps. If we are lucky we find some things by digging into the Mesos logs. However, if a task is in TASK_STARTING and is not becoming TASK_RUNNING Marathon cannot do anything about this. |
Comment by Ioannis Charalampidis (Inactive) [ 06/Jun/18 ] |
It might be unrelated, or I might be missing something, but I see that the python job on TeamCity is waiting for the tasks to be "healthy": self._info = r.json() > assert self._info['app']['tasksHealthy'] == self.app['instances'] E assert 0 == 1 test_networking.py:72: AssertionError But I did not see any health checks defined in the pod definition that Sergey posted above. |
Comment by Mergebot [ 06/Jun/18 ] |
@gpaul overrode teamcity/dcos/test/dcos-e2e/docker/static/strict status of mesosphere/dcos-enterprise/pull/2853 (Title: rexray: upgrade to v0.11.1 [Backport to 1.11], Branch: 1.11) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Gustav Paul (Inactive) [ 06/Jun/18 ] |
I'm not 100% this last failure should be tracked by this issue, but it looks similar enough that I'll leave it to Sergey Urbanovich to decide whether this merits a different ticket. |
Comment by Matthias Eichstedt (Inactive) [ 06/Jun/18 ] |
Karsten Jeschkies is right – the log snippets he provided earlier clearly show that a task is reported STARTING and then does not turn RUNNING within 5 minutes. The default behavior of Marathon is to kill such a task (aka expunge all information about it) after 5 minutes. A mitigation could be to increase the task_launch_timeout to e.g. 1200000L (20 minutes), but we should find out why it takes so long for the task to turn running. Linking Senthil Kumaran we are kind of blocked triaging this. We could (1) increase the above timeout, but we (2) should have sandboxes available. I don't think we're investigating a Marathon problem here – tasks are not reported Running in time, so either the docker daemon, the agent, or other things are are severely slow to respond. (Increasing the Marathon timeout is not a substitute for an RCA.) |
Comment by Sergey Urbanovich (Inactive) [ 06/Jun/18 ] |
Senthil Kumaran It seems like we have to add mesos sandboxes dirs to artifacts or do we have any other options? May I kindly ask you to create a blocker JIRA for that? |
Comment by Mergebot [ 18/Jun/18 ] |
@skumaran overrode teamcity/dcos/test/aws/onprem/static status of dcos/dcos/pull/2968 (Title: [master] Mergebot Automated Train PR - 2018-Jun-11-12-00, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 25/Jun/18 ] |
@skumaran overrode teamcity/dcos/test/azure/arm status of dcos/dcos/pull/3000 (Title: [master] Mergebot Automated Train PR - 2018-Jun-25-06-56, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 25/Jun/18 ] |
@kapil overrode teamcity/dcos/test/aws/cloudformation/simple status of dcos/dcos/pull/3003 (Title: [master] Bump Mesos to nightly master d22a3d7, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 26/Jun/18 ] |
@jp overrode teamcity/dcos/test/aws/onprem/static status of dcos/dcos/pull/2990 (Title: Don't skip checks that are limited to a specific role, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 27/Jun/18 ] |
@skumaran overrode teamcity/dcos/test/aws/cloudformation/simple status of dcos/dcos/pull/2992 (Title: Bump CoreOS AMI to v1745.7.0, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 29/Jun/18 ] |
@gpaul overrode teamcity/dcos/test/azure/arm status of dcos/dcos/pull/3017 (Title: Added second EBS drive to agents and public agents (1.10 backport)., Branch: 1.10) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 29/Jun/18 ] |
@jp overrode teamcity/dcos/test/aws/cloudformation/simple status of dcos/dcos/pull/2992 (Title: Bump CoreOS AMI to v1745.7.0, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 29/Jun/18 ] |
@kapil overrode teamcity/dcos/test/aws/onprem/static status of dcos/dcos/pull/3021 (Title: [master] Bump Mesos to nightly master 22471b8, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 29/Jun/18 ] |
@sergeyurbanovich overrode teamcity/dcos/test/azure/arm status of dcos/dcos/pull/3006 (Title: Adds network information to be collected as part of diagnostic bundle, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Karsten Jeschkies (Inactive) [ 03/Jul/18 ] |
We've ran test_networking.py::test_vip on a cluster an it failed for two instances integration-test-2f50d63b9f5f44d29489e89e383d30d5 and integration-test-0749fbd35959418787aa0e015a5f0bc9. See the test_vip.log Marathon does start the Python server. See python_test_server.logs.md I assume this is a networking issue, Sergey Urbanovich. Are there other logs we could log at? The full bundle is bundle-2018-07-03-1530620037.zip |
Comment by Aleksey Dukhovniy (Inactive) [ 03/Jul/18 ] |
A few things:
That's how it should look like: { "id": "integration-test-0749fbd35959418787aa0e015a5f0bc9", "cpus": 0.1, "mem": 32, "instances": 1, "cmd": "/opt/mesosphere/bin/dcos-shell python /opt/mesosphere/active/dcos-integration-test/util/python_test_server.py 10043", "env": { "DCOS_TEST_UUID": "0749fbd35959418787aa0e015a5f0bc9", "HOME": "/" }, "healthChecks": [ { "protocol": "MESOS_HTTP", "path": "/ping", "gracePeriodSeconds": 5, "intervalSeconds": 10, "timeoutSeconds": 10, "maxConsecutiveFailures": 120, "port": 10043 } ], "networks": [ { "mode": "container", "name": "dcos" } ], "container": { "type": "DOCKER", "docker": { "image": "debian:jessie" }, "portMappings": [ { "containerPort": 10043, "protocol": "tcp", "name": "test", "labels": { "VIP_0": "/namedvip:10042" } } ], "volumes": [ { "containerPath": "/opt/mesosphere", "hostPath": "/opt/mesosphere", "mode": "RO" } ] }, "constraints": [ [ "hostname", "CLUSTER", "10.0.0.221" ] ], "acceptedResourceRoles": [ "*", "slave_public" ] } Nevertheless: I can run deprecated and proper app definitions manually and they both are successful in isolation. |
Comment by Sergey Urbanovich (Inactive) [ 03/Jul/18 ] |
Hi Karsten Jeschkies! You've caught another bug with test_vip, it's not related to the case which we have been tracking here. Your logs show that all applications were ready, the test failed on assert [1]. It definitely looks like a networking issue on CoreOS v1745.7.0, The summary of this JIRA is "test_vip failed with RetryError on MarathonApp.wait". In that case, you would see a stack trace that starts with setup_vip_workload_tests function [3]. |
Comment by Karsten Jeschkies (Inactive) [ 04/Jul/18 ] |
Hi, Sergey Urbanovich, without the sandboxes we cannot do much here. |
Comment by Mergebot [ 04/Jul/18 ] |
@skumaran overrode teamcity/dcos/test/azure/arm status of dcos/dcos/pull/3010 (Title: [1.11] Mergebot Automated Train PR - 2018-Jun-27-11-00, Branch: 1.11) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Adam Dangoor (Inactive) [ 05/Jul/18 ] |
Senthil Kumaran I don't think that the previous override is relevant to this issue. |
Comment by Pawel Rozlach [ 05/Jul/18 ] |
The problem described by Karsten Jeschkies in [1] has been narrowed down in [2]: basically, mesos-modules need some patching before we can bump to the newer Docker version (and by proxy - to the newer CoreOs version). As Sergey Urbanovich already pointed out, this is not a flakiness issue but a genuine failure detected by test_vip integration test. [1] https://jira.mesosphere.com/browse/DCOS_OSS-2115?focusedCommentId=161189&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-161189 |
Comment by Pawel Rozlach [ 05/Jul/18 ] |
I have created |
Comment by Sergey Urbanovich (Inactive) [ 08/Jul/18 ] |
> It seems like we have to add mesos sandboxes dirs to artifacts or do we have any other options? May I kindly ask you to create a blocker JIRA for that? Senthil Kumaran would you please provide any updates on this matter? |
Comment by Jan-Philip Gehrcke (Inactive) [ 09/Jul/18 ] |
|
Comment by Karsten Jeschkies (Inactive) [ 09/Jul/18 ] |
I tried to debug test_vip. However, the tests deploys 288 apps if I'm not mistaken. These do not finish due to unfulfilled roles on my tests cluster. Sergey Urbanovich, what cluster do we require? |
Comment by Adam Dangoor (Inactive) [ 09/Jul/18 ] |
Karsten Jeschkies In case it helps, the integration tests are run on a cluster with one master, two private agents, one public agent. |
Comment by Pawel Rozlach [ 09/Jul/18 ] |
Discussed things briefly with Karsten Jeschkies and Aleksey Dukhovniy:
This should allow us to work around the 288 apps issue that Karsten Jeschkies mentioned a few comments earlier. |
Comment by Gustav Paul (Inactive) [ 11/Jul/18 ] |
Every permutation appears to have some small percentage chance of failing. This is not as simple as finding the one permutation that fails. If we got the logs from the task sandboxes and the journal I think Sergey Urbanovich would be happy to comb through the 36 test cases (as opposed to 144). I believe Tools Infra are going to work on that soon. |
Comment by Karsten Jeschkies (Inactive) [ 12/Jul/18 ] |
Here are the logs from 100 runs test_vip.tar.gz Five runs failed › rg "====.*failed" tar/test_vip*.log
tar/test_vip_34.log
7676:============ 1 failed, 30 passed, 1042 warnings in 2292.29 seconds =============
tar/test_vip_28.log
7881:============ 1 failed, 31 passed, 1068 warnings in 2315.99 seconds =============
tar/test_vip_41.log
8614:============ 1 failed, 34 passed, 1159 warnings in 2416.24 seconds =============
tar/test_vip_77.log
6411:============= 1 failed, 21 passed, 887 warnings in 2149.93 seconds =============
tar/test_vip_94.log
4648:============= 1 failed, 13 passed, 659 warnings in 1877.69 seconds =============
Three container pod tests fail with 63 error_msg = 'Status was {}: {}'.format(self._info['status'], self._info.get('message', 'no message')) 64 > assert self._info['status'] == 'STABLE', error_msg 65 E AssertionError: Status was DEGRADED: no message 66 E assert 'DEGRADED' == 'STABLE' 67 E - DEGRADED 68 E + STABLE two others with 58 self._info = r.json() 59 > assert self._info['app']['tasksHealthy'] == self.app['instances'] 60 E assert 0 == 1 |
Comment by Karsten Jeschkies (Inactive) [ 12/Jul/18 ] |
See the sandboxes And the diagnostics.zip One failed app was integration-test-vip-user-host-proxy-8639690e02544ddf91c5258a9ffce698.tar.gz |
Comment by Jan-Philip Gehrcke (Inactive) [ 12/Jul/18 ] |
I love the development that I see here. Thank you everyone. |
Comment by Sergey Urbanovich (Inactive) [ 12/Jul/18 ] |
Logs start from 2018-07-11 12:29:01 on leader nodes in diagnostic bundle, it's test_vip_60. I've checked a failure from test_vip_94.log. integration-test-vip-user-user-proxy-9a6a11d810294a42b9008a408fc63ffd failed to start on 10.0.2.170, UCR container on dcos overlay network. 2018-07-12 00:15:00: I0712 00:15:00.342406 2267 containerizer.cpp:2006] Checkpointing container's forked pid 25922 to '/var/lib/mesos/slave/meta/slaves/7b81ef39-41f7-4906-b5fb-b5f11c0d4c5b-S3/frameworks/7b81ef39-41f7-4906-b5fb-b5f11c0d4c5b-0001/executors/integration-test-vip-user-user-proxy-9a6a11d810294a42b9008a408fc63ffd.9d5a6549-8568-11e8-b41a-aae668466825/runs/251be515-f13e-4ae0-b333-2cae5baa1bd0/pids/forked.pid' 2018-07-12 00:24:59: I0712 00:24:59.834357 2262 slave.cpp:6792] Terminating executor 'integration-test-vip-user-user-proxy-9a6a11d810294a42b9008a408fc63ffd.9d5a6549-8568-11e8-b41a-aae668466825' of framework 7b81ef39-41f7-4906-b5fb-b5f11c0d4c5b-0001 because it did not register within 10mins 2018-07-12 00:35:22: I0712 00:35:22.863657 2266 slave.cpp:3633] Asked to kill task integration-test-vip-user-user-proxy-9a6a11d810294a42b9008a408fc63ffd.9d5a6549-8568-11e8-b41a-aae668466825 of framework 7b81ef39-41f7-4906-b5fb-b5f11c0d4c5b-0001 |
Comment by Mergebot [ 13/Jul/18 ] |
@gpaul overrode teamcity/dcos/test/aws/onprem/static status of dcos/dcos/pull/3071 (Title: Change Adminrouter access_log logging facility to daemon [Backport 1.10], Branch: 1.10) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 13/Jul/18 ] |
@gpaul overrode teamcity/dcos/test/aws/cloudformation/simple status of dcos/dcos/pull/2866 (Title: Increase the limit on worker_connections to 10K, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Jie Yu (Inactive) [ 13/Jul/18 ] |
Vinod Kone can you have someone from the Mesos team to take a look? Looks like the executor cannot register within 10min. |
Comment by Karsten Jeschkies (Inactive) [ 16/Jul/18 ] |
Here are the stats we use for our loops. Failing test cases: › for f in tar/*.xml; do echo $(xmlstarlet sel -t -v "/testsuite/testcase[failure]/@name" "$f"); done | sort | uniq -c 1 test_vip[Container.MESOS-Network.USER-Network.USER] 1 test_vip[Container.NONE-Network.USER-Network.HOST] 1 test_vip[Container.POD-Network.BRIDGE-Network.USER] 1 test_vip[Container.POD-Network.USER-Network.HOST] 1 test_vip[Container.POD-Network.USER-Network.USER] Sergey Urbanovich, do you see any pattern in the network types? Unique error causes: › for f in tar/*.xml; do echo $(xmlstarlet sel -t -v "/testsuite/testcase/failure/@message" "$f"); done | sort | uniq -c 3 AssertionError: Status was DEGRADED: no message assert 'DEGRADED' == 'STABLE' - DEGRADED + STABLE 2 assert 0 == 1 |
Comment by Mergebot [ 16/Jul/18 ] |
@alexr overrode teamcity/dcos/test/dcos-e2e/docker/static/strict status of mesosphere/dcos-enterprise/pull/2997 (Title: Split test_ee_signal and improve debug logging on failure., Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 16/Jul/18 ] |
@gpaul overrode teamcity/dcos/test/aws/cloudformation/simple status of dcos/dcos/pull/3064 (Title: [1.12/master] Use ngx.timer.every() for the AR cache update, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 16/Jul/18 ] |
@gpaul overrode teamcity/dcos/test/dcos-e2e/docker/static/strict status of mesosphere/dcos-enterprise/pull/3032 (Title: Add an integration test for auto load cgroups subsystems and container-specific cgroups mounts., Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Sergey Urbanovich (Inactive) [ 16/Jul/18 ] |
Karsten Jeschkies I'd say that 2 out of 5 could be related to network issue. I've recently re-wrote the whole mesos polling in dcos-net and added some logs. The patch will be merged with the next master train. I would like to wait for some time and collect new test failures with those logs and sandbox data. At the moment I don't see any marathon related issues. |
Comment by Karsten Jeschkies (Inactive) [ 17/Jul/18 ] |
Sergey Urbanovich, thanks for the feedback. Senthil Kumaran, would it be possible to setup a loop for DC/OS master to gather the data constantly? The comments by the merge bot are hard to analyze and come from pull requests which distort the results. |
Comment by Pawel Rozlach [ 17/Jul/18 ] |
Karsten Jeschkies I already created a Jira (DCOS-17519) for that and tried to make lots of different people to notice it, but so far it was ignored |
Comment by Senthil Kumaran (Inactive) [ 17/Jul/18 ] |
> Senthil Kumaran, would it be possible to setup a loop for DC/OS master to gather the data constantly? Karsten Jeschkies - Yes, I have it today. It is good that we are focussing on this problem, let us not loose momentum on this. |
Comment by Mergebot [ 18/Jul/18 ] |
@alexr overrode teamcity/dcos/test/dcos-e2e/docker/static/strict status of mesosphere/dcos-enterprise/pull/2702 (Title: dcos-checks: bump for cockroachdb ranges check and enable config, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 19/Jul/18 ] |
@gpaul overrode teamcity/dcos/test/dcos-e2e/docker/static/strict status of mesosphere/dcos-enterprise/pull/2975 (Title: [master] Provide Adminrouter URL for IAM access, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 19/Jul/18 ] |
@sergeyurbanovich overrode teamcity/dcos/test/dcos-e2e/docker/static/strict status of mesosphere/dcos-enterprise/pull/3062 (Title: Adds dataDir to ucr bridge cni configuration, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Senthil Kumaran (Inactive) [ 19/Jul/18 ] |
> would it be possible to setup a loop for DC/OS master to gather the data constantly? Karsten Jeschkies / Sergey Urbanovich Let's keep an eye on this - https://teamcity.mesosphere.io/viewType.html?buildTypeId=DcOs_Enterprise_Test_Inte_TestVipExclusive&branch_DcOs_Enterprise_Test_Inte=%3Cdefault%3E&tab=buildTypeStatusDiv
This is test_vip exclusive, it is going to exercise only `pytest -k test_vip` for every 3 hours, and if test step fails, the cluster wont be deleted. Let's monitor this one.
|
Comment by Gustav Paul (Inactive) [ 19/Jul/18 ] |
First failures: |
Comment by Sergey Urbanovich (Inactive) [ 20/Jul/18 ] |
Senthil Kumaran In that job all tests on ucr are failing consistently. It doesn't sound like the test_vip flakiness. |
Comment by Senthil Kumaran (Inactive) [ 23/Jul/18 ] |
Hey Sergey Urbanovich - You are right, the UCR failure is unrelated to this. I am investigating it further here. If this is broken in master then it is being observed only in AWS Onprem w/ Static Backend and Security Strict test suite. Further investigation is in progress https://jira.mesosphere.com/browse/DCOS-39700 as we don't want merge those failures with flaky behavior of test_vip. |
Comment by Mergebot [ 24/Jul/18 ] |
@alexr overrode teamcity/dcos/test/dcos-docker/static status of dcos/dcos/pull/3133 (Title: gen/calc: normalize check timeouts, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 30/Jul/18 ] |
@alexr overrode teamcity/dcos/test/dcos-e2e/docker/static/strict status of mesosphere/dcos-enterprise/pull/3102 (Title: Enable Mesos jemalloc and memory profiling support in DC/OS, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 31/Jul/18 ] |
@alexr overrode teamcity/dcos/test/dcos-docker/static status of dcos/dcos/pull/3123 (Title: Enable Mesos jemalloc and memory profiling support in DC/OS, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 31/Jul/18 ] |
@branden overrode teamcity/dcos/test/aws/onprem/static status of dcos/dcos/pull/3175 (Title: [1.10] Mergebot Automated Train PR - 2018-Jul-31-10-00, Branch: 1.10) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 01/Aug/18 ] |
@skumaran overrode teamcity/dcos/test/aws/cloudformation/simple status of dcos/dcos/pull/3120 (Title: Add Telegraf as a DC/OS component, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 02/Aug/18 ] |
@skumaran overrode teamcity/dcos/test/dcos-e2e/docker/static/strict status of mesosphere/dcos-enterprise/pull/3164 (Title: [1.11] Mergebot Automated Train PR - 2018-Aug-02-11-00, Branch: 1.11) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 03/Aug/18 ] |
@skumaran overrode teamcity/dcos/test/dcos-e2e/docker/static/strict status of mesosphere/dcos-enterprise/pull/3174 (Title: [master] Mergebot Automated Train PR - 2018-Aug-02-23-24, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Senthil Kumaran (Inactive) [ 07/Aug/18 ] |
Hi Karsten Jeschkies / Sergey Urbanovich - I hope you noticed that the Sandbox logs are being collected on these jobs (Done as explained here - https://jira.mesosphere.com/browse/DCOS-39211) Moreover. we have a periodic execution of just test_vip test case that has showing a consistent pattern of failing intermittently here - https://teamcity.mesosphere.io/viewType.html?buildTypeId=FooBar_DcOs_Enterprise_Test_Inte_TestVipExclusive&tab=buildTypeHistoryList&branch_DcOs_Enterprise_Test_Inte=1.12.DCOS-39700.t1 I hope that is useful to debug further.
|
Comment by Mergebot [ 07/Aug/18 ] |
@timweidner overrode teamcity/dcos/test/dcos-docker/static status of dcos/dcos/pull/3226 (Title: [1.10] packages/java: Update Java to 8u181 version, Branch: 1.10) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 07/Aug/18 ] |
@skumaran overrode teamcity/dcos/test/aws/onprem/static status of dcos/dcos/pull/3227 (Title: [master] Mergebot Automated Train PR - 2018-Aug-07-12-00, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Senthil Kumaran (Inactive) [ 09/Aug/18 ] |
I am setting the priority to High. It has been a "Blocker" status bug for a long time and we have not blocked any releases due to this bug. |
Comment by Mergebot [ 09/Aug/18 ] |
@skumaran overrode teamcity/dcos/test/azure/arm status of dcos/dcos/pull/3233 (Title: [Backport][1.11] locks the glide version to fix build issue *, Branch: *1.11) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Senthil Kumaran (Inactive) [ 10/Aug/18 ] |
Hello, We have been tracking this issue as flaky bug/task. Please make sure that the metadata such as Priority, Issue Type reflect the status accurately. If this frequently observed flake, please set the status to Blocker or High. |
Comment by Mergebot [ 10/Aug/18 ] |
@jp overrode teamcity/dcos/test/dcos-e2e/docker/static/strict status of mesosphere/dcos-enterprise/pull/3185 (Title: packages/bouncer: replace `dig` for leader detection with python only implementation, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Adam Dangoor (Inactive) [ 10/Aug/18 ] |
Sergey Urbanovich Karsten Jeschkies - it looks like recent reported failures have a different failure to the one in the description. Is this because of work done to make the error clear? In particular: self = <test_networking.MarathonPod object at 0x7efdc4194b00> dcos_api_session = <dcos_test_utils.enterprise.EnterpriseApiSession object at 0x7efdfb957d30> @retrying.retry( wait_fixed=5000, stop_max_delay=20 * 60 * 1000, retry_on_result=lambda res: res is False) def wait(self, dcos_api_session): r = dcos_api_session.marathon.get('/v2/pods/{}::status'.format(self.id)) assert_response_ok(r) self._info = r.json() error_msg = 'Status was {}: {}'.format(self._info['status'], self._info.get('message', 'no message')) > assert self._info['status'] == 'STABLE', error_msg E AssertionError: Status was DEGRADED: no message E assert 'DEGRADED' == 'STABLE' E - DEGRADED E + STABLE I will ask for an override against this issue, and I'd ask that, if possible, you change the description of this issue. |
Comment by Mergebot [ 10/Aug/18 ] |
@skumaran overrode teamcity/dcos/test/dcos-e2e/docker/static/strict status of mesosphere/dcos-enterprise/pull/3228 (Title: [master] Mergebot Automated Train PR - 2018-Aug-09-23-45, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 10/Aug/18 ] |
@skumaran overrode teamcity/dcos/test/azure/arm status of dcos/dcos/pull/3259 (Title: [1.10] Mergebot Automated Train PR - 2018-Aug-09-23-45, Branch: 1.10) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 10/Aug/18 ] |
@skumaran overrode teamcity/dcos/test/dcos-docker/static status of dcos/dcos/pull/3259 (Title: [1.10] Mergebot Automated Train PR - 2018-Aug-09-23-45, Branch: 1.10) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Karsten Jeschkies (Inactive) [ 13/Aug/18 ] |
this test has multiple flakes. See my comment. Since test_vip is not split up and is covering a lot of DC/OS this JIRA became a pool of all sorts of flake reports. Our, ie Aleksey Dukhovniy and my, suggestion was
This way we would now when an override is appropriate or not. AFAIK 1. is not going to happen. |
Comment by Mergebot [ 13/Aug/18 ] |
@jp overrode teamcity/dcos/test/aws/cloudformation/simple status of dcos/dcos/pull/3248 (Title: [1.11] changelog: Add Java update note, Branch: 1.11) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 13/Aug/18 ] |
@drozhkov overrode teamcity/dcos/test/aws/cloudformation/simple status of dcos/dcos/pull/3210 (Title: Bump ui to 1.11+v1.17.0, Branch: 1.11) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 13/Aug/18 ] |
@sergeyurbanovich overrode teamcity/dcos/test/dcos-docker/static status of dcos/dcos/pull/3268 (Title: [master] Mergebot Automated Train PR - 2018-Aug-13-12-00, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Adam Dangoor (Inactive) [ 15/Aug/18 ] |
Karsten Jeschkies Can you suggest an alternative for someone with a PR that hits this flake? |
Comment by Mergebot [ 15/Aug/18 ] |
@jp overrode teamcity/dcos/test/dcos-e2e/docker/static/strict status of mesosphere/dcos-enterprise/pull/2998 (Title: Test that a configurable permissions cache is used by various authorizers, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 15/Aug/18 ] |
@jp overrode teamcity/dcos/test/dcos-e2e/docker/static/strict status of mesosphere/dcos-enterprise/pull/3221 (Title: Implement bootstrap methods for telegraf, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 15/Aug/18 ] |
@cprovencher overrode teamcity/dcos/test/azure/arm status of dcos/dcos/pull/3274 (Title: [master] Mergebot Automated Train PR - 2018-Aug-14-12-01, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 16/Aug/18 ] |
@jp overrode teamcity/dcos/test/aws/onprem/static status of dcos/dcos/pull/3250 (Title: Implement bootstrap methods for telegraf, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 20/Aug/18 ] |
@gpaul overrode teamcity/dcos/test/dcos-e2e/docker/static/strict status of mesosphere/dcos-enterprise/pull/3206 (Title: setup.py: specify all files in ./pkgpanda/docker/dcos-builder, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 21/Aug/18 ] |
@cprovencher overrode teamcity/dcos/test/dcos-e2e/docker/static/strict status of mesosphere/dcos-enterprise/pull/3273 (Title: [master] Mergebot Automated Train PR - 2018-Aug-20-12-00, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 21/Aug/18 ] |
@cprovencher overrode teamcity/dcos/test/azure/arm status of dcos/dcos/pull/3313 (Title: [1.10] Mergebot Automated Train PR - 2018-Aug-21-10-00, Branch: 1.10) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 23/Aug/18 ] |
@skumaran overrode teamcity/dcos/test/dcos-e2e/docker/static/strict status of mesosphere/dcos-enterprise/pull/3295 (Title: [WIP] Bump cosmos testing version, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 24/Aug/18 ] |
@skumaran overrode teamcity/dcos/test/dcos-e2e/docker/static/strict status of mesosphere/dcos-enterprise/pull/3265 (Title: Bumping dcos-test-utils version, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 24/Aug/18 ] |
@charlesprovencher overrode teamcity/dcos/test/aws/onprem/static status of dcos/dcos/pull/3329 (Title: [master] Mergebot Automated Train PR - 2018-Aug-23-12-00, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 24/Aug/18 ] |
@charlesprovencher overrode teamcity/dcos/test/dcos-docker/static status of dcos/dcos/pull/3329 (Title: [master] Mergebot Automated Train PR - 2018-Aug-23-12-00, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 27/Aug/18 ] |
@jp overrode teamcity/dcos/test/dcos-e2e/docker/static/strict status of mesosphere/dcos-enterprise/pull/3310 (Title: [DCOS-39776] Remove disabled security mode, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 27/Aug/18 ] |
@charlesprovencher overrode teamcity/dcos/test/dcos-docker/static status of dcos/dcos/pull/3346 (Title: [master] Mergebot Automated Train PR - 2018-Aug-27-23-27, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 27/Aug/18 ] |
@charlesprovencher overrode teamcity/dcos/test/aws/onprem/static status of dcos/dcos/pull/3344 (Title: [1.11] Mergebot Automated Train PR - 2018-Aug-27-19-19, Branch: 1.11) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 28/Aug/18 ] |
@charlesprovencher overrode teamcity/dcos/test/dcos-e2e/docker/static/strict status of mesosphere/dcos-enterprise/pull/3321 (Title: [1.11] Mergebot Automated Train PR - 2018-Aug-27-19-20, Branch: 1.11) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 28/Aug/18 ] |
@skumaran overrode teamcity/dcos/test/dcos-e2e/docker/static/strict status of mesosphere/dcos-enterprise/pull/3321 (Title: [1.11] Mergebot Automated Train PR - 2018-Aug-27-19-20, Branch: 1.11) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 28/Aug/18 ] |
@gpaul overrode teamcity/dcos/test/aws/onprem/static status of dcos/dcos/pull/3148 (Title: Add timestamp for dmesg, distro version, timedatectl and systemd unit status to diag bundle, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Sergey Urbanovich (Inactive) [ 28/Aug/18 ] |
Here is a really good example for the mesos team. Mesos couldn't start UCR container on host network. Vinod Kone could someone help us with this? Please check Artifacts tab for logs and mesos sandboxes. 2018-08-28 18:14:39: I0828 18:14:39.182575 13694 containerizer.cpp:2006] Checkpointing container's forked pid 28601 to '/var/lib/mesos/slave/meta/slaves/55852c35-cc19-415d-a747-9dcfb7472e9d-S1/frameworks/55852c35-cc19-415d-a747-9dcfb7472e9d-0001/executors/integration-test-628cbd1f301b4c07bd6946cf4eb35168.39e5ccf3-aaee-11e8-a0d8-fe211abb3180/runs/a6ebb482-8bcf-4524-9b6b-0b91b3150efb/pids/forked.pid' 2018-08-28 18:24:38: I0828 18:24:38.591409 13695 slave.cpp:6790] Terminating executor 'integration-test-628cbd1f301b4c07bd6946cf4eb35168.39e5ccf3-aaee-11e8-a0d8-fe211abb3180' of framework 55852c35-cc19-415d-a747-9dcfb7472e9d-0001 because it did not register within 10mins |
Comment by Mergebot [ 29/Aug/18 ] |
@skumaran overrode teamcity/dcos/test/dcos-docker/static status of dcos/dcos/pull/3346 (Title: [master] Mergebot Automated Train PR - 2018-Aug-27-23-27, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 30/Aug/18 ] |
@jp overrode teamcity/dcos/test/aws/onprem/static status of dcos/dcos/pull/3362 (Title: Disable a watchdog for stuck processes, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 30/Aug/18 ] |
@jp overrode teamcity/dcos/test/aws/onprem/static status of dcos/dcos/pull/3359 (Title: [1.10] Disable a watchdog for stuck processes, Branch: 1.10) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 31/Aug/18 ] |
@jp overrode teamcity/dcos/test/dcos-docker/static status of dcos/dcos/pull/3370 (Title: Fix the release create stage., Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 31/Aug/18 ] |
@skumaran overrode teamcity/dcos/test/aws/cloudformation/simple status of dcos/dcos/pull/3372 (Title: [master] Mergebot Automated Train PR - 2018-Aug-31-12-01, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Jan-Philip Gehrcke (Inactive) [ 03/Sep/18 ] |
Highly relevant discussion: https://github.com/dcos/dcos/pull/1801 Also see |
Comment by Mergebot [ 04/Sep/18 ] |
@drozhkov overrode teamcity/dcos/test/aws/onprem/static status of dcos/dcos/pull/3382 (Title: chore(dcos-ui): bump DC/OS UI dcos-ui/master+v2.19.4, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 04/Sep/18 ] |
@skumaran overrode teamcity/dcos/test/dcos-e2e/docker/static/strict status of mesosphere/dcos-enterprise/pull/3359 (Title: Bumping marathon to 1.7.111, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 04/Sep/18 ] |
@branden overrode teamcity/dcos/test/dcos-e2e/docker/static/strict status of mesosphere/dcos-enterprise/pull/3347 (Title: [WIP] Add plugins to Telegraf, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 04/Sep/18 ] |
@skumaran overrode teamcity/dcos/test/dcos-e2e/docker/static/strict status of mesosphere/dcos-enterprise/pull/3352 (Title: [master] Mergebot Automated Train PR - 2018-Sep-03-12-00, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 05/Sep/18 ] |
@jp overrode teamcity/dcos/test/dcos-e2e/docker/static/strict status of mesosphere/dcos-enterprise/pull/3365 (Title: adminrouter: authentication architecture adjustments (WIP), Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 05/Sep/18 ] |
@jp overrode teamcity/dcos/test/dcos-e2e/docker/static/strict status of mesosphere/dcos-enterprise/pull/3351 (Title: [1.11] Bump Mesos to nightly 1.5.x 19d17ce, Branch: 1.11) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 05/Sep/18 ] |
@jp overrode teamcity/dcos/test/aws/onprem/static status of dcos/dcos/pull/3366 (Title: [WIP] Add plugins to Telegraf, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 05/Sep/18 ] |
@skumaran overrode teamcity/dcos/test/dcos-docker/static status of dcos/dcos/pull/3366 (Title: [WIP] Add plugins to Telegraf, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 05/Sep/18 ] |
@sergeyurbanovich overrode teamcity/dcos/test/dcos-e2e/docker/static/strict status of mesosphere/dcos-enterprise/pull/3366 (Title: [master] Mergebot Automated Train PR - 2018-Sep-05-12-01, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Vinod Kone (Inactive) [ 06/Sep/18 ] |
Gilbert Song and Qian Zhang will triage this. |
Comment by Mergebot [ 07/Sep/18 ] |
@jp overrode teamcity/dcos/test/aws/onprem/static status of dcos/dcos/pull/3254 (Title: Prevent dcos-history leaking auth tokens, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Qian Zhang (Inactive) [ 07/Sep/18 ] |
For the Docker containerizer's case (i.e., the test `test_networking.test_vip[Container_DOCKER-Network_BRIDGE-Network_HOST]`), I checked the stderr of the Docker executor and found an error: ... E0830 00:09:37.303499 2428 executor.cpp:385] Failed to inspect container 'mesos-eaa4f455-0a2c-47ff-bf98-8bd0ad243740': Unable to create container: Unable to find Id in container [2018-08-30 00:09:37,745] INFO: HTTP server is starting, port: 3511, test-UUID: '0d4176ad55894360907e0e4ea6ce0f81' ... So Docker executor has already launched the Docker container but the output of the `docker inspect` does not include the container's ID, this is weird, I never see this issue before. |
Comment by Qian Zhang (Inactive) [ 07/Sep/18 ] |
For Mesos containerizer's cases, after checking the logs, I found there are actually two different cases. Case 1: In the stderr of the executor, I see only one message: Failed to synchronize with agent (it's probably exited) This is an error which could happen when reading a pipe. Case 2: The stderr of the executor is empty, and in agent log I see: 2018-09-03 20:30:06: I0903 20:30:06.007843 13710 cni.cpp:952] Bind mounted '/proc/13189/ns/net' to '/run/mesos/isolators/network/cni/d955d3cb-099e-496a-87a9-fc89ef3567ef/ns' for container d955d3cb-099e-496a-87a9-fc89ef3567ef 2018-09-03 20:50:13: I0903 20:50:13.872481 13713 cni.cpp:1383] Got assigned IPv4 address '172.31.254.22/24' from CNI network 'mesos-bridge' for container d955d3cb-099e-496a-87a9-fc89ef3567ef So it took 20 mins for CNI isolator to get IP for the container which is weird. |
Comment by Karsten Jeschkies (Inactive) [ 10/Sep/18 ] |
Qian Zhang, do you think it would make sense to track the issue separately? This flake still shows up as one in Carter Gawron's summaries even though there are multiple independent issues. |
Comment by Carter Gawron [ 10/Sep/18 ] |
Anything we can do to split this up and resolve would be great. We have overriden this issue 119 times. That's ~40hrs worth of work just doing that.
|
Comment by Senthil Kumaran (Inactive) [ 10/Sep/18 ] |
Karsten - Those multiple independent issues, can be tracked separately and this issue be made a dependency those. We have already done that for this ticket, please notice this in the issue links. Once the core issues are fixed, and upon the resolution of the flakiness we should close this issue. I hope, you do not mean that we should close this issue, and track those independent issues instead. That won't help much IMO. We are close to resolution on this problem, and it will be great to get across the finish line with this.
|
Comment by Qian Zhang (Inactive) [ 10/Sep/18 ] |
I suspect this issue (at least the Mesos containerizer's case) may be caused by the FD leak bug that Gilbert recently fixed in Mesos, and that fix landed in DC/OS master branch 5 days ago. If this issue happens again, this ticket will be automatically updated by Mergebot by adding a new comment, right? I will keep monitoring this ticket and see if there are anything different if this issue happens again. |
Comment by Mergebot [ 10/Sep/18 ] |
@charlesprovencher overrode teamcity/dcos/test/azure/arm status of dcos/dcos/pull/3416 (Title: [1.11] Mergebot Automated Train PR - 2018-Sep-10-16-49, Branch: 1.11) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 10/Sep/18 ] |
@charlesprovencher overrode teamcity/dcos/test/aws/onprem/static status of dcos/dcos/pull/3416 (Title: [1.11] Mergebot Automated Train PR - 2018-Sep-10-16-49, Branch: 1.11) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 11/Sep/18 ] |
@skumaran overrode teamcity/dcos/test/aws/onprem/static status of dcos/dcos/pull/3414 (Title: [1.10] Backported detailed resource logging for some allocator errors in Mesos., Branch: 1.10) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 11/Sep/18 ] |
@skumaran overrode teamcity/dcos/test/aws/onprem/static status of dcos/dcos/pull/3371 (Title: Add wait command to dcos-docker instructions, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Qian Zhang (Inactive) [ 11/Sep/18 ] |
This test failed again for the Mesos containerizer case (see the last comment added by MergeBot) in another place. This time in the executor's stderr, I see the task has been started successfully and the health check returned 200 which is also good. [2018-09-07 23:18:36,926] INFO: HTTP server is starting, port: 12830, test-UUID: '0979a2280fc3431e9904885603a0c810' [2018-09-07 23:18:50,273] INFO: REQ: 127.0.0.1 "GET /ping HTTP/1.1" 200 - I0907 23:18:50.315387 9 checker_process.cpp:1140] HTTP health check for task 'integration-test-0979a2280fc3431e9904885603a0c810.578be3f4-b2f4-11e8-afd5-92e466048727' returned: 200 I0907 23:18:50.315495 9 executor.cpp:350] Received task health update, healthy: true But the weird thing is, agent did not receive any status updates for this task from the executor. $ grep integration-test-0979a2280fc3431e9904885603a0c810 ~/Downloads/dcos-mesos-slave.service 2018-09-07 23:18:34: I0907 23:18:34.559394 16711 slave.cpp:2035] Got assigned task 'integration-test-0979a2280fc3431e9904885603a0c810.578be3f4-b2f4-11e8-afd5-92e466048727' for framework e4956377-6a5b-4a83-9277-7f35da39387e-0000 2018-09-07 23:18:34: I0907 23:18:34.559952 16711 slave.cpp:2409] Authorizing task 'integration-test-0979a2280fc3431e9904885603a0c810.578be3f4-b2f4-11e8-afd5-92e466048727' for framework e4956377-6a5b-4a83-9277-7f35da39387e-0000 2018-09-07 23:18:34: I0907 23:18:34.560616 16711 slave.cpp:2852] Launching task 'integration-test-0979a2280fc3431e9904885603a0c810.578be3f4-b2f4-11e8-afd5-92e466048727' for framework e4956377-6a5b-4a83-9277-7f35da39387e-0000 2018-09-07 23:18:34: I0907 23:18:34.560703 16711 paths.cpp:745] Creating sandbox '/var/lib/mesos/slave/slaves/e4956377-6a5b-4a83-9277-7f35da39387e-S1/frameworks/e4956377-6a5b-4a83-9277-7f35da39387e-0000/executors/integration-test-0979a2280fc3431e9904885603a0c810.578be3f4-b2f4-11e8-afd5-92e466048727/runs/e56b8602-0de7-4b57-bc61-c839a28e554f' for user 'root' 2018-09-07 23:18:34: I0907 23:18:34.561393 16711 paths.cpp:748] Creating sandbox '/var/lib/mesos/slave/meta/slaves/e4956377-6a5b-4a83-9277-7f35da39387e-S1/frameworks/e4956377-6a5b-4a83-9277-7f35da39387e-0000/executors/integration-test-0979a2280fc3431e9904885603a0c810.578be3f4-b2f4-11e8-afd5-92e466048727/runs/e56b8602-0de7-4b57-bc61-c839a28e554f' 2018-09-07 23:18:34: I0907 23:18:34.561604 16711 slave.cpp:9015] Launching executor 'integration-test-0979a2280fc3431e9904885603a0c810.578be3f4-b2f4-11e8-afd5-92e466048727' of framework e4956377-6a5b-4a83-9277-7f35da39387e-0000 with resources [{"allocation_info":{"role":"slave_public"},"name":"cpus","scalar":{"value":0.1},"type":"SCALAR"},{"allocation_info":{"role":"slave_public"},"name":"mem","scalar":{"value":32.0},"type":"SCALAR"}] in work directory '/var/lib/mesos/slave/slaves/e4956377-6a5b-4a83-9277-7f35da39387e-S1/frameworks/e4956377-6a5b-4a83-9277-7f35da39387e-0000/executors/integration-test-0979a2280fc3431e9904885603a0c810.578be3f4-b2f4-11e8-afd5-92e466048727/runs/e56b8602-0de7-4b57-bc61-c839a28e554f' 2018-09-07 23:18:34: I0907 23:18:34.561944 16711 slave.cpp:3530] Launching container e56b8602-0de7-4b57-bc61-c839a28e554f for executor 'integration-test-0979a2280fc3431e9904885603a0c810.578be3f4-b2f4-11e8-afd5-92e466048727' of framework e4956377-6a5b-4a83-9277-7f35da39387e-0000 2018-09-07 23:18:34: I0907 23:18:34.562489 16711 slave.cpp:3049] Queued task 'integration-test-0979a2280fc3431e9904885603a0c810.578be3f4-b2f4-11e8-afd5-92e466048727' for executor 'integration-test-0979a2280fc3431e9904885603a0c810.578be3f4-b2f4-11e8-afd5-92e466048727' of framework e4956377-6a5b-4a83-9277-7f35da39387e-0000 2018-09-07 23:18:34: I0907 23:18:34.603900 16711 containerizer.cpp:2022] Checkpointing container's forked pid 28352 to '/var/lib/mesos/slave/meta/slaves/e4956377-6a5b-4a83-9277-7f35da39387e-S1/frameworks/e4956377-6a5b-4a83-9277-7f35da39387e-0000/executors/integration-test-0979a2280fc3431e9904885603a0c810.578be3f4-b2f4-11e8-afd5-92e466048727/runs/e56b8602-0de7-4b57-bc61-c839a28e554f/pids/forked.pid' 2018-09-07 23:18:35: I0907 23:18:35.235574 16710 slave.cpp:4824] Got registration for executor 'integration-test-0979a2280fc3431e9904885603a0c810.578be3f4-b2f4-11e8-afd5-92e466048727' of framework e4956377-6a5b-4a83-9277-7f35da39387e-0000 from executor(1)@10.10.0.145:33052 2018-09-07 23:18:35: I0907 23:18:35.263344 16708 slave.cpp:3262] Sending queued task 'integration-test-0979a2280fc3431e9904885603a0c810.578be3f4-b2f4-11e8-afd5-92e466048727' to executor 'integration-test-0979a2280fc3431e9904885603a0c810.578be3f4-b2f4-11e8-afd5-92e466048727' of framework e4956377-6a5b-4a83-9277-7f35da39387e-0000 at executor(1)@10.10.0.145:33052 2018-09-07 23:38:38: I0907 23:38:38.775454 16713 slave.cpp:3636] Asked to kill task integration-test-0979a2280fc3431e9904885603a0c810.578be3f4-b2f4-11e8-afd5-92e466048727 of framework e4956377-6a5b-4a83-9277-7f35da39387e-0000 2018-09-07 23:38:39: I0907 23:38:39.933745 16710 slave.cpp:6310] Executor 'integration-test-0979a2280fc3431e9904885603a0c810.578be3f4-b2f4-11e8-afd5-92e466048727' of framework e4956377-6a5b-4a83-9277-7f35da39387e-0000 exited with status 0 2018-09-07 23:38:39: I0907 23:38:39.933826 16710 slave.cpp:5290] Handling status update TASK_FAILED (Status UUID: 4fe40aba-2210-464d-ae76-dc1c61614ac8) for task integration-test-0979a2280fc3431e9904885603a0c810.578be3f4-b2f4-11e8-afd5-92e466048727 of framework e4956377-6a5b-4a83-9277-7f35da39387e-0000 from @0.0.0.0:0 2018-09-07 23:38:39: E0907 23:38:39.933982 16710 slave.cpp:5621] Failed to update resources for container e56b8602-0de7-4b57-bc61-c839a28e554f of executor 'integration-test-0979a2280fc3431e9904885603a0c810.578be3f4-b2f4-11e8-afd5-92e466048727' running task integration-test-0979a2280fc3431e9904885603a0c810.578be3f4-b2f4-11e8-afd5-92e466048727 on status update for terminal task, destroying container: Container not found 2018-09-07 23:38:39: I0907 23:38:39.934042 16710 task_status_update_manager.cpp:328] Received task status update TASK_FAILED (Status UUID: 4fe40aba-2210-464d-ae76-dc1c61614ac8) for task integration-test-0979a2280fc3431e9904885603a0c810.578be3f4-b2f4-11e8-afd5-92e466048727 of framework e4956377-6a5b-4a83-9277-7f35da39387e-0000 2018-09-07 23:38:39: I0907 23:38:39.934334 16710 task_status_update_manager.cpp:842] Checkpointing UPDATE for task status update TASK_FAILED (Status UUID: 4fe40aba-2210-464d-ae76-dc1c61614ac8) for task integration-test-0979a2280fc3431e9904885603a0c810.578be3f4-b2f4-11e8-afd5-92e466048727 of framework e4956377-6a5b-4a83-9277-7f35da39387e-0000 2018-09-07 23:38:39: I0907 23:38:39.934468 16712 slave.cpp:5782] Forwarding the update TASK_FAILED (Status UUID: 4fe40aba-2210-464d-ae76-dc1c61614ac8) for task integration-test-0979a2280fc3431e9904885603a0c810.578be3f4-b2f4-11e8-afd5-92e466048727 of framework e4956377-6a5b-4a83-9277-7f35da39387e-0000 to master@10.10.0.104:5050 2018-09-07 23:38:39: I0907 23:38:39.950619 16712 task_status_update_manager.cpp:401] Received task status update acknowledgement (UUID: 4fe40aba-2210-464d-ae76-dc1c61614ac8) for task integration-test-0979a2280fc3431e9904885603a0c810.578be3f4-b2f4-11e8-afd5-92e466048727 of framework e4956377-6a5b-4a83-9277-7f35da39387e-0000 2018-09-07 23:38:39: I0907 23:38:39.950681 16712 task_status_update_manager.cpp:842] Checkpointing ACK for task status update TASK_FAILED (Status UUID: 4fe40aba-2210-464d-ae76-dc1c61614ac8) for task integration-test-0979a2280fc3431e9904885603a0c810.578be3f4-b2f4-11e8-afd5-92e466048727 of framework e4956377-6a5b-4a83-9277-7f35da39387e-0000 2018-09-07 23:38:39: I0907 23:38:39.950841 16712 slave.cpp:6408] Cleaning up executor 'integration-test-0979a2280fc3431e9904885603a0c810.578be3f4-b2f4-11e8-afd5-92e466048727' of framework e4956377-6a5b-4a83-9277-7f35da39387e-0000 at executor(1)@10.10.0.145:33052 2018-09-07 23:38:39: I0907 23:38:39.951190 16712 gc.cpp:95] Scheduling '/var/lib/mesos/slave/slaves/e4956377-6a5b-4a83-9277-7f35da39387e-S1/frameworks/e4956377-6a5b-4a83-9277-7f35da39387e-0000/executors/integration-test-0979a2280fc3431e9904885603a0c810.578be3f4-b2f4-11e8-afd5-92e466048727/runs/e56b8602-0de7-4b57-bc61-c839a28e554f' for gc 1.9999889926163days in the future 2018-09-07 23:38:39: I0907 23:38:39.951225 16712 gc.cpp:95] Scheduling '/var/lib/mesos/slave/slaves/e4956377-6a5b-4a83-9277-7f35da39387e-S1/frameworks/e4956377-6a5b-4a83-9277-7f35da39387e-0000/executors/integration-test-0979a2280fc3431e9904885603a0c810.578be3f4-b2f4-11e8-afd5-92e466048727' for gc 1.99998899182815days in the future 2018-09-07 23:38:39: I0907 23:38:39.951248 16712 gc.cpp:95] Scheduling '/var/lib/mesos/slave/meta/slaves/e4956377-6a5b-4a83-9277-7f35da39387e-S1/frameworks/e4956377-6a5b-4a83-9277-7f35da39387e-0000/executors/integration-test-0979a2280fc3431e9904885603a0c810.578be3f4-b2f4-11e8-afd5-92e466048727/runs/e56b8602-0de7-4b57-bc61-c839a28e554f' for gc 1.99998899146963days in the future 2018-09-07 23:38:39: I0907 23:38:39.951269 16712 gc.cpp:95] Scheduling '/var/lib/mesos/slave/meta/slaves/e4956377-6a5b-4a83-9277-7f35da39387e-S1/frameworks/e4956377-6a5b-4a83-9277-7f35da39387e-0000/executors/integration-test-0979a2280fc3431e9904885603a0c810.578be3f4-b2f4-11e8-afd5-92e466048727' for gc 1.99998899119111days in the future Only one task status (`TASK_FAILED`) was handled by agent for this task, but I suspect that status update was generated by the agent itself rather than sent from executor. It looks like executor cannot send any status updates to the agent. |
Comment by Karsten Jeschkies (Inactive) [ 12/Sep/18 ] |
Well, as I said Carter's weekly summary did not surface these details. Maybe it should not. However, as an engineer this tickets becomes very hard to follow. Mergebot is generating a lot of noise and the core issues investigated don't seem obvious by looking at this ticket. Also, the only two open related issues are As Aleksey Dukhovniy and I mentioned before to Pawel Rozlach and Fabricio de Sousa Nascimento, the test should be split up as well as this ticket. There is no way we can help and get ahead of this if the rest of the company thinks that test_vip is one flaky test. It is not. Anyways, kudos to Qian for diving into this. |
Comment by Jan-Philip Gehrcke (Inactive) [ 12/Sep/18 ] |
The override command data shows that the test_vip instability has hurt us really bad in the past two months (more than any other instability) and justifies assembling a "tiger team" of individual domain experts which focuses on finding and addressing the individual causes leading to the test_vip instability. From the looks of it we almost have such a team (comprised of Aleksey Dukhovniy, Sergey Urbanovich, Karsten Jeschkies, Qian Zhang, ...) but I think we should first-class this and make sure that they can focus. CC Artem Harutyunyan Chandler Hoisington.
While some might indeed think that you can be sure that others (like me) know how diverse and mean the test_vip instability is. We know that it runs much more Marathon apps than other tests which is why, statistically, it suffers from even minor instabilities around app and task launches. And getting to the bottom of the individual, independent causes is indeed what we must focus on. I propose:
|
Comment by Mergebot [ 12/Sep/18 ] |
@charlesprovencher overrode teamcity/dcos/test/aws/onprem/static status of dcos/dcos/pull/3426 (Title: [master] Mergebot Automated Train PR - 2018-Sep-12-12-00, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Qian Zhang (Inactive) [ 12/Sep/18 ] |
This test failed again, but it seems there is no logs? |
Comment by Mergebot [ 12/Sep/18 ] |
@charlesprovencher overrode teamcity/dcos/test/dcos-docker/static status of dcos/dcos/pull/3357 (Title: [Backport] [1.11] bump mesos-module to include the fix for coreos 1800.7.0, Branch: 1.11) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 13/Sep/18 ] |
@jp overrode teamcity/dcos/test/dcos-e2e/docker/static/strict status of mesosphere/dcos-enterprise/pull/3406 (Title: exhibitor package: bump ZooKeeper to 3.4.13 release, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 13/Sep/18 ] |
@jp overrode teamcity/dcos/test/aws/onprem/static status of dcos/dcos/pull/3431 (Title: chore: bump dcos-ui v1.10+v1.10.9-rc3, Branch: 1.10) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 13/Sep/18 ] |
@jonathangiddy overrode teamcity/dcos/test/dcos-docker/static status of dcos/dcos/pull/3428 (Title: exhibitor package: bump ZooKeeper to 3.4.13 release, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 13/Sep/18 ] |
@gauripowale overrode teamcity/dcos/test/dcos-docker/static status of dcos/dcos/pull/3437 (Title: [1.11] Mergebot Automated Train PR - 2018-Sep-13-11-00, Branch: 1.11) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Qian Zhang (Inactive) [ 13/Sep/18 ] |
I created an OSS ticket (https://issues.apache.org/jira/browse/MESOS-9231) to trace the issue of Docker containerizer, and I will try to manually reproduce the issue of UCR with `dcos-launch`. |
Comment by Mergebot [ 13/Sep/18 ] |
@skumaran overrode teamcity/dcos/test/dcos-e2e/docker/static/strict status of mesosphere/dcos-enterprise/pull/3404 (Title: [master] Mergebot Automated Train PR - 2018-Sep-12-12-00, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 13/Sep/18 ] |
@kapil overrode teamcity/dcos/test/dcos-e2e/docker/static/strict status of mesosphere/dcos-enterprise/pull/3350 (Title: Bumped Mesos SHA for dc/os 1.11 container cleanup EBUSY fix., Branch: 1.11) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 17/Sep/18 ] |
@jp overrode teamcity/dcos/test/aws/cloudformation/simple status of dcos/dcos/pull/3447 (Title: [1.11] Bump Mesos to nightly 1.5.x 5a7ad47, Branch: 1.11) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 17/Sep/18 ] |
@jp overrode teamcity/dcos/test/dcos-e2e/docker/static/strict status of mesosphere/dcos-enterprise/pull/3434 (Title: openssl: bump to 1.0.2p, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 17/Sep/18 ] |
@gpaul overrode teamcity/dcos/test/dcos-docker/static status of dcos/dcos/pull/3443 (Title: openssl: bump to 1.0.2p, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 17/Sep/18 ] |
@gpaul overrode teamcity/dcos/test/aws/onprem/static status of dcos/dcos/pull/3443 (Title: openssl: bump to 1.0.2p, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Gustav Paul (Inactive) [ 17/Sep/18 ] |
Any idea why we no longer artifact the sandbox logs? We were tracking collection of sandbox logs in https://jira.mesosphere.com/browse/DCOS-39211 which is resolved, yet I don't see any sandbox logs in any of today's overrides. This test is unbelievably flaky and we're about to GA a release while this test (which exercises out stack end-to-end) is failing several times per day. As far as I understand the current status there is no way to make progress without the sandbox logs. Senthil Kumaran Charles Provencher Patrick Crews Carter Gawron Please help! |
Comment by Senthil Kumaran (Inactive) [ 17/Sep/18 ] |
Hi Gustav Paul - The sandbox logs are collected. Also, a general trend for us to "improve" and collect more logs that do any regression. The latest failure 1) - *teamcity/dcos/test/aws/onprem/static * - DC/OS Installation itself had failed. It can seen from the Build Logs in the console. `+ ./dcos-launch wait` kept waiting for the cluster to come up and it never came up. Transient network issues? (pehaps). Will jounald logs on bootstrap node help in addition to console logs? (That will be an addition, but seem to gather information from the console logs here on what happened. Since DC/OS didn't come up, we don't have master or sandbox logs here. 2) teamcity/dcos/test/dcos-docker/static - Collection of logs for this has not been added yet. Only the enterprise side was added by Charles recently and the steps needs to copied over to Open status too. This is in progress as we are trying to make sure all status checks have consistent set of logs. (https://jira.mesosphere.com/browse/DCOS-41749) In this specific scenario, a re-triggering - teamcity/dcos/test/aws/onprem/static must have helped and Qian is assisting us with the mesos bug here and has gotten access to valuable logs so far. HTH. |
Comment by Senthil Kumaran (Inactive) [ 17/Sep/18 ] |
Gustav Paul - If you look at any other failure, that is actual test_vip failure, and not the cluster creation failure, which shouldn't be linked to this ticket, You will find the sanbox logs available - For e.g. - https://teamcity.mesosphere.io/viewLog.html?buildId=1210700&buildTypeId=DcOs_Open_Test_IntegrationTest_AwsOnpremWStaticBackend&tab=artifacts |
Comment by Gustav Paul (Inactive) [ 17/Sep/18 ] |
Thanks Senthil Kumaran! I'm still confused though, for example this build from yesterday: That is an Enterprise strict mode build, test_vip failed, this was yesterday, and I don't see the sandbox logs, while I do see them for the build you linked (awesome, btw.) Do you perhaps mean that the Enterprise builds don't collect sandbox logs yet but the OSS builds do? |
Comment by Senthil Kumaran (Inactive) [ 17/Sep/18 ] |
Hi Gustav Paul - That was a miss on our side (Tools Infra). The Job that you linked should have sandbox logs collected. Looks like we failed to add it to the Docker Job on Enterprise. I have added the comment on that task https://jira.mesosphere.com/browse/DCOS_OSS-3738 to be reopened and to be completed. This EPIC - DCOS-41749 tracks for making sure all relevant logs are made available consistently for the TC jobs. |
Comment by Gustav Paul (Inactive) [ 17/Sep/18 ] |
Thanks Senthil Kumaran the log collection effort is fiddly but I believe it's going to pay for itself a thousand times over. |
Comment by Mergebot [ 17/Sep/18 ] |
@charlesprovencher overrode teamcity/dcos/test/aws/onprem/static status of dcos/dcos/pull/3452 (Title: 1.12.0 beta2 train, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 18/Sep/18 ] |
@skumaran overrode teamcity/dcos/test/aws/onprem/static status of dcos/dcos/pull/3436 (Title: Fix 500 responses from v0 metrics API, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 19/Sep/18 ] |
@jp overrode teamcity/dcos/test/dcos-e2e/docker/static/strict status of mesosphere/dcos-enterprise/pull/3455 (Title: [1.11] packages/bootstrap: Do not remove permissions from dcos_marathon and dcos_metronome service accounts, Branch: 1.11) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 19/Sep/18 ] |
@drozhkov overrode teamcity/dcos/test/aws/onprem/static status of dcos/dcos/pull/3462 (Title: Bump ui to v1.22.0, Branch: 1.11) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 19/Sep/18 ] |
@jp overrode teamcity/dcos/test/dcos-e2e/docker/static/strict status of mesosphere/dcos-enterprise/pull/3242 (Title: Add LDAP_GROUP_IMPORT_LIMIT_SECONDS Bouncer configuration variable, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 19/Sep/18 ] |
@jp overrode teamcity/dcos/test/dcos-e2e/docker/static/strict status of mesosphere/dcos-enterprise/pull/3411 (Title: Enabled GC of nested container sandboxes by the Mesos agent., Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 19/Sep/18 ] |
@skumaran overrode teamcity/dcos/test/dcos-e2e/docker/static/strict status of mesosphere/dcos-enterprise/pull/3448 (Title: Fix expected sha1 value for the rewrite_amd64_en-US.msi installer, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 19/Sep/18 ] |
@skumaran overrode teamcity/dcos/test/dcos-e2e/docker/static/strict status of mesosphere/dcos-enterprise/pull/3448 (Title: Fix expected sha1 value for the rewrite_amd64_en-US.msi installer, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 19/Sep/18 ] |
@skumaran overrode teamcity/dcos/test/dcos-e2e/docker/static/strict status of mesosphere/dcos-enterprise/pull/3457 (Title: [1.10] Mergebot Automated Train PR - 2018-Sep-19-10-00, Branch: 1.10) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 19/Sep/18 ] |
@jonathangiddy overrode teamcity/dcos/test/aws/onprem/static status of dcos/dcos/pull/3464 (Title: (1.12) Fix 500 responses from v0 metrics API, Branch: 1.12) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 19/Sep/18 ] |
@gpaul overrode teamcity/dcos/test/dcos-e2e/docker/static/strict status of mesosphere/dcos-enterprise/pull/3472 (Title: Add more context to vip test app names, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 20/Sep/18 ] |
@kapil overrode teamcity/dcos/test/aws/onprem/static status of dcos/dcos/pull/3477 (Title: [1.12] Bump Mesos to nightly 1.7.x 06eb5ba, Branch: 1.12) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Qian Zhang (Inactive) [ 21/Sep/18 ] |
I reproduced a failed case of Mesos containerizer in a DC/OS cluster launched with `dcos-docker`, it was caused by a container which was stuck as `ISOLATING` state, here is the the agent logs for that container:
Sep 21 07:21:39 dcos-e2e-4ab303d7-436b-4b07-8b78-43dd9abacd34-agent-0 mesos-agent[1270]: I0921 07:21:39.343425 1275 paths.cpp:745] Creating sandbox '/var/lib/mesos/slave/slaves/453c7b65-353f-4944-9b49-d5dcbba2e6f5-S2/frameworks/453c7b65-353f-4944-9b49-d5dcbba2e6f5-0001/executors/integration-test-6de4f1fbc9644f80993b6170e9e432f0.fb2f1952-bd6e-11e8-9fcb-70b3d5800003/runs/85809953-a904-4823-9279-a46b023be09a' for user 'root' Sep 21 07:21:39 dcos-e2e-4ab303d7-436b-4b07-8b78-43dd9abacd34-agent-0 mesos-agent[1270]: I0921 07:21:39.345252 1275 paths.cpp:748] Creating sandbox '/var/lib/mesos/slave/meta/slaves/453c7b65-353f-4944-9b49-d5dcbba2e6f5-S2/frameworks/453c7b65-353f-4944-9b49-d5dcbba2e6f5-0001/executors/integration-test-6de4f1fbc9644f80993b6170e9e432f0.fb2f1952-bd6e-11e8-9fcb-70b3d5800003/runs/85809953-a904-4823-9279-a46b023be09a' Sep 21 07:21:39 dcos-e2e-4ab303d7-436b-4b07-8b78-43dd9abacd34-agent-0 mesos-agent[1270]: I0921 07:21:39.347165 1275 slave.cpp:8997] Launching executor 'integration-test-6de4f1fbc9644f80993b6170e9e432f0.fb2f1952-bd6e-11e8-9fcb-70b3d5800003' of framework 453c7b65-353f-4944-9b49-d5dcbba2e6f5-0001 with resources [{"allocation_info":{"role":"slave_public"},"name":"cpus","scalar":{"value":0.1},"type":"SCALAR"},{"allocation_info":{"role":"slave_public"},"name":"mem","scalar":{"value":32.0},"type":"SCALAR"}] in work directory '/var/lib/mesos/slave/slaves/453c7b65-353f-4944-9b49-d5dcbba2e6f5-S2/frameworks/453c7b65-353f-4944-9b49-d5dcbba2e6f5-0001/executors/integration-test-6de4f1fbc9644f80993b6170e9e432f0.fb2f1952-bd6e-11e8-9fcb-70b3d5800003/runs/85809953-a904-4823-9279-a46b023be09a' Sep 21 07:21:39 dcos-e2e-4ab303d7-436b-4b07-8b78-43dd9abacd34-agent-0 mesos-agent[1270]: I0921 07:21:39.348870 1275 slave.cpp:3530] Launching container 85809953-a904-4823-9279-a46b023be09a for executor 'integration-test-6de4f1fbc9644f80993b6170e9e432f0.fb2f1952-bd6e-11e8-9fcb-70b3d5800003' of framework 453c7b65-353f-4944-9b49-d5dcbba2e6f5-0001 Sep 21 07:21:39 dcos-e2e-4ab303d7-436b-4b07-8b78-43dd9abacd34-agent-0 mesos-agent[1270]: I0921 07:21:39.352116 1275 containerizer.cpp:1282] Starting container 85809953-a904-4823-9279-a46b023be09a Sep 21 07:21:39 dcos-e2e-4ab303d7-436b-4b07-8b78-43dd9abacd34-agent-0 mesos-agent[1270]: I0921 07:21:39.354837 1275 provisioner.cpp:545] Provisioning image rootfs '/var/lib/mesos/slave/provisioner/containers/85809953-a904-4823-9279-a46b023be09a/backends/overlay/rootfses/02148442-f072-46e2-8809-b43f982e784d' for container 85809953-a904-4823-9279-a46b023be09a using overlay backend Sep 21 07:21:39 dcos-e2e-4ab303d7-436b-4b07-8b78-43dd9abacd34-agent-0 mesos-agent[1270]: I0921 07:21:39.360311 1275 containerizer.cpp:3120] Transitioning the state of container 85809953-a904-4823-9279-a46b023be09a from PROVISIONING to PREPARING Sep 21 07:21:39 dcos-e2e-4ab303d7-436b-4b07-8b78-43dd9abacd34-agent-0 mesos-agent[1270]: I0921 07:21:39.375891 1279 memory.cpp:478] Started listening for OOM events for container 85809953-a904-4823-9279-a46b023be09a Sep 21 07:21:39 dcos-e2e-4ab303d7-436b-4b07-8b78-43dd9abacd34-agent-0 mesos-agent[1270]: I0921 07:21:39.376188 1279 memory.cpp:590] Started listening on 'low' memory pressure events for container 85809953-a904-4823-9279-a46b023be09a Sep 21 07:21:39 dcos-e2e-4ab303d7-436b-4b07-8b78-43dd9abacd34-agent-0 mesos-agent[1270]: I0921 07:21:39.376269 1279 memory.cpp:590] Started listening on 'medium' memory pressure events for container 85809953-a904-4823-9279-a46b023be09a Sep 21 07:21:39 dcos-e2e-4ab303d7-436b-4b07-8b78-43dd9abacd34-agent-0 mesos-agent[1270]: I0921 07:21:39.376343 1279 memory.cpp:590] Started listening on 'critical' memory pressure events for container 85809953-a904-4823-9279-a46b023be09a Sep 21 07:21:39 dcos-e2e-4ab303d7-436b-4b07-8b78-43dd9abacd34-agent-0 mesos-agent[1270]: I0921 07:21:39.381487 1275 cpu.cpp:92] Updated 'cpu.shares' to 204 (cpus 0.2) for container 85809953-a904-4823-9279-a46b023be09a Sep 21 07:21:39 dcos-e2e-4ab303d7-436b-4b07-8b78-43dd9abacd34-agent-0 mesos-agent[1270]: I0921 07:21:39.381837 1275 cpu.cpp:112] Updated 'cpu.cfs_period_us' to 100ms and 'cpu.cfs_quota_us' to 20ms (cpus 0.2) for container 85809953-a904-4823-9279-a46b023be09a Sep 21 07:21:39 dcos-e2e-4ab303d7-436b-4b07-8b78-43dd9abacd34-agent-0 mesos-agent[1270]: I0921 07:21:39.382069 1275 memory.cpp:198] Updated 'memory.soft_limit_in_bytes' to 64MB for container 85809953-a904-4823-9279-a46b023be09a Sep 21 07:21:39 dcos-e2e-4ab303d7-436b-4b07-8b78-43dd9abacd34-agent-0 mesos-agent[1270]: I0921 07:21:39.393718 1275 memory.cpp:227] Updated 'memory.limit_in_bytes' to 64MB for container 85809953-a904-4823-9279-a46b023be09a Sep 21 07:21:39 dcos-e2e-4ab303d7-436b-4b07-8b78-43dd9abacd34-agent-0 mesos-agent[1270]: I0921 07:21:39.422675 1279 secret.cpp:309] 0 secrets have been resolved for container 85809953-a904-4823-9279-a46b023be09a Sep 21 07:21:39 dcos-e2e-4ab303d7-436b-4b07-8b78-43dd9abacd34-agent-0 mesos-agent[1270]: I0921 07:21:39.521315 1277 switchboard.cpp:316] Container logger module finished preparing container 85809953-a904-4823-9279-a46b023be09a; IOSwitchboard server is not required Sep 21 07:21:39 dcos-e2e-4ab303d7-436b-4b07-8b78-43dd9abacd34-agent-0 mesos-agent[1270]: I0921 07:21:39.534044 1275 linux_launcher.cpp:492] Launching container 85809953-a904-4823-9279-a46b023be09a and cloning with namespaces CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWPID | CLONE_NEWNET Sep 21 07:21:39 dcos-e2e-4ab303d7-436b-4b07-8b78-43dd9abacd34-agent-0 mesos-agent[1270]: I0921 07:21:39.800432 1277 containerizer.cpp:2046] Checkpointing container's forked pid 20039 to '/var/lib/mesos/slave/meta/slaves/453c7b65-353f-4944-9b49-d5dcbba2e6f5-S2/frameworks/453c7b65-353f-4944-9b49-d5dcbba2e6f5-0001/executors/integration-test-6de4f1fbc9644f80993b6170e9e432f0.fb2f1952-bd6e-11e8-9fcb-70b3d5800003/runs/85809953-a904-4823-9279-a46b023be09a/pids/forked.pid' Sep 21 07:21:39 dcos-e2e-4ab303d7-436b-4b07-8b78-43dd9abacd34-agent-0 mesos-agent[1270]: I0921 07:21:39.805660 1277 containerizer.cpp:3120] Transitioning the state of container 85809953-a904-4823-9279-a46b023be09a from PREPARING to ISOLATING Sep 21 07:21:39 dcos-e2e-4ab303d7-436b-4b07-8b78-43dd9abacd34-agent-0 mesos-agent[1270]: I0921 07:21:39.814954 1279 cni.cpp:962] Bind mounted '/proc/20039/ns/net' to '/run/mesos/isolators/network/cni/85809953-a904-4823-9279-a46b023be09a/ns' for container 85809953-a904-4823-9279-a46b023be09a Sep 21 07:21:41 dcos-e2e-4ab303d7-436b-4b07-8b78-43dd9abacd34-agent-0 mesos-agent[1270]: I0921 07:21:41.238293 1281 cni.cpp:1394] Got assigned IPv4 address '172.31.254.185/24' from CNI network 'mesos-bridge' for container 85809953-a904-4823-9279-a46b023be09a Sep 21 07:21:41 dcos-e2e-4ab303d7-436b-4b07-8b78-43dd9abacd34-agent-0 mesos-agent[1270]: I0921 07:21:41.239543 1281 cni.cpp:1102] Unable to find DNS nameservers for container 85809953-a904-4823-9279-a46b023be09a, using host '/etc/resolv.conf' Sep 21 07:31:39 dcos-e2e-4ab303d7-436b-4b07-8b78-43dd9abacd34-agent-0 mesos-agent[1270]: I0921 07:31:39.350037 1276 containerizer.cpp:2457] Destroying container 85809953-a904-4823-9279-a46b023be09a in ISOLATING state Sep 21 07:31:39 dcos-e2e-4ab303d7-436b-4b07-8b78-43dd9abacd34-agent-0 mesos-agent[1270]: I0921 07:31:39.350167 1276 containerizer.cpp:3120] Transitioning the state of container 85809953-a904-4823-9279-a46b023be09a from ISOLATING to DESTROYING So the container was stuck at `ISOLATING` state for 10 mins and then containerizer tried to destroy it, but the destroy can never finish since we need to wait for the isolators to finish isolating. So there must be an isolator's `isolate()` method never returned. I will add more logs in isolators and try to figure out which isolator caused this issue. |
Comment by Mergebot [ 24/Sep/18 ] |
@skumaran overrode teamcity/dcos/test/dcos-e2e/docker/static/strict status of mesosphere/dcos-enterprise/pull/3461 (Title: [master] Mergebot Automated Train PR - 2018-Sep-19-12-00, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 24/Sep/18 ] |
Pull Request, https://github.com/dcos/dcos/pull/3475, associated with the JIRA ticket was merged into DC/OS 1.12.0 |
Comment by Mergebot [ 25/Sep/18 ] |
@jp overrode teamcity/dcos/test/dcos-e2e/docker/static/strict status of mesosphere/dcos-enterprise/pull/3330 (Title: Mesos modules: increase network timeout, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 25/Sep/18 ] |
@gpaul overrode teamcity/dcos/test/dcos-e2e/docker/static/strict status of mesosphere/dcos-enterprise/pull/3505 (Title: [1.11] Add more data to diagnostics bundle, Branch: 1.11) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 25/Sep/18 ] |
@jp overrode teamcity/dcos/test/dcos-e2e/docker/static/strict status of mesosphere/dcos-enterprise/pull/3507 (Title: [BACKPORT] Mesos modules Increased IAM timeout, Branch: 1.11) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 27/Sep/18 ] |
@philip overrode teamcity/dcos/test/dcos-e2e/docker/static/strict status of mesosphere/dcos-enterprise/pull/3519 (Title: [1.12] Telegraf fixes, Branch: 1.12) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 27/Sep/18 ] |
@skumaran overrode teamcity/dcos/test/dcos-e2e/docker/static/strict status of mesosphere/dcos-enterprise/pull/3515 (Title: [master] Mergebot Automated Train PR - 2018-Sep-26-12-00, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 27/Sep/18 ] |
@skumaran overrode teamcity/dcos/test/aws/onprem/static status of dcos/dcos/pull/3512 (Title: [master] Mergebot Automated Train PR - 2018-Sep-26-12-00, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 28/Sep/18 ] |
@philip overrode teamcity/dcos/test/dcos-e2e/docker/static/strict status of mesosphere/dcos-enterprise/pull/3533 (Title: Marathon remove precheck on single node 1.11, Branch: 1.11) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 28/Sep/18 ] |
@philip overrode teamcity/dcos/test/aws/cloudformation/simple status of dcos/dcos/pull/3531 (Title: Marathon remove precheck on single node 1.11, Branch: 1.11) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 01/Oct/18 ] |
@charlesprovencher overrode teamcity/dcos/test/aws/onprem/static status of dcos/dcos/pull/3536 (Title: New config.yaml to support Windows Build Artifacts in Separate S3 Bucket, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Matthias Eichstedt (Inactive) [ 02/Oct/18 ] |
DCOS-19619 is also suffering from TaskGroup containers stuck in STARTING – I've linked it as a duplicate, but there are no sandboxes to verify the root cause is the same. |
Comment by Mergebot [ 02/Oct/18 ] |
@drozhkov overrode teamcity/dcos/test/dcos-e2e/docker/static/strict status of mesosphere/dcos-enterprise/pull/3523 (Title: chore(dcos-ui): bump DC/OS UI v2.24.4, Branch: 1.12) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 02/Oct/18 ] |
@skumaran overrode teamcity/dcos/test/dcos-e2e/docker/static/strict status of mesosphere/dcos-enterprise/pull/3539 (Title: [1.12][DCOS-42419] Add UCR Support for package registry by adding v2 schema 1 manifests, Branch: 1.12) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 02/Oct/18 ] |
@skumaran overrode teamcity/dcos/test/dcos-e2e/docker/static/strict status of mesosphere/dcos-enterprise/pull/3553 (Title: 1.12: Pass ssl_keystore_password via MARATHON_ environment variables, Branch: 1.12) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 03/Oct/18 ] |
@jonathangiddy overrode teamcity/dcos/test/dcos-e2e/docker/static/strict status of mesosphere/dcos-enterprise/pull/3567 (Title: [master] packages/bouncer: bump bouncer, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 04/Oct/18 ] |
@sergeyurbanovich overrode teamcity/dcos/test/dcos-e2e/docker/static/strict status of mesosphere/dcos-enterprise/pull/3503 (Title: bump mesos-dns to bring in changes for mesos state endpoint, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 04/Oct/18 ] |
@sergeyurbanovich overrode teamcity/dcos/test/aws/onprem/static status of dcos/dcos/pull/3504 (Title: bump mesos-dns to bring in changes for mesos state endpoint, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 05/Oct/18 ] |
@skumaran overrode teamcity/dcos/test/aws/onprem/static status of dcos/dcos/pull/3530 (Title: Marathon remove precheck on single node Master, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 08/Oct/18 ] |
@branden overrode teamcity/dcos/test/aws/onprem/static status of dcos/dcos/pull/3571 (Title: [1.12] Grant containers dir ownership to dcos_telegraf, Branch: 1.12) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 09/Oct/18 ] |
@skumaran overrode teamcity/dcos/test/aws/onprem/static status of dcos/dcos/pull/3578 (Title: [1.12] Add root capabilities to dcos-diagnostics, Branch: 1.12) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 11/Oct/18 ] |
@charlesprovencher overrode teamcity/dcos/test/dcos-e2e/docker/static/strict status of mesosphere/dcos-enterprise/pull/3626 (Title: Skip test_packaging_api, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 11/Oct/18 ] |
@charlesprovencher overrode teamcity/dcos/test/dcos-e2e/docker/static/strict status of mesosphere/dcos-enterprise/pull/3627 (Title: Skip test_packaging_api, Branch: 1.12) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 11/Oct/18 ] |
@sergeyurbanovich overrode teamcity/dcos/test/azure/arm status of dcos/dcos/pull/3607 (Title: [1.11] Bump dcos-net, Branch: 1.11) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 11/Oct/18 ] |
@sergeyurbanovich overrode teamcity/dcos/test/dcos-docker/static status of dcos/dcos/pull/3608 (Title: [1.10] Bump navstar, Branch: 1.10) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 12/Oct/18 ] |
@timweidner overrode teamcity/dcos/test/dcos-e2e/docker/static/strict status of mesosphere/dcos-enterprise/pull/3629 (Title: [1.12] Backport tweidner/adangoor/fix-mesos-api-test-flake, Branch: 1.12) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 12/Oct/18 ] |
@branden overrode teamcity/dcos/test/dcos-e2e/docker/static/strict status of mesosphere/dcos-enterprise/pull/3610 (Title: (1.12) Assert system clock is synced before starting dcos-exhibitor, Branch: 1.12) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 15/Oct/18 ] |
@timweidner overrode teamcity/dcos/test/dcos-e2e/docker/static/strict status of mesosphere/dcos-enterprise/pull/3575 (Title: Handle exceptions during Metronome startup, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 16/Oct/18 ] |
@jp overrode teamcity/dcos/test/aws/onprem/static status of dcos/dcos/pull/3625 (Title: Bump to the newest Metronome, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 16/Oct/18 ] |
@sergeyurbanovich overrode teamcity/dcos/test/dcos-e2e/docker/static/strict status of mesosphere/dcos-enterprise/pull/3653 (Title: [1.12] Bump dcos-net, Branch: 1.12) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 17/Oct/18 ] |
@charlesprovencher overrode teamcity/dcos/test/aws/onprem/static/strict status of mesosphere/dcos-enterprise/pull/3667 (Title: 1.11 train 10/17, Branch: 1.11) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 17/Oct/18 ] |
@skumaran overrode teamcity/dcos/test/dcos-e2e/docker/static/strict status of mesosphere/dcos-enterprise/pull/3662 (Title: [master] Mergebot Automated Train PR - 2018-Oct-17-12-01, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 17/Oct/18 ] |
@skumaran overrode teamcity/dcos/test/aws/onprem/static/strict status of mesosphere/dcos-enterprise/pull/3662 (Title: [master] Mergebot Automated Train PR - 2018-Oct-17-12-01, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 17/Oct/18 ] |
@skumaran overrode teamcity/dcos/test/dcos-e2e/docker/static/strict status of mesosphere/dcos-enterprise/pull/3661 (Title: [1.12] Mergebot Automated Train PR - 2018-Oct-17-12-01, Branch: 1.12) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 18/Oct/18 ] |
@jp overrode teamcity/dcos/test/aws/onprem/static/strict status of mesosphere/dcos-enterprise/pull/3645 (Title: [1.12] Update dcos-diagnostics, Branch: 1.12) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 18/Oct/18 ] |
@jp overrode teamcity/dcos/test/aws/onprem/static/strict status of mesosphere/dcos-enterprise/pull/3642 (Title: Update dcos-diagnostics, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 18/Oct/18 ] |
@jp overrode teamcity/dcos/test/dcos-e2e/docker/static/strict status of mesosphere/dcos-enterprise/pull/3642 (Title: Update dcos-diagnostics, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 18/Oct/18 ] |
@gpaul overrode teamcity/dcos/test/aws/onprem/static status of dcos/dcos/pull/3648 (Title: maintenance_mode is enabled by default in 1.8 (1.13), Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 18/Oct/18 ] |
@sergeyurbanovich overrode teamcity/dcos/test/azure/arm status of dcos/dcos/pull/3649 (Title: [1.10] Bump Mesos to nightly 1.4.x 82df2a4, Branch: 1.10) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 18/Oct/18 ] |
@skumaran overrode teamcity/dcos/test/aws/onprem/static/strict status of mesosphere/dcos-enterprise/pull/3654 (Title: Bump dcos-net, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 18/Oct/18 ] |
@skumaran overrode teamcity/dcos/test/aws/onprem/static/strict status of mesosphere/dcos-enterprise/pull/3670 (Title: [1.12] Always prefer to serve schema 2 over schema 1 docker manifest, Branch: 1.12) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 18/Oct/18 ] |
@charlesprovencher overrode teamcity/dcos/test/dcos-e2e/docker/static/strict status of mesosphere/dcos-enterprise/pull/3673 (Title: [master] Mergebot Automated Train PR - 2018-Oct-18-12-00, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 18/Oct/18 ] |
@charlesprovencher overrode teamcity/dcos/test/aws/onprem/static status of dcos/dcos/pull/3652 (Title: [master] Mergebot Automated Train PR - 2018-Oct-18-12-00, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 18/Oct/18 ] |
@sergeyurbanovich overrode teamcity/dcos/test/dcos-e2e/docker/static/strict status of mesosphere/dcos-enterprise/pull/3665 (Title: packages/dcos-integration-test/test_tls: Enable dcos-net TLS tests, Branch: 1.12) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 19/Oct/18 ] |
@jp overrode teamcity/dcos/test/aws/cloudformation/simple status of dcos/dcos/pull/3638 (Title: chore(dcos-ui): bump package to 1.11+v1.24.0, Branch: 1.11) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 19/Oct/18 ] |
@skumaran overrode teamcity/dcos/test/aws/onprem/static/strict status of mesosphere/dcos-enterprise/pull/3582 (Title: Do not pull overlay data when overlay is disable, Branch: 1.11) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 19/Oct/18 ] |
@skumaran overrode teamcity/dcos/test/aws/cloudformation/simple status of dcos/dcos/pull/3568 (Title: [1.11] Do not pull overlay data when overlay is disable, Branch: 1.11) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 19/Oct/18 ] |
@skumaran overrode teamcity/dcos/test/azure/arm status of dcos/dcos/pull/3654 (Title: bump marathon 1.6.654, Branch: 1.11) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 20/Oct/18 ] |
@klueska overrode teamcity/dcos/test/aws/cloudformation/simple status of dcos/dcos/pull/3645 (Title: Bump dcos-log, Branch: 1.11) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 20/Oct/18 ] |
@klueska overrode teamcity/dcos/test/dcos-e2e/docker/static/strict status of mesosphere/dcos-enterprise/pull/3603 (Title: Add external Mesos master/agent logs in the bundle, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 22/Oct/18 ] |
@klueska overrode teamcity/dcos/test/aws/onprem/static status of dcos/dcos/pull/3645 (Title: Bump dcos-log, Branch: 1.11) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 23/Oct/18 ] |
@skumaran overrode teamcity/dcos/test/dcos-e2e/docker/static/strict status of mesosphere/dcos-enterprise/pull/3682 (Title: [1.11] Bump Mesos to nightly 1.5.x 2ead30d, Branch: 1.11) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Adam Dangoor (Inactive) [ 23/Oct/18 ] |
What does it mean that this is "In Progress" but not assigned? |
Comment by Mergebot [ 23/Oct/18 ] |
@timweidner overrode teamcity/dcos/test/dcos-e2e/docker/static/strict status of mesosphere/dcos-enterprise/pull/3692 (Title: [DCOS-43342] Retry reserving disk in Mesos v0 scheduler test., Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 23/Oct/18 ] |
@greg overrode teamcity/dcos/test/aws/cloudformation/simple status of dcos/dcos/pull/3590 (Title: [1.11] Add Mesos patches to ensure TEARDOWN is sent in v1 Java shim., Branch: 1.11) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 24/Oct/18 ] |
@greg overrode teamcity/dcos/test/azure/arm status of dcos/dcos/pull/3591 (Title: [1.10] Add Mesos patches to ensure TEARDOWN is sent in v1 Java shim., Branch: 1.10) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Greg Mann (Inactive) [ 25/Oct/18 ] |
Seems like we may have multiple failure modes leading to this test failure. Here are the Mesos agent logs from a repro I just attained of test_vip[Container.POD-Network.USER-Network.BRIDGE], filtered for the task and container ID of the test_vip task: Oct 25 18:13:46 dcos-e2e-fa3ba6a6-56ea-4369-a0dc-a9e45e03aaf8-agent-0 mesos-agent[2076]: I1025 18:13:46.486881 2084 slave.cpp:2035] Got assigned task group containing tasks [ integration-test-bf8517ec5711454691f6f1c28184fa07.instance-b6bf6f89-d881-11e8-9bf3-70b3d5800001.app-bf8517ec5711454691f6f1c28184fa07 ] for framework 00dc552a-2133-45b6-b2f1-15651df01139-0001 Oct 25 18:13:46 dcos-e2e-fa3ba6a6-56ea-4369-a0dc-a9e45e03aaf8-agent-0 mesos-agent[2076]: I1025 18:13:46.488046 2084 slave.cpp:2409] Authorizing task group containing tasks [ integration-test-bf8517ec5711454691f6f1c28184fa07.instance-b6bf6f89-d881-11e8-9bf3-70b3d5800001.app-bf8517ec5711454691f6f1c28184fa07 ] for framework 00dc552a-2133-45b6-b2f1-15651df01139-0001 Oct 25 18:13:46 dcos-e2e-fa3ba6a6-56ea-4369-a0dc-a9e45e03aaf8-agent-0 mesos-agent[2076]: I1025 18:13:46.489056 2084 slave.cpp:8469] Authorizing framework principal 'dcos_marathon' to launch task integration-test-bf8517ec5711454691f6f1c28184fa07.instance-b6bf6f89-d881-11e8-9bf3-70b3d5800001.app-bf8517ec5711454691f6f1c28184fa07 Oct 25 18:13:46 dcos-e2e-fa3ba6a6-56ea-4369-a0dc-a9e45e03aaf8-agent-0 mesos-agent[2076]: I1025 18:13:46.500901 2084 slave.cpp:2852] Launching task group containing tasks [ integration-test-bf8517ec5711454691f6f1c28184fa07.instance-b6bf6f89-d881-11e8-9bf3-70b3d5800001.app-bf8517ec5711454691f6f1c28184fa07 ] for framework 00dc552a-2133-45b6-b2f1-15651df01139-0001 Oct 25 18:13:46 dcos-e2e-fa3ba6a6-56ea-4369-a0dc-a9e45e03aaf8-agent-0 mesos-agent[2076]: I1025 18:13:46.500946 2084 paths.cpp:745] Creating sandbox '/var/lib/mesos/slave/slaves/00dc552a-2133-45b6-b2f1-15651df01139-S1/frameworks/00dc552a-2133-45b6-b2f1-15651df01139-0001/executors/instance-integration-test-bf8517ec5711454691f6f1c28184fa07.b6bf6f89-d881-11e8-9bf3-70b3d5800001/runs/a06b3776-7b56-4ebc-9926-144ae795877a' for user 'nobody' Oct 25 18:13:46 dcos-e2e-fa3ba6a6-56ea-4369-a0dc-a9e45e03aaf8-agent-0 mesos-agent[2076]: I1025 18:13:46.501479 2084 paths.cpp:748] Creating sandbox '/var/lib/mesos/slave/meta/slaves/00dc552a-2133-45b6-b2f1-15651df01139-S1/frameworks/00dc552a-2133-45b6-b2f1-15651df01139-0001/executors/instance-integration-test-bf8517ec5711454691f6f1c28184fa07.b6bf6f89-d881-11e8-9bf3-70b3d5800001/runs/a06b3776-7b56-4ebc-9926-144ae795877a' Oct 25 18:13:46 dcos-e2e-fa3ba6a6-56ea-4369-a0dc-a9e45e03aaf8-agent-0 mesos-agent[2076]: I1025 18:13:46.501626 2084 slave.cpp:8997] Launching executor 'instance-integration-test-bf8517ec5711454691f6f1c28184fa07.b6bf6f89-d881-11e8-9bf3-70b3d5800001' of framework 00dc552a-2133-45b6-b2f1-15651df01139-0001 with resources [{"allocation_info":{"role":"slave_public"},"name":"cpus","scalar":{"value":0.1},"type":"SCALAR"},{"allocation_info":{"role":"slave_public"},"name":"mem","scalar":{"value":32.0},"type":"SCALAR"},{"allocation_info":{"role":"slave_public"},"name":"disk","scalar":{"value":10.0},"type":"SCALAR"},{"allocation_info":{"role":"slave_public"},"name":"ports","ranges":{"range":[{"begin":13463,"end":13463}]},"type":"RANGES"}] in work directory '/var/lib/mesos/slave/slaves/00dc552a-2133-45b6-b2f1-15651df01139-S1/frameworks/00dc552a-2133-45b6-b2f1-15651df01139-0001/executors/instance-integration-test-bf8517ec5711454691f6f1c28184fa07.b6bf6f89-d881-11e8-9bf3-70b3d5800001/runs/a06b3776-7b56-4ebc-9926-144ae795877a' Oct 25 18:13:46 dcos-e2e-fa3ba6a6-56ea-4369-a0dc-a9e45e03aaf8-agent-0 mesos-agent[2076]: I1025 18:13:46.501796 2084 jwt_secret_generator.cpp:71] Generated token 'eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJjaWQiOiJhMDZiMzc3Ni03YjU2LTRlYmMtOTkyNi0xNDRhZTc5NTg3N2EiLCJlaWQiOiJpbnN0YW5jZS1pbnRlZ3JhdGlvbi10ZXN0LWJmODUxN2VjNTcxMTQ1NDY5MWY2ZjFjMjgxODRmYTA3LmI2YmY2Zjg5LWQ4ODEtMTFlOC05YmYzLTcwYjNkNTgwMDAwMSIsImZpZCI6IjAwZGM1NTJhLTIxMzMtNDViNi1iMmYxLTE1NjUxZGYwMTEzOS0wMDAxIn0.im3hKnkvU-ztJIBU8-BLfRjHLzxP0-7BRg_egQNphO8' for principal '{"claims":{"cid":"a06b3776-7b56-4ebc-9926-144ae795877a","fid":"00dc552a-2133-45b6-b2f1-15651df01139-0001","eid":"instance-integration-test-bf8517ec5711454691f6f1c28184fa07.b6bf6f89-d881-11e8-9bf3-70b3d5800001"}}' using secret (base64) 'e0s2LVp0ZDNDdklXYDQkekx0Km0tTiQ/dzNvNzYoOVE0RXJEIVQ5Ul9UOHlHNlY7S09zS019SkVkanA0KTBOM2ZtXnQ9WVBHckR2anhFbTs3fH1zTkA9Y1pzcUwpYWs+YFlgV25QcUU7JGZISDBELSNePT1HIWNuPT8/QjMhfCRiSGZnST9qVT9jZGRDfT56QmxCaFlKMTcrV2Z8S2g/N3dYOSpXaV5fYXRNc3NRc3h8WUR3aWVaUUJuY0FNQG9gdlpvZlNVSDVBLVdqX3dOWmQ4dE8tVmFfbXItY3lgY0UwYGs0XmNuVzJmJGgpaVp1cW8kIXV8OTdPcGd3bUBQe0F+fXR2fEZGVSlUflcqZ0tWUHdvQnxPWWxqemFWPlJDNGhpaVJOMStFdStDbzhSWVhSWFRXfFFFUnBKME0/KHo=' Oct 25 18:13:46 dcos-e2e-fa3ba6a6-56ea-4369-a0dc-a9e45e03aaf8-agent-0 mesos-agent[2076]: I1025 18:13:46.502132 2084 slave.cpp:3049] Queued task group containing tasks [ integration-test-bf8517ec5711454691f6f1c28184fa07.instance-b6bf6f89-d881-11e8-9bf3-70b3d5800001.app-bf8517ec5711454691f6f1c28184fa07 ] for executor 'instance-integration-test-bf8517ec5711454691f6f1c28184fa07.b6bf6f89-d881-11e8-9bf3-70b3d5800001' of framework 00dc552a-2133-45b6-b2f1-15651df01139-0001 Oct 25 18:13:46 dcos-e2e-fa3ba6a6-56ea-4369-a0dc-a9e45e03aaf8-agent-0 mesos-agent[2076]: I1025 18:13:46.503170 2084 slave.cpp:3530] Launching container a06b3776-7b56-4ebc-9926-144ae795877a for executor 'instance-integration-test-bf8517ec5711454691f6f1c28184fa07.b6bf6f89-d881-11e8-9bf3-70b3d5800001' of framework 00dc552a-2133-45b6-b2f1-15651df01139-0001 Oct 25 18:13:46 dcos-e2e-fa3ba6a6-56ea-4369-a0dc-a9e45e03aaf8-agent-0 mesos-agent[2076]: I1025 18:13:46.504715 2085 containerizer.cpp:1282] Starting container a06b3776-7b56-4ebc-9926-144ae795877a Oct 25 18:13:46 dcos-e2e-fa3ba6a6-56ea-4369-a0dc-a9e45e03aaf8-agent-0 mesos-agent[2076]: I1025 18:13:46.505230 2085 containerizer.cpp:3124] Transitioning the state of container a06b3776-7b56-4ebc-9926-144ae795877a from PROVISIONING to PREPARING Oct 25 18:13:46 dcos-e2e-fa3ba6a6-56ea-4369-a0dc-a9e45e03aaf8-agent-0 mesos-agent[2076]: I1025 18:13:46.519476 2081 memory.cpp:478] Started listening for OOM events for container a06b3776-7b56-4ebc-9926-144ae795877a Oct 25 18:13:46 dcos-e2e-fa3ba6a6-56ea-4369-a0dc-a9e45e03aaf8-agent-0 mesos-agent[2076]: I1025 18:13:46.519562 2081 memory.cpp:590] Started listening on 'low' memory pressure events for container a06b3776-7b56-4ebc-9926-144ae795877a Oct 25 18:13:46 dcos-e2e-fa3ba6a6-56ea-4369-a0dc-a9e45e03aaf8-agent-0 mesos-agent[2076]: I1025 18:13:46.519753 2081 memory.cpp:590] Started listening on 'medium' memory pressure events for container a06b3776-7b56-4ebc-9926-144ae795877a Oct 25 18:13:46 dcos-e2e-fa3ba6a6-56ea-4369-a0dc-a9e45e03aaf8-agent-0 mesos-agent[2076]: I1025 18:13:46.519860 2081 memory.cpp:590] Started listening on 'critical' memory pressure events for container a06b3776-7b56-4ebc-9926-144ae795877a Oct 25 18:13:46 dcos-e2e-fa3ba6a6-56ea-4369-a0dc-a9e45e03aaf8-agent-0 mesos-agent[2076]: I1025 18:13:46.532474 2082 cpu.cpp:92] Updated 'cpu.shares' to 102 (cpus 0.1) for container a06b3776-7b56-4ebc-9926-144ae795877a Oct 25 18:13:46 dcos-e2e-fa3ba6a6-56ea-4369-a0dc-a9e45e03aaf8-agent-0 mesos-agent[2076]: I1025 18:13:46.532487 2086 memory.cpp:198] Updated 'memory.soft_limit_in_bytes' to 32MB for container a06b3776-7b56-4ebc-9926-144ae795877a Oct 25 18:13:46 dcos-e2e-fa3ba6a6-56ea-4369-a0dc-a9e45e03aaf8-agent-0 mesos-agent[2076]: I1025 18:13:46.532569 2082 cpu.cpp:112] Updated 'cpu.cfs_period_us' to 100ms and 'cpu.cfs_quota_us' to 10ms (cpus 0.1) for container a06b3776-7b56-4ebc-9926-144ae795877a Oct 25 18:13:46 dcos-e2e-fa3ba6a6-56ea-4369-a0dc-a9e45e03aaf8-agent-0 mesos-agent[2076]: I1025 18:13:46.532605 2086 memory.cpp:227] Updated 'memory.limit_in_bytes' to 32MB for container a06b3776-7b56-4ebc-9926-144ae795877a Oct 25 18:13:46 dcos-e2e-fa3ba6a6-56ea-4369-a0dc-a9e45e03aaf8-agent-0 mesos-agent[2076]: I1025 18:13:46.535212 2082 secret.cpp:309] 0 secrets have been resolved for container a06b3776-7b56-4ebc-9926-144ae795877a Oct 25 18:13:46 dcos-e2e-fa3ba6a6-56ea-4369-a0dc-a9e45e03aaf8-agent-0 mesos-agent[2076]: I1025 18:13:46.634297 2080 switchboard.cpp:316] Container logger module finished preparing container a06b3776-7b56-4ebc-9926-144ae795877a; IOSwitchboard server is not required Oct 25 18:13:46 dcos-e2e-fa3ba6a6-56ea-4369-a0dc-a9e45e03aaf8-agent-0 mesos-agent[2076]: I1025 18:13:46.637706 2087 linux_launcher.cpp:492] Launching container a06b3776-7b56-4ebc-9926-144ae795877a and cloning with namespaces CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWPID | CLONE_NEWNET Oct 25 18:13:46 dcos-e2e-fa3ba6a6-56ea-4369-a0dc-a9e45e03aaf8-agent-0 mesos-agent[2076]: I1025 18:13:46.648129 2086 containerizer.cpp:2046] Checkpointing container's forked pid 25844 to '/var/lib/mesos/slave/meta/slaves/00dc552a-2133-45b6-b2f1-15651df01139-S1/frameworks/00dc552a-2133-45b6-b2f1-15651df01139-0001/executors/instance-integration-test-bf8517ec5711454691f6f1c28184fa07.b6bf6f89-d881-11e8-9bf3-70b3d5800001/runs/a06b3776-7b56-4ebc-9926-144ae795877a/pids/forked.pid' Oct 25 18:13:46 dcos-e2e-fa3ba6a6-56ea-4369-a0dc-a9e45e03aaf8-agent-0 mesos-agent[2076]: I1025 18:13:46.649010 2086 containerizer.cpp:3124] Transitioning the state of container a06b3776-7b56-4ebc-9926-144ae795877a from PREPARING to ISOLATING Oct 25 18:13:46 dcos-e2e-fa3ba6a6-56ea-4369-a0dc-a9e45e03aaf8-agent-0 mesos-agent[2076]: I1025 18:13:46.690057 2080 cni.cpp:960] Bind mounted '/proc/25844/ns/net' to '/run/mesos/isolators/network/cni/a06b3776-7b56-4ebc-9926-144ae795877a/ns' for container a06b3776-7b56-4ebc-9926-144ae795877a Oct 25 18:13:47 dcos-e2e-fa3ba6a6-56ea-4369-a0dc-a9e45e03aaf8-agent-0 mesos-agent[2076]: I1025 18:13:47.922582 2080 cni.cpp:1394] Got assigned IPv4 address '172.31.254.22/24' from CNI network 'mesos-bridge' for container a06b3776-7b56-4ebc-9926-144ae795877a Oct 25 18:13:47 dcos-e2e-fa3ba6a6-56ea-4369-a0dc-a9e45e03aaf8-agent-0 mesos-agent[2076]: I1025 18:13:47.924751 2083 cni.cpp:1100] Unable to find DNS nameservers for container a06b3776-7b56-4ebc-9926-144ae795877a, using host '/etc/resolv.conf' Oct 25 18:23:46 dcos-e2e-fa3ba6a6-56ea-4369-a0dc-a9e45e03aaf8-agent-0 mesos-agent[2076]: I1025 18:23:46.504267 2085 slave.cpp:6793] Terminating executor 'instance-integration-test-bf8517ec5711454691f6f1c28184fa07.b6bf6f89-d881-11e8-9bf3-70b3d5800001' of framework 00dc552a-2133-45b6-b2f1-15651df01139-0001 because it did not register within 10mins Oct 25 18:23:46 dcos-e2e-fa3ba6a6-56ea-4369-a0dc-a9e45e03aaf8-agent-0 mesos-agent[2076]: I1025 18:23:46.504505 2085 containerizer.cpp:2457] Destroying container a06b3776-7b56-4ebc-9926-144ae795877a in ISOLATING state Oct 25 18:23:46 dcos-e2e-fa3ba6a6-56ea-4369-a0dc-a9e45e03aaf8-agent-0 mesos-agent[2076]: I1025 18:23:46.504531 2085 containerizer.cpp:3124] Transitioning the state of container a06b3776-7b56-4ebc-9926-144ae795877a from ISOLATING to DESTROYING Looks like in this particular case, the container was stuck in ISOLATING state. |
Comment by Mergebot [ 25/Oct/18 ] |
@gaston overrode teamcity/dcos/test/azure/arm status of dcos/dcos/pull/3679 (Title: bump marathon to 1.5.12, Branch: 1.10) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Qian Zhang (Inactive) [ 25/Oct/18 ] |
Thanks Greg Mann, I think that is the issue MESOS-9334, we will fix it soon. |
Comment by Mergebot [ 30/Oct/18 ] |
@philip overrode teamcity/dcos/test/aws/onprem/static/strict status of mesosphere/dcos-enterprise/pull/3696 (Title: [1.11] Merge dependent tests into one big scenario, Branch: 1.11) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 30/Oct/18 ] |
@philip overrode teamcity/dcos/test/aws/onprem/static status of dcos/dcos/pull/3589 (Title: Add external Mesos master/agent logs in the bundle, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 30/Oct/18 ] |
@klueska overrode teamcity/dcos/test/dcos-e2e/docker/static/strict status of mesosphere/dcos-enterprise/pull/3663 (Title: bump dcos-log, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 30/Oct/18 ] |
@klueska overrode teamcity/dcos/test/aws/onprem/static status of dcos/dcos/pull/3644 (Title: Bump dcos-log, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 31/Oct/18 ] |
@charlesprovencher overrode teamcity/dcos/test/aws/onprem/static status of dcos/dcos/pull/3673 (Title: [master] Mergebot Automated Train PR - 2018-Oct-23-12-00, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 31/Oct/18 ] |
@charlesprovencher overrode teamcity/dcos/test/dcos-e2e/docker/static/strict status of mesosphere/dcos-enterprise/pull/3680 (Title: [1.12] Mergebot Automated Train PR - 2018-Oct-19-12-00, Branch: 1.12) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 01/Nov/18 ] |
@gauripowale overrode teamcity/dcos/test/dcos-docker/static status of dcos/dcos/pull/3693 (Title: Add fetch_cluster_logs.bash, Branch: 1.10) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 01/Nov/18 ] |
@charlesprovencher overrode teamcity/dcos/test/aws/onprem/static/strict status of mesosphere/dcos-enterprise/pull/3695 (Title: [master] Mergebot Automated Train PR - 2018-Oct-23-12-00, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 01/Nov/18 ] |
@philip overrode teamcity/dcos/test/aws/onprem/static/strict status of mesosphere/dcos-enterprise/pull/3735 (Title: (1.12) Fix Telegraf dcos_statsd plugin race condition, Branch: 1.12) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 01/Nov/18 ] |
@charlesprovencher overrode teamcity/dcos/test/dcos-e2e/docker/static/strict status of mesosphere/dcos-enterprise/pull/3727 (Title: Add fetch_cluster_logs.bash, Branch: 1.10) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 01/Nov/18 ] |
@alex overrode teamcity/dcos/test/dcos-e2e/docker/static/strict status of mesosphere/dcos-enterprise/pull/3731 (Title: [1.12] Bump Mesos to nightly 1.7.x cb07b69, Branch: 1.12) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 06/Nov/18 ] |
@jonathangiddy overrode teamcity/dcos/test/aws/onprem/static/strict status of mesosphere/dcos-enterprise/pull/3712 (Title: Mh/java 8u192 1.12, Branch: 1.12) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 06/Nov/18 ] |
@jonathangiddy overrode teamcity/dcos/test/dcos-e2e/docker/static/strict status of mesosphere/dcos-enterprise/pull/3712 (Title: Mh/java 8u192 1.12, Branch: 1.12) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 06/Nov/18 ] |
@jonathangiddy overrode teamcity/dcos/test/aws/onprem/static status of dcos/dcos/pull/3681 (Title: packages/java: Update to 8u192, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 06/Nov/18 ] |
@charlesprovencher overrode teamcity/dcos/test/aws/onprem/static/strict status of mesosphere/dcos-enterprise/pull/3760 (Title: Add missing error check, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 06/Nov/18 ] |
@charlesprovencher overrode teamcity/dcos/test/dcos-e2e/docker/static/strict status of mesosphere/dcos-enterprise/pull/3760 (Title: Add missing error check, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 07/Nov/18 ] |
@sergeyurbanovich overrode teamcity/dcos/test/dcos-e2e/docker/static/strict status of mesosphere/dcos-enterprise/pull/3750 (Title: Fix TLS handshake, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 07/Nov/18 ] |
@sergeyurbanovich overrode teamcity/dcos/test/dcos-e2e/docker/static/strict status of mesosphere/dcos-enterprise/pull/3743 (Title: [1.11] Bump dcos-net, Branch: 1.11) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 07/Nov/18 ] |
@sergeyurbanovich overrode teamcity/dcos/test/dcos-e2e/docker/static/strict status of mesosphere/dcos-enterprise/pull/3743 (Title: [1.11] Bump dcos-net, Branch: 1.11) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 07/Nov/18 ] |
@sergeyurbanovich overrode teamcity/dcos/test/aws/onprem/static/strict status of mesosphere/dcos-enterprise/pull/3743 (Title: [1.11] Bump dcos-net, Branch: 1.11) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 07/Nov/18 ] |
@charlesprovencher overrode teamcity/dcos/test/dcos-e2e/docker/static/strict status of mesosphere/dcos-enterprise/pull/3771 (Title: 1.12 train 11/06/2018, Branch: 1.12) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 08/Nov/18 ] |
@klueska overrode teamcity/dcos/test/dcos-e2e/docker/static/strict status of mesosphere/dcos-enterprise/pull/3603 (Title: Add external Mesos master/agent logs in the bundle, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 08/Nov/18 ] |
@gauripowale overrode teamcity/dcos/test/aws/onprem/static/strict status of mesosphere/dcos-enterprise/pull/3776 (Title: Bump cosmos-enterprise and package registry and add a new integration test with spark fwk, Branch: 1.12) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 08/Nov/18 ] |
@sergeyurbanovich overrode teamcity/dcos/test/dcos-e2e/docker/static/strict status of mesosphere/dcos-enterprise/pull/3734 (Title: [1.12] Bump dcos-net, Branch: 1.12) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 08/Nov/18 ] |
@sergeyurbanovich overrode teamcity/dcos/test/aws/onprem/static/strict status of mesosphere/dcos-enterprise/pull/3734 (Title: [1.12] Bump dcos-net, Branch: 1.12) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 09/Nov/18 ] |
@jonathangiddy overrode teamcity/dcos/test/aws/onprem/static status of dcos/dcos/pull/3744 (Title: Collect ZooKeeper Metrics using DC/OS Telegraf, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 13/Nov/18 ] |
@alex overrode teamcity/dcos/test/dcos-e2e/docker/static/strict status of mesosphere/dcos-enterprise/pull/3603 (Title: Add external Mesos master/agent logs in the bundle, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 13/Nov/18 ] |
@alex overrode teamcity/dcos/test/dcos-e2e/docker/static/strict status of mesosphere/dcos-enterprise/pull/3603 (Title: Add external Mesos master/agent logs in the bundle, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 13/Nov/18 ] |
@alex overrode teamcity/dcos/test/aws/onprem/static/strict status of mesosphere/dcos-enterprise/pull/3603 (Title: Add external Mesos master/agent logs in the bundle, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 13/Nov/18 ] |
@branden overrode teamcity/dcos/test/aws/onprem/static/strict status of mesosphere/dcos-enterprise/pull/3430 (Title: Add SELinux details to diagnostics bundle, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 13/Nov/18 ] |
@charlesprovencher overrode teamcity/dcos/test/dcos-e2e/docker/static/strict status of mesosphere/dcos-enterprise/pull/3777 (Title: Adding required ending forwardslash to download_url, Branch: master) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 14/Nov/18 ] |
@jp overrode teamcity/dcos/test/dcos-e2e/docker/static/strict status of mesosphere/dcos-enterprise/pull/3812 (Title: [1.12] Mergebot Automated Train PR - 2018-Nov-14-02-38, Branch: 1.12) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Mergebot [ 16/Nov/18 ] |
@sergeyurbanovich overrode teamcity/dcos/test/aws/onprem/static/strict status of mesosphere/dcos-enterprise/pull/3821 (Title: [1.12] Bump dcos-net, Branch: 1.12) with the failure noted in this JIRA. Here are the TeamCity failure Logs for reference. |
Comment by Jan-Philip Gehrcke (Inactive) [ 26/Nov/18 ] |
We did not have an override for 10 days. This is the longest period of silence since many months. The DC/OS pull request throughput remained roughly constant. That is, we can conclude that the rate at which the underlying instabilities create problems has been significantly reduced (by more than one order of magnitude, probably – hard to precisely quantify when only observing a short timeframe). This is a major success. We have effectively addressed all instabilities resulting in this symptom. I think it's a good time to close this ticket (after about a year!). If we ever observe a test_vip instability again we should track the symptom(s) in separate JIRA ticket(s). |
Comment by Jan-Philip Gehrcke (Inactive) [ 26/Nov/18 ] |
For posterity, the following graph shows the evolution of the override command rate for |
Comment by Jan-Philip Gehrcke (Inactive) [ 26/Nov/18 ] |
Closing, thanks to everyone who helped fixing the underlying instabilities (most of which were in DC/OS; and not in the test method itself!) |
Comment by Jan-Philip Gehrcke (Inactive) [ 15/Dec/18 ] |
The story went on with DCOS-45799 and DCOS-46220, but we seem to have it under control! |