[DCOS_OSS-4575] Add timeout while trying to recover overlay Created: 06/Dec/18  Updated: 05/Jun/19  Resolved: 27/Mar/19

Status: Resolved
Project: DC/OS
Component/s: networking
Affects Version/s: DC/OS 1.10.9, DC/OS 1.11.8, DC/OS 1.12.0
Fix Version/s: DC/OS 1.12.4, DC/OS 1.13.0

Type: Bug Priority: Medium
Reporter: Deepak Goel Assignee: Sergey Urbanovich (Inactive)
Resolution: Done  
Labels: networking
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate
Relates
Team: DELETE Networking Team
Sprint: Networking: RI-10 Sprint 38, Networking: RI-10 Sprint 39, Networking: RI-11 Sprint 40, Networking: RI-11 Sprint 41, Networking: RI-12 Sprint 42
Story Points: 13

 Description   

While debugging COPS-4167, it was discovered that mesos overlay master doesn't have a timeout [1] while trying to recover overlay. This sometimes causes mesos overlay master to hang at the recovery stage. It requires manual intervention to bring mesos overlay master out of this state. A similar implementation in mesos has a timeout [2]

[1] https://github.com/dcos/dcos-mesos-modules/blob/master/overlay/master.cpp#L1521
[2] https://github.com/apache/mesos/blob/master/src/master/registrar.cpp#L342



 Comments   
Comment by Sergey Urbanovich (Inactive) [ 03/Jan/19 ]

https://github.com/dcos/dcos-mesos-modules/pull/82
https://github.com/dcos/dcos/pull/4298/files#diff-1a4c148ff2c418d6ff5e14ec904acacbR9

Comment by Sergey Urbanovich (Inactive) [ 20/Feb/19 ]

[master]
https://github.com/dcos/dcos-mesos-modules/pull/98

[1.12]
https://github.com/dcos/dcos-mesos-modules/pull/99

Comment by Sergey Urbanovich (Inactive) [ 21/Feb/19 ]

It is really challenging to back port my patch to 1.11 and, especially, 1.10 branches.

$ git diff --stat origin/1.11..origin/1.12 -- overlay tests/overlay_tests.cpp
 overlay/agent.cpp       | 211 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++----------------------------------------------------
 overlay/agent.hpp       |  17 +++++++---
 overlay/master.cpp      |  36 +++++++++++++++------
 overlay/network.hpp     |  58 +++++++++++++++++-----------------
 overlay/overlay.proto   |   8 +++++
 tests/overlay_tests.cpp | 181 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++----------
 6 files changed, 361 insertions(+), 150 deletions(-)
$ git diff --stat origin/1.10..origin/1.12 -- overlay tests/overlay_tests.cpp
 overlay/agent.cpp       | 256 ++++++++++++++++++++++++-------------
 overlay/agent.hpp       |  17 ++-
 overlay/master.cpp      | 848 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-------------------------------
 overlay/messages.proto  |  11 ++
 overlay/network.hpp     | 423 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 overlay/overlay.proto   |  35 ++++-
 tests/overlay_tests.cpp | 660 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++------
 7 files changed, 1899 insertions(+), 351 deletions(-)
Comment by Lisa Gunn (Inactive) [ 05/Jun/19 ]

This ticket is not currently flagged to be included in the 1.12.4 release notes. If this ticket should be published in the Release Notes, then do the following:

  1. Click Edit, select Release Notes, then set Include in Release Notes = Yes.
  2. Write a brief Description about the root cause, symptoms, and fix information.
  3. (Optionally) Add the label `external` to help identify issues that are user-facing.

If this ticket is internal-facing, please explicitly set Include in Release Notes = No and add the label `internal` (indicating that this is not user-facing information).

Comment by Lisa Gunn (Inactive) [ 05/Jun/19 ]

Set the Include in Release Note = Yes flag. DRAFT content:

  • Adds a timeout to the Mesos network overlay module to prevent the overlay master from getting stuck in RECOVERING mode (COPS-4167, COPS-4747, DCOS_OSS-4575, DCOS-47930).
Generated at Wed May 18 08:31:06 CDT 2022 using JIRA 7.8.4#78004-sha1:5704c55c9196a87d91490cbb295eb482fa3e65cf.