[DCOS_OSS-2362] Lashup: Gradually remove unrelated nodes. Created: 10/Apr/18  Updated: 04/Dec/19  Resolved: 04/Dec/19

Status: Resolved
Project: DC/OS
Component/s: networking
Affects Version/s: None
Fix Version/s: None

Type: Task Priority: Medium
Reporter: Justin Lee (Inactive) Assignee: Deepak Goel
Resolution: Won't Do  
Labels: issuetype:improvement, networking
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate
is duplicated by DCOS_OSS-4165 Duplicated Cryptographic Cluster ID s... Resolved
Relates
Team: DELETE Networking Team
Story Points: 13

 Description   

Lashup uses a gossip protocol to keep track of nodes that it shares information with.

It does not currently remove nodes that are no longer present.

This may cause issues in the following situation:

Cluster X: Node A, B, C

Cluster Y: Node D, E, F

Remove Node C.  Later on (several days later), create a new node with the same IP address as Node C, but in Cluster Y.

Nodes A and B will continue to think they're supposed to talk to the node with IP C, and then the two clusters will get bridged.

 

Proposed solution:

  • For every node, on some (random?) interval T, periodically look at Mesos state.
  • If there are node IPs that are not in Mesos state that are in the local lashup gossip state, perform the following:
    • Mark it as 'absent' (or something)
    • After some X number of interval Ts, perform the following:
      • Remove it from the local gossip list
      • Stop talking with it (block inbound connections from it) (blacklist)
      • Stop advertising it to other gossip neighbors
      • Do not propagate the removal (to prevent inadvertent removal from the other cluster)
    • After some A * X * T period, remove it from the blacklist (to support situations where the node might be added back to the same cluster).

Or something like this.

Thoughts?



 Comments   
Comment by cbuben [ 23/Aug/18 ]

We appear to be affected by the "cluster bridging" issue.  I know nothing about the implementation details or capabilities of lashup, so this is rife with assumptions - apologies in advance.

A group cleanup mechanism sounds important/required, yes.  But isn't unintended bridging fundamentally caused by the absence of a group communication authentication mechanism, whereby received communications can be authenticated as originating from a current member of the cluster?

Seems like encryption or signing of group communications with a cluster-scoped secret would eliminate the possibility of unintended bridging.  Does this capability already exist?  Would this approach make sense?

Comment by Deepak Goel [ 27/Aug/18 ]

cbuben You are right. The current implementation implicitly relies on reachability to establish a group membership and there is no notion of explicit group membership which leads to "cluster bridging" (I liked this name) issue. Authentication would be one way of solving it. Another way would be a notion of cluster identity which is known to each member of the group.

Comment by Justin Lee (Inactive) [ 05/Sep/18 ]

I vote cluster identity, as a minimum. We already have a cluster cryptographic ID - we could continue to utilize this.

Cluster authentication would be ideal, maybe as a second part of this request.

Generated at Tue May 24 04:02:54 CDT 2022 using JIRA 7.8.4#78004-sha1:5704c55c9196a87d91490cbb295eb482fa3e65cf.