Details

    • Type: Task
    • Status: Open
    • Priority: Medium
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: zookeeper
    • Labels:

      Description

      Our current ZooKeeper restore method is insufficient to guarantee a consistent state after having restored from a backup. The problem here is that ephemeral nodes are contained in the backup and currently we just restore everything INCLUDING the ephemeral nodes.

      Quick explanation how ZooKeeper sessions and ephemeral nodes work:
      http://zookeeper-user.578899.n2.nabble.com/Why-are-ephemeral-nodes-written-to-disk-td7583403.html

      This is can cause weird behavior nicely outlined by the following article:
      https://www.elastic.co/blog/zookeeper-backup-a-treatise

      It has been implemented in this way because Mesosphere support engineers were doing ZooKeeper backups like this in the past. Also mentioned in the article is what would be necessary to get to a consistent state after restoration.

      Deleting all ephemeral nodes from the backup before restoring (or even when taking the backup) would suffice to reach a consistent backup/restore procedure.

      There has been prior work in this regard by a former Mesosphere engineer:
      https://github.com/phunt/zk-txnlog-tools

      This script parses transaction logs. We would need to utilize/extend it to delete all the ephemeral nodes.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                timweidner Tim Weidner (Inactive)
                Team:
                Mesosphere
                Watchers:
                Dominik Dary, Jan-Philip Gehrcke (Inactive), jongiddy, Tim Weidner (Inactive)
              • Watchers:
                4 Start watching this issue

                Dates

                • Created:
                  Updated: