The MarathonScheduler does currently not handle the slaveLost callback other than by logging it. As janisz pointed out below, there might be value in performing an explicit reconciliation upon receiving this message. A simple approach would be to trigger an explicit reconciliation for all tasks located on a given agent that was reported unreachable.

      In big clusters, it's questionable whether this simple approach is helpful, if we don't know how many agents will be reported unreachable. Also, we need to understand under which circumstances Mesos would not reliably send status updates for the affected unreachable tasks. If we receive a task status update anyways, there's no need to trigger a reconciliation.

      Note that the slaveLost message itself is not reliably delivered – related Mesos docs:

      Invoked when a slave has been determined unreachable (e.g.,
      machine failure, network partition). Most frameworks will need to
      reschedule any tasks launched on this slave on a new slave.
      NOTE: This callback is not reliably delivered. If a host or
      network failure causes messages between the master and the
      scheduler to be dropped, this callback may not be invoked.

      Original Text:
      Currently Marathon do not handle Agent lost message. When it got this message it should perform reconciliation for tasks that resided on that agent.

      It's probably worthwhile to improve our reconciliation doc to suggest that
      frameworks do explicit reconciliation on slave lost messages as well.




            • Assignee:
              janisz janisz
              daltonmatos, janisz
            • Watchers:
              2 Start watching this issue


              • Created: