      I'm noticing some large problems in removing agents and unreachable strategy handling in the latest stable marathon/mesos versions- 1.4.3 and 1.1.1.

      I started by ensuring all of our app definitions included this new stuff:

      I ran a quick script to PATCH all applications via the API with the following aggressive unreachable strategy:

        "unreachableStrategy": {
          "inactiveAfterSeconds": 60,
          "expungeAfterSeconds": 90

      We're using 100% MESOS_HTTP health checks, no more HTTP based ones running from marathon itself.

      Steps to reproduce:

      • Leave marathon `reconciliation_interval` command line setting unset (defaults to 10 minutes)
      • Have at least a two slave/agent cluster, so one can be shut off/terminated
      • Launch a few dummy apps in marathon into the cluster with the above unreachable strategy settings
      • Terminate/disconnect a slave (kill -USR1 <mesos_slave_pid>; sleep 10; systemctl stop mesos-slave)
      • Notice that although marathon will indicate "0 of 1 running" or similar, and the app continues to show "Healthy" for the health check, it isn't reachable at all. It won't restart until the next time marathon runs task reconciliation at the earliest- 10 minutes.

      It also seems like only the "expungeAfterSeconds" setting is being honored; I'm not sure "inactiveAfterSeconds" is actually working. But that is a hunch, I haven't been able to prove that yet.

      A not-so-great workaround here is to set the reconciliation interval for marathon to a super low value, like 60 seconds, which seems to force things to get into sync quicker. You can also force marathon to reconcile by simply restarting it- it then seems to honor the unreachable settings quicker.

      Let me know if I can provide any other useful info!


