Uploaded image for project: 'Marathon'
  1. Marathon
  2. MARATHON-7828

Resident tasks get killed after the first restart due unhealthy state if first health check fails



      We have resident tasks and it takes some long amount of time to load them and get ready. We have grace period set up for them, intended so when the task is starts it can load it's state and all health checks will be ignored during this grace period. But if this task got restarted or somehow move to unhealthy check grace period will be ignored and consecutive fails counter won't be reset until first successful check.


      We've looked into source code and found, that in `HealthCheckActor` items from `healthByInstanceId` are never removed. So, that means that for instance id `HealthCheckActor` will keep state forever and never reset `firstSuccess` field in `Health` class. This especially affects resident tasks which keep their instance id across task reboots. So a resident task that retains its instance id and fails due to health checks will already have failed health check counters even though it was just started.


      Theoretically, this can even lead to OOM exceptions for some applications for which instances are being added and removed very often since the hash map never decreases.




            • Assignee:
              ken Ken Sipe
              logarithm logarithm
              ( DO NOT USE ) Orchestration Team
              egor-ryashin, Ken Sipe, logarithm, xanec
            • Watchers:
              4 Start watching this issue


              • Created: