I think also a configurable wait time will help. Currently the available setting for waiting is after the try to disable the service. The wait_period_after_service_update is the time between it gets disabled and when an eviction is attempted. I think introducing something like: wait_period_before_marking_service_as_disabled will help, because it will give time to the nova API to recognize there are no heartbeats from nova-compute and will mark the service down. Afaik this period of the heartbeats is something like 60 seconds, so now it depends when the failure happens and how fast Masakari responds . I think combining the above and this idea will help to avoid the racing condition.
I think also a configurable wait time will help. Currently the available setting for waiting is after the try to disable the service. The wait_period_ after_service_ update is the time between it gets disabled and when an eviction is attempted. I think introducing something like: wait_period_ before_ marking_ service_ as_disabled will help, because it will give time to the nova API to recognize there are no heartbeats from nova-compute and will mark the service down. Afaik this period of the heartbeats is something like 60 seconds, so now it depends when the failure happens and how fast Masakari responds . I think combining the above and this idea will help to avoid the racing condition.