masakari is unable to disable nova-compute of failed node if nova-compute is in active-state
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
masakari |
Triaged
|
High
|
suzhengwei |
Bug Description
I just run into issues with masakari failing to disable a nova-compute agent.
Steps to reproduce:
* kill the pacemaker_remote process of the phy. compute node
* pacemaker will fence the phy. compute node by ipmi
* masakari will detect the host down
* masakari will *instant* try to disable the nova-compute agent
* nova-api will return 504, because the agent is still in "active"-state
* masakari will fail the workflow
Expected:
* masakari tries to disable the nova-compute agent with success
Logs:
Masakari: http://
Nova: http://
I already created an bug-report in nova ( https:/
As a workaround i moved the "sleep" between "disable nova-agent" and "wait for nova-agent down" to the top, so masakari waits for nova to detect the new agent-state.
Is this a valid fix?
Is there a reason why the sleep was added after "disable nova-agent"?
Should we add some retries to the nova-api-call instead/
Changed in masakari: | |
status: | Confirmed → Triaged |
importance: | Undecided → High |
Thanks for reporting this; I am seeing the same behaviour and was about to report/discuss as well.
I believe the order got mixed and the wait should really be happening for the nova to discover the host being down. I actually think there should be both the wait time (to postpone it a bit) and active status polling to be able to act as quickly as possible rather than rely on preset times.