masakari is unable to disable nova-compute of failed node if nova-compute is in active-state

Bug #1887756 reported by Fabian Zimmermann
16
This bug affects 3 people
Affects Status Importance Assigned to Milestone
masakari
Triaged
High
suzhengwei

Bug Description

I just run into issues with masakari failing to disable a nova-compute agent.

Steps to reproduce:

* kill the pacemaker_remote process of the phy. compute node
* pacemaker will fence the phy. compute node by ipmi
* masakari will detect the host down
* masakari will *instant* try to disable the nova-compute agent
* nova-api will return 504, because the agent is still in "active"-state
* masakari will fail the workflow

Expected:

* masakari tries to disable the nova-compute agent with success

Logs:
Masakari: http://paste.openstack.org/show/wwiBbs9gyImcJm311wJM/
Nova: http://paste.openstack.org/show/Vm2KQvxAJI2bd3S3iEUR/

I already created an bug-report in nova ( https://bugs.launchpad.net/nova/+bug/1887751 ), but i think masakari should handle this case a bit smarter.

As a workaround i moved the "sleep" between "disable nova-agent" and "wait for nova-agent down" to the top, so masakari waits for nova to detect the new agent-state.

Is this a valid fix?

Is there a reason why the sleep was added after "disable nova-agent"?

Should we add some retries to the nova-api-call instead/additionally?

Revision history for this message
Radosław Piliszek (yoctozepto) wrote :

Thanks for reporting this; I am seeing the same behaviour and was about to report/discuss as well.
I believe the order got mixed and the wait should really be happening for the nova to discover the host being down. I actually think there should be both the wait time (to postpone it a bit) and active status polling to be able to act as quickly as possible rather than rely on preset times.

Changed in masakari:
status: New → Confirmed
Revision history for this message
suzhengwei (sue.sam) wrote :

I think it is an issue about host fencing.It didn't poweroff the failed host when failure found by pacemaker.
Nova service has two similar flag 'status'(enabled/disabled) and 'state'(up/down). In masakari-engine host recovery workflow, It just disabled the nova-compute, and the 'nova-compute'status is 'disables' wirh state still 'up' when evacuating instances . So it returned 504 error.
If force-down the nova-compute, the evacuations will continue to execute, but brain-split problem exists.
Host fencing(poweroff) in masakari-engine workflow is discussed on wallaby PTG, and it could completely solve the problem.

Changed in masakari:
status: Confirmed → Triaged
importance: Undecided → High
Revision history for this message
suzhengwei (sue.sam) wrote :
Changed in masakari:
assignee: nobody → suzhengwei (sue.sam)
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.