masakari

masakari is unable to disable nova-compute of failed node if nova-compute is in active-state

Bug #1887756 reported by Fabian Zimmermann on 2020-07-16

This bug report is a duplicate of: Bug #1883465: Masakari fails to evacuate instance when nova-compute is not down already. Edit Remove

This bug affects 3 people

Affects		Status	Importance	Assigned to	Milestone
	masakari	Triaged	High	suzhengwei

Bug Description

I just run into issues with masakari failing to disable a nova-compute agent.

Steps to reproduce:

* kill the pacemaker_remote process of the phy. compute node
* pacemaker will fence the phy. compute node by ipmi
* masakari will detect the host down
* masakari will *instant* try to disable the nova-compute agent
* nova-api will return 504, because the agent is still in "active"-state
* masakari will fail the workflow

Expected:

* masakari tries to disable the nova-compute agent with success

Logs:
Masakari: http://paste.openstack.org/show/wwiBbs9gyImcJm311wJM/
Nova: http://paste.openstack.org/show/Vm2KQvxAJI2bd3S3iEUR/

I already created an bug-report in nova ( https://bugs.launchpad.net/nova/+bug/1887751 ), but i think masakari should handle this case a bit smarter.

As a workaround i moved the "sleep" between "disable nova-agent" and "wait for nova-agent down" to the top, so masakari waits for nova to detect the new agent-state.

Is this a valid fix?

Is there a reason why the sleep was added after "disable nova-agent"?

Should we add some retries to the nova-api-call instead/additionally?

Revision history for this message

Radosław Piliszek (yoctozepto) wrote on 2020-07-16:

Thanks for reporting this; I am seeing the same behaviour and was about to report/discuss as well.
I believe the order got mixed and the wait should really be happening for the nova to discover the host being down. I actually think there should be both the wait time (to postpone it a bit) and active status polling to be able to act as quickly as possible rather than rely on preset times.

Changed in masakari:
status:	New → Confirmed

Revision history for this message

suzhengwei (sue.sam) wrote on 2020-10-30:

I think it is an issue about host fencing.It didn't poweroff the failed host when failure found by pacemaker.
Nova service has two similar flag 'status'(enabled/disabled) and 'state'(up/down). In masakari-engine host recovery workflow, It just disabled the nova-compute, and the 'nova-compute'status is 'disables' wirh state still 'up' when evacuating instances . So it returned 504 error.
If force-down the nova-compute, the evacuations will continue to execute, but brain-split problem exists.
Host fencing(poweroff) in masakari-engine workflow is discussed on wallaby PTG, and it could completely solve the problem.

Radosław Piliszek (yoctozepto) on 2021-04-25