masakari

Error 504 when disabling a nova-compute service recently down

Bug #1923058 reported by Nautik on 2021-04-08

This bug report is a duplicate of: Bug #1883465: Masakari fails to evacuate instance when nova-compute is not down already. Edit Remove

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	masakari	Triaged	High	Unassigned

Bug Description

By default, the 1st task executed when receiving a notification for a host being down is disable_compute_service_task.

When a host is down but still seen as up by the nova control plane (which can be up to 60 seconds to be reported), any attempt at disabling the nova-compute service ends up with a timeout and an error.

This bug was 1st reported in nova (see: https://bugs.launchpad.net/nova/+bug/1920977 ), but apparently it is an expected behavior on nova side.

As answered in the nova bug report, since Train it is expected to use the force-down api call instead of the disable service call (which is done in RH OSP instanceHA feature: https://github.com/ClusterLabs/fence-agents/pull/303).

Which raises another question. As described in the nova api-ref (https://docs.openstack.org/api-ref/compute/?expanded=update-forced-down-detail#update-forced-down), force-down should be used only when we are sure that the host has been fenced. As fencing is handled by an external service by default (pacemaker/corosync), can masakari know it was fenced correctly and force-down?

An idea for now could be to drop the task and keep only the "wait 60 seconds" part, but I am missing history on why this task exists.

Radosław Piliszek (yoctozepto) on 2021-04-25

Changed in masakari:
status:	New → Triaged
importance:	Undecided → High

Report a bug

This report contains Public information

Everyone can see this information.

Duplicate of bug #1883465 Remove

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.