Error 504 when disabling a nova-compute service recently down

Bug #1923058 reported by Nautik
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
masakari
Triaged
High
Unassigned

Bug Description

By default, the 1st task executed when receiving a notification for a host being down is disable_compute_service_task.

When a host is down but still seen as up by the nova control plane (which can be up to 60 seconds to be reported), any attempt at disabling the nova-compute service ends up with a timeout and an error.

This bug was 1st reported in nova (see: https://bugs.launchpad.net/nova/+bug/1920977 ), but apparently it is an expected behavior on nova side.

As answered in the nova bug report, since Train it is expected to use the force-down api call instead of the disable service call (which is done in RH OSP instanceHA feature: https://github.com/ClusterLabs/fence-agents/pull/303).

Which raises another question. As described in the nova api-ref (https://docs.openstack.org/api-ref/compute/?expanded=update-forced-down-detail#update-forced-down), force-down should be used only when we are sure that the host has been fenced. As fencing is handled by an external service by default (pacemaker/corosync), can masakari know it was fenced correctly and force-down?

An idea for now could be to drop the task and keep only the "wait 60 seconds" part, but I am missing history on why this task exists.

Changed in masakari:
status: New → Triaged
importance: Undecided → High
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.