nova compute service does not reset instance with task_state in rebooting_hard

Bug #1999674 reported by Pierre-Samuel LE STANG
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Undecided
Unassigned

Bug Description

Description
===========
When a user ask for a reboot hard of a running instance while nova compute is unavailable (service stopped or host down) it might happens under certain conditions that the instance stays in rebooting_hard task_state after nova-compute start again.

The condition to get this issue is to have a rabbitmq message-ttl of messages in queue which is lower than the time needed to get nova compute up again.

Steps to reproduce
==================

Prerequisites:
* Set a low message-ttl (let's say 60 seconds) in your rabbitmq
* Have a running instance on a host

First case is having a failure on nova-compute service
1/ stop nova compute service on host
2/ ask for a reboot hard: openstack server reboot --hard <instance_id>
3/ wait 60 seconds
4/ start nova compute service
5/ check instance task_state and status

Second case is having a failure on the host
1/ hard shutdown the host (let's say a power supply issue)
2/ ask for a reboot hard: openstack server reboot --hard <instance_id>
3/ wait 60 seconds
2/ restart the host
5/ check instance task_state and status

Expected result
===============
We expect nova compute to be able to reset the state to active as we lost the message, to let the user take some other actions on the instance.

Actual result
=============
The instance is stuck in rebooting_hard task_state, user is blocked

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/nova/+/867807

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/nova/+/867832

Changed in nova:
status: New → In Progress
Revision history for this message
sean mooney (sean-k-mooney) wrote :

if we know the comput serrvice is down we probably should reject this in the API.

the message bus diconnects we expect the mandatory flag to detect that at the rpc level.

i have not triaged this fully but I'm not sure resetinng the state is the correct approch

Revision history for this message
Pierre-Samuel LE STANG (pslestang) wrote :

We don't have the compute service status in real time so it's hard du rely on it

REBOOTING_HARD is also a transient status so it makes sense to handle this status with other transient status.

Revision history for this message
Arnaud Morin (arnaud-morin) wrote :

The message bus disconnection appears only after a timeout, so nova-compute will be reported down only after a defined period of time.

If an API call to request a reboot hard is done in the middle of this, the message is sent by nova to nova-compute in the message bus.

But if the message TTL in too short, it can be dropped by queue system (rabbit) before nova-compute is up again.

In that scenario, the only possible action is to reset-state the instance from an admin context.

We (OVHcloud) are going to have this patch downstream, but we think this would be nice to consider having it upstream as well.

I dont think we can rely on the fact that nova-api will know that nova-compute is down or the message bus is disconnected because this may not be always true.

Moreover, we already reset state when status is PAUSING, UNPAUSING, etc., why not for REBOOTING_HARD?

Is there any better approach you can see?

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (master)

Reviewed: https://review.opendev.org/c/openstack/nova/+/867807
Committed: https://opendev.org/openstack/nova/commit/e5766446e5513aba6a64f6944e50b9effa3bff52
Submitter: "Zuul (22348)"
Branch: master

commit e5766446e5513aba6a64f6944e50b9effa3bff52
Author: Pierre-Samuel Le Stang <email address hidden>
Date: Thu Dec 15 15:54:38 2022 +0100

    Reproducer test of bug #1999674

    This commit aims to show that an active instance which is hard rebooted
    while nova compute is unavailable (service or host down) stays in
    task_state REBOOTING_HARD after nova compute is available again.

    Related-Bug: #1999674
    Change-Id: Ic672ce509ca21715f74931b8a6c6990b1c20ce30

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.opendev.org/c/openstack/nova/+/867832
Committed: https://opendev.org/openstack/nova/commit/aa3e8fef7b949ec3ddb3c4eaa348eb004593d29e
Submitter: "Zuul (22348)"
Branch: master

commit aa3e8fef7b949ec3ddb3c4eaa348eb004593d29e
Author: Pierre-Samuel Le Stang <email address hidden>
Date: Thu Dec 15 18:30:15 2022 +0100

    Correctly reset instance task state in rebooting hard

    When a user ask for a reboot hard of a running instance while nova compute is
    unavailable (service stopped or host down) it might happens under certain
    conditions that the instance stays in rebooting_hard task_state after
    nova-compute start again. This patch aims to fix that.

    Closes-Bug: #1999674
    Change-Id: I170e390fe4e467898a8dc7df6a446f62941d49ff

Changed in nova:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.