OpenStack Compute (nova)

nova compute service does not reset instance with task_state in rebooting_hard

Bug #1999674 reported by Pierre-Samuel LE STANG on 2022-12-14

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	OpenStack Compute (nova)	Fix Released	Undecided	Unassigned

Bug Description

Description
===========
When a user ask for a reboot hard of a running instance while nova compute is unavailable (service stopped or host down) it might happens under certain conditions that the instance stays in rebooting_hard task_state after nova-compute start again.

The condition to get this issue is to have a rabbitmq message-ttl of messages in queue which is lower than the time needed to get nova compute up again.

Steps to reproduce
==================

Prerequisites:
* Set a low message-ttl (let's say 60 seconds) in your rabbitmq
* Have a running instance on a host

First case is having a failure on nova-compute service
1/ stop nova compute service on host
2/ ask for a reboot hard: openstack server reboot --hard <instance_id>
3/ wait 60 seconds
4/ start nova compute service
5/ check instance task_state and status

Second case is having a failure on the host
1/ hard shutdown the host (let's say a power supply issue)
2/ ask for a reboot hard: openstack server reboot --hard <instance_id>
3/ wait 60 seconds
2/ restart the host
5/ check instance task_state and status

Expected result
===============
We expect nova compute to be able to reset the state to active as we lost the message, to let the user take some other actions on the instance.

Actual result
=============
The instance is stuck in rebooting_hard task_state, user is blocked

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2022-12-15: Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/nova/+/867807

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2022-12-15: Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/nova/+/867832

Changed in nova:
status:	New → In Progress

Revision history for this message

sean mooney (sean-k-mooney) wrote on 2022-12-16:

if we know the comput serrvice is down we probably should reject this in the API.

the message bus diconnects we expect the mandatory flag to detect that at the rpc level.

i have not triaged this fully but I'm not sure resetinng the state is the correct approch

Revision history for this message

Pierre-Samuel LE STANG (pslestang) wrote on 2022-12-16:

We don't have the compute service status in real time so it's hard du rely on it

REBOOTING_HARD is also a transient status so it makes sense to handle this status with other transient status.

Revision history for this message

Arnaud Morin (arnaud-morin) wrote on 2023-01-03:

The message bus disconnection appears only after a timeout, so nova-compute will be reported down only after a defined period of time.

If an API call to request a reboot hard is done in the middle of this, the message is sent by nova to nova-compute in the message bus.

But if the message TTL in too short, it can be dropped by queue system (rabbit) before nova-compute is up again.

In that scenario, the only possible action is to reset-state the instance from an admin context.

We (OVHcloud) are going to have this patch downstream, but we think this would be nice to consider having it upstream as well.

I dont think we can rely on the fact that nova-api will know that nova-compute is down or the message bus is disconnected because this may not be always true.

Moreover, we already reset state when status is PAUSING, UNPAUSING, etc., why not for REBOOTING_HARD?

Is there any better approach you can see?

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2023-12-20: Related fix merged to nova (master)

Reviewed: https://review.opendev.org/c/openstack/nova/+/867807
Committed: https://opendev.org/openstack/nova/commit/e5766446e5513aba6a64f6944e50b9effa3bff52
Submitter: "Zuul (22348)"
Branch: master

commit e5766446e5513aba6a64f6944e50b9effa3bff52
Author: Pierre-Samuel Le Stang <email address hidden>
Date: Thu Dec 15 15:54:38 2022 +0100

Reproducer test of bug #1999674

    This commit aims to show that an active instance which is hard rebooted
    while nova compute is unavailable (service or host down) stays in
    task_state REBOOTING_HARD after nova compute is available again.

Related-Bug: #1999674
Change-Id: Ic672ce509ca21715f74931b8a6c6990b1c20ce30

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2024-03-20: Fix merged to nova (master)

Reviewed: https://review.opendev.org/c/openstack/nova/+/867832
Committed: https://opendev.org/openstack/nova/commit/aa3e8fef7b949ec3ddb3c4eaa348eb004593d29e
Submitter: "Zuul (22348)"
Branch: master

commit aa3e8fef7b949ec3ddb3c4eaa348eb004593d29e
Author: Pierre-Samuel Le Stang <email address hidden>
Date: Thu Dec 15 18:30:15 2022 +0100

Correctly reset instance task state in rebooting hard

    When a user ask for a reboot hard of a running instance while nova compute is
    unavailable (service stopped or host down) it might happens under certain
    conditions that the instance stays in rebooting_hard task_state after
    nova-compute start again. This patch aims to fix that.

Closes-Bug: #1999674
Change-Id: I170e390fe4e467898a8dc7df6a446f62941d49ff

Changed in nova:
status:	In Progress → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.