OpenStack Compute (nova)

Nova instance stuck in powering-off when rebooting all nodes in cluster

Bug #1593186 reported by Eyal Posener on 2016-06-16

28

This bug affects 6 people

Affects		Status	Importance	Assigned to	Milestone
	OpenStack Compute (nova)	Expired	Undecided	Unassigned

Bug Description

After rebooting all nodes in the cluster, all the instances that were running on the cluster are stuck in Status ACTIVE, Task state: powering-off, Power state: Crashed.
From the log it looks that during in nova-compute service start, messages sent form init_host method vanished, because the start of RPC server is invoked only afterwards.

The menager.init_host methods, see an instance with vm_state == vm_states.ACTIVE and vm_power_state in (power_state.SHUTDOWN, power_state.CRASHED). I get the log message "Instance shutdown by itself. Calling the stop API. Current vm_state: active, current task_state: None, original DB power_state: 1, current VM power_state: 6".
Then it calls the api.stop method, which invokes the api.force_stop method, and I see the following log message "Going to try to stop instance force_stop". This method invokes through RPC a stop_instance method. But the RPC message never reach the RPC server, which is started only after the init_host is called in service.start method.
Since I am using rabbitmq, the message queues after rebooting the cluster of nodes are not initiated, and the call never gets to the destination.

After wards, the _sync_instance_power_state see the powering-off task state, and never cleans the instance state. I get the log messages: "During sync_power_state the instance has a pending task (powering-off). Skip."

Nova version is 12.0.0.

Revision history for this message

Shoham Peller (shoham-peller) wrote on 2016-06-16:

#1

So it seems the fix is to switch the order of these:
https://github.com/openstack/nova/blob/master/nova/service.py#L117
https://github.com/openstack/nova/blob/master/nova/service.py#L153

To init the rpc server before init_host, that sends messages to itself with rpc.

Eyal Posener (eyal-6) on 2016-06-16

Changed in nova:
assignee:	nobody → Eyal Posener (eyal-6)
status:	New → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-06-16: Fix proposed to nova (master)

#2

Fix proposed to branch: master
Review: https://review.openstack.org/330556

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-06-27: Change abandoned on nova (master)

#3

Change abandoned by Eyal Posener (<email address hidden>) on branch: master
Review: https://review.openstack.org/330556
Reason: Found a bug in the patch.
Working on a new patch.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-06-27: Fix proposed to nova (master)

#4

Fix proposed to branch: master
Review: https://review.openstack.org/334566

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-12-09: Change abandoned on nova (master)

#5

Change abandoned by Sean Dague (<email address hidden>) on branch: master
Review: https://review.openstack.org/334566
Reason: This review is > 6 weeks without comment, and failed Jenkins the last time it was checked. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.

Revision history for this message

Sean Dague (sdague) wrote on 2017-06-23:

#6

There are no currently open reviews on this bug, changing
the status back to the previous state and unassigning. If
there are active reviews related to this bug, please include
links in comments.

Changed in nova:
status:	In Progress → New
assignee:	Eyal Posener (eyal-6) → nobody

Revision history for this message

Sean Dague (sdague) wrote on 2017-06-28:

#7

is this still an issue in master?

Changed in nova:
status:	New → Incomplete

Revision history for this message

Volodymyr Pushkar (vpushkar) wrote on 2017-07-24:

#8

I have the same issue with the latest Ocata (freshly updated).

Revision history for this message

Erik McCormick (emccormickva) wrote on 2017-08-06:

#9

Just had this happen to a client who's building kindly switched off the power for several hours with no warning. All instances in Active / powering-off state. nova reset-state --active takes care of it, but it's super frustrating to have to do that in bulk.

Revision history for this message

de1m (wlkalexander) wrote on 2017-08-17:

#10

jep, I've the same issue in newton.
I've restarted the nova-compute server and one vm stack in poweroff state.

Revision history for this message

Chenjun Shen (cshen) wrote on 2017-08-22:

#11

me too, same issue in newton.

one vm has task_state in powering-off.

Revision history for this message

Launchpad Janitor (janitor) wrote on 2017-10-22:

#12

[Expired for OpenStack Compute (nova) because there has been no activity for 60 days.]

Changed in nova:
status:	Incomplete → Expired

Revision history for this message

vijay (vijayforos) wrote on 2017-12-18:

#13

I have the same issue on Ocata, doesn't seems to have a solution on how to recover from it without making DB changes

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.