Nova instance stuck in powering-off when rebooting all nodes in cluster

Bug #1593186 reported by Eyal Posener
28
This bug affects 6 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Expired
Undecided
Unassigned

Bug Description

After rebooting all nodes in the cluster, all the instances that were running on the cluster are stuck in Status ACTIVE, Task state: powering-off, Power state: Crashed.
From the log it looks that during in nova-compute service start, messages sent form init_host method vanished, because the start of RPC server is invoked only afterwards.

The menager.init_host methods, see an instance with vm_state == vm_states.ACTIVE and vm_power_state in (power_state.SHUTDOWN, power_state.CRASHED). I get the log message "Instance shutdown by itself. Calling the stop API. Current vm_state: active, current task_state: None, original DB power_state: 1, current VM power_state: 6".
Then it calls the api.stop method, which invokes the api.force_stop method, and I see the following log message "Going to try to stop instance force_stop". This method invokes through RPC a stop_instance method. But the RPC message never reach the RPC server, which is started only after the init_host is called in service.start method.
Since I am using rabbitmq, the message queues after rebooting the cluster of nodes are not initiated, and the call never gets to the destination.

After wards, the _sync_instance_power_state see the powering-off task state, and never cleans the instance state. I get the log messages: "During sync_power_state the instance has a pending task (powering-off). Skip."

Nova version is 12.0.0.

Revision history for this message
Shoham Peller (shoham-peller) wrote :

So it seems the fix is to switch the order of these:
https://github.com/openstack/nova/blob/master/nova/service.py#L117
https://github.com/openstack/nova/blob/master/nova/service.py#L153

To init the rpc server before init_host, that sends messages to itself with rpc.

Eyal Posener (eyal-6)
Changed in nova:
assignee: nobody → Eyal Posener (eyal-6)
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/330556

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (master)

Change abandoned by Eyal Posener (<email address hidden>) on branch: master
Review: https://review.openstack.org/330556
Reason: Found a bug in the patch.
Working on a new patch.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/334566

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (master)

Change abandoned by Sean Dague (<email address hidden>) on branch: master
Review: https://review.openstack.org/334566
Reason: This review is > 6 weeks without comment, and failed Jenkins the last time it was checked. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.

Revision history for this message
Sean Dague (sdague) wrote :

There are no currently open reviews on this bug, changing
the status back to the previous state and unassigning. If
there are active reviews related to this bug, please include
links in comments.

Changed in nova:
status: In Progress → New
assignee: Eyal Posener (eyal-6) → nobody
Revision history for this message
Sean Dague (sdague) wrote :

is this still an issue in master?

Changed in nova:
status: New → Incomplete
Revision history for this message
Volodymyr Pushkar (vpushkar) wrote :

I have the same issue with the latest Ocata (freshly updated).

Revision history for this message
Erik McCormick (emccormickva) wrote :

Just had this happen to a client who's building kindly switched off the power for several hours with no warning. All instances in Active / powering-off state. nova reset-state --active takes care of it, but it's super frustrating to have to do that in bulk.

Revision history for this message
de1m (wlkalexander) wrote :

jep, I've the same issue in newton.
I've restarted the nova-compute server and one vm stack in poweroff state.

Revision history for this message
Chenjun Shen (cshen) wrote :

me too, same issue in newton.

one vm has task_state in powering-off.

Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for OpenStack Compute (nova) because there has been no activity for 60 days.]

Changed in nova:
status: Incomplete → Expired
Revision history for this message
vijay (vijayforos) wrote :

I have the same issue on Ocata, doesn't seems to have a solution on how to recover from it without making DB changes

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.