Nova instance stuck in powering-off when rebooting all nodes in cluster

Bug #1593186 reported by Eyal Posener on 2016-06-16

This bug report was marked for expiration 0 days ago. (find out why)

20
This bug affects 4 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Undecided
Unassigned

Bug Description

After rebooting all nodes in the cluster, all the instances that were running on the cluster are stuck in Status ACTIVE, Task state: powering-off, Power state: Crashed.
From the log it looks that during in nova-compute service start, messages sent form init_host method vanished, because the start of RPC server is invoked only afterwards.

The menager.init_host methods, see an instance with vm_state == vm_states.ACTIVE and vm_power_state in (power_state.SHUTDOWN, power_state.CRASHED). I get the log message "Instance shutdown by itself. Calling the stop API. Current vm_state: active, current task_state: None, original DB power_state: 1, current VM power_state: 6".
Then it calls the api.stop method, which invokes the api.force_stop method, and I see the following log message "Going to try to stop instance force_stop". This method invokes through RPC a stop_instance method. But the RPC message never reach the RPC server, which is started only after the init_host is called in service.start method.
Since I am using rabbitmq, the message queues after rebooting the cluster of nodes are not initiated, and the call never gets to the destination.

After wards, the _sync_instance_power_state see the powering-off task state, and never cleans the instance state. I get the log messages: "During sync_power_state the instance has a pending task (powering-off). Skip."

Nova version is 12.0.0.

Shoham Peller (shoham-peller) wrote :

So it seems the fix is to switch the order of these:
https://github.com/openstack/nova/blob/master/nova/service.py#L117
https://github.com/openstack/nova/blob/master/nova/service.py#L153

To init the rpc server before init_host, that sends messages to itself with rpc.

Eyal Posener (eyal-6) on 2016-06-16
Changed in nova:
assignee: nobody → Eyal Posener (eyal-6)
status: New → In Progress

Change abandoned by Eyal Posener (<email address hidden>) on branch: master
Review: https://review.openstack.org/330556
Reason: Found a bug in the patch.
Working on a new patch.

Change abandoned by Sean Dague (<email address hidden>) on branch: master
Review: https://review.openstack.org/334566
Reason: This review is > 6 weeks without comment, and failed Jenkins the last time it was checked. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.

Sean Dague (sdague) wrote :

There are no currently open reviews on this bug, changing
the status back to the previous state and unassigning. If
there are active reviews related to this bug, please include
links in comments.

Changed in nova:
status: In Progress → New
assignee: Eyal Posener (eyal-6) → nobody
Sean Dague (sdague) wrote :

is this still an issue in master?

Changed in nova:
status: New → Incomplete
Volodymyr Pushkar (vpushkar) wrote :

I have the same issue with the latest Ocata (freshly updated).

Erik McCormick (emccormickva) wrote :

Just had this happen to a client who's building kindly switched off the power for several hours with no warning. All instances in Active / powering-off state. nova reset-state --active takes care of it, but it's super frustrating to have to do that in bulk.

de1m (wlkalexander) wrote :

jep, I've the same issue in newton.
I've restarted the nova-compute server and one vm stack in poweroff state.

Chenjun Shen (cshen) wrote :

me too, same issue in newton.

one vm has task_state in powering-off.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers