Nova instance stuck in powering-off when rebooting all nodes in cluster
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
OpenStack Compute (nova) |
Expired
|
Undecided
|
Unassigned |
Bug Description
After rebooting all nodes in the cluster, all the instances that were running on the cluster are stuck in Status ACTIVE, Task state: powering-off, Power state: Crashed.
From the log it looks that during in nova-compute service start, messages sent form init_host method vanished, because the start of RPC server is invoked only afterwards.
The menager.init_host methods, see an instance with vm_state == vm_states.ACTIVE and vm_power_state in (power_
Then it calls the api.stop method, which invokes the api.force_stop method, and I see the following log message "Going to try to stop instance force_stop". This method invokes through RPC a stop_instance method. But the RPC message never reach the RPC server, which is started only after the init_host is called in service.start method.
Since I am using rabbitmq, the message queues after rebooting the cluster of nodes are not initiated, and the call never gets to the destination.
After wards, the _sync_instance_
Nova version is 12.0.0.
Changed in nova: | |
assignee: | nobody → Eyal Posener (eyal-6) |
status: | New → In Progress |
So it seems the fix is to switch the order of these: /github. com/openstack/ nova/blob/ master/ nova/service. py#L117 /github. com/openstack/ nova/blob/ master/ nova/service. py#L153
https:/
https:/
To init the rpc server before init_host, that sends messages to itself with rpc.