instance unexpected shutdown when source node startup

Bug #2008876 reported by zhouzhong
28
This bug affects 6 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
In Progress
Wishlist
zhou zhong

Bug Description

Description
===========
Before shutdown the instance, the source node was breakdown, caused that the shutdown instance message remained in queue and could not be consumed, message was not consumed until the source node was started. When the source node was up, continuing to execute the stop operation, and the instance vm_state was changed to STOPPED, the periodic task of synchronized power state _sync_power_states was captured and triggered the shutdown operation.

Steps to reproduce
==================
1.create an instance
2.break down the source node of the instance, by shutdown or the others actions
3.send request to shutdown the instance
4.start the source node

Expected result
===============
The instance does not shutdown by _sync_power_states and power state is always RUNNING

Actual result
=============
The instance unexpected shutdown when the source node up

Environment
===========

zhouzhong (zhouzhong)
summary: - instance unexpected shutdown when node startup
+ instance unexpected shutdown when source node startup
Changed in nova:
status: New → In Progress
zhouzhong (zhouzhong)
Changed in nova:
assignee: nobody → zhouzhong (zhouzhong)
Revision history for this message
zhouzhong (zhouzhong) wrote (last edit ):

I have made restrictions on all operations that may be affected by the status of the source node. The specific implementation is add the decorator "@check_instance_host(check_is_up=True)".

Uploaded patch at https://review.opendev.org/c/openstack/nova/+/875859

Instance unexpected shutdown when source node startup

Before shutdown the instance, the source node was breakdown, caused that
the shutdown instance message remained in queue and could not be
consumed, message was not consumed until the source node was started.
When the source node was up, continuing to execute the stop operation,
and the instance vm_state was changed to STOPPED, the periodic task of
synchronized power state _sync_power_states was captured and triggered
the shutdown operation.

Not only the instance shutdown operation, but also the operations
related to the status of the source node where the instance is located
is abnormal, the operation on the instance may cause unexpected
exceptions, so they are limited.

Revision history for this message
sean mooney (sean-k-mooney) wrote :

im not actully sure we shoudl fix this.

really the operator should purge the rabbitmq queue as part of recovering form the failed compute.

we perhaps could provide a nova-manage command for this but that would be a mini feature not a bug.

Changed in nova:
importance: Undecided → Low
importance: Low → Wishlist
zhou zhong (zhouzhongg)
Changed in nova:
assignee: zhouzhong (zhouzhong) → Zhong Zhou (zhongzhou)
Revision history for this message
Tobias Urdin (tobias-urdin) wrote (last edit ):

Thanks for this bug report, this all now makes sense why an instance was unexpectedly powered off when we took back a previously evacuated compute node.

I spent about 2 hours walking through the lifecycle events and power state sync code trying to find what caused this exact issue.

Revision history for this message
Tobias Urdin (tobias-urdin) wrote :

Still feels like a design flaw, we should really verify that the message is intended for us before acting on anything – couldn't we somehow verify the message originated to the compute node when it had that service ID/compute node UUID and drop message otherwise?

Revision history for this message
Tobias Urdin (tobias-urdin) wrote :

Just doing a google search on log messages show how this has probably been an issue for a very long time, some even recommend disabling power state sync all together.

I don't have permission to view the RedHat issue tracker on https://bugzilla.redhat.com/show_bug.cgi?id=2049487 perhaps this issue was uncovered there but not shared publicly.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.