unshelved offloaded instance is unexpectedly stopped during ceph job
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
OpenStack Compute (nova) |
Triaged
|
Low
|
Unassigned |
Bug Description
Noticed this during a ceph job tempest run:
2016-11-30 10:23:30.682154 | tempest.
2016-11-30 10:23:30.682250 | -------
2016-11-30 10:23:30.682275 |
2016-11-30 10:23:30.682306 | Captured traceback-1:
2016-11-30 10:23:30.682348 | ~~~~~~~
2016-11-30 10:23:30.682397 | Traceback (most recent call last):
2016-11-30 10:23:30.682467 | File "tempest/
2016-11-30 10:23:30.682522 | self.server_
2016-11-30 10:23:30.682580 | File "tempest/
2016-11-30 10:23:30.682615 | cls.server_id, 'ACTIVE')
2016-11-30 10:23:30.682697 | File "tempest/
2016-11-30 10:23:30.682751 | raise lib_exc.
2016-11-30 10:23:30.682802 | tempest.
2016-11-30 10:23:30.682941 | Details: (ServerActionsT
Tracing the lifecycle of the instance through the n-cpu logs, there is a race:
1. shelve
/oslo_concurren
2016-11-30 09:51:28.926 11001 INFO nova.compute.
2. stopped
2016-11-30 09:51:43.938 11001 INFO nova.virt.
3. unshelve
2016-11-30 09:51:53.803 11001 INFO nova.compute.
4. stop
2016-11-30 09:51:58.921 11001 DEBUG nova.virt.driver [-] Emitting event <LifecycleEvent: 1480499503.92, 409d71a2-
2016-11-30 09:51:58.921 11001 INFO nova.compute.
^ is 15 seconds after the instance was stopped during shelve, which is:
https:/
This is only a problem in the ceph job probably because it's using the rbd shallow clone feature which makes the snapshot taken during shelve-offloading fast super fast, and it's so fast that by the time the initial lifecycle event from shelve / stop operation in #2 is processed, we've already unshelved the instance and it's active with no task_state set, so the compute manager doesn't think it's going through any task and it's stopped because the libvirt driver says it should be stopped.
summary: |
- unshelved offloaded instance is unexpectedly stopped + unshelved offloaded instance is unexpectedly stopped during ceph job |
The issue here is that a user has taken an overt action (unshelve) against a VM that will cause its state to change from STOPPED while there is still a pending STOPPED event in the queue. In this particular case unshelve eventually calls driver.spawn which will attempt to change the state of the VM.
I believe that the proper fix for this is to have the driver.spawn code first clear all pending STOP event from the queue before it attempts to spawn the instance. This ensures that any old events aren't accidentally processed after we have already taken an overt action to change the state.