OpenStack Compute (nova)

unshelved offloaded instance is unexpectedly stopped during ceph job

Bug #1646212 reported by Matt Riedemann on 2016-11-30

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	OpenStack Compute (nova)	Triaged	Low	Unassigned

Bug Description

Noticed this during a ceph job tempest run:

http://logs.openstack.org/08/404508/3/check/gate-tempest-dsvm-full-devstack-plugin-ceph-ubuntu-xenial/f2ef8c2/console.html#_2016-11-30_10_23_30_682154

2016-11-30 10:23:30.682154 | tempest.api.compute.servers.test_server_actions.ServerActionsTestJSON.test_shelve_unshelve_server[id-77eba8e0-036e-4635-944b-f7a8f3b78dc9]
2016-11-30 10:23:30.682250 | ------------------------------------------------------------------------------------------------------------------------------------------
2016-11-30 10:23:30.682275 |
2016-11-30 10:23:30.682306 | Captured traceback-1:
2016-11-30 10:23:30.682348 | ~~~~~~~~~~~~~~~~~~~~~
2016-11-30 10:23:30.682397 | Traceback (most recent call last):
2016-11-30 10:23:30.682467 | File "tempest/api/compute/servers/test_server_actions.py", line 58, in tearDown
2016-11-30 10:23:30.682522 | self.server_check_teardown()
2016-11-30 10:23:30.682580 | File "tempest/api/compute/base.py", line 166, in server_check_teardown
2016-11-30 10:23:30.682615 | cls.server_id, 'ACTIVE')
2016-11-30 10:23:30.682697 | File "tempest/common/waiters.py", line 95, in wait_for_server_status
2016-11-30 10:23:30.682751 | raise lib_exc.TimeoutException(message)
2016-11-30 10:23:30.682802 | tempest.lib.exceptions.TimeoutException: Request timed out
2016-11-30 10:23:30.682941 | Details: (ServerActionsTestJSON:tearDown) Server 409d71a2-e0b3-4bdf-b0f9-8e23fee18550 failed to reach ACTIVE status and task state "None" within the required time (196 s). Current status: SHELVED_OFFLOADED. Current task state: spawning.

Tracing the lifecycle of the instance through the n-cpu logs, there is a race:

1. shelve

http://logs.openstack.org/08/404508/3/check/gate-tempest-dsvm-full-devstack-plugin-ceph-ubuntu-xenial/f2ef8c2/logs/screen-n-cpu.txt.gz#_2016-11-30_09_51_28_926

/oslo_concurrency/lockutils.py:270
2016-11-30 09:51:28.926 11001 INFO nova.compute.manager [req-0d4c19ae-ca59-48e5-a9b7-aa19a1cbec4e tempest-ServerActionsTestJSON-107143809 tempest-ServerActionsTestJSON-107143809] [instance: 409d71a2-e0b3-4bdf-b0f9-8e23fee18550] Shelving

2. stopped

http://logs.openstack.org/08/404508/3/check/gate-tempest-dsvm-full-devstack-plugin-ceph-ubuntu-xenial/f2ef8c2/logs/screen-n-cpu.txt.gz#_2016-11-30_09_51_43_938

2016-11-30 09:51:43.938 11001 INFO nova.virt.libvirt.driver [req-0d4c19ae-ca59-48e5-a9b7-aa19a1cbec4e tempest-ServerActionsTestJSON-107143809 tempest-ServerActionsTestJSON-107143809] [instance: 409d71a2-e0b3-4bdf-b0f9-8e23fee18550] Instance shutdown successfully after 13 seconds.

3. unshelve

http://logs.openstack.org/08/404508/3/check/gate-tempest-dsvm-full-devstack-plugin-ceph-ubuntu-xenial/f2ef8c2/logs/screen-n-cpu.txt.gz#_2016-11-30_09_51_53_803

2016-11-30 09:51:53.803 11001 INFO nova.compute.manager [req-676d94d0-c8ec-495b-be9b-eb9c0ee29a3d tempest-ServerActionsTestJSON-107143809 tempest-ServerActionsTestJSON-107143809] [instance: 409d71a2-e0b3-4bdf-b0f9-8e23fee18550] Unshelving

4. stop

http://logs.openstack.org/08/404508/3/check/gate-tempest-dsvm-full-devstack-plugin-ceph-ubuntu-xenial/f2ef8c2/logs/screen-n-cpu.txt.gz#_2016-11-30_09_51_58_921

2016-11-30 09:51:58.921 11001 DEBUG nova.virt.driver [-] Emitting event <LifecycleEvent: 1480499503.92, 409d71a2-e0b3-4bdf-b0f9-8e23fee18550 => Stopped> emit_event /opt/stack/new/nova/nova/virt/driver.py:1440
2016-11-30 09:51:58.921 11001 INFO nova.compute.manager [-] [instance: 409d71a2-e0b3-4bdf-b0f9-8e23fee18550] VM Stopped (Lifecycle Event)

^ is 15 seconds after the instance was stopped during shelve, which is:

https://github.com/openstack/nova/blob/14.0.0/nova/virt/libvirt/host.py#L104

This is only a problem in the ceph job probably because it's using the rbd shallow clone feature which makes the snapshot taken during shelve-offloading fast super fast, and it's so fast that by the time the initial lifecycle event from shelve / stop operation in #2 is processed, we've already unshelved the instance and it's active with no task_state set, so the compute manager doesn't think it's going through any task and it's stopped because the libvirt driver says it should be stopped.

Tags:

Matt Riedemann (mriedem) on 2016-11-30

summary:

- unshelved offloaded instance is unexpectedly stopped
+ unshelved offloaded instance is unexpectedly stopped during ceph job

Revision history for this message

Chet Burgess (cfb-n) wrote on 2016-12-01:

The issue here is that a user has taken an overt action (unshelve) against a VM that will cause its state to change from STOPPED while there is still a pending STOPPED event in the queue. In this particular case unshelve eventually calls driver.spawn which will attempt to change the state of the VM.

I believe that the proper fix for this is to have the driver.spawn code first clear all pending STOP event from the queue before it attempts to spawn the instance. This ensures that any old events aren't accidentally processed after we have already taken an overt action to change the state.

Revision history for this message

Gaëtan Trellu (goldyfruit) wrote on 2017-09-25:

Same issue here with Pike.

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.