OpenStack Compute (nova)

Instance sometimes do not fully terminate, causes crash

Bug #956719 reported by Anthony Young on 2012-03-16

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	OpenStack Compute (nova)	Fix Released	High	Anthony Young	OpenStack Compute (nova) 2012.1 "essex"

Bug Description

I can reproduce a situation where instances do not fully terminate. In this situation, instance backing files are deleted, but the domain still exists in libvirt. Then, due to this bug: https://bugs.launchpad.net/bugs/955788 nova-compute will crash on startup.

Steps to reproduce:

> (run devstack)
> cd exercises
> # the following script launches 2 instances in quick succession, and then terminates after 10 seconds
> curl https://raw.github.com/gist/2048763/898a7dc2348bf994eb4f4a93c299f1096522a824/gistfile1.txt > test.sh
> chmod 755 test.sh

And then:

> ./test.sh

Repeat this command 4-8 times. Then:

> sudo virsh list

Expected:

No domains

Actual:

$ sudo virsh list
Id Name State
----------------------------------
19 instance-00000002 running
29 instance-0000000c running

Then, nova-compute starts spitting this error:

2012-03-15 23:10:09 ERROR nova.manager [-] Error during ComputeManager.update_available_resource: Unexpected error while running command.Command: qemu-img info /opt/stack/nova/instances/instance-00000002/diskExit code: 1Stdout: ''Stderr: "qemu-img: Could not open '/opt/stack/nova/instances/instance-00000002/disk': No such file or directory\n"(nova.manager): TRACE: Traceback (most recent call last):(nova.manager): TRACE: File "/opt/stack/nova/nova/manager.py", line 155, in periodic_tasks(nova.manager): TRACE: task(self, context)(nova.manager): TRACE: File "/opt/stack/nova/nova/compute/manager.py", line 2386, in update_available_resource(nova.manager): TRACE: self.driver.update_available_resource(context, self.host)(nova.manager): TRACE: File "/opt/stack/nova/nova/virt/libvirt/connection.py", line 1805, in update_available_resource(nova.manager): TRACE: 'disk_available_least': self.get_disk_available_least()}(nova.manager): TRACE: File "/opt/stack/nova/nova/virt/libvirt/connection.py", line 2156, in get_disk_available_least(nova.manager): TRACE: disk_infos = utils.loads(self.get_instance_disk_info(i_name))(nova.manager): TRACE: File "/opt/stack/nova/nova/virt/libvirt/connection.py", line 2115, in get_instance_disk_info(nova.manager): TRACE: out, err = utils.execute('qemu-img', 'info', path)(nova.manager): TRACE: File "/opt/stack/nova/nova/utils.py", line 240, in execute(nova.manager): TRACE: cmd=' '.join(cmd))(nova.manager): TRACE: ProcessExecutionError: Unexpected error while running command.(nova.manager): TRACE: Command: qemu-img info /opt/stack/nova/instances/instance-00000002/disk(nova.manager): TRACE: Exit code: 1(nova.manager): TRACE: Stdout: ''(nova.manager): TRACE: Stderr: "qemu-img: Could not open '/opt/stack/nova/instances/instance-00000002/disk': No such file or directory\n"(nova.manager): TRACE:

And upon next restart, the process crashes

See original description

Anthony Young (sleepsonthefloor) on 2012-03-16

summary:

- Instance sometimes do not fully terminated, causes crash
+ Instance sometimes do not fully terminate, causes crash

Anthony Young (sleepsonthefloor) on 2012-03-16

description:

updated

Vish Ishaya (vishvananda) on 2012-03-16

Changed in nova:
milestone:	none → essex-rc1
importance:	Undecided → High
status:	New → Triaged

Revision history for this message

Pádraig Brady (p-draigbrady) wrote on 2012-03-16:

I'm not sure if https://bugs.launchpad.net/nova/+bug/957110 is exactly the same issue,
but the symptom is the same.

The patch provided there only mitigates the symptom,
allowing the compute service to start (and logging an error about the missing disk file)

Anthony Young (sleepsonthefloor) on 2012-03-16

Changed in nova:
assignee:	nobody → Anthony Young (sleepsonthefloor)

Revision history for this message

Vish Ishaya (vishvananda) wrote on 2012-03-16:

anthony has a fix for this which involves synchronizing launch/terminate/start/stop

He's just putting in some tests and we should see a review soon.

Changed in nova:
status:	Triaged → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2012-03-16: Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/5476

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2012-03-17: Fix merged to nova (master)

Reviewed: https://review.openstack.org/5476
Committed: http://github.com/openstack/nova/commit/fe7055a5bd25bef33fe10f4fee858ad8cd30a6ea
Submitter: Jenkins
Branch: master

commit fe7055a5bd25bef33fe10f4fee858ad8cd30a6ea
Author: Anthony Young <email address hidden>
Date: Fri Mar 16 16:51:03 2012 -0700

Fix run/terminate race conditions.

     * synchronize run,terminate,stop,start on instance_uuid
     * don't surpress error when unfiltering instance, which
       can result in a zombified instance.
     * Fixes bug 956719
     * Remove debug raise

Change-Id: I8b2eaffdabfd5c1a9414adb1b5ed11e4c48711fc

Changed in nova:
status:	In Progress → Fix Committed

Thierry Carrez (ttx) on 2012-03-20

Changed in nova:
status:	Fix Committed → Fix Released

Thierry Carrez (ttx) on 2012-04-05

Changed in nova:
milestone:	essex-rc1 → 2012.1

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.