Instance sometimes do not fully terminate, causes crash

Bug #956719 reported by Anthony Young
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
High
Anthony Young

Bug Description

I can reproduce a situation where instances do not fully terminate. In this situation, instance backing files are deleted, but the domain still exists in libvirt. Then, due to this bug: https://bugs.launchpad.net/bugs/955788 nova-compute will crash on startup.

Steps to reproduce:

> (run devstack)
> cd exercises
> # the following script launches 2 instances in quick succession, and then terminates after 10 seconds
> curl https://raw.github.com/gist/2048763/898a7dc2348bf994eb4f4a93c299f1096522a824/gistfile1.txt > test.sh
> chmod 755 test.sh

And then:

> ./test.sh

Repeat this command 4-8 times. Then:

> sudo virsh list

Expected:

No domains

Actual:

$ sudo virsh list
 Id Name State
----------------------------------
 19 instance-00000002 running
 29 instance-0000000c running

Then, nova-compute starts spitting this error:

2012-03-15 23:10:09 ERROR nova.manager [-] Error during ComputeManager.update_available_resource: Unexpected error while running command.Command: qemu-img info /opt/stack/nova/instances/instance-00000002/diskExit code: 1Stdout: ''Stderr: "qemu-img: Could not open '/opt/stack/nova/instances/instance-00000002/disk': No such file or directory\n"(nova.manager): TRACE: Traceback (most recent call last):(nova.manager): TRACE: File "/opt/stack/nova/nova/manager.py", line 155, in periodic_tasks(nova.manager): TRACE: task(self, context)(nova.manager): TRACE: File "/opt/stack/nova/nova/compute/manager.py", line 2386, in update_available_resource(nova.manager): TRACE: self.driver.update_available_resource(context, self.host)(nova.manager): TRACE: File "/opt/stack/nova/nova/virt/libvirt/connection.py", line 1805, in update_available_resource(nova.manager): TRACE: 'disk_available_least': self.get_disk_available_least()}(nova.manager): TRACE: File "/opt/stack/nova/nova/virt/libvirt/connection.py", line 2156, in get_disk_available_least(nova.manager): TRACE: disk_infos = utils.loads(self.get_instance_disk_info(i_name))(nova.manager): TRACE: File "/opt/stack/nova/nova/virt/libvirt/connection.py", line 2115, in get_instance_disk_info(nova.manager): TRACE: out, err = utils.execute('qemu-img', 'info', path)(nova.manager): TRACE: File "/opt/stack/nova/nova/utils.py", line 240, in execute(nova.manager): TRACE: cmd=' '.join(cmd))(nova.manager): TRACE: ProcessExecutionError: Unexpected error while running command.(nova.manager): TRACE: Command: qemu-img info /opt/stack/nova/instances/instance-00000002/disk(nova.manager): TRACE: Exit code: 1(nova.manager): TRACE: Stdout: ''(nova.manager): TRACE: Stderr: "qemu-img: Could not open '/opt/stack/nova/instances/instance-00000002/disk': No such file or directory\n"(nova.manager): TRACE:

And upon next restart, the process crashes

summary: - Instance sometimes do not fully terminated, causes crash
+ Instance sometimes do not fully terminate, causes crash
description: updated
Changed in nova:
milestone: none → essex-rc1
importance: Undecided → High
status: New → Triaged
Revision history for this message
Pádraig Brady (p-draigbrady) wrote :

I'm not sure if https://bugs.launchpad.net/nova/+bug/957110 is exactly the same issue,
but the symptom is the same.

The patch provided there only mitigates the symptom,
allowing the compute service to start (and logging an error about the missing disk file)

Changed in nova:
assignee: nobody → Anthony Young (sleepsonthefloor)
Revision history for this message
Vish Ishaya (vishvananda) wrote :

anthony has a fix for this which involves synchronizing launch/terminate/start/stop

He's just putting in some tests and we should see a review soon.

Changed in nova:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/5476

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/5476
Committed: http://github.com/openstack/nova/commit/fe7055a5bd25bef33fe10f4fee858ad8cd30a6ea
Submitter: Jenkins
Branch: master

commit fe7055a5bd25bef33fe10f4fee858ad8cd30a6ea
Author: Anthony Young <email address hidden>
Date: Fri Mar 16 16:51:03 2012 -0700

    Fix run/terminate race conditions.

     * synchronize run,terminate,stop,start on instance_uuid
     * don't surpress error when unfiltering instance, which
       can result in a zombified instance.
     * Fixes bug 956719
     * Remove debug raise

    Change-Id: I8b2eaffdabfd5c1a9414adb1b5ed11e4c48711fc

Changed in nova:
status: In Progress → Fix Committed
Thierry Carrez (ttx)
Changed in nova:
status: Fix Committed → Fix Released
Thierry Carrez (ttx)
Changed in nova:
milestone: essex-rc1 → 2012.1
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.