OpenStack Compute (Nova)

Instance sometimes do not fully terminate, causes crash

Reported by Anthony Young on 2012-03-16
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
High
Anthony Young

Bug Description

I can reproduce a situation where instances do not fully terminate. In this situation, instance backing files are deleted, but the domain still exists in libvirt. Then, due to this bug: https://bugs.launchpad.net/bugs/955788 nova-compute will crash on startup.

Steps to reproduce:

> (run devstack)
> cd exercises
> # the following script launches 2 instances in quick succession, and then terminates after 10 seconds
> curl https://raw.github.com/gist/2048763/898a7dc2348bf994eb4f4a93c299f1096522a824/gistfile1.txt > test.sh
> chmod 755 test.sh

And then:

> ./test.sh

Repeat this command 4-8 times. Then:

> sudo virsh list

Expected:

No domains

Actual:

$ sudo virsh list
 Id Name State
----------------------------------
 19 instance-00000002 running
 29 instance-0000000c running

Then, nova-compute starts spitting this error:

2012-03-15 23:10:09 ERROR nova.manager [-] Error during ComputeManager.update_available_resource: Unexpected error while running command.Command: qemu-img info /opt/stack/nova/instances/instance-00000002/diskExit code: 1Stdout: ''Stderr: "qemu-img: Could not open '/opt/stack/nova/instances/instance-00000002/disk': No such file or directory\n"(nova.manager): TRACE: Traceback (most recent call last):(nova.manager): TRACE: File "/opt/stack/nova/nova/manager.py", line 155, in periodic_tasks(nova.manager): TRACE: task(self, context)(nova.manager): TRACE: File "/opt/stack/nova/nova/compute/manager.py", line 2386, in update_available_resource(nova.manager): TRACE: self.driver.update_available_resource(context, self.host)(nova.manager): TRACE: File "/opt/stack/nova/nova/virt/libvirt/connection.py", line 1805, in update_available_resource(nova.manager): TRACE: 'disk_available_least': self.get_disk_available_least()}(nova.manager): TRACE: File "/opt/stack/nova/nova/virt/libvirt/connection.py", line 2156, in get_disk_available_least(nova.manager): TRACE: disk_infos = utils.loads(self.get_instance_disk_info(i_name))(nova.manager): TRACE: File "/opt/stack/nova/nova/virt/libvirt/connection.py", line 2115, in get_instance_disk_info(nova.manager): TRACE: out, err = utils.execute('qemu-img', 'info', path)(nova.manager): TRACE: File "/opt/stack/nova/nova/utils.py", line 240, in execute(nova.manager): TRACE: cmd=' '.join(cmd))(nova.manager): TRACE: ProcessExecutionError: Unexpected error while running command.(nova.manager): TRACE: Command: qemu-img info /opt/stack/nova/instances/instance-00000002/disk(nova.manager): TRACE: Exit code: 1(nova.manager): TRACE: Stdout: ''(nova.manager): TRACE: Stderr: "qemu-img: Could not open '/opt/stack/nova/instances/instance-00000002/disk': No such file or directory\n"(nova.manager): TRACE:

And upon next restart, the process crashes

summary: - Instance sometimes do not fully terminated, causes crash
+ Instance sometimes do not fully terminate, causes crash
description: updated
Changed in nova:
milestone: none → essex-rc1
importance: Undecided → High
status: New → Triaged
Pádraig Brady (p-draigbrady) wrote :

I'm not sure if https://bugs.launchpad.net/nova/+bug/957110 is exactly the same issue,
but the symptom is the same.

The patch provided there only mitigates the symptom,
allowing the compute service to start (and logging an error about the missing disk file)

Changed in nova:
assignee: nobody → Anthony Young (sleepsonthefloor)
Vish Ishaya (vishvananda) wrote :

anthony has a fix for this which involves synchronizing launch/terminate/start/stop

He's just putting in some tests and we should see a review soon.

Changed in nova:
status: Triaged → In Progress

Reviewed: https://review.openstack.org/5476
Committed: http://github.com/openstack/nova/commit/fe7055a5bd25bef33fe10f4fee858ad8cd30a6ea
Submitter: Jenkins
Branch: master

commit fe7055a5bd25bef33fe10f4fee858ad8cd30a6ea
Author: Anthony Young <email address hidden>
Date: Fri Mar 16 16:51:03 2012 -0700

    Fix run/terminate race conditions.

     * synchronize run,terminate,stop,start on instance_uuid
     * don't surpress error when unfiltering instance, which
       can result in a zombified instance.
     * Fixes bug 956719
     * Remove debug raise

    Change-Id: I8b2eaffdabfd5c1a9414adb1b5ed11e4c48711fc

Changed in nova:
status: In Progress → Fix Committed
Thierry Carrez (ttx) on 2012-03-20
Changed in nova:
status: Fix Committed → Fix Released
Thierry Carrez (ttx) on 2012-04-05
Changed in nova:
milestone: essex-rc1 → 2012.1
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers