Comment 13 for bug 1254872

Revision history for this message
Daniel Berrange (berrange) wrote :

I analysed the logs from here which includes libvirt debug

http://logs.openstack.org/85/76685/3/check/check-tempest-dsvm-neutron-full/27b5a9d/logs/

The last successful thing to run against the VM in question (instance-00000021) was a virDomainManagedSave API call. The next API call virDomainCreate fails with the timed out error message. Other later calls have similar problems. From the logs I believe the problem is that a pending job is not getting ended.

Looking in libvirt GIT history for changes related to job management I see this obvious strong candidate

commit 6948b725e78016e45b846a17b89fafb69965be51
Author: Jiri Denemark <email address hidden>
Date: Wed Dec 14 09:57:07 2011 +0100

    qemu: Fix race between async and query jobs

    If an async job run on a domain will stop the domain at the end of the
    job, a concurrently run query job can hang in qemu monitor and nothing
    can be done with that domain from this point on. An attempt to start
    such domain results in "Timed out during operation: cannot acquire state
    change lock" error.

    However, quite a few things have to happen at the right time... There
    must be an async job running which stops a domain at the end. This race
    was reported with dump --crash but other similar jobs, such as
    (managed)save and migration, should be able to trigger this bug as well.
    While this async job is processing its last monitor command, that is a
    query-migrate to which qemu replies with status "completed", a new
    libvirt API that results in a query job must arrive and stay waiting
    until the query-migrate command finishes. Once query-migrate is done but
    before the async job closes qemu monitor while stopping the domain, the
    other thread needs to wake up and call qemuMonitorSend to send its
    command to qemu. Before qemu gets a chance to respond to this command,
    the async job needs to close the monitor. At this point, the query job
    thread is waiting for a condition that no-one will ever signal so it
    never finishes the job.

This describes precisely the scenario I see from the logs. It went into libvirt 0.9.9, which is of course just after the 0.9.8 version ubuntu ships

Unfortunately there's nothing we can do in nova code to workaround this. It will just require ubuntu to patch their ancient libvirt builds with this upstream fix, or for openstack to switch to newer libvirt builds.