Comment 31 for bug 734777

Revision history for this message
In , Daniel (daniel-redhat-bugs) wrote :

Summary of situation wrt "Timed out during operation: cannot acquire state change lock"

There are a few reasons why you might see that error message in RHEL-5

     1. The QEMU process has hung.

        QEMU won't respond to monitor commands. The API call making the first monitor command will wait forever, any subsequent API calls issuing monitor commands will timeout after ~30 seconds with this libvirt error message.

        This is expected behaviour when QEMU has hung.

     2. The QEMU process is working on a very long/slow monitor command

        The API call making the long monitor command will wait until it (eventually) finishes. Any subsequent API calls wanting to issue monitor commands will wait upto ~30 seconds, for the first call to finish, after which they return this libvirt error message.

        This is also expected behaviour when one API call is running a very long monitor command.

     3. Migration is aborted in between the 'Prepare' and 'Finish' step.

        Migration is a 3 phase process. First we 'Prepare' on the target host, acquiring the lock. Then we run on the source host. Finally we 'Finish' on the target host, releasing the lock. If the libvirt client dies/quits half way through, the lock may never be released. In this case, further monitor commands will return this libvirt error message.

        This is a bug

     4. Libvirt has a bug in lock handling

        libvirt might run a monitor command, but forgets to release the 'state change lock' once complete. Again further monitor commands will return this message.

        This is a bug.

In RHEL-6.2 we have done a number of things to address / mitigate these problems

 - It is now always possible to destroy a guest, even if the monitor is stuck. This lets you destroy a guest in scenario 1, which is not always possible with RHEL-5 libvirt, without restarting libvirtd.

 - Some pieces of code which held the lock for a long time, have been refactored to hold it for a much shorter period. This is primarily migration/save/restore/snapshot code. This should address some of the common reasons for seeing this error message

 - The migration code has been made more robust, to guarantee that all locks are released, even if migration client aborts/quits without calling Finish.

So in RHEL-6.2, only scenario 1/2 should remain and those should occur less frequently, or at least be recoverable without requiring a libvirtd daemon restart, by killing the guest in question.

The changes made in RHEL-6.1/6.2 to deal with this error message required alot of changes across all areas of the code. These changes would not be practical to backport to RHEL-5, because of the risk of them introducing regressions in other areas.