"cannot acquire state change lock" problems

Bug #734777 reported by Fred van Zwieten
32
This bug affects 6 people
Affects Status Importance Assigned to Milestone
libvirt
Won't Fix
Critical
libvirt (Debian)
Fix Released
Unknown
libvirt (Ubuntu)
Invalid
Undecided
Unassigned

Bug Description

Binary package hint: libvirt-bin

When I shutdown a guest (tried with both WinXP and RHEL6, tried with "shutdown" from virsh and virtmanager and from within guest"), at the point the client has shutdown ("Halting System..." in RHEL6 terms) the state keeps being "running". Also, the guest processes are still there, but I can't kill them. Then, whatever I try to do with those guests, I must wait for minimal 1 minute after I get ""Timed out during operation: cannot acquire state change lock".

The traceback of a destroy attempt is this:

Traceback (most recent call last):
  File "/usr/share/virt-manager/virtManager/engine.py", line 768, in destroy_domain
    vm.destroy()
  File "/usr/share/virt-manager/virtManager/domain.py", line 1324, in destroy
    self._backend.destroy()
  File "/usr/lib/python2.6/dist-packages/libvirt.py", line 349, in destroy
    if ret == -1: raise libvirtError ('virDomainDestroy() failed', dom=self)
libvirtError: Timed out during operation: cannot acquire state change lock

Restarting the libvirt-bin service doesn't help.

Other seem to have similar problems: http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=602715

Only thing that works reliably is to reboot the host.

ProblemType: Bug
DistroRelease: Ubuntu 10.10
Package: libvirt-bin 0.8.3-1ubuntu14
Uname: Linux 2.6.37-020637rc2-generic x86_64
Architecture: amd64
Date: Mon Mar 14 12:04:51 2011
InstallationMedia: Ubuntu 10.10 "Maverick Meerkat" - Release amd64 (20101007)
ProcEnviron:
 LANGUAGE=en_US:en
 LANG=en_US.UTF-8
 SHELL=/bin/bash
SourcePackage: libvirt

Revision history for this message
In , Douglas (douglas-redhat-bugs) wrote :

Description of problem:

Customer cannot resume virtual machine, below the error:

# virsh resume v_rhel5_prod
error: Failed to resume domain v_rhel5_prod
error: Timed out during operation: cannot acquire state change lock

Version-Release number of selected component (if applicable):
libvirt-0.8.2-15.el5_6.1.x86_64
libvirt-python-0.8.2-15.el5_6.1.x86_64

Also tried:

# virsh destroy v_rhel5_prod
error: Failed to destroy domain v_rhel5_prod
error: Timed out during operation: cannot acquire state change lock

# virsh start v_rhel5_prod
error: Domain is already active

Additional info:
Even rebooting the host the VM keep locked.

Similar issue:
https://bugzilla.redhat.com/show_bug.cgi?id=668438

Attached the debug logs

Revision history for this message
In , Douglas (douglas-redhat-bugs) wrote :

Created attachment 477730
debug logs

Revision history for this message
In , Daniel (daniel-redhat-bugs) wrote :

Are there any files in /var/lib/libvirt/qemu/save or /var/lib/libvirt/qemu/snapshot ? And is the 'libvirt-guests' initscript enabled on boot ?

Most likely guess would be that there was a saved guest that failed to restore properly on boot

Revision history for this message
In , Douglas (douglas-redhat-bugs) wrote :

Hello Daniel,

Sorry the delay here, customer decided to moving to RHEL 6. They are uploading these files to further analyze. Would you like to make any suggestion?

Thanks
Douglas

Revision history for this message
In , John (john-redhat-bugs) wrote :

Hi,

I am running libvirtd with kvm on a Debian Squeeze host and I am experiencing the same problem from time to time. I'm using virt-manager to control my virtual machines and sometimes libvirtd runs into problems controlling a kvm domain.

I cannot exactly say, when the problem occurs but it usually happens when I start and stop several virtual machines one after another. I.e., I have several virtual machines with test installations for development and since some of them are running unstable versions (Debian for example), I start all the VMs once a week to update them, usually not more than two at the same time since my kvm host has only 4GB of RAM. It then usually happens that the state of the virtual machines is not updated in virt-manager and when trying to start a virtual machine which is powered off, I receive the aforementioned error message.

The problem is always fixed by:

killall -9 libvirtd
rm /var/run/libvirtd.pid
/etc/init.d/libvirt-bin restart

The virtual machines are never affected by this problem, they still continue to run without any problems. It simply seems that libvirtd at some point cannot connect to the kvm host anymore due to a race condition. I'm attaching a screenshot of the error message in virt-manager the last time it happened. In this case, I logged into my Debian Squeeze kvm host over ssh and used X-forwarding to display virt-manager on the MacOS X host. virt-manager was not running on the Mac.

Version numbers:

dpkg -l libvirt\* |grep -e '^ii'
ii libvirt-bin 0.8.3-5 the programs for the libvirt library
ii libvirt0

dpkg -l virt\* |grep -e '^ii'
ii virt-manager 0.8.4-8 desktop application for managing virtual machines
ii virt-viewer 0.2.1-1 Displaying the graphical console of a virtual machine
ii virtinst 0.500.3-2

Regards,

Adrian

Revision history for this message
In , John (john-redhat-bugs) wrote :

Created attachment 479933
Screenshot of virt-manager running on Debian Squeeze (over X-forwarding on MacOSX) when the problem with libvirtd occured

Revision history for this message
Fred van Zwieten (fvzwieten) wrote :
Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

Thanks for reporting th

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

Thanks for reporting this bug and helping to make Ubuntu better.

I will try to reproduce this later today.

Have you seen this on anything other than 10.10? Is the guest OS always Red Hat?

Revision history for this message
Fred van Zwieten (fvzwieten) wrote :

No, it's also with a WindowsXP client.

I do have some additional observations. When I do a force off when the guest is running, it works (but is hardly diserable). Reboot works, Just shutdown not. After shutdown, the guest is in a state that nothing works, including forced off.

Is it possible to have a testing ppa available with the version that has a supposed fix in it (afaik 0.8.3-5).

Revision history for this message
Serge Hallyn (serge-hallyn) wrote : Re: [Bug 734777] Re: "cannot acquire state change lock" problems

Quoting Fred van Zwieten (<email address hidden>):
> No, it's also with a WindowsXP client.
>
> I do have some additional observations. When I do a force off when the
> guest is running, it works (but is hardly diserable). Reboot works, Just
> shutdown not. After shutdown, the guest is in a state that nothing
> works, including forced off.
>
> Is it possible to have a testing ppa available with the version that has
> a supposed fix in it (afaik 0.8.3-5).

Since you are on 10.10, you could try the server edger's ppa version,
which is close to the upstream version:

https://launchpad.net/~ubuntu-server-edgers/+archive/server-edgers-libvirt?field.series_filter=maverick

Revision history for this message
Fred van Zwieten (fvzwieten) wrote :

OK, tried that, but now i cannot even start any guest:

Error starting domain: Unable to create cgroup for server1.lab.local: No such file or directory

Traceback (most recent call last):
  File "/usr/share/virt-manager/virtManager/engine.py", line 814, in run_domain
    vm.startup()
  File "/usr/share/virt-manager/virtManager/domain.py", line 1296, in startup
    self._backend.create()
  File "/usr/lib/python2.6/dist-packages/libvirt.py", line 330, in create
    if ret == -1: raise libvirtError ('virDomainCreate() failed', dom=self)
libvirtError: Unable to create cgroup for server1.lab.local: No such file or directory

cgroup-bin 0.36.20-3 is installed

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

I wasn't able to reproduce this with a ubuntu server guest started with libvirt in 10.10. I shut the guest down using 'sudo poweroff'.

Is there any version of fedora (or some other freely downloadable OS) with which you've been able to reproduce this?

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

Can you give the .xml file for one of the VMs which does not shut down?

(virsh dumpxml VMNAME > uploadme.xml)

Changed in libvirt (Ubuntu):
status: New → Incomplete
Revision history for this message
Fred van Zwieten (fvzwieten) wrote :

Sure. Had to revert to 10.10 libvirt, because virsh from ppa libvirt didn't work.

I have 2 guests:

1. Windows xp
2. RHEL6 (64)

I am now deploying a 64bit ubuntu 10.10 server.

Attached are the xml fiels for 1 and 2

Revision history for this message
Fred van Zwieten (fvzwieten) wrote :
Revision history for this message
Fred van Zwieten (fvzwieten) wrote :

Just to be complete:

uname -r <host>
2.6.37-020637rc2-generic

That's not stock 10.10

Revision history for this message
Fred van Zwieten (fvzwieten) wrote :

Ubuntu 10.10 server 64bit: same story,

xml attached

Revision history for this message
Fred van Zwieten (fvzwieten) wrote :

OK, went back to stock 10.10 kernel. All problems solved, it seemed. Sorry for the noise. Case closed.

Changed in libvirt (Ubuntu):
status: Incomplete → Invalid
Revision history for this message
olx69 (ope-linux) wrote :

same here. I test at this time lxc. I've mounted cgroup to /cgroups by 'mount none -t cgroup /cgroup'. virt-manager/libvirt did work before mounting until this time

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

@olx69 - I don't that is related. Please open a new bug for your issue.

Revision history for this message
In , John (john-redhat-bugs) wrote :

I had the same problem.

Running RHEL5.6 host machine with latest patches as of today.

# lsb_release -r
Release: 5.6

# uname -a
Linux lark.cs.unc.edu 2.6.18-238.9.1.el5 #1 SMP Fri Mar 18 12:42:39 EDT 2011 x86_64 x86_64 x86_64 GNU/Linux

# rpm -qa|grep virt
virt-manager-0.6.1-13.el5
libvirt-0.8.2-15.el5_6.3
libvirt-0.8.2-15.el5_6.3
python-virtinst-0.400.3-11.el5
libvirt-python-0.8.2-15.el5_6.3

Installed a RHEL6.0 virtual machine, install went fine. The install rebooted at the end, the virtual machine window hung, the virt-manger window hung, let them sit for 5 minutes or so then killed them. Have 2 other old virtual machines that continued to run. Tried to start the new machine:

virsh # start lark-virtx
error: Failed to start domain lark-virtx
error: Timed out during operation: cannot acquire state change lock

Stopped librvirtd:

service libvirtd stop

Removed run directory:

rm -rf /var/run/libvirt

Note after the shutdown there was no /var/run/libvirt.pid file

Started libvirtd:

service libvirtd start

Was able to use virt-manager to start the new virtual machine.

Thanks Adrian, that worked!

Revision history for this message
In , John (john-redhat-bugs) wrote :

Hi John,

on a side note: I recently upgraded libvirt from 0.8.2 to 0.9.0 and haven't seen the problem ever since. So, if you have the possibility to upgrade your libvirtd to the more recent version 0.9.0 or newer, I highly suggest you to do so and see if that permanently fixes the problem for you as well.

It's certainly also nice for the maintainers/developers to know whether the new version fixes the bug and if several people independently claim it does, they will be able to change this bug report to "fixed" =).

Greetings from Norway,

Adrian

Revision history for this message
In , John (john-redhat-bugs) wrote :

Got a chance to play with this again for a few minutes. If I
login and halt a vm or do a Shut Down from virt-manager this
consistently hangs virt-manager and the vnc client window.

If I do a "service restart libvirtd" virt-manager is able to
re-connect again. I do not have to remove any /var/run/libvirt*
files.

I have been running 2 virtual machines for over a year. I just
noticed this problem because I created a new machine and started
seeing virt-manager hanging with the "Timed out during operation: cannot acquire state change lock" error.

Hope an update comes out soon.

Revision history for this message
In , John (john-redhat-bugs) wrote :

John,

as I previously mentioned, the bug has been fixed in the version 0.9.0 and later. But since you are using an older version and cannot easily upgrade, the most reasonable solution would be a backport of the fix, which means that the appropriate lines of code that were changed in 0.9.0 to address this particular problem should also be changed in 0.8.x, however, without changing anything else to make sure that no other, possible new problems are introduced.

I haven't checked the changelog of libvirt 0.9.0, so I don't know which change actually fixed the problem, but I am pretty sure that it can easily backported and will be backported since many people are actually using libvirt 0.8.x on RHEL which they have paid support for.

Adrian

Revision history for this message
In , John (john-redhat-bugs) wrote :

Yes, just thought having a consistent way, shutting down the system,
to reproduce the problem would be some helpful information.

Changed in libvirt (Debian):
status: Unknown → Fix Released
Revision history for this message
In , Daniel (daniel-redhat-bugs) wrote :

Summary of situation wrt "Timed out during operation: cannot acquire state change lock"

There are a few reasons why you might see that error message in RHEL-5

     1. The QEMU process has hung.

        QEMU won't respond to monitor commands. The API call making the first monitor command will wait forever, any subsequent API calls issuing monitor commands will timeout after ~30 seconds with this libvirt error message.

        This is expected behaviour when QEMU has hung.

     2. The QEMU process is working on a very long/slow monitor command

        The API call making the long monitor command will wait until it (eventually) finishes. Any subsequent API calls wanting to issue monitor commands will wait upto ~30 seconds, for the first call to finish, after which they return this libvirt error message.

        This is also expected behaviour when one API call is running a very long monitor command.

     3. Migration is aborted in between the 'Prepare' and 'Finish' step.

        Migration is a 3 phase process. First we 'Prepare' on the target host, acquiring the lock. Then we run on the source host. Finally we 'Finish' on the target host, releasing the lock. If the libvirt client dies/quits half way through, the lock may never be released. In this case, further monitor commands will return this libvirt error message.

        This is a bug

     4. Libvirt has a bug in lock handling

        libvirt might run a monitor command, but forgets to release the 'state change lock' once complete. Again further monitor commands will return this message.

        This is a bug.

In RHEL-6.2 we have done a number of things to address / mitigate these problems

 - It is now always possible to destroy a guest, even if the monitor is stuck. This lets you destroy a guest in scenario 1, which is not always possible with RHEL-5 libvirt, without restarting libvirtd.

 - Some pieces of code which held the lock for a long time, have been refactored to hold it for a much shorter period. This is primarily migration/save/restore/snapshot code. This should address some of the common reasons for seeing this error message

 - The migration code has been made more robust, to guarantee that all locks are released, even if migration client aborts/quits without calling Finish.

So in RHEL-6.2, only scenario 1/2 should remain and those should occur less frequently, or at least be recoverable without requiring a libvirtd daemon restart, by killing the guest in question.

The changes made in RHEL-6.1/6.2 to deal with this error message required alot of changes across all areas of the code. These changes would not be practical to backport to RHEL-5, because of the risk of them introducing regressions in other areas.

Revision history for this message
Janne Snabb (snabb) wrote :

I got hit by this on natty. There seems to be a corresponding Red Hat bug at: https://bugzilla.redhat.com/show_bug.cgi?id=676205

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

@Janne,

could you please file a new bug using 'ubuntu-bug libvirt', and list any information to help us reproduce it? That should give us more information about what has gone wrong in your particular case. If you can catch it while it is hanging, please also give us the output of
   for p in `pidof libvirtd`; do
     echo -n "$p: "
     cat /proc/$p/cmdline
     cat /proc/$p/cgroup
     cat /proc/$p/status
   done
   for p in `pidof kvm`; do
      echo -n "$p: "
     cat /proc/$p/cmdline
     cat /proc/$p/cgroup
     cat /proc/$p/status
   done

Revision history for this message
In , RHEL (rhel-redhat-bugs) wrote :

Development Management has reviewed and declined this request. You may appeal
this decision by reopening this request.

Revision history for this message
nutznboltz (nutznboltz-deactivatedaccount) wrote :

@Serge Hallyn

I used 'ubunut-bug libvirt-bin' to open LP: #882579 about this issue.

Revision history for this message
nutznboltz (nutznboltz-deactivatedaccount) wrote :

@Janne your issue is really a bug in Linux kernel KVM code being tracked in LP: #795717
https://bugs.launchpad.net/bugs/795717

Revision history for this message
In , nigil (nigil-redhat-bugs) wrote :

I want to add:
When tried to shutdown VM, it went to paused state and just hangs. Could not resume/shutdown vm from the paused state.
[root@lnx132-75 vol_vm_data_disk_f63]# virsh list
 Id Name State
----------------------------------
 12 vm2_rhel6_x86_64 paused
 15 vm6_win2003_x86_64 running
 16 vm7_win2008_x86_64 running
 17 vm8_win7_x86 running
 18 vm9_win7_x86_64 running

[root@lnx132-75]# virsh shutdown vm2_rhel6_x86_64
error: Failed to shutdown domain vm2_rhel6_x86_64
error: Timed out during operation: cannot acquire state change lock

[root@lnx132-75]# virsh resume vm2_rhel6_x86_64
error: Failed to resume domain vm2_rhel6_x86_64
error: Timed out during operation: cannot acquire state change lock

[root@lnx132-75]# virsh start vm2_rhel6_x86_64
error: Domain is already active

[root@lnx132-75]# lsb_release -r
Release: 5.8

[root@lnx132-75]# rpm -qa | grep libvirt
libvirt-cim-0.5.8-3.el5
libvirt-0.8.2-25.el5
libvirt-0.8.2-25.el5
libvirt-python-0.8.2-25.el5

Found xml of the VM is saved as .save.
[root@lnx132-75 save]# pwd
/var/lib/libvirt/qemu/save
[root@lnx132-75 save]# ls
vm2_rhel6_x86_64.save

Removed .save file and tried to resume/shutdown, but same issue has been observed.

Changed in libvirt:
importance: Unknown → Critical
status: Unknown → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.