Some Commands return: "Timed out during operation: cannot acquire state change lock"

Bug #882579 reported by nutznboltz
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
libvirt (Ubuntu)
Invalid
Medium
Unassigned
linux (Ubuntu)
Incomplete
Undecided
Unassigned

Bug Description

To reproduce:

On Lucid run:
apt-get source linux-image-2.6.38-12-generic

examine linux-lts-backport-natty-2.6.38/arch/x86/kvm/x86.c

See missing patch:

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 980317a..0556e05 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2100,8 +2100,8 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
   if (check_tsc_unstable()) {
    kvm_x86_ops->adjust_tsc_offset(vcpu, -tsc_delta);
    vcpu->arch.tsc_catchup = 1;
- kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
   }
+ kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
   if (vcpu->cpu != cpu)
    kvm_migrate_timers(vcpu);
   vcpu->cpu = cpu;

=====

While attempting to use "virsh" commands this error occurs:

Oct 27 08:52:22 boron libvirtd: 08:52:23.000: 25851: error : qemuDomainObjBeginJobWithDriver:453 : Timed out during operation: cannot acquire state change lock

While the error state is present certain commands like "virsh shutdown" and "virsh destroy" elicit that error message and fail to work completely I found that if I kill the kvm guest processes

$ kill -9 $(ps ax | grep '/usr/bin/kvm.*guestname' | awk '{print $1}')

Then "virsh start guestname" would work.

I'm not certain that the number of VM guests is the issue, however, this happens more frequently to me on systems with larger numbers of guests. We have one with 14 guests and another with 53.

There seems to be a corresponding Red Hat bug at:
https://bugzilla.redhat.com/show_bug.cgi?id=676205

ProblemType: Bug
DistroRelease: Ubuntu 11.04
Package: libvirt-bin 0.8.8-1ubuntu6.5
ProcVersionSignature: Ubuntu 2.6.38-12.51-generic 2.6.38.8
Uname: Linux 2.6.38-12-generic x86_64
Architecture: amd64
Date: Thu Oct 27 09:41:23 2011
InstallationMedia: Ubuntu 10.10 "Maverick Meerkat" - Release amd64 (20101007)
ProcEnviron:
 LANGUAGE=en_US:en
 PATH=(custom, user)
 LANG=en_US.UTF-8
 SHELL=/bin/bash
SourcePackage: libvirt
UpgradeStatus: Upgraded to natty on 2011-05-26 (153 days ago)

Revision history for this message
nutznboltz (nutznboltz-deactivatedaccount) wrote :
Revision history for this message
nutznboltz (nutznboltz-deactivatedaccount) wrote :

Attached the output of

    for p in `pidof libvirtd`; do
     echo -n "$p: "
     cat /proc/$p/cmdline
     cat /proc/$p/cgroup
     cat /proc/$p/status
   done
   for p in `pidof kvm`; do
      echo -n "$p: "
     cat /proc/$p/cmdline
     cat /proc/$p/cgroup
     cat /proc/$p/status
   done

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

Thanks for filing this new bug.

(I think it is in fact a dup of bug 734777, but it's not certain)

I'm afraid I'll simply have to try to reproduce this on a maverick system and debug with gdb, as per the linked redhat bug, upstream is not sure what fixed the problem between 0.8.8 and 0.9.0.

If it is feasible for you to upgrade to oneiric, the newer version of libvirt there should bypass this problem, however that certainly is not a satisfactory solution, especially as this appears to be hitting multiple users.

Changed in libvirt (Ubuntu):
status: New → Confirmed
Revision history for this message
Dave Walker (davewalker) wrote :

This really needs a reproducible test case. Are you able to offer any further guidance on how you managed to encounter this?

Thanks.

Changed in libvirt (Ubuntu):
importance: Undecided → Medium
status: Confirmed → Incomplete
Revision history for this message
nutznboltz (nutznboltz-deactivatedaccount) wrote :

Today two VM hosts running kernel 2.6.38-11, one with with 14 guests and another with 53 had scheduled reboots of the many of their guests.

On the VM host with 14 guests, 11 rebooted. On the one with 53 guests, 48 rebooted.

VM guests that automatically rebooted this morning at 6:30 AM via at(8) hung instead of rebooted.

All of the VM guests were rebooting from kernel 2.6.38-11 into 2.6.38-12

Not all of the VM guests on the hosts were scheduled to be rebooted.

"virsh shutdown" and "virsh destroy" appeared to always hang.

After using "kill -9" on the PID of the VM guest "virsh start" for that guest worked every time.

Initially on the VM hosts "virsh list" would output some VM names then hang.

After enough VM guests were reset via "kill -9" (but before all of them were) "virsh list" became able to run to completion.

Revision history for this message
nutznboltz (nutznboltz-deactivatedaccount) wrote :

While the VM guests were hung they were using 100% of the CPU.

Revision history for this message
nutznboltz (nutznboltz-deactivatedaccount) wrote :

If this issue was seen on Maverick it is still an issue with Natty.

I could try running tests on an Oneiric system too.

Revision history for this message
nutznboltz (nutznboltz-deactivatedaccount) wrote :

The system with 53 VM guests:
ProLiant DL585 G5, Four CPU sockets each with Quad-Core AMD Opteron(tm) Processor 8382

The system with 14 VM guests:
Sun Fire X4600 M2, Eight CPU sockets each with Dual-Core AMD Opteron(tm) Processor 8220

Revision history for this message
Serge Hallyn (serge-hallyn) wrote : Re: [Bug 882579] Re: Some Commands return: "Timed out during operation: cannot acquire state change lock"

Quoting nutznboltz (<email address hidden>):
> While the VM guests were hung they were using 100% of the CPU.

Is 'virsh destroy' and 'virsh shutdown' what always triggers this?

Revision history for this message
nutznboltz (nutznboltz-deactivatedaccount) wrote :

I don't think "virsh shutdown" and "virsh destroy" are "triggers".

I think when the guest rebooted something went wrong and subsequent attempts to use "virsh shutdown" and "virsh destroy" fail.

This morning the en-mass VM guest reboot was triggered by cron jobs inside the guests.

I tried unsuccessfully to reproduce this on a VM host that only has a single VM guest. I'm still investigating if having several VM guests reboot at the same time is the trigger or not.

Revision history for this message
nutznboltz (nutznboltz-deactivatedaccount) wrote :

If you look at the last comment in the Red Hat bug report:
https://bugzilla.redhat.com/show_bug.cgi?id=676205#c26

{{ 1. The QEMU process has hung.

        QEMU won't respond to monitor commands. The API call making the first
monitor command will wait forever, any subsequent API calls issuing monitor
commands will timeout after ~30 seconds with this libvirt error message.

        This is expected behaviour when QEMU has hung. }}

I think this might be the issue I am experiencing.

Revision history for this message
nutznboltz (nutznboltz-deactivatedaccount) wrote :

Using "kill -9" directly on the /usr/bin/kvm process of the hung VM guest that was using 100% CPU did restore the ability to use "virsh start" on the guest. That is consistent with the comment #26 by Daniel Berrange in the Red Hat ticket about "The QEMU process has hung."

Revision history for this message
nutznboltz (nutznboltz-deactivatedaccount) wrote :

I've been able to narrow this down a little more. The number of VM guests is not a factor, this has happened on a VM host with only one guest. The VM guest kernel does not seem to matter either, even VM guests which are Linux distros other than Ubuntu are affected.

It seems that the /usr/bin/kvm process goes into an infinite loop during VM guest shutdown.

I'm still investigating this further.

Revision history for this message
nutznboltz (nutznboltz-deactivatedaccount) wrote :

Probably related thread: kvm linux guest hanging for minutes at a time
http://www.spinics.net/lists/kvm/msg59479.html

Notable in that thread:

a post about using "-F" with "trace-cmd" too: http://www.linux-kvm.org/page/Tracing
http://www.spinics.net/lists/kvm/msg59716.html

a post which states: "When booting with clocksouce=hpet, the system does not hang."
http://www.spinics.net/lists/kvm/msg59837.html

a post which states: "Using the wrong clock easily causes hangs. The system schedules a wakeup in 3 ms, wrong clock causes it to wakeup in 3 years, you get a hang for (3 years - 3 ms). "
http://www.spinics.net/lists/kvm/msg59884.html

Revision history for this message
nutznboltz (nutznboltz-deactivatedaccount) wrote :
Download full text (10.3 KiB)

I have one that rebooted this morning and is spinning

attaching gdb shows very little due to lack of symbol tables:

(gdb) where
#0 0x00007f671aab3ff3 in select () from /lib/libc.so.6

strace shows a loop:

Process 971 attached - interrupt to quit
select(16, [7 10 13 14 15], [], [], {0, 995193}) = 1 (in [14], left {0, 970413})
read(14, "\16\0\0\0\0\0\0\0\376\377\377\377\0\0\0\0\0\0\0\0\0\0\0\0\5\0\0\0\0\0\0\0"..., 128) = 128
rt_sigaction(SIGALRM, NULL, {0x4a9900, ~[KILL STOP RTMIN RT_1], SA_RESTORER, 0x7f671d28c8f0}, 8) = 0
write(8, "\1\0\0\0\0\0\0\0", 8) = 8
read(14, 0x7fff53785d20, 128) = -1 EAGAIN (Resource temporarily unavailable)
timer_gettime(0x5, {it_interval={0, 0}, it_value={0, 0}}) = 0
timer_settime(0x5, 0, {it_interval={0, 0}, it_value={0, 250000}}, NULL) = 0
timer_gettime(0x5, {it_interval={0, 0}, it_value={0, 185840}}) = 0
select(16, [7 10 13 14 15], [], [], {1, 0}) = 1 (in [7], left {0, 999997})
read(7, "\1\0\0\0\0\0\0\0", 512) = 8
select(16, [7 10 13 14 15], [], [], {1, 0}) = 1 (in [14], left {0, 999997})
read(14, "\16\0\0\0\0\0\0\0\376\377\377\377\0\0\0\0\0\0\0\0\0\0\0\0\5\0\0\0\0\0\0\0"..., 128) = 128
rt_sigaction(SIGALRM, NULL, {0x4a9900, ~[KILL STOP RTMIN RT_1], SA_RESTORER, 0x7f671d28c8f0}, 8) = 0
write(8, "\1\0\0\0\0\0\0\0", 8) = 8
read(14, 0x7fff53785d20, 128) = -1 EAGAIN (Resource temporarily unavailable)
timer_gettime(0x5, {it_interval={0, 0}, it_value={0, 0}}) = 0
timer_settime(0x5, 0, {it_interval={0, 0}, it_value={0, 28902469}}, NULL) = 0
(repeats)

$ sudo lsof -p 971
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
kvm 971 libvirt-qemu cwd DIR 259,0 4096 128 /
kvm 971 libvirt-qemu rtd DIR 259,0 4096 128 /
kvm 971 libvirt-qemu txt REG 259,0 3217592 16817685 /usr/bin/qemu-system-x86_64
kvm 971 libvirt-qemu mem REG 259,0 18520 33597583 /usr/lib/sasl2/libplain.so.2.0.23
kvm 971 libvirt-qemu mem REG 259,0 34904 33597580 /usr/lib/sasl2/libntlm.so.2.0.23
kvm 971 libvirt-qemu mem REG 259,0 43296 33595849 /lib/libcrypt-2.11.1.so
kvm 971 libvirt-qemu mem REG 259,0 18520 33597534 /usr/lib/sasl2/liblogin.so.2.0.23
kvm 971 libvirt-qemu mem REG 259,0 1622304 33596048 /lib/libcrypto.so.0.9.8
kvm 971 libvirt-qemu mem REG 259,0 51752 33597531 /usr/lib/sasl2/libdigestmd5.so.2.0.23
kvm 971 libvirt-qemu mem REG 259,0 22624 33597528 /usr/lib/sasl2/libcrammd5.so.2.0.23
kvm 971 libvirt-qemu mem REG 259,0 18528 33597524 /usr/lib/sasl2/libanonymous.so.2.0.23
kvm 971 libvirt-qemu mem REG 259,0 1494424 187 /usr/lib/libdb-4.8.so
kvm 971 libvirt-qemu mem REG 259,0 22464 33597518 /usr/lib/sasl2/libsasldb.so.2.0.23
kvm 971 libvirt-qemu mem REG 259,0 26840 8762051 /usr/lib/libogg.so.0.6.0
kvm 971 libvirt-qemu mem REG 259,0 182584 56392 /usr/lib/libvorbis.so.0...

Revision history for this message
nutznboltz (nutznboltz-deactivatedaccount) wrote :

This report (in German) https://forge.univention.org/bugzilla/show_bug.cgi?id=23258 suggests that when the guest VM is reset by a reboot the clock jumps backward in time.

Two patches were suggested, both being necessary together:

KVM: x86: Fix kvmclock bug
backported to Natty Aug 11, 2011 but not to linux-lts-backport-natty-2.6.38 yet!!!
http://patchwork.ozlabs.org/patch/109436/
http://lkml.org/lkml/2011/5/5/436
http://bugs.launchpad.net/bugs/714335

KVM: x86: Fix a possible backwards warp of kvmclock
http://kerneltrap.org/mailarchive/linux-kernel/2010/8/20/4609037
http://choon.net/forum/read.php?21,76764

Revision history for this message
nutznboltz (nutznboltz-deactivatedaccount) wrote :

This patch is missing from linux-lts-backport-natty-2.6.38
http://patchwork.ozlabs.org/patch/109436/

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 980317a..0556e05 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2100,8 +2100,8 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
   if (check_tsc_unstable()) {
    kvm_x86_ops->adjust_tsc_offset(vcpu, -tsc_delta);
    vcpu->arch.tsc_catchup = 1;
- kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
   }
+ kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
   if (vcpu->cpu != cpu)
    kvm_migrate_timers(vcpu);
   vcpu->cpu = cpu;

Revision history for this message
nutznboltz (nutznboltz-deactivatedaccount) wrote :

Please help apply patch http://patchwork.ozlabs.org/patch/109436/ to linux-lts-backport-natty-2.6.38 thanks

Changed in libvirt (Ubuntu):
status: Incomplete → Confirmed
description: updated
description: updated
tags: added: lucid testcase
removed: natty running-unity
Revision history for this message
Brad Figg (brad-figg) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 882579

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
tags: added: natty
Revision history for this message
nutznboltz (nutznboltz-deactivatedaccount) wrote :

This is a bug in the kernel not libvirt.

Changed in libvirt (Ubuntu):
status: Confirmed → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.