libvirtError: Timed out during operation: cannot acquire state change lock

Bug #1254872 reported by Joe Gordon on 2013-11-25
72
This bug affects 11 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
High
Unassigned
ubuntu-server-iso-testing
Undecided
Unassigned
libvirt (Ubuntu)
High
Unassigned
Precise
High
Unassigned

Bug Description

libvirtError: Timed out during operation: cannot acquire state change lock

Source:

http://logs.openstack.org/72/58372/1/check/check-tempest-devstack-vm-full/4dd1a60/logs/screen-n-cpu.txt.gz
http://logs.openstack.org/72/58372/1/check/check-tempest-devstack-vm-full/4dd1a60/testr_results.html.gz

Query: "libvirtError: Timed out during operation: cannot acquire state change lock" AND filename:"logs/screen-n-cpu.txt"

http://logstash.openstack.org/#eyJzZWFyY2giOiJcImxpYnZpcnRFcnJvcjogVGltZWQgb3V0IGR1cmluZyBvcGVyYXRpb246IGNhbm5vdCBhY3F1aXJlIHN0YXRlIGNoYW5nZSBsb2NrXCIgQU5EIGZpbGVuYW1lOlwibG9ncy9zY3JlZW4tbi1jcHUudHh0XCIiLCJmaWVsZHMiOltdLCJvZmZzZXQiOjAsInRpbWVmcmFtZSI6IjYwNDgwMCIsImdyYXBobW9kZSI6ImNvdW50IiwidGltZSI6eyJ1c2VyX2ludGVydmFsIjowfSwic3RhbXAiOjEzODU0MTI3MzA4NzQsIm1vZGUiOiIiLCJhbmFseXplX2ZpZWxkIjoiIn0=

[IMPACT]

Queries against a domain which is being worked on asynchronously
can result in hung clients and the inability to do anything
further with that domain.

[TEST CASE]

Setup openstack cluster with heavy VM turnover and querying.

[REGRESSION POTENTIAL]

This patch is cherrypicked from upstream and has been tested using
openstack. It has so far shown no regressions.

Joe Gordon (jogo) wrote :

Marking as critical as we are seeing this in the gate

Changed in nova:
importance: Undecided → Critical
milestone: none → icehouse-1
Changed in nova:
status: New → Triaged
Daniel Berrange (berrange) wrote :

What this error message indicates is that two separate threads have made libvirt API calls against the same VM. One of the calls has either hung completely, or is taking a very long time to respond, causing the second API call to report this error message.

Here we're seeing the error message in response to virDomainAttachDeviceFlags. Unfortunately we need to really see what
the original API call was which is hard to find out.

With some libvirt versions there have been bugs which caused this error message, so its important to know what libvirt version shows this error. Indeed if the gate runs on hosts with a variety of libvirt / qemu versions, it'd be useful to figure out if it only occurs on hosts with specific versions of libvirt/qemu.

To diagnose this further likely requires turning up the libvirt logging level.

For Nova I'd suggest setting

  export LIBVIRT_LOG_FILTERS="1:libvirt"
  export LIBVIRT_LOG_OUTPUTS=1:file:/var/log/libvirt/nova-cpu.log"

before starting the nova-cpu processes. This captures client side traces

It might also be useful to enable logging on the libvirtd server side by editing /etc/libvirt/libvirtd.log and settings

  log_filters="1:libvirt 1:qemu"
  log_outputs="1:file:/var/log/libvirt/libvirtd.log"

this captures server side traces.

Changed in nova:
milestone: icehouse-1 → icehouse-2

i see the following 4 scenarios in the last 48 hours

File "/usr/lib/python2.7/dist-packages/libvirt.py", line 581, in createWithFlags
File "/usr/lib/python2.7/dist-packages/libvirt.py", line 696, in managedSave
File "/usr/lib/python2.7/dist-packages/libvirt.py", line 420, in attachDeviceFlags
File "/usr/lib/python2.7/dist-packages/libvirt.py", line 1352, in suspend

"line 581, in createWithFlags" AND "libvirtError: Timed out during operation" : 15
"line 696, in managedSave" AND "libvirtError: Timed out during operation" : 8
"line 420, in attachDeviceFlags" AND "libvirtError: Timed out during operation" : 12
"line 1352, in suspend" AND "libvirtError: Timed out during operation" : 1

Added the following to capture more logs

https://review.openstack.org/#/c/65834/ - Enable server-side and client-side logs for libvirt
https://review.openstack.org/#/c/65833/ - Capture libvirtd logs

Changed in nova:
milestone: icehouse-2 → icehouse-3
Kashyap Chamarthy (kashyapc) wrote :

This is blocking Gate -- http://status.openstack.org/elastic-recheck/

Trying to reproduce this issue & figure *version* of libvirt this is happening (as Dan noted in comment #2, that it's important to know this to identify the root cause.)

Kashyap Chamarthy (kashyapc) wrote :

Sean Dague on #openstack-nova confirmed the gate is running libvirt version -- 0.9.8 (which is a bit ancient, and is what gate is running on Ubuntu 12.04 version.)

However, there's newer Libvirt version supported version on Ubuntu is 13.10

Here's the Libvirt OS distribution support matrix -- https://wiki.openstack.org/wiki/LibvirtDistroSupportMatrix

tags: added: libvirt libvirt1x
Matt Riedemann (mriedem) wrote :

Interesting note under section "Upgrade to libvirt-1.0.3+" here:

http://blog.gridcentric.com/bid/318277/Boosting-OpenStack-s-Parallel-Performance

"Prior to version 1.0.3, libvirt had one big lock that serialized request service. A series of patches from Daniel P. Berrangé broke the big lock into numerous fine-grained locks in libvirt-1.0.3."

seems to fall in line with what we're seeing timing out here, but moving up would be blocked by bug 1228977.

Joe Gordon (jogo) on 2014-02-07
tags: added: testing
Joe Gordon (jogo) wrote :
Changed in nova:
importance: Critical → High
Ken'ichi Ohmichi (oomichi) wrote :

I faced the same issue on http://logs.openstack.org/97/71197/2/gate/gate-tempest-dsvm-full/7498458/logs/ .

The operation of update_available_resource() was stalled on n-cpu process with having the semaphore "compute_resources" in long time due to this problem.
And all "create a server" operations failed because the operations could not get the semaphore "compute_resources".

Current e-r query of this problem cannot catch it now, and I have posted the patch for it.
https://review.openstack.org/#/c/74925/

Changed in nova:
milestone: icehouse-3 → none
Attila Fazekas (afazekas) wrote :

http://status.openstack.org/elastic-recheck/ It happens more frequently.
I suggest to increase the Importance to critical.

Jeremy Stanley (fungi) wrote :

As mentioned in bug 1290637, the recent upgrade in Rackspace from linux 3.2.0-59-virtual to 3.2.0-60-virtual seems to correlate with a significant increase in occurrence.

Daniel Berrange (berrange) wrote :

I analysed the logs from here which includes libvirt debug

http://logs.openstack.org/85/76685/3/check/check-tempest-dsvm-neutron-full/27b5a9d/logs/

The last successful thing to run against the VM in question (instance-00000021) was a virDomainManagedSave API call. The next API call virDomainCreate fails with the timed out error message. Other later calls have similar problems. From the logs I believe the problem is that a pending job is not getting ended.

Looking in libvirt GIT history for changes related to job management I see this obvious strong candidate

commit 6948b725e78016e45b846a17b89fafb69965be51
Author: Jiri Denemark <email address hidden>
Date: Wed Dec 14 09:57:07 2011 +0100

    qemu: Fix race between async and query jobs

    If an async job run on a domain will stop the domain at the end of the
    job, a concurrently run query job can hang in qemu monitor and nothing
    can be done with that domain from this point on. An attempt to start
    such domain results in "Timed out during operation: cannot acquire state
    change lock" error.

    However, quite a few things have to happen at the right time... There
    must be an async job running which stops a domain at the end. This race
    was reported with dump --crash but other similar jobs, such as
    (managed)save and migration, should be able to trigger this bug as well.
    While this async job is processing its last monitor command, that is a
    query-migrate to which qemu replies with status "completed", a new
    libvirt API that results in a query job must arrive and stay waiting
    until the query-migrate command finishes. Once query-migrate is done but
    before the async job closes qemu monitor while stopping the domain, the
    other thread needs to wake up and call qemuMonitorSend to send its
    command to qemu. Before qemu gets a chance to respond to this command,
    the async job needs to close the monitor. At this point, the query job
    thread is waiting for a condition that no-one will ever signal so it
    never finishes the job.

This describes precisely the scenario I see from the logs. It went into libvirt 0.9.9, which is of course just after the 0.9.8 version ubuntu ships

Unfortunately there's nothing we can do in nova code to workaround this. It will just require ubuntu to patch their ancient libvirt builds with this upstream fix, or for openstack to switch to newer libvirt builds.

Daniel Berrange (berrange) wrote :

The kernel version of the host is not a major factor here - at most it could affect timing of threads making the race condition in libvirt slightly more or less likely

fyi, closest to 0.9.8 seems to be 0.9.11 (in terms of maintained streams) http://libvirt.org/git/?p=libvirt.git;a=heads

Changed in libvirt (Ubuntu):
importance: Undecided → High
status: New → Fix Released
Changed in libvirt (Ubuntu Precise):
importance: Undecided → High
status: New → Triaged
Serge Hallyn (serge-hallyn) wrote :

The test packages at http://people.canonical.com/~serge/libvirt-0.9.8-2ubuntu18 have the commit Daniel mentioned in comment #13. Please let us know if that helps.

Thanks @serge-hallyn - trying it out on this review https://review.openstack.org/#/c/79816/

I have 10+ runs of the gate jobs with 0.9.8+ in https://review.openstack.org/#/c/79816/ and the following logstash query is good (no errors)

"libvirtError: Timed out during operation: cannot acquire state change lock" AND filename:"logs/screen-n-cpu.txt" AND build_change:"79816"

So, Serge Hallyn / James Page, we are good with the changes in http://people.canonical.com/~serge/libvirt-0.9.8-2ubuntu18/ from the openstack side.

Thanks, I've proposed the package for precise-proposed. In order for
the SRU team to accept it, they'll need an explicit test case filled
in under '[TEST CASE]' in the bug description. Can someone please fill
that in? Is it something like:

virsh start vm1
virsh destroy vm1 & for i in `seq 1 5`; do virsh vncdisplay vm1; done

?

description: updated

@Sere, honestly i don't have a stand alone way to recreate this consistently, given the description from Daniel in comment #13 and from the commit itself ("6948b725e78016e45b846a17b89fafb69965be51"). Not sure how to proceed next if a TEST CASE is a must. I can't seem to access https://bugzilla.redhat.com/show_bug.cgi?id=770955 which may have additional information per page [1]

[1] https://access.redhat.com/site/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/6.2_Technical_Notes/libvirt.html

description: updated

Hello Joe, or anyone else affected,

Accepted libvirt into precise-proposed. The package will build now and be available at http://launchpad.net/ubuntu/+source/libvirt/0.9.8-2ubuntu17.18 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-needed to verification-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in libvirt (Ubuntu Precise):
status: Triaged → Fix Committed
tags: added: verification-needed

@Serge, @Berrange, @Page, the 0.9.8+ patch has been running in literally all the gate tests today and is holding up well. I did see one instance, see libvirtd.txt.gz under

http://logs.openstack.org/39/71839/5/gate/gate-tempest-dsvm-full/491c26e/logs/

2014-03-20 21:12:57.000+0000: 15216: error : qemuDomainObjBeginJobInternal:789 : Timed out during operation: cannot acquire state change lock.

The bug is definitely holding up well. If you want to see incidents from the last 24 hours, please look at

http://logstash.openstack.org/#eyJzZWFyY2giOiJcImxpYnZpcnRFcnJvcjogVGltZWQgb3V0IGR1cmluZyBvcGVyYXRpb246IGNhbm5vdCBhY3F1aXJlIHN0YXRlIGNoYW5nZSBsb2NrXCIgQU5EIGZpbGVuYW1lOlwibG9ncy9zY3JlZW4tbi1jcHUudHh0XCIiLCJmaWVsZHMiOlsiYnVpbGRfY2hhbmdlIiwiZmlsZW5hbWUiLCJwcm9qZWN0Il0sIm9mZnNldCI6MCwidGltZWZyYW1lIjoiODY0MDAiLCJncmFwaG1vZGUiOiJjb3VudCIsInRpbWUiOnsidXNlcl9pbnRlcnZhbCI6MH0sInN0YW1wIjoxMzk1MzU1MTk4Mjc0LCJtb2RlIjoiIiwiYW5hbHl6ZV9maWVsZCI6IiJ9

thanks,
dims

s/The bug is/The patch is/

Joe Gordon (jogo) wrote :

Related bug: https://bugs.launchpad.net/nova/+bug/1295876

This just popped up around the time https://review.openstack.org/#/c/79816 landed

Serge Hallyn (serge-hallyn) wrote :

It sounds like this patch has improved the robustness, but not 100% fixed it; however it has not regressed it. Therefore are there any objections to calling this bug verification-done so that sru can complete?

I think in order to push further on the robustness, we will need some specific stress testing scripts - not involving a full openstack setup - which can be run against versions 0.9.8, 1.1.1, and 1.2.x.

James Page (james-page) wrote :

I think sufficient verification of this fix has occurred upstream in OpenStack gate - marking verification-done

tags: added: verification-done
removed: verification-needed
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package libvirt - 0.9.8-2ubuntu17.18

---------------
libvirt (0.9.8-2ubuntu17.18) precise-proposed; urgency=medium

  * d/p/ubuntu/qemu-fix-race-between-async-and-query-jobs.patch: fix a race
    between asynchronous and query jobs. (LP: #1254872)
 -- Serge Hallyn <email address hidden> Tue, 11 Mar 2014 14:14:38 -0500

Changed in libvirt (Ubuntu Precise):
status: Fix Committed → Fix Released

The verification of the Stable Release Update for libvirt has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regresssions.

Changed in nova:
status: Triaged → Fix Committed
Changed in nova:
milestone: none → juno-2
status: Fix Committed → Fix Released
Ray Chen (chenrano2002) wrote :

see this issue again on 0.9.8-2ubuntu17.19, which should include the patch. please take a look

https://bugs.launchpad.net/nova/+bug/1330098

Federico Iezzi (fiezzi) wrote :

Hi guys I had the exactly same problem with Ubuntu 12.04 but using cloud archive Havana repository.
So the libvirt versione is the 1.1.1. The Linux Kernel 3.11.

dpkg -l | grep libvirt
ii libvirt-bin 1.1.1-0ubuntu8.11~cloud0 programs for the libvirt library
ii libvirt0 1.1.1-0ubuntu8.11~cloud0 library for interfacing with different virtualization systems
ii python-libvirt 1.1.1-0ubuntu8.11~cloud0 libvirt Python bindings

uname -a
Linux compute010 3.11.0-24-generic #41~precise1-Ubuntu SMP Wed Jun 11 13:15:06 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux

virsh version
Compiled against library: libvirt 1.1.1
Using library: libvirt 1.1.1
Using API: QEMU 1.1.1
Running hypervisor: QEMU 1.5.0

And from LibVirt log (/var/log/libvirt/libvirtd.log)

2014-09-20 06:10:46.000+0000: 2529: error : qemuDomainObjBeginJobInternal:1027 : Timed out during operation: cannot acquire state change lock
2014-09-20 06:11:23.000+0000: 2530: warning : qemuDomainObjBeginJobInternal:1023 : Cannot start job (query, none) for domain instance-00000ecf; current job is (modify, none) owned by (2533, 0)

Do you have any ideas?

Thanks a lot!
Kind Regards,
Federico

Serge Hallyn (serge-hallyn) wrote :

I think it'd be best to file a new bug providing the relevant debugging information from your systems.

Thierry Carrez (ttx) on 2014-10-16
Changed in nova:
milestone: juno-2 → 2014.2
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Related blueprints

Remote bug watches

Bug watches keep track of this bug in other bug trackers.