Bug #1254872 “libvirtError: Timed out during operation: cannot a...” : Bugs : OpenStack Compute (nova)

Revision history for this message

Joe Gordon (jogo) wrote on 2013-11-25:

#1

Marking as critical as we are seeing this in the gate

Changed in nova:
importance:	Undecided → Critical
milestone:	none → icehouse-1

Robert Collins (lifeless) on 2013-11-25

Changed in nova:
status:	New → Triaged

Revision history for this message

Daniel Berrange (berrange) wrote on 2013-11-27:

#2

What this error message indicates is that two separate threads have made libvirt API calls against the same VM. One of the calls has either hung completely, or is taking a very long time to respond, causing the second API call to report this error message.

Here we're seeing the error message in response to virDomainAttachDeviceFlags. Unfortunately we need to really see what
the original API call was which is hard to find out.

With some libvirt versions there have been bugs which caused this error message, so its important to know what libvirt version shows this error. Indeed if the gate runs on hosts with a variety of libvirt / qemu versions, it'd be useful to figure out if it only occurs on hosts with specific versions of libvirt/qemu.

To diagnose this further likely requires turning up the libvirt logging level.

For Nova I'd suggest setting

export LIBVIRT_LOG_FILTERS="1:libvirt"
export LIBVIRT_LOG_OUTPUTS=1:file:/var/log/libvirt/nova-cpu.log"

before starting the nova-cpu processes. This captures client side traces

It might also be useful to enable logging on the libvirtd server side by editing /etc/libvirt/libvirtd.log and settings

log_filters="1:libvirt 1:qemu"
log_outputs="1:file:/var/log/libvirt/libvirtd.log"

this captures server side traces.

Russell Bryant (russellb) on 2013-12-03

Changed in nova:
milestone:	icehouse-1 → icehouse-2

Revision history for this message

Davanum Srinivas (DIMS) (dims-v) wrote on 2014-01-10:

#3

i see the following 4 scenarios in the last 48 hours

File "/usr/lib/python2.7/dist-packages/libvirt.py", line 581, in createWithFlags
File "/usr/lib/python2.7/dist-packages/libvirt.py", line 696, in managedSave
File "/usr/lib/python2.7/dist-packages/libvirt.py", line 420, in attachDeviceFlags
File "/usr/lib/python2.7/dist-packages/libvirt.py", line 1352, in suspend

"line 581, in createWithFlags" AND "libvirtError: Timed out during operation" : 15
"line 696, in managedSave" AND "libvirtError: Timed out during operation" : 8
"line 420, in attachDeviceFlags" AND "libvirtError: Timed out during operation" : 12
"line 1352, in suspend" AND "libvirtError: Timed out during operation" : 1

Revision history for this message

Davanum Srinivas (DIMS) (dims-v) wrote on 2014-01-10:

#4

Added the following to capture more logs

https://review.openstack.org/#/c/65834/ - Enable server-side and client-side logs for libvirt
https://review.openstack.org/#/c/65833/ - Capture libvirtd logs

Russell Bryant (russellb) on 2014-01-13

Changed in nova:
milestone:	icehouse-2 → icehouse-3

Revision history for this message

Kashyap Chamarthy (kashyapc) wrote on 2014-01-24:

#5

This is blocking Gate -- http://status.openstack.org/elastic-recheck/

Trying to reproduce this issue & figure *version* of libvirt this is happening (as Dan noted in comment #2, that it's important to know this to identify the root cause.)

Revision history for this message

Kashyap Chamarthy (kashyapc) wrote on 2014-01-24:

#6

Sean Dague on #openstack-nova confirmed the gate is running libvirt version -- 0.9.8 (which is a bit ancient, and is what gate is running on Ubuntu 12.04 version.)

However, there's newer Libvirt version supported version on Ubuntu is 13.10

Here's the Libvirt OS distribution support matrix -- https://wiki.openstack.org/wiki/LibvirtDistroSupportMatrix

Davanum Srinivas (DIMS) (dims-v) on 2014-01-27

tags:

added: libvirt libvirt1x

Revision history for this message

Matt Riedemann (mriedem) wrote on 2014-01-28:

#7

Interesting note under section "Upgrade to libvirt-1.0.3+" here:

http://blog.gridcentric.com/bid/318277/Boosting-OpenStack-s-Parallel-Performance

"Prior to version 1.0.3, libvirt had one big lock that serialized request service. A series of patches from Daniel P. Berrangé broke the big lock into numerous fine-grained locks in libvirt-1.0.3."

seems to fall in line with what we're seeing timing out here, but moving up would be blocked by bug 1228977.

Joe Gordon (jogo) on 2014-02-07

tags:

added: testing

Revision history for this message

Joe Gordon (jogo) wrote on 2014-02-11:

#8

this is a subset of https://blueprints.launchpad.net/nova/+spec/support-libvirt-1x

Changed in nova:
importance:	Critical → High

Revision history for this message

Ken'ichi Ohmichi (oomichi) wrote on 2014-02-20:

#9

I faced the same issue on http://logs.openstack.org/97/71197/2/gate/gate-tempest-dsvm-full/7498458/logs/ .

The operation of update_available_resource() was stalled on n-cpu process with having the semaphore "compute_resources" in long time due to this problem.
And all "create a server" operations failed because the operations could not get the semaphore "compute_resources".

Current e-r query of this problem cannot catch it now, and I have posted the patch for it.
https://review.openstack.org/#/c/74925/

Russell Bryant (russellb) on 2014-02-21

Changed in nova:
milestone:	icehouse-3 → none

Revision history for this message

YAMAMOTO Takashi (yamamoto) wrote on 2014-03-07:

#10

i saw the similar.
http://logs.openstack.org/20/74920/2/check/check-tempest-dsvm-full/bc84b0f/logs/

Revision history for this message

Attila Fazekas (afazekas) wrote on 2014-03-11:

#11

http://status.openstack.org/elastic-recheck/ It happens more frequently.
I suggest to increase the Importance to critical.

Revision history for this message

Jeremy Stanley (fungi) wrote on 2014-03-11:

#12

As mentioned in bug 1290637, the recent upgrade in Rackspace from linux 3.2.0-59-virtual to 3.2.0-60-virtual seems to correlate with a significant increase in occurrence.

Revision history for this message

Daniel Berrange (berrange) wrote on 2014-03-11:

#13

I analysed the logs from here which includes libvirt debug

http://logs.openstack.org/85/76685/3/check/check-tempest-dsvm-neutron-full/27b5a9d/logs/

The last successful thing to run against the VM in question (instance-00000021) was a virDomainManagedSave API call. The next API call virDomainCreate fails with the timed out error message. Other later calls have similar problems. From the logs I believe the problem is that a pending job is not getting ended.

Looking in libvirt GIT history for changes related to job management I see this obvious strong candidate

commit 6948b725e78016e45b846a17b89fafb69965be51
Author: Jiri Denemark <email address hidden>
Date: Wed Dec 14 09:57:07 2011 +0100

qemu: Fix race between async and query jobs

    If an async job run on a domain will stop the domain at the end of the
    job, a concurrently run query job can hang in qemu monitor and nothing
    can be done with that domain from this point on. An attempt to start
    such domain results in "Timed out during operation: cannot acquire state
    change lock" error.

    However, quite a few things have to happen at the right time... There
    must be an async job running which stops a domain at the end. This race
    was reported with dump --crash but other similar jobs, such as
    (managed)save and migration, should be able to trigger this bug as well.
    While this async job is processing its last monitor command, that is a
    query-migrate to which qemu replies with status "completed", a new
    libvirt API that results in a query job must arrive and stay waiting
    until the query-migrate command finishes. Once query-migrate is done but
    before the async job closes qemu monitor while stopping the domain, the
    other thread needs to wake up and call qemuMonitorSend to send its
    command to qemu. Before qemu gets a chance to respond to this command,
    the async job needs to close the monitor. At this point, the query job
    thread is waiting for a condition that no-one will ever signal so it
    never finishes the job.

This describes precisely the scenario I see from the logs. It went into libvirt 0.9.9, which is of course just after the 0.9.8 version ubuntu ships

Unfortunately there's nothing we can do in nova code to workaround this. It will just require ubuntu to patch their ancient libvirt builds with this upstream fix, or for openstack to switch to newer libvirt builds.

I analysed the logs from here which includes libvirt debug

http://logs.openstack.org/85/76685/3/check/check-tempest-dsvm-neutron-full/27b5a9d/logs/

The last successful thing to run against the VM in question (instance-00000021) was a virDomainManagedSave API call. The next API call virDomainCreate fails with the timed out error message. Other later calls have similar problems. From the logs I believe the problem is that a pending job is not getting ended.

Looking in libvirt GIT history for changes related to job management I see this obvious strong candidate

commit 6948b725e78016e45b846a17b89fafb69965be51
Author: Jiri Denemark <jdenemar@redhat.com>
Date:   Wed Dec 14 09:57:07 2011 +0100

qemu: Fix race between async and query jobs

If an async job run on a domain will stop the domain at the end of the
    job, a concurrently run query job can hang in qemu monitor and nothing
    can be done with that domain from this point on. An attempt to start
    such domain results in "Timed out during operation: cannot acquire state
    change lock" error.

However, quite a few things have to happen at the right time... There
    must be an async job running which stops a domain at the end. This race
    was reported with dump --crash but other similar jobs, such as
    (managed)save and migration, should be able to trigger this bug as well.
    While this async job is processing its last monitor command, that is a
    query-migrate to which qemu replies with status "completed", a new
    libvirt API that results in a query job must arrive and stay waiting
    until the query-migrate command finishes. Once query-migrate is done but
    before the async job closes qemu monitor while stopping the domain, the
    other thread needs to wake up and call qemuMonitorSend to send its
    command to qemu. Before qemu gets a chance to respond to this command,
    the async job needs to close the monitor. At this point, the query job
    thread is waiting for a condition that no-one will ever signal so it
    never finishes the job.

This describes precisely the scenario I see from the logs. It went into libvirt 0.9.9, which is of course just after the 0.9.8 version ubuntu ships

Unfortunately there's nothing we can do in nova code to workaround this. It will just require ubuntu to patch  their ancient libvirt builds with this upstream fix, or for openstack to switch to newer libvirt builds.

Revision history for this message

Daniel Berrange (berrange) wrote on 2014-03-11:

#14

The kernel version of the host is not a major factor here - at most it could affect timing of threads making the race condition in libvirt slightly more or less likely

Revision history for this message

Davanum Srinivas (DIMS) (dims-v) wrote on 2014-03-11:

#15

fyi, closest to 0.9.8 seems to be 0.9.11 (in terms of maintained streams) http://libvirt.org/git/?p=libvirt.git;a=heads

Serge Hallyn (serge-hallyn) on 2014-03-11

Changed in libvirt (Ubuntu):
importance:	Undecided → High
status:	New → Fix Released
Changed in libvirt (Ubuntu Precise):
importance:	Undecided → High
status:	New → Triaged

Revision history for this message

Serge Hallyn (serge-hallyn) wrote on 2014-03-11:

#16

The test packages at http://people.canonical.com/~serge/libvirt-0.9.8-2ubuntu18 have the commit Daniel mentioned in comment #13. Please let us know if that helps.

Revision history for this message

Davanum Srinivas (DIMS) (dims-v) wrote on 2014-03-12:

#17

Thanks @serge-hallyn - trying it out on this review https://review.openstack.org/#/c/79816/

Revision history for this message

Davanum Srinivas (DIMS) (dims-v) wrote on 2014-03-13:

#18

I have 10+ runs of the gate jobs with 0.9.8+ in https://review.openstack.org/#/c/79816/ and the following logstash query is good (no errors)

"libvirtError: Timed out during operation: cannot acquire state change lock" AND filename:"logs/screen-n-cpu.txt" AND build_change:"79816"

So, Serge Hallyn / James Page, we are good with the changes in http://people.canonical.com/~serge/libvirt-0.9.8-2ubuntu18/ from the openstack side.

Revision history for this message

Serge Hallyn (serge-hallyn) wrote on 2014-03-13: Re: [Bug 1254872] Re: libvirtError: Timed out during operation: cannot acquire state change lock

#19

Thanks, I've proposed the package for precise-proposed. In order for
the SRU team to accept it, they'll need an explicit test case filled
in under '[TEST CASE]' in the bug description. Can someone please fill
that in? Is it something like:

virsh start vm1
virsh destroy vm1 & for i in `seq 1 5`; do virsh vncdisplay vm1; done

?

description:

updated

Revision history for this message

Davanum Srinivas (DIMS) (dims-v) wrote on 2014-03-18:

#20

@Sere, honestly i don't have a stand alone way to recreate this consistently, given the description from Daniel in comment #13 and from the commit itself ("6948b725e78016e45b846a17b89fafb69965be51"). Not sure how to proceed next if a TEST CASE is a must. I can't seem to access https://bugzilla.redhat.com/show_bug.cgi?id=770955 which may have additional information per page [1]

[1] https://access.redhat.com/site/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/6.2_Technical_Notes/libvirt.html

Serge Hallyn (serge-hallyn) on 2014-03-18

description:

updated

Revision history for this message

Stéphane Graber (stgraber) wrote on 2014-03-19: Please test proposed package

#21

Hello Joe, or anyone else affected,

Accepted libvirt into precise-proposed. The package will build now and be available at http://launchpad.net/ubuntu/+source/libvirt/0.9.8-2ubuntu17.18 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-needed to verification-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in libvirt (Ubuntu Precise):
status:	Triaged → Fix Committed
tags:	added: verification-needed

Revision history for this message

Davanum Srinivas (DIMS) (dims-v) wrote on 2014-03-20:

#22

@Serge, @Berrange, @Page, the 0.9.8+ patch has been running in literally all the gate tests today and is holding up well. I did see one instance, see libvirtd.txt.gz under

http://logs.openstack.org/39/71839/5/gate/gate-tempest-dsvm-full/491c26e/logs/

2014-03-20 21:12:57.000+0000: 15216: error : qemuDomainObjBeginJobInternal:789 : Timed out during operation: cannot acquire state change lock.

The bug is definitely holding up well. If you want to see incidents from the last 24 hours, please look at

http://logstash.openstack.org/#eyJzZWFyY2giOiJcImxpYnZpcnRFcnJvcjogVGltZWQgb3V0IGR1cmluZyBvcGVyYXRpb246IGNhbm5vdCBhY3F1aXJlIHN0YXRlIGNoYW5nZSBsb2NrXCIgQU5EIGZpbGVuYW1lOlwibG9ncy9zY3JlZW4tbi1jcHUudHh0XCIiLCJmaWVsZHMiOlsiYnVpbGRfY2hhbmdlIiwiZmlsZW5hbWUiLCJwcm9qZWN0Il0sIm9mZnNldCI6MCwidGltZWZyYW1lIjoiODY0MDAiLCJncmFwaG1vZGUiOiJjb3VudCIsInRpbWUiOnsidXNlcl9pbnRlcnZhbCI6MH0sInN0YW1wIjoxMzk1MzU1MTk4Mjc0LCJtb2RlIjoiIiwiYW5hbHl6ZV9maWVsZCI6IiJ9

thanks,
dims

Revision history for this message

Davanum Srinivas (DIMS) (dims-v) wrote on 2014-03-20:

#23

s/The bug is/The patch is/

Revision history for this message

Joe Gordon (jogo) wrote on 2014-03-21:

#24

Related bug: https://bugs.launchpad.net/nova/+bug/1295876

This just popped up around the time https://review.openstack.org/#/c/79816 landed

Revision history for this message

Serge Hallyn (serge-hallyn) wrote on 2014-03-26:

#25

It sounds like this patch has improved the robustness, but not 100% fixed it; however it has not regressed it. Therefore are there any objections to calling this bug verification-done so that sru can complete?

I think in order to push further on the robustness, we will need some specific stress testing scripts - not involving a full openstack setup - which can be run against versions 0.9.8, 1.1.1, and 1.2.x.

Revision history for this message

James Page (james-page) wrote on 2014-04-08:

#26

I think sufficient verification of this fix has occurred upstream in OpenStack gate - marking verification-done

tags:

added: verification-done
removed: verification-needed

Revision history for this message

Launchpad Janitor (janitor) wrote on 2014-04-08:

#27

This bug was fixed in the package libvirt - 0.9.8-2ubuntu17.18

---------------
libvirt (0.9.8-2ubuntu17.18) precise-proposed; urgency=medium

* d/p/ubuntu/qemu-fix-race-between-async-and-query-jobs.patch: fix a race
between asynchronous and query jobs. (LP: #1254872)
-- Serge Hallyn <email address hidden> Tue, 11 Mar 2014 14:14:38 -0500

Changed in libvirt (Ubuntu Precise):
status:	Fix Committed → Fix Released

Revision history for this message

Colin Watson (cjwatson) wrote on 2014-04-08: Update Released

#28

The verification of the Stable Release Update for libvirt has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regresssions.

Davanum Srinivas (DIMS) (dims-v) on 2014-07-01

Changed in nova:
status:	Triaged → Fix Committed

Russell Bryant (russellb) on 2014-07-23

Changed in nova:
milestone:	none → juno-2
status:	Fix Committed → Fix Released

Revision history for this message

Ray Chen (chenrano2002) wrote on 2014-07-25:

#29

see this issue again on 0.9.8-2ubuntu17.19, which should include the patch. please take a look

https://bugs.launchpad.net/nova/+bug/1330098

Revision history for this message

Federico Iezzi (fiezzi) wrote on 2014-09-23:

#30

Hi guys I had the exactly same problem with Ubuntu 12.04 but using cloud archive Havana repository.
So the libvirt versione is the 1.1.1. The Linux Kernel 3.11.

dpkg -l | grep libvirt
ii libvirt-bin 1.1.1-0ubuntu8.11~cloud0 programs for the libvirt library
ii libvirt0 1.1.1-0ubuntu8.11~cloud0 library for interfacing with different virtualization systems
ii python-libvirt 1.1.1-0ubuntu8.11~cloud0 libvirt Python bindings

uname -a
Linux compute010 3.11.0-24-generic #41~precise1-Ubuntu SMP Wed Jun 11 13:15:06 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux

virsh version
Compiled against library: libvirt 1.1.1
Using library: libvirt 1.1.1
Using API: QEMU 1.1.1
Running hypervisor: QEMU 1.5.0

And from LibVirt log (/var/log/libvirt/libvirtd.log)

2014-09-20 06:10:46.000+0000: 2529: error : qemuDomainObjBeginJobInternal:1027 : Timed out during operation: cannot acquire state change lock
2014-09-20 06:11:23.000+0000: 2530: warning : qemuDomainObjBeginJobInternal:1023 : Cannot start job (query, none) for domain instance-00000ecf; current job is (modify, none) owned by (2533, 0)

Do you have any ideas?

Thanks a lot!
Kind Regards,
Federico

Revision history for this message

Serge Hallyn (serge-hallyn) wrote on 2014-09-23:

#31

I think it'd be best to file a new bug providing the relevant debugging information from your systems.

Thierry Carrez (ttx) on 2014-10-16

Changed in nova:
milestone:	juno-2 → 2014.2

OpenStack Compute (nova)

libvirtError: Timed out during operation: cannot acquire state change lock

Bug Description

Duplicates of this bug

Other bug subscribers

Related blueprints

Remote bug watches

	Status	Importance	Assigned to	Milestone
OpenStack Compute (nova)	Fix Released	High	Unassigned	OpenStack Compute (nova) 2014.2 "juno"
ubuntu-server-iso-testing	New	Undecided	Unassigned
libvirt (Ubuntu)	Fix Released	High	Unassigned
Precise	Fix Released	High	Unassigned