Under load, libvirt fails to start VMs concurrently

Bug #1055658 reported by Adin Scannell
32
This bug affects 6 people
Affects Status Importance Assigned to Milestone
libvirt (Ubuntu)
Fix Released
Undecided
Unassigned
Precise
Fix Released
High
Unassigned

Bug Description

==================================
SRU Justification
1. Impact: when starting multiple VMs simultaneously, many may fail to start.
2. Development fix: the attached patch was used upstream to fix the race
3. Stable fix: use the same patch
4. test case: start many patches imultaneously, make sure all started
5. Regression potential: there should be none, patch was taken straight from
   upstream.
==================================

Symptoms:
When the system is under load and you attempt to start multiple VMs simultaneously via libvirt, many of the VMs fail to start. This happens for us running an OpenStack compute server when VMs are provisioned rapidly (for automated tests, scaling, etc.). We can easily reproduce it by starting 5-10 VMs on a single server simultaneously -- in which case, Libvirt will fail to start about half. Even in more modest scenarios, this is likely to occasionally affect OpenStack compute servers running on precise (i.e. when VMs are started within a few seconds of each other).

Diagnosis:
I investigated the source of the problem, and it seems that Libvirt has a few double close() problems that have been fixed since 0.9.8. Concurrent tasks in Libvirtd have a decent chance of stepping on each other's toes (incorrectly closing some file descriptor that has been reused for some other purpose). Some of these race conditions have a very small window of opportunity, but one problem seems to be more more common (particularly when the system is under load and a new qemu process might take a while to start).

Solution:
The problem has since been fixed in upstream Libvirt, and the commit message refers to the RedHat bug here (https://bugzilla.redhat.com/show_bug.cgi?id=823716) .. but I can't see it.

I've attached the upstream patch. I've tested it and it fixes the problem and applies cleanly (with a few offsets). It is a small, low-risk patch. I'm submitting this bug because it would be great to have this fix in the LTS release.

Revision history for this message
Adin Scannell (amscanne) wrote :
Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

Thanks for reporting this bug, and especially for finding the patch fixing the problem! I assume you are seeing this in 12.04 and not in 12.10? (Just making sure) If so I'll go ahead and mark this for SRU

Revision history for this message
Adin Scannell (amscanne) wrote :

Yep, this is 12.04.

This patch is included in libvirt 0.9.13, so I believe 12.10 is already covered.

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

Thanks, Adin - there is already a libvirt package in precise-proposed awaiting verification. I will push this fix as soon as that package is promoted to -updates.

Changed in libvirt (Ubuntu):
status: New → Fix Released
Changed in libvirt (Ubuntu Precise):
status: New → Triaged
importance: Undecided → High
description: updated
Revision history for this message
Brian Murray (brian-murray) wrote : Please test proposed package

Hello Adin, or anyone else affected,

Accepted libvirt into precise-proposed. The package will build now and be available at http://launchpad.net/ubuntu/+source/libvirt/0.9.8-2ubuntu17.5 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please change the bug tag from verification-needed to verification-done. If it does not, change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

tags: added: precise
Changed in libvirt (Ubuntu Precise):
status: Triaged → Fix Committed
tags: added: verification-needed
Revision history for this message
Matt Rae (mattrae) wrote :

Verified in precise proposed using the breaklibvirt.sh from bug:961217. breaklibvirt.sh completes 100 rounds of starting and stopping 4 instances without failures after updating to http://launchpad.net/ubuntu/+source/libvirt/0.9.8-2ubuntu17.5

tags: added: verification-done
removed: verification-needed
Revision history for this message
David Pottage (david-electric-spoon) wrote :

I have tested the patch on a KVM cluster at my emplyer and based on a day or so of testing under heavy load, it fixes the bug for me.

The cluster consists of 7 machines each running up to 8 VM jobs in parallel. Most are varous windows versions but some are Ubuntu Lucid. Each job runs for arround 2 minutes, before being killed and a fresh VM booted in it's place, so we have thousands of KVM invocations per day.

Before we saw arround 5% of jobs would fail due to Libvirt errors. (mostly from get_domain_by_name from the perl bindings) This was reduced to 1.5% by making multiple retrys on the API call untill it worked. It now looks like the faliure rate is down to zero with the new version of libvirt.

Thank you.

Revision history for this message
Colin Watson (cjwatson) wrote : Update Released

The verification of this Stable Release Update has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regresssions.

Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package libvirt - 0.9.8-2ubuntu17.5

---------------
libvirt (0.9.8-2ubuntu17.5) precise-proposed; urgency=low

  * add patch Reduce-udevadm-settle-timeout-to-10-seconds.patch (copied from
    Debian tree) to fix 3 minute hang during pool-refresh when using LVM
    backed pools. (LP: #1027987)
  * add upstream patch command-avoid-double-close-bugs toi avoid a race when
    starting multiple VMs concurrently. (LP: #1055658)
 -- Serge Hallyn <email address hidden> Wed, 03 Oct 2012 11:48:31 -0500

Changed in libvirt (Ubuntu Precise):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.