Under load, libvirt fails to start VMs concurrently
1. Impact: when starting multiple VMs simultaneously, many may fail to start.
2. Development fix: the attached patch was used upstream to fix the race
3. Stable fix: use the same patch
4. test case: start many patches imultaneously, make sure all started
5. Regression potential: there should be none, patch was taken straight from
When the system is under load and you attempt to start multiple VMs simultaneously via libvirt, many of the VMs fail to start. This happens for us running an OpenStack compute server when VMs are provisioned rapidly (for automated tests, scaling, etc.). We can easily reproduce it by starting 5-10 VMs on a single server simultaneously -- in which case, Libvirt will fail to start about half. Even in more modest scenarios, this is likely to occasionally affect OpenStack compute servers running on precise (i.e. when VMs are started within a few seconds of each other).
I investigated the source of the problem, and it seems that Libvirt has a few double close() problems that have been fixed since 0.9.8. Concurrent tasks in Libvirtd have a decent chance of stepping on each other's toes (incorrectly closing some file descriptor that has been reused for some other purpose). Some of these race conditions have a very small window of opportunity, but one problem seems to be more more common (particularly when the system is under load and a new qemu process might take a while to start).
The problem has since been fixed in upstream Libvirt, and the commit message refers to the RedHat bug here (https:/
I've attached the upstream patch. I've tested it and it fixes the problem and applies cleanly (with a few offsets). It is a small, low-risk patch. I'm submitting this bug because it would be great to have this fix in the LTS release.