guest hangs after live migration due to tsc jump
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
QEMU |
Fix Released
|
Undecided
|
Unassigned | ||
glusterfs (Ubuntu) |
Invalid
|
Undecided
|
Unassigned | ||
Trusty |
Confirmed
|
Undecided
|
Unassigned | ||
qemu (Ubuntu) |
Fix Released
|
High
|
Unassigned | ||
Trusty |
Fix Released
|
High
|
Unassigned |
Bug Description
=======
SRU Justification:
1. Impact: guests hang after live migration with 100% cpu
2. Upstream fix: a set of four patches fix this upstream
3. Stable fix: we have a backport of the four patches into a single patch.
4. Test case: try a set of migrations of different VMS (it is unfortunately not 100% reproducible)
5. Regression potential: the patch is not trivial, however the lp:qa-regression-tests testsuite passed 100% with this package.
=======
We have two identical Ubuntu servers running libvirt/kvm/qemu, sharing a Gluster filesystem. Guests can be live migrated between them. However, live migration often leads to the guest being stuck at 100% for a while. In that case, the dmesg output for such a guest will show (once it recovers): Clocksource tsc unstable (delta = 662463064082 ns). In this particular example, a guest was migrated and only after 11 minutes (662 seconds) did it become responsive again.
It seems that newly booted guests doe not suffer from this problem, these can be migrated back and forth at will. After a day or so, the problem becomes apparent. It also seems that migrating from server A to server B causes much more problems than going from B back to A. If necessary, I can do more measurements to qualify these observations.
The VM servers run Ubuntu 13.04 with these packages:
Kernel: 3.8.0-35-generic x86_64
Libvirt: 1.0.2
Qemu: 1.4.0
Gluster-fs: 3.4.2 (libvirt access the images via the filesystem, not using libgfapi yet as the Ubuntu libvirt is not linked against libgfapi).
The interconnect between both machines (both for migration and gluster) is 10GbE.
Both servers are synced to NTP and well within 1ms form one another.
Guests are either Ubuntu 13.04 or 13.10.
On the guests, the current_clocksource is kvm-clock.
The XML definition of the guests only contains: <clock offset='utc'/>
Now as far as I've read in the documentation of kvm-clock, it specifically supports live migrations, so I'm a bit surprised at these problems. There isn't all that much information to find on these issue, although I have found postings by others that seem to have run into the same issues, but without a solution.
---
ApportVersion: 2.14.1-0ubuntu3
Architecture: amd64
DistroRelease: Ubuntu 14.04
Package: libvirt (not installed)
ProcCmdline: BOOT_IMAGE=
ProcVersionSign
Tags: trusty apparmor apparmor apparmor apparmor apparmor
Uname: Linux 3.13.0-24-generic x86_64
UpgradeStatus: No upgrade log present (probably fresh install)
UserGroups:
_MarkForUpload: True
modified.
modified.
modified.
modified.
mtime.conffile.
mtime.conffile.
mtime.conffile.
Changed in libvirt (Ubuntu): | |
importance: | Low → High |
status: | New → Triaged |
Changed in glusterfs (Ubuntu): | |
status: | New → Invalid |
Thanks for reporting this bug. Unfortunately 13.04 (on your servers) is no longer supported. Could you test to see if you can reproduce this on 13.10 or 14.04? If you can, then I'll mark this as also affecting linux (the kernel) and hopefully we can get 'apport-collect 1297218' output from both the host and a guest.