Live migration locks up Linux 3.2-based guests
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
qemu (Ubuntu) |
Invalid
|
High
|
Unassigned | ||
Trusty |
Invalid
|
High
|
Unassigned | ||
Utopic |
Won't Fix
|
High
|
Unassigned |
Bug Description
In the thread at http://
I have cherry-picked those three commits (with some considerable fix-up for the first , which may or may not be correct; the others apply cleanly) and built packages locally. Installing that on the migration-receiver seems to fix my guest lockups after live-migrating. I can attach the patches I'm using if someone is able to review my fix-ups to the first one.
My original problem description was:
Somewhere between kernel 3.2 and 3.11 on my VM hosts (yes, I know that narrows
it down a /whole lot/ ...), live migration started killing my Ubuntu precise
(kernel 3.2.x) guests, causing all of their vcpus to go into a busy loop. Once
(and only once) I've observed the guest eventually becoming responsive again,
with a clock nearly 600 years in the future and a negative uptime.
I haven't been able to dig up any previous threads about this problem, so my
gut instinct is that I've configured something wonky. Any pointers toward
/what/ I may have done wrong are appreciated.
It only seems to happen if I've given the guests Nehalem-class CPU features.
My longest-running VMs, from before I started passing-through the CPU
capabilities into the guest, seem to migrate without issue.
It also seems to happen reliably when the guest has been running for a while;
it's easily reproducible with guests that have been up ~1 day, and I've
reproduced it in VMs with an uptime of ~20 hours. I haven't yet figured out a
lower-bound, which makes the testing cycle a little longer for me.
The guests that I reliably reproduce this on are Ubuntu 12.04 guests running
the current 3.2 kernel that Canonical distributes. Recent Fedora kernels
(3.14+, IIRC) don't seem to busy-spin this way, though I haven't tested this
case exhaustively, and I haven't written down very good notes for the tests I
have done with Fedora.
The hosts are dual-socket Nehalem Xeons (L5520), currently running Ubuntu 14.04
and the associated 3.13 kernel. I had previously reproduced this with 12.04
running a raring-backport 3.11 kernel as well, but I (seemingly erroneously)
assumed it may have been a qemu userspace discrepancy.
Changed in qemu (Ubuntu): | |
status: | New → Triaged |
importance: | Undecided → High |
Changed in qemu (Ubuntu Trusty): | |
status: | New → Confirmed |
Changed in qemu (Ubuntu Utopic): | |
status: | New → Confirmed |
Changed in qemu (Ubuntu Trusty): | |
importance: | Undecided → High |
Changed in qemu (Ubuntu Utopic): | |
importance: | Undecided → High |
Could you confirm that this is still an issue in vivid?