Live migration locks up Linux 3.2-based guests

Bug #1398718 reported by Matt Mullins
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
qemu (Ubuntu)
Invalid
High
Unassigned
Trusty
Invalid
High
Unassigned
Utopic
Won't Fix
High
Unassigned

Bug Description

In the thread at http://thread.gmane.org/gmane.comp.emulators.kvm.devel/127042/focus=129294, three commits were identified to fix live migration for qemu 2.0 (at least), which I am using on trusty. I would like to get these pulled-in by the package maintainer.

I have cherry-picked those three commits (with some considerable fix-up for the first , which may or may not be correct; the others apply cleanly) and built packages locally. Installing that on the migration-receiver seems to fix my guest lockups after live-migrating. I can attach the patches I'm using if someone is able to review my fix-ups to the first one.

My original problem description was:
Somewhere between kernel 3.2 and 3.11 on my VM hosts (yes, I know that narrows
it down a /whole lot/ ...), live migration started killing my Ubuntu precise
(kernel 3.2.x) guests, causing all of their vcpus to go into a busy loop. Once
(and only once) I've observed the guest eventually becoming responsive again,
with a clock nearly 600 years in the future and a negative uptime.

I haven't been able to dig up any previous threads about this problem, so my
gut instinct is that I've configured something wonky. Any pointers toward
/what/ I may have done wrong are appreciated.

It only seems to happen if I've given the guests Nehalem-class CPU features.
My longest-running VMs, from before I started passing-through the CPU
capabilities into the guest, seem to migrate without issue.

It also seems to happen reliably when the guest has been running for a while;
it's easily reproducible with guests that have been up ~1 day, and I've
reproduced it in VMs with an uptime of ~20 hours. I haven't yet figured out a
lower-bound, which makes the testing cycle a little longer for me.

The guests that I reliably reproduce this on are Ubuntu 12.04 guests running
the current 3.2 kernel that Canonical distributes. Recent Fedora kernels
(3.14+, IIRC) don't seem to busy-spin this way, though I haven't tested this
case exhaustively, and I haven't written down very good notes for the tests I
have done with Fedora.

The hosts are dual-socket Nehalem Xeons (L5520), currently running Ubuntu 14.04
and the associated 3.13 kernel. I had previously reproduced this with 12.04
running a raring-backport 3.11 kernel as well, but I (seemingly erroneously)
assumed it may have been a qemu userspace discrepancy.

Changed in qemu (Ubuntu):
status: New → Triaged
importance: Undecided → High
Changed in qemu (Ubuntu Trusty):
status: New → Confirmed
Changed in qemu (Ubuntu Utopic):
status: New → Confirmed
Changed in qemu (Ubuntu Trusty):
importance: Undecided → High
Changed in qemu (Ubuntu Utopic):
importance: Undecided → High
Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

Could you confirm that this is still an issue in vivid?

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

Ping - can you confirm whether this is an issue still in vivid or wily?

Changed in qemu (Ubuntu):
status: Triaged → Incomplete
Revision history for this message
Matt Mullins (mokomull) wrote :

Apologies for the delay. I can't even seem to reproduce the original issue on the affected hardware before upgrading, so I can't confirm the fix in vivid or wily.

I'd be fine with closing this "can (no longer) reproduce"—maybe my recollection is playing funny tricks on me.

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

Thanks. I'm afraid the status for "cannot be reproduced" is "invalid". If you see this happening again, please do re-open this bug.

Changed in qemu (Ubuntu):
status: Incomplete → Invalid
Changed in qemu (Ubuntu Trusty):
status: Confirmed → Invalid
Changed in qemu (Ubuntu Utopic):
status: Confirmed → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.