Data corruption after block migration (LV->LV)
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
QEMU |
Fix Released
|
Undecided
|
Unassigned |
Bug Description
We quite frequently use the live block migration feature to move VM's between nodes without downtime. These sometimes result in data corruption on the receiving end. It only happens if the VM is actually doing I/O (doesn't have to be all that much to trigger the issue).
We use logical volumes and each VM has two disks. We use cache=none for all VM disks.
All guests use virtio (a mix of various Linux distro's and Windows 2008R2).
We currently have two stacks in use and have seen the issue on both of them:
Fedora - qemu-kvm 0.13
Scientific Linux 6.2 (RHEL derived) - qemu-kvm package 0.12.1.2
Even though we don't run the most recent versions of KVM I highly suspect this issue is still unreported and that filing a bug is therefore appropriate. (There doesn't seem to be any similar bug report in launchpad or RedHat's bugzilla and nothing related in change logs, release notes and git commit logs.)
I have no idea where to look or where to start debugging this issue, but if there is any way I can provide useful debug information please let me know.
Hi, I suggest that you try a newer version. There were several fixes that I think went only in 0.14, in particular commit 62155e2b51e3c28 2ddc30adbb6d7b8 d3bb7c386e, commit 62155e2b51e3c28 2ddc30adbb6d7b8 d3bb7c386e, commit 62155e2b51e3c28 2ddc30adbb6d7b8 d3bb7c386e. RHEL6.2 doesn't have them. With the fixes, it's quite less likely that live block migration will eat your data.
However, we were also thinking of deprecating block migration, so we are interesting of hearing about your setup. The replacement would be more powerful (it would allow migrating storage separately from the VM), more efficient (storage and RAM streams would run in parallel on different TCP ports), and easier for us to test and maintain.
However, it would be more complicated to set the new mechanism up for migration without shared storage. This is what live block migration does, and it sounds like your usecase requires migration without shared storage. Likely, a true replacement of live block migration would not be ready in time for the next release (1.2), hence its removal would also be delayed.