Data corruption after block migration (LV->LV)

Bug #1013714 reported by Dennis Krul
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
QEMU
Fix Released
Undecided
Unassigned

Bug Description

We quite frequently use the live block migration feature to move VM's between nodes without downtime. These sometimes result in data corruption on the receiving end. It only happens if the VM is actually doing I/O (doesn't have to be all that much to trigger the issue).

We use logical volumes and each VM has two disks. We use cache=none for all VM disks.

All guests use virtio (a mix of various Linux distro's and Windows 2008R2).

We currently have two stacks in use and have seen the issue on both of them:

Fedora - qemu-kvm 0.13
Scientific Linux 6.2 (RHEL derived) - qemu-kvm package 0.12.1.2

Even though we don't run the most recent versions of KVM I highly suspect this issue is still unreported and that filing a bug is therefore appropriate. (There doesn't seem to be any similar bug report in launchpad or RedHat's bugzilla and nothing related in change logs, release notes and git commit logs.)

I have no idea where to look or where to start debugging this issue, but if there is any way I can provide useful debug information please let me know.

Revision history for this message
Paolo Bonzini (bonzini) wrote :

Hi, I suggest that you try a newer version. There were several fixes that I think went only in 0.14, in particular commit 62155e2b51e3c282ddc30adbb6d7b8d3bb7c386e, commit 62155e2b51e3c282ddc30adbb6d7b8d3bb7c386e, commit 62155e2b51e3c282ddc30adbb6d7b8d3bb7c386e. RHEL6.2 doesn't have them. With the fixes, it's quite less likely that live block migration will eat your data.

However, we were also thinking of deprecating block migration, so we are interesting of hearing about your setup. The replacement would be more powerful (it would allow migrating storage separately from the VM), more efficient (storage and RAM streams would run in parallel on different TCP ports), and easier for us to test and maintain.

However, it would be more complicated to set the new mechanism up for migration without shared storage. This is what live block migration does, and it sounds like your usecase requires migration without shared storage. Likely, a true replacement of live block migration would not be ready in time for the next release (1.2), hence its removal would also be delayed.

Revision history for this message
Dennis Krul (launchpad-themirror) wrote :

Hello Paolo,

Thank you for your quick response!

Did you intend to mention 3 different commits or did you accidentally paste the same commit thrice? ;) I came across that commit but somehow thought it was already included in 0.13. Thanks!

We're of course in no position to ask, but I'll do it anyway: Would you be in a position to add patches for these commits to the qemu-kvm package for RHEL6 (assuming they apply at all)? Or perhaps ask one of the RH package maintainers to do so? We'd be very grateful!

A little bit of background (our use case for using live block migration): We are an ISP and provide virtual private servers on KVM.

The way we see it traditional centralized shared storage introduces one big, expensive and complicated SPOF into a VM platform.

We actually have no problems dealing with the limitations of local storage. For example, we have automated (offline) VM migrations to other hosts when customers need to upgrade and the current host doesn't have enough resources. It would be great if live block migration would be stable enough to do this online to reduce downtime for customers.

We sometimes use live block migration to reduce the server load by migrating off a busy VM. It doesn't really matter if the migration takes a while to complete. We also use it to migrate all VM's off a host in case the hardware is being retired or we need to reinstall.

Live block migration is just not very useful for generic system maintenance, like a reboot for a kernel or firmware update. In that case we simply reboot the host (and most customers don't mind that once in a while).

We would appreciate it if live block migration would not be removed until its superior replacement is ready. We don't mind if it's more complex to work with, as long as it's well documented ;)

Revision history for this message
Dennis Krul (launchpad-themirror) wrote :

Hello Paolo,

We backported most of the block migration fixes from upstream to the RHEL6 qemu-kvm package ourselves and are now unable to reproduce the disk corruption problem anymore. So I guess the issues are (mostly) fixed upstream.

You can close this bug, but I would still appreciate it if you could fix this in the RHEL6 package (other RH customers might appreciate that as well ;). We could even provide the patches if you like.

Revision history for this message
Thomas Huth (th-huth) wrote :

Closing according to comment #3

Changed in qemu:
status: New → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.