[SRU] libvirt 1.2.12 live-migration corrupts some instances
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Ubuntu Cloud Archive |
Invalid
|
Undecided
|
Unassigned | ||
Kilo |
Fix Released
|
High
|
Hua Zhang | ||
libvirt (Ubuntu) |
Fix Released
|
High
|
Unassigned |
Bug Description
[Impact]
While memory load is high, libvirt 1.2.12 (kilo) live-migration corrupts some instances
[Test Case]
We can replicate the corruption pretty much at will. The sequence of events to trigger it is:
Create an instance using a cloud image
Start a job running with the following command: "dd if=/dev/urandom of=/var/tmp/mjb.1 bs=4M count=1000"
Live migrate the instance using a command like: "nova live-migration --block-migrate <server-id> <target-
Once the migration has finished, stop the dd job on the instance
do a "Hard reboot" of the instance (eg: for openstack, nova reboot --hard $INSTANCE)
When the instance boots, file system corruption will be observed and it won't boot correctly
[Regression Potential]
[Other Info]
Both libvirt 1.2.16 (liberty) and libvirt 1.2.13 have already fixed this problem. So this problem only happens on kilo.
Backported from upstream patches, before the commit 80c5f10e libvirt just polls the events we are interested which can lead to drive mirror can not be cancelled, then the destination is not in a consistent state. in this case it is not safe to continue with the migration. so the commit 80c5f10e introduces listening queue events instead of polling to fix the problem.
http://
http://
http://
http://
BTW, we have completed 20 to 30 live migrations with I/O running and have had no problems, and also tested that other functions continue to work as expected.
summary: |
- Nova live-migration corrupts some instances + libvirt live-migration corrupts some instances |
summary: |
- libvirt live-migration corrupts some instances + libvirt 1.2.12 live-migration corrupts some instances |
description: | updated |
Changed in libvirt (Ubuntu): | |
assignee: | nobody → Hua Zhang (zhhuabj) |
tags: | added: sts-sru |
Changed in libvirt (Ubuntu): | |
importance: | Undecided → High |
Changed in cloud-archive: | |
status: | New → Invalid |
no longer affects: | libvirt (Ubuntu Trusty) |
tags: |
added: sts-sru-done removed: sts-sru |
Hi,
so it is writing huge buffers and corrupts on migration - interesting.
Might I ask what Release and versions qemu/libvirt are you on with this?
We have a regular set of migration Tests that we are running since recently.
One of the cases is a guest with workload that might be critical - I added yours.
Also what exactly do you mean by "do a hard reboot"?
Reboot from the guest - virsh shutdown and wait and start - virsh destroy and start?
For now I added a normal shutdown, wait-til-gone,start iteration.
In my tests on Xenial (I had to pick one to be fast) I migrated 5 times with and 5 times without workload (including yours now) with live and offline migration and afterwards restarted the guest.
They worked for me, so please share more details.
P.S. no matter what it already was worth to make this part of the regular tests - we will see after the weekend how that went on more architectures and releases.