[SRU] libvirt 1.2.12 live-migration corrupts some instances

Bug #1640676 reported by Hua Zhang on 2016-11-10
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Ubuntu Cloud Archive
Undecided
Unassigned
Kilo
High
Hua Zhang
libvirt (Ubuntu)
High
Unassigned

Bug Description

[Impact]

While memory load is high, libvirt 1.2.12 (kilo) live-migration corrupts some instances

[Test Case]

We can replicate the corruption pretty much at will. The sequence of events to trigger it is:

Create an instance using a cloud image
Start a job running with the following command: "dd if=/dev/urandom of=/var/tmp/mjb.1 bs=4M count=1000"
Live migrate the instance using a command like: "nova live-migration --block-migrate <server-id> <target-hypervisor>"
Once the migration has finished, stop the dd job on the instance
do a "Hard reboot" of the instance (eg: for openstack, nova reboot --hard $INSTANCE)
When the instance boots, file system corruption will be observed and it won't boot correctly

[Regression Potential]

[Other Info]

Both libvirt 1.2.16 (liberty) and libvirt 1.2.13 have already fixed this problem. So this problem only happens on kilo.

Backported from upstream patches, before the commit 80c5f10e libvirt just polls the events we are interested which can lead to drive mirror can not be cancelled, then the destination is not in a consistent state. in this case it is not safe to continue with the migration. so the commit 80c5f10e introduces listening queue events instead of polling to fix the problem.

http://libvirt.org/git/?p=libvirt.git;a=commitdiff;h=80c5f10e865cda0302519492f197cb020bd14a07
http://libvirt.org/git/?p=libvirt.git;a=commitdiff;h=76c61cdca20c106960af033e5d0f5da70177af0f
http://libvirt.org/git/?p=libvirt.git;a=commitdiff;h=c37943a0687a8fdb08e6eda8ae4b9f4f43f4f2ed
http://libvirt.org/git/?p=libvirt.git;a=commitdiff;h=c88b323bf5d5a070c074fda7adc11085f14415ce

BTW, we have completed 20 to 30 live migrations with I/O running and have had no problems, and also tested that other functions continue to work as expected.

Hua Zhang (zhhuabj) on 2016-11-10
summary: - Nova live-migration corrupts some instances
+ libvirt live-migration corrupts some instances

Hi,
so it is writing huge buffers and corrupts on migration - interesting.
Might I ask what Release and versions qemu/libvirt are you on with this?

We have a regular set of migration Tests that we are running since recently.
One of the cases is a guest with workload that might be critical - I added yours.

Also what exactly do you mean by "do a hard reboot"?
Reboot from the guest - virsh shutdown and wait and start - virsh destroy and start?
For now I added a normal shutdown, wait-til-gone,start iteration.

In my tests on Xenial (I had to pick one to be fast) I migrated 5 times with and 5 times without workload (including yours now) with live and offline migration and afterwards restarted the guest.

They worked for me, so please share more details.

P.S. no matter what it already was worth to make this part of the regular tests - we will see after the weekend how that went on more architectures and releases.

Hua Zhang (zhhuabj) on 2016-11-15
summary: - libvirt live-migration corrupts some instances
+ libvirt 1.2.12 live-migration corrupts some instances
description: updated
Changed in libvirt (Ubuntu):
assignee: nobody → Hua Zhang (zhhuabj)
Hua Zhang (zhhuabj) wrote :

Hello paelzer, I have updated the problem description to deliver your concerns, thanks.

description: updated

Thank you a lot zhhuabi.
I'm looking to provide you a ppa for verification pre-SRU - but currently very blocked by a few other issues - so it might take 1-3 days (estimation).

It already was good to add that kind of workload to my tests as well.
For whatever reason they don't trigger in my case still - to overcome this I hope you could verify the ppa for me once created - would that be ok?

Changed in libvirt (Ubuntu Trusty):
status: New → Triaged
Changed in libvirt (Ubuntu):
status: New → Fix Released
Changed in libvirt (Ubuntu Trusty):
importance: Undecided → High
Hua Zhang (zhhuabj) wrote :

hi paelzer, sure, it's ok for me

description: updated
Hua Zhang (zhhuabj) on 2016-11-15
tags: added: sts-sru

Thanks zhhuabj for already backporting to a debdiff.
I've seen that your debdiff is for the cloud archive kilo version that you run on.

I checked if the change would apply to trusty (without UCA) as well, but it has a huge amount of delta where manual adaption of the patch is needed.
That combined with the fact that I can't reproduce the issue on base trusty so far makes me not consider it for trusty atm.

If your test environment has any chance to recreate the same issue on base trusty could you give it a try to verify if that needs a similar fix as well? With some luck the older version isn't affected.
Also if you happen to find some time to even do the backport to trusty I'm willing to do a bunch of extra tests to it on my side.

All that should not stop you, please go forward on getting it approved and sponsored into cloud archive given some more testing as the changes are rather huge.

Hua Zhang (zhhuabj) wrote :

Hi Christian, i have managed to reproduce the problem using libvirt from the UCA Kilo and it does not exist in the UCA Liberty. I have not tried earlier versions of libvirt (Juno UCA is EOL and Vivid is also EOL) but I think that given that the problem exists in Kilo it needs to be fixed and as you say the diff for trusty-updates libvirt is far too large to consider. This is a difficult problem to reproduce, perhaps you can share what steps you are taking to reproduce the problem?

summary: - libvirt 1.2.12 live-migration corrupts some instances
+ [SRU] libvirt 1.2.12 live-migration corrupts some instances
Changed in libvirt (Ubuntu):
assignee: Hua Zhang (zhhuabj) → nobody

On Thu, Nov 17, 2016 at 10:29 AM, Hua Zhang <email address hidden>
 wrote:
[...]

Agree to all you said before - I'm not able to sponsor the fix into UCA -
let me know if you need contacts.

This is a difficult problem to reproduce, perhaps you can
> share what steps you are taking to reproduce the problem?
>

Well, as I said before I added tests based on your case and so far was
unable to trigger it on any of the base Distribution releases nor T+Mitaka.
So until anything/anybody is able to trigger it - given the huge delta for
base Trusty I'd keep the Trusty task as incomplete.

Changed in libvirt (Ubuntu Trusty):
status: Triaged → Incomplete
Changed in libvirt (Ubuntu):
importance: Undecided → High
Changed in cloud-archive:
status: New → Invalid

Hello Hua, or anyone else affected,

Accepted libvirt into kilo-proposed. The package will build now and be available in the Ubuntu Cloud Archive in a few hours, and then in the -proposed repository.

Please help us by testing this new package. To enable the -proposed repository:

  sudo add-apt-repository cloud-archive:kilo-proposed
  sudo apt-get update

Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-kilo-needed to verification-kilo-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-kilo-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

tags: added: verification-kilo-needed
Hua Zhang (zhhuabj) wrote :

I've verified that the new package doesn't break anything and fix the reported problem.

tags: added: verification-kilo-done
removed: verification-kilo-needed

The verification of the Stable Release Update for libvirt has completed successfully and the package has now been released to -updates. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

James Page (james-page) wrote :

This bug was fixed in the package libvirt - 1.2.12-0ubuntu14.4~cloud2
---------------

 libvirt (1.2.12-0ubuntu14.4~cloud2) trusty-kilo; urgency=medium
 .
   * Added d/p/conf-Introduce-helper-to-find-duplicate-device-addre.patch (LP: #1640676)
   * Added d/p/qemuProcessHandleBlockJob-Set-disk-mirrorState-more-.patch (LP: #1640676)
   * Added d/p/qemuProcessHandleBlockJob-Take-status-into-account.patch (LP: #1640676)
   * Added d/p/qemuMigrationDriveMirror-Listen-to-events.patch (LP: #1640676)

Sebastien Bacher (seb128) wrote :

reading the comments it seems like there is nothing left to upload so unsubscribing sponsors

no longer affects: libvirt (Ubuntu Trusty)
Louis Bouchard (louis) on 2017-03-22
tags: added: sts-sru-done
removed: sts-sru
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers