Live Migration fails with error "operation failed: migration job: unexpectedly failed"

Bug #1637438 reported by bugproxy on 2016-10-28
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
libvirt (Ubuntu)
Undecided
Taco Screen team

Bug Description

Problem Description
========================
Live Migration fails with error "operation failed: migration job: unexpectedly failed"

# virsh migrate avocado-vt-vm1-bala qemu+ssh://9.40.192.182/system --live --verbose
Migration: [ 99 %]error: operation failed: migration job: unexpectedly failed

Contact Information = Balamuruhan S / <email address hidden>

---uname output---
# uname -a Linux powerkvm4-lp1 4.8.0-26-generic #28-Ubuntu SMP Tue Oct 18 14:41:40 UTC 2016 ppc64le ppc64le ppc64le GNU/Linux

Machine Type = Tuleta

Steps to Reproduce
=================================
 1. Define and start the guest and placed image in NFS share.
2. Ran top command inside guest
3. Enabled ports 49152:49216 in iptables
4. Mounted the image location in destination and started migration.
5. Migrating from source to destination and migrating back from destination to source, it occurs after 5-6 to &fro migration
6. Migration command - virsh migrate avocado-vt-vm1-Bala qemu+ssh://9.40.192.182/system --live --verbose

Userspace tool common name: virsh (libvirt)

The userspace tool has the following bit modes: ppc64le

Userspace rpm:
# dpkg --list | grep libvirt
ii gir1.2-libvirt-glib-1.0:ppc64el 0.2.3-2 ppc64el GObject introspection files for the libvirt-glib library
ii gir1.2-libvirt-sandbox-1.0 0.5.1+git20151113-3 ppc64el GObject introspection files for the libvirt-sandbox library
ii libvirt-bin 2.1.0-1ubuntu9 ppc64el programs for the libvirt library
ii libvirt-clients 2.1.0-1ubuntu9 ppc64el Programs for the libvirt library
ii libvirt-daemon 2.1.0-1ubuntu9 ppc64el Virtualization daemon
ii libvirt-daemon-system 2.1.0-1ubuntu9 ppc64el Libvirt daemon configuration files
ii libvirt-dev:ppc64el 2.1.0-1ubuntu9 ppc64el development files for the libvirt library
ii libvirt-doc 2.1.0-1ubuntu9 all documentation for the libvirt library
ii libvirt-glib-1.0-0:ppc64el 0.2.3-2 ppc64el libvirt GLib and GObject mapping library
ii libvirt-glib-1.0-dev:ppc64el 0.2.3-2 ppc64el Development files for the libvirt-glib library
ii libvirt-ocaml 0.6.1.2-1build2 ppc64el OCaml bindings for libvirt
ii libvirt-ocaml-dev 0.6.1.2-1build2 ppc64el OCaml bindings for libvirt
ii libvirt-sandbox-1.0-5 0.5.1+git20151113-3 ppc64el Application sandbox toolkit shared library
ii libvirt-sandbox-1.0-dev 0.5.1+git20151113-3 ppc64el Development files for libvirt-sandbox library
ii libvirt-sanlock 2.1.0-1ubuntu9 ppc64el Sanlock plugin for virtlockd
ii libvirt0:ppc64el 2.1.0-1ubuntu9 ppc64el library for interfacing with different virtualization systems
ii munin-libvirt-plugins 0.0.6-1 all Munin plugins using libvirt
ii python-libvirt 2.0.0-1 ppc64el libvirt Python bindings
ii uvtool-libvirt 0~bzr99-0ubuntu2 all Library and tools for using Ubuntu Cloud Images with libvirt

Userspace tool obtained from project website: na

*Additional Instructions for Balamuruhan S / <email address hidden>:
-Post a private note with access information to the machine that the bug is occuring on.
-Attach ltrace and strace of userspace application.

Attachement:

1. Libvirtd debug logs - source and destination
2. Libvirtd journal logs - source and destination
3. qemu log - source and destination

-- Mirroring this bug and seeking distro's advice. More details in bug 143635 or
https://bugzilla.linux.ibm.com/page.cgi?id=track/proxy.html&bug_id=143635 --

Default Comment by Bridge

tags: added: architecture-ppc64le bugnameltc-147993 severity-high targetmilestone-inin---

Default Comment by Bridge

Default Comment by Bridge

Default Comment by Bridge

Changed in ubuntu:
assignee: nobody → Taco Screen team (taco-screen-team)
affects: ubuntu → libvirt (Ubuntu)

Default Comment by Bridge

------- Comment (attachment only) From <email address hidden> 2016-10-28 05:05 EDT-------

------- Comment (attachment only) From <email address hidden> 2016-10-28 05:06 EDT-------

------- Comment (attachment only) From <email address hidden> 2016-11-02 01:21 EDT-------

------- Comment (attachment only) From <email address hidden> 2016-10-28 05:06 EDT-------

------- Comment (attachment only) From <email address hidden> 2016-11-02 01:21 EDT-------

Hi,
I usually run migration repetitions from->to->from in the three types of live, offline and postcopy migration and that 5 times (so 15 overall)
Then add workload to the guest to make it more busy and run the same again.

That ran fine on all archs recently - so we need to find what might be special on your setup.

I need to understand more of your case, could you share:
1. any more workload than top in the guest?
2. to confirm only --live migration over and over right?
3. do you have any check after the migration to see if the guest is alive at the target?

It might be that my check for #3 give it the time it needs or the fact that I do live, offline, postcopy instead of live, live, live, ... may be important here so I need to understand your case.

Robie Basak (racb) wrote :

Marking Incomplete pending a response to Christian's questions.

Changed in libvirt (Ubuntu):
status: New → Incomplete

Default Comment by Bridge

Default Comment by Bridge

Default Comment by Bridge

Default Comment by Bridge

I assume this is still the case running top in the guest?
I usually run 3 migration types 5 times (=15) and all work fine.

I now added an extra sequence to do 10 further live migrations (just live) in a tight loop and pushed it to our biweekly tests. But I doubt that will show more.

As I said before we have to identify what is different in your setup.

No offense, but the mass dump that was added is barely helpful for that.
The logs are full, but I have no other index than the name "november 2" to look in them.
At least a time for the testcase would have been helpful as e.g. on Nov 1st the logs are full of rcu stall debugs.
All issues that I've seen in the logs are repetive, so e.g. the I/O error in syslog happens all over the place which makes it hard to pinpoint what was related to the case or just "happens there more often".
The same applies to the libvirt logs that - as usually are hard to read anyway - have no singular issue and reading through 208153 lines just to hope to find something is not working.

Please:
- describe more of your setup
- is that occurring on all architectures
- could you quick check if that is still reproducible with other versions of libvirt/qemu (that gets easier once we have newer ones in zesty)
- since the system has so many issue sin the log (rather common on test systems - I know), but could you reproduce that issue on a different system as well?
- ... so much more, but not just another dump of syslogs

Don't get me wrong these logs are great once we search for something, but so far neither of us knows what to look out for in this particular case.

keeping incomplete for now.

Default Comment by Bridge

Default Comment by Bridge

Default Comment by Bridge

Default Comment by Bridge

------- Comment From <email address hidden> 2016-12-12 09:50 EDT-------
Christian,

Earlier today Bala, who opened this issue, rejected this issue. Making his comment publicly:

(In reply to comment #21)
> [In reply to comment #18]
>
> I tried migration to and fro 10 times and issue is not observed, migration
> works fine.
>
> # uname -a
> Linux c158f2u09os 4.8.0-040800rc6-generic #201609121119 SMP Mon Sep 12
> 15:51:37 UTC 2016 ppc64le ppc64le ppc64le GNU/Linux
>
> # dpkg -l | grep libvirt
> ii libvirt-bin 2.1.0-1ubuntu9.1
> ppc64el programs for the libvirt library
> ii libvirt-clients 2.1.0-1ubuntu9.1
> ppc64el Programs for the libvirt library
> ii libvirt-daemon 2.1.0-1ubuntu9.1
> ppc64el Virtualization daemon
> ii libvirt-daemon-system 2.1.0-1ubuntu9.1
> ppc64el Libvirt daemon configuration files
> ii libvirt-dev:ppc64el 2.1.0-1ubuntu9.1
> ppc64el development files for the libvirt library
> ii libvirt-glib-1.0-0:ppc64el 0.2.3-2
> ppc64el libvirt GLib and GObject mapping library
> ii libvirt0:ppc64el 2.1.0-1ubuntu9.1
> ppc64el library for interfacing with different virtualization systems
> ii python-libvirt 2.0.0-1
> ppc64el libvirt Python bindings

Ok, thanks a lot for the clarification lagarcia !
I assume the mass dump of new (actually old) data was just a bugproxy issue then.

With that confirmed setting to invalid for now.

For what it is worth I'll keep the repetitive migrations in my test to provide some extra coverage of this.

Changed in libvirt (Ubuntu):
status: Incomplete → Invalid

------- Comment (attachment only) From <email address hidden> 2016-12-12 11:52 EDT-------

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers