[Ubuntu1610][Libvirt]Migration fails in ubuntu1610 -> ubuntu1610 with "error: operation failed: job: unexpectedly failed"

Bug #1617214 reported by bugproxy
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
libvirt (Ubuntu)
Fix Released
High
Christian Ehrhardt 

Bug Description

Problem Description
================================
Observing this issue while migrating the guest from Ubuntu1610 -> Ubuntu1610

Source Machine:
# virsh migrate avocado-vt-vm1-Bala qemu+ssh://9.40.192.182/system --live --verbose
Migration: [100 %]error: operation failed: job: unexpectedly failed

Destination Machine:

Jul 9 11:55:00 powerkvm4-lp1 libvirtd[63232]: operation failed: job: unexpectedly failed
Jul 9 11:55:00 powerkvm4-lp1 virtlogd[5102]: Cannot open log file: '/var/log/libvirt/qemu/avocado-vt-vm1-Bala.log': Device or resource busy
Jul 9 11:55:00 powerkvm4-lp1 libvirtd[63232]: Cannot open log file: '/var/log/libvirt/qemu/avocado-vt-vm1-Bala.log': Device or resource busy
Jul 9 11:55:00 powerkvm4-lp1 virtlogd[5102]: End of file while reading data: Input/output error

---uname output---
# uname -a Linux powerkvm4-lp1 4.4.0-28-generic #47-Ubuntu SMP Fri Jun 24 10:09:20 UTC 2016 ppc64le ppc64le ppc64le GNU/Linux

Machine Type = Habanero

Steps to Reproduce
=============================
 1. Define and start the guest and placed image in NFS share.
2. Ran top command inside guest
3. Enabled ports 49152:49216 in iptables
4. Mounted the image location in destination and started migration.
5. Migration command - virsh migrate avocado-vt-vm1-Bala qemu+ssh://9.40.192.182/system --live --verbose

Userspace tool common name: virsh (libvirt)

The userspace tool has the following bit modes: ppc64le

Userspace rpm: # dpkg --get-selections | grep -i libvirt
libvirt-bin install
libvirt-clients install
libvirt-daemon install
libvirt-daemon-system install
libvirt-dev:ppc64el install
libvirt0:ppc64el install
python-libvirt install

Logs - SOSreport for source and destination machine

== Comment: #12 - Nitesh Konkar <email address hidden> - 2016-08-11 06:24:14 ==
There are two issues here.

 1) USB keyboard is not sent to destination as part of migration. Please pull 192a53e07c5fefd9dad2f310886209b76dcc5d83 & be1a7e6d31f652a6b279c3c9962262ac42c69d0a & 3d3d1dfa patches to Ubuntu for that.

2) After the above patches get pulled in,cherry pick patches 5e6143fbccf2e6afb73c3f872ccdafd02fed5d95, 78b9b85c069d1dea60139cff7ee6f6d5ac2f3359 ,91a6eacc8f214882c5c67ad84d767bdc0d46b944 & cf3ea0769c54a328733bcb0cd27f546e70090c89

Revision history for this message
bugproxy (bugproxy) wrote : SOSreport - source

Default Comment by Bridge

tags: added: architecture-ppc64le bugnameltc-143635 severity-critical targetmilestone-inin---
Revision history for this message
bugproxy (bugproxy) wrote : SOSreport - destination

Default Comment by Bridge

Changed in ubuntu:
assignee: nobody → Taco Screen team (taco-screen-team)
affects: ubuntu → libvirt (Ubuntu)
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Thanks for identifying the related fixes already.
I'm working on some qemu fixes already and put this on my list.

I'll ping you once I have something ready to test.
Would you prefer something asap in a ppa or wait until I finished more of the other task to test it in one shot?

Changed in libvirt (Ubuntu):
status: New → Triaged
importance: Undecided → High
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Dear taco Team, I don't want to "take away" but if you don't mind please feel free to assign to me.

Steve Langasek (vorlon)
Changed in libvirt (Ubuntu):
assignee: Taco Screen team (taco-screen-team) → ChristianEhrhardt (paelzer)
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Hi,
I clearly see how the patches referred as 1) fix issues with the usb keyboard handling.
I reviewed them and thought they are valid patches to be carried as delta. But then I realized that they are already included.

Similar on second set.
I read more on them and if they might need more patches like "cf3ea076 qemu: process: Append the "shutting down" message using the new APIs" to have any effect.
But then I realized that this as well, is actually just part of libvirt 2.0. So since 2.1 was merged for yakkety recently that is released already.

So after a while of code study I have to come to the conclusion - all done already.

The bug was opened on 26th, but the new libvirt released on 25th just a day before. It might be your tests were using the old version still.
Could you please give the latest package in Yakkety (=2.1.0-1ubuntu4) another try and reopen in case you still face the same issue?

I wonder thou if the patches of set 1 would be needed in Xenial, please if you can could you also do a 16.04->16.04 test and check if the issue is appearing there?

Changed in libvirt (Ubuntu):
status: Triaged → Fix Released
Revision history for this message
bugproxy (bugproxy) wrote : SOSreport - Source (Oct 09 2016)

------- Comment (attachment only) From <email address hidden> 2016-10-09 10:36 EDT-------

Revision history for this message
bugproxy (bugproxy) wrote : SOSreport - destination (Oct 09 2016)

------- Comment (attachment only) From <email address hidden> 2016-10-09 10:45 EDT-------

Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2016-10-18 05:20 EDT-------
== Comment from Bala ==
Issue is still observed in latest build,

# uname -a
Linux pkvmhab006 4.8.0-19-generic #21-Ubuntu SMP Thu Sep 29 19:43:27 UTC 2016 ppc64le ppc64le ppc64le GNU/Linux

# dpkg --list | grep libvirt
ii libvirt-bin 2.1.0-1ubuntu8 ppc64el programs for the libvirt library
ii libvirt-clients 2.1.0-1ubuntu8 ppc64el Programs for the libvirt library
ii libvirt-daemon 2.1.0-1ubuntu8 ppc64el Virtualization daemon
ii libvirt-daemon-system 2.1.0-1ubuntu8 ppc64el Libvirt daemon configuration files
ii libvirt-dev:ppc64el 2.1.0-1ubuntu8 ppc64el development files for the libvirt library
ii libvirt-glib-1.0-0:ppc64el 0.2.3-2 ppc64el libvirt GLib and GObject mapping library
ii libvirt0:ppc64el 2.1.0-1ubuntu8 ppc64el library for interfacing with different virtualization systems
ii python-libvirt 2.0.0-1 ppc64el libvirt Python bindings

Source and destination are in latest build

# virsh migrate avocado-vt-vm1-bala qemu+ssh://9.47.68.198/system --live --verbose
root@9.47.68.198's password:
Migration: [ 99 %]error: operation failed: migration job: unexpectedly failed

Attached Sosreport
[reply] [-] Comment 21

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Hi vaish123,
the patches you referred to initially are integrated already.
The same migration works for me - so we have to find what is different with your setup so that I can reproduce your issue and check what is going on.

Source log:
2016-10-09 14:12:40.947+0000: initiating migration
2016-10-09T14:13:10.066887Z qemu-system-ppc64: socket_writev_buffer: Got err=104 for (33578/18446744073709551615)

Destination Log:
2016-10-09T14:13:09.845186Z qemu-system-ppc64: terminating on signal 15 from pid 158142
2016-10-09 14:13:10.245+0000: shutting down

That is not a lot info :-/
The err=104 itself isn't too bad according to https://bugzilla.redhat.com/show_bug.cgi?id=1355662
All other cases like that that I know of reported a more reasonable pointer to the issue after this message, but your logs have none.

There is a similar bug out there at https://bugzilla.redhat.com/show_bug.cgi?id=1325059
This one was related to overloading the system to the extend that it can't respond anymore, but was never resolved. In my cases I usually have 50% guest vcpus and all host cpus busy and things still work fine - are you running more or any special load?

I don't know (is there?) a way to get to the logs that are in journal - so the sosreport isn't sufficient anyway.
I'd ask you to enable debug level logging for libvirt service on source&target and for virsh on source, see https://libvirt.org/logging.html
Run your case again (just the failing migration) and include the logfiles from "var/log/libvirt/qemu/" as well as the journal entries for libvirt ("journalctl -x -u libvirtd" or on former versions "journalctl -x -u libvirt-bin").

As I mentioned migration on 16.10 (and 16.04) on ppc64el works for me. So we have to find what is different. I see that you drive it via avocado - could you create a guest without it and migrate that one. I'd be curious to see how that behaves on your system - and it might help to spot the important difference to reproduce on our side.

Finally since your initial patches where applied it would be kind if you could open a new bug starting with your last post and refer to the bug here.
That would help to track those as the separate things they are.

Summary:
1. open new bug with your last post being the start
2. mention the new bug number here, I'll transfer my reply then
3. report on your workload other than the migration itself that could lock up the machine
4. retest with debug level logging enabled
5. report with qemu (files) and libvirt logs (journal)
6. try to migrate a guest not created by avocado

Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2016-10-28 05:00 EDT-------
== Comment from Bala ==

I could able to reproduce the issue with latest build, by trying manually, with the debug logs enabled as per distro comments.

Submitted Bug 147993 which is
https://bugzilla.linux.ibm.com/page.cgi?id=track/proxy.html&bug_id=147993

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Hi vaish,
I should see it pop up then once mirrored.
If you happen to know which LP bug it gets mirrored to let me know.

Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2017-01-16 01:55 EDT-------
Hi,

> Hi vaish,
> I should see it pop up then once mirrored.
> If you happen to know which LP bug it gets mirrored to let me know.

https://bugzilla.linux.ibm.com/page.cgi?id=track/proxy.html&bug_id=147993

Thank you.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote : Re: [Bug 1617214] Comment bridged from LTC Bugzilla

On Mon, Jan 16, 2017 at 9:01 AM, bugproxy <email address hidden> wrote:

> > If you happen to know which LP bug it gets mirrored to let me know.
>
> https://bugzilla.linux.ibm.com/page.cgi?id=track/proxy.html&bug_id=147993
>

That link is not accessible outside IBM.
Just the Launchpad bug number if possible would be fine.

Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2017-01-16 05:23 EDT-------
Hi ,
The launchpad bug number is :
1637438 - Live Migration fails with error "operation failed: migration job: unexpectedly failed"

Thank you.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.