libvirt tunnelled migration fails with "migration job: unexpectedly failed"

Bug #1432630 reported by Lukas Vacek
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
libvirt
Won't Fix
Medium
libvirt (Ubuntu)
Won't Fix
High
Unassigned

Bug Description

There is a bug in libvirt in up-to-date Ubuntu 14.04.2 LTS when live migrating a VM with a big storage attached to it - the migration fails with "error: operation failed: migration job: unexpectedly failed". Not sure what's the threshold for the storage size to trigger the bug, but I can migrate a guest with 8GB storage between nodes but a guest with 30GB storage fails to migrate. This only happens when --tunnelled parameter is passed to "virsh migrate".

# virsh migrate --live --p2p --copy-storage-inc --tunnelled ubuntuutopic-small "qemu+tcp://lab5/system"
error: operation failed: migration job: unexpectedly failed

#on the other hand, this WORKS OK:
virsh migrate --live --p2p --copy-storage-inc ubuntuutopic "qemu+tcp://lab6/system"

libvirt: 1.2.2-0ubuntu13.1.9
qemu: 2.0.0+dfsg-2ubuntu1.10
linux kernel: 3.13.0-46-generic #79-Ubuntu

Versions are same on both boxes.

libvirtd.conf only changed to listen on TCP and not to require authenitcation.

Logs and the domain xml attached.

Revision history for this message
Lukas Vacek (lukas-vacek) wrote :
Revision history for this message
Lukas Vacek (lukas-vacek) wrote :
Revision history for this message
Lukas Vacek (lukas-vacek) wrote :
Revision history for this message
Lukas Vacek (lukas-vacek) wrote :
Revision history for this message
In , Lukas (lukas-redhat-bugs) wrote :

Created attachment 1002391
domain.xml

There is a bug in libvirt (built from current master: 51f9f03a4ca50b070c0fbfb29748d49f583e15e1) when live migrating a VM with a big storage attached to it - the migration fails with "error: operation failed: migration job: unexpectedly failed". Not sure what's the threshold for the storage size to trigger the bug, but a guest with 30GB storage fails to migrate in our test lab. This only happens when --tunnelled parameter is passed to "virsh migrate".

# virsh migrate --live --p2p --copy-storage-inc --tunnelled ubuntuutopic "qemu+tcp://lab5/system"
error: operation failed: migration job: unexpectedly failed

#on the other hand, this WORKS OK:
virsh migrate --live --p2p --copy-storage-inc ubuntuutopic "qemu+tcp://lab5/system"

libvirt: current master - 51f9f03a4ca50b070c0fbfb29748d49f583e15e1
qemu: 2.0.0+dfsg-2ubuntu1.10
linux kernel: 3.13.0-46-generic #79-Ubuntu

Versions are same on both boxes.

libvirtd.conf only changed to listen on TCP and not to require authentication.

Logs and the domain xml attached.

Steps to Reproduce:
1. create a new domain on host1 (if can't reproduce, you might need to creat a domain with a bigger storage)
2. setup host2 - precreate an empty qcow2 disk in the corresponding location, change libvirtd config to listen on tcp port
3. run "virsh migrate --live --p2p --copy-storage-inc --tunnelled GUEST_VM "qemu+tcp://host2/system" on host1

Actual results:
error: operation failed: migration job: unexpectedly failed

Expected results:
migration succeeds just like when --tunnelled is not used

Domain and logs attached.

Revision history for this message
Lukas Vacek (lukas-vacek) wrote :

I've just tested with the current upstream git master 51f9f03a4ca50b070c0fbfb29748d49f583e15e1 and ran into the same problem. I've raised an issue against upstream libvirt here https://bugzilla.redhat.com/show_bug.cgi?id=1202453 .

Revision history for this message
In , Lukas (lukas-redhat-bugs) wrote :

Created attachment 1002394
destination libvirtd log

Revision history for this message
In , Lukas (lukas-redhat-bugs) wrote :

Created attachment 1002395
source node libvirtd log

Revision history for this message
In , Lukas (lukas-redhat-bugs) wrote :

Created attachment 1002397
destination libvirt/qemu/guest.log

Revision history for this message
In , Jiri (jiri-redhat-bugs) wrote :

As the error message from source daemon suggests, the reason is a different way of transferring disk images with p2p vs tunnelled migration. The preferred way is using NBD but this is unfortunately impossible with tunnelled migration. Thus it falls back to the old way of storage migration. I'm not sure how much this older method is supported by QEMU community but you can try to raise the issue with them. There doesn't seem to be any bug in libvirt here. Except for the lack of NBD support with tunnelled migration. But that's rather a request for new feature.

Revision history for this message
In , Lukas (lukas-redhat-bugs) wrote :

Thanks for quick answer.

Just two things.

1) The migration works fine when --tunnelled is not used. Based on that I'd assume native QEMU migration works fine.
2) Could libvirt provide better errors logs when this fail occurs?

Revision history for this message
In , Jiri (jiri-redhat-bugs) wrote :

1) There are two implementations of storage migration in QEMU. The old variant ("migrate -b" monitor command) and the new variant using NBD. The usage of --tunnelled forces libvirt to switch from NBD to the old implementation when asking QEMU to migrate. It's QEMU doing the migration including storage in both cases. According to the logs NBD based storage migration works fine for you while the old implementation doesn't work.

2) The error actually comes from QEMU so unless it provides anything better to us, we can't report it. And there's nothing interesting in the qemu log on destination host, which doesn't make things any better. Can you also check that log file on the source host?

Revision history for this message
In , Lukas (lukas-redhat-bugs) wrote :

Thanks for clarification.

Revision history for this message
Lukas Vacek (lukas-vacek) wrote :

According to the upstream (see the bug report mentioned in the previous comment) it's not a bug in libvirt but in qemu so I have added qemu to affected packages.

Revision history for this message
Lukas Vacek (lukas-vacek) wrote :

I will raise a seperate bug for qemu.

Changed in libvirt (Ubuntu):
status: New → Invalid
Changed in qemu (Ubuntu):
status: New → Invalid
Revision history for this message
In , Lukas (lukas-redhat-bugs) wrote :

Hi Jiri

I did more tests at my side and it turns out it very well might be an issue in libvirt so I have to reopen this issue.

I have done following tests:

Test A)
1) start libvirt on hostA and hostB
2) start GUEST on hostA
3) create an empty disk on hostB with qemu-img create (might not be necessary with recent enough libvirt)
4) start migration using "virsh migrate --live --p2p --copy-storage-inc --tunnelled GUEST "qemu+tcp://hostB/system"

# at this point migration fails with "unexpectedly failed"

5) Now I stopped libvirt on hostA and hostB
6) I have manually started qemu with -incoming on hostB
7) I have connected via QMP to hostA and executed "migrate blk=true inc=true uri=tcp:10.0.1.31:49152" in qmp-shell

# migration fails with "unexpectedly failed"

At this point I suspected the problem to be in qemu. However, when I do everything manually the live migration works. ie.:

Test B)
1) stop libvirt on hostA and hostB
2) start GUEST on hostA
3) create an empty disk on hostB with qemu-img create
4) start qemu with -incoming on hostB
5) connect via QMP to hostA and execute "migrate blk=true inc=true uri=tcp:10.0.1.31:49152" in qmp-shell

migration works!

btw. this issue might be more important than it seems because openstack nova defaults to tunneled migration.

Thanks!

Revision history for this message
Lukas Vacek (lukas-vacek) wrote :

reopening because the upstream issue have been reopen

Changed in libvirt (Ubuntu):
status: Invalid → New
Revision history for this message
In , Jiri (jiri-redhat-bugs) wrote :

(In reply to Lukas Vacek from comment #8)
> Test B)
> 1) stop libvirt on hostA and hostB
> 2) start GUEST on hostA
> 3) create an empty disk on hostB with qemu-img create

I see it now. Another difference between migrating storage using NBD vs. the old way is in this step 3. Current libvirt (as of 1.2.13) will precreate the disk on the destination host but only when NBD is used. If it's not used, the files need to be properly created on the destination before starting migration (I think Nova takes care of this). With older libvirt, the files need to exist even if NBD is used.

Revision history for this message
In , Lukas (lukas-redhat-bugs) wrote :

Agreed. But I don't think it's the cause of the issue because I precreate the files exactly the same way in Test A and Test B.

Revision history for this message
In , Jiri (jiri-redhat-bugs) wrote :

Heh, I'm blind.

Anyway, could you please post the logs I asked for on IRC few days ago? Turn on debug logs (http://wiki.libvirt.org/page/DebugLogs), run the migration and attach libvirtd.log and guest.log files from both source and destination hosts.

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

Could you please list here the qemu bug number?

Changed in libvirt (Ubuntu):
importance: Undecided → High
status: New → Confirmed
Revision history for this message
In , Lukas (lukas-redhat-bugs) wrote :

Created attachment 1009066
libvirtd source log log_filters="3:rpc 3:remote 3:util.json 3:util.event 3:node_device 3:util.object 3:util.netlink 3:access"

Revision history for this message
In , Lukas (lukas-redhat-bugs) wrote :

Created attachment 1009067
libvirtd destination log log_filters="3:rpc 3:remote 3:util.json 3:util.event 3:node_device 3:util.object 3:util.netlink 3:access"

Revision history for this message
In , Lukas (lukas-redhat-bugs) wrote :

Created attachment 1009068
new qemu/guest.log on source

Revision history for this message
In , Lukas (lukas-redhat-bugs) wrote :

Created attachment 1009070
new qemu/guest.log on destination

Revision history for this message
In , Lukas (lukas-redhat-bugs) wrote :

First of all, sorry I didn't get to this earlier. We did some reorganizing of our lab env so I could reproduce the test with logs on only now.

It dies with another error now. However, direct qemu migration works as does not-tunnelled libvirt migration.

root@lab1:/var/lib/libvirt# virsh migrate --live --p2p --copy-storage-inc --tunnelled ubuntuutopic "qemu+tcp://lab2/system"
error: Unable to read from monitor: Connection reset by peer

Libvirt debug logs attached.

Revision history for this message
In , Kashyap (kashyap-redhat-bugs) wrote :

(In reply to Lukas Vacek from comment #16)
> First of all, sorry I didn't get to this earlier. We did some reorganizing
> of our lab env so I could reproduce the test with logs on only now.
>
> It dies with another error now. However, direct qemu migration works as does
> not-tunnelled libvirt migration.
>
> root@lab1:/var/lib/libvirt# virsh migrate --live --p2p --copy-storage-inc
> --tunnelled ubuntuutopic "qemu+tcp://lab2/system"
> error: Unable to read from monitor: Connection reset by peer

Just a side question, can you also reproduce it with qemu+ssh? I was just testing a slight variant of the above CLI yesterday with qemu+ssh on Fedora 22, and it worked:

    $ virsh migrate --verbose --copy-storage-all --p2p --live cvm1 \
        qemu+ssh://root@desthost/system

(NOTE: The above assumes root on src can SSH to dst without any password prompt, so, for testing you might want to quickly create SSH keys with empty passphrase, assuming it's a trusted network.)

> Libvirt debug logs attached.

Revision history for this message
In , Lukas (lukas-redhat-bugs) wrote :

Just wondering, is qemu+tcp working for you or not?

Revision history for this message
In , Kashyap (kashyap-redhat-bugs) wrote :

Yes, qemu+tcp is working for me, I tested four variants (refer further
below) with these versions:

    kernel-4.0.0-0.rc5.git4.1.fc22.x86_64
    libvirt-daemon-kvm-1.2.13-2.fc22.x86_64
    qemu-system-x86-2.3.0-0.2.rc1.fc22.x86_64

Config setup
------------
I had this config in destination's libvirtd.conf:

    $ cat /etc/libvirt/libvirtd.conf | grep -v ^$ | grep -v ^#
    listen_tls = 0
    listen_tcp = 1
    auth_tcp = "none"

And started the libvirtd daemon on the destination with:

    $ cat /etc/sysconfig/libvirtd | grep -v ^$ | grep -v ^#
    LIBVIRTD_ARGS="--listen"

Since I'm testing in a trusted network, I also had SSH access (via
public/private keys) to root on destinatoin host without any password
prompts.

Tests
-----

I just tested three variants of migration with qemu+tcp, successfully:

(1) Native migration, client to two libvirtd servers

    $ virsh migrate --verbose --copy-storage-all \
        --live cvm1 qemu+tcp://kashyapc@devstack3/system

(2) Native migration, client to and peer2peer between, two libvirtd servers

    $ virsh migrate --verbose --copy-storage-all \
         --p2p --live cvm1 qemu+tcp://kashyapc@devstack3/system

(3) Tunnelled migration, client and peer2peer between two libvirtd servers

    $ virsh migrate --verbose --copy-storage-all \
        --p2p --tunnelled --live cvm1 qemu+tcp://kashyapc@devstack3/system

Successful libvirtd log (with debug filter set) for the 3rd variant:

    https://kashyapc.fedorapeople.org/virt/temp/tunnelled-p2p-migration-qemu-tcp-libvirtd-log.txt

Additionally, I also tested the below (without explicit
'--copy-storage-all' flag, it works too):

    $ virsh migrate --verbose --p2p --tunnelled \
        --live cvm1 qemu+tcp://kashyapc@devstack3/system

Revision history for this message
In , Kashyap (kashyap-redhat-bugs) wrote :

Closing the bug, per comment #19. Feel free to reopen in case you can provide a reliable reproducer with appropriate logs.

Revision history for this message
In , Lukas (lukas-redhat-bugs) wrote :

I'd like to test with qemu+ssh but after I have provided the debug logs I have downgraded qemu on our lab boxes.

I think it would be best to raise a separate issue for the problem with qemu+ssh.

Thanks,
Lucas

Revision history for this message
In , Kashyap (kashyap-redhat-bugs) wrote :

(In reply to Kashyap Chamarthy from comment #19)

[. . .]

[Just correcting the terminology for migration scenarios (2) and (3).]

Assuming I'm reading this doc correctly. (Libvirt devs, please correct me if I'm wrong.)

    http://libvirt.org/migration.html#scenarios

> Tests
> -----
>
> I just tested three variants of migration with qemu+tcp, successfully:
>
>
> (1) Native migration, client to two libvirtd servers
>
> $ virsh migrate --verbose --copy-storage-all \
> --live cvm1 qemu+tcp://kashyapc@devstack3/system
>
> (2) Native migration, client to and peer2peer between, two libvirtd servers

The below is called "Native migration, peer2peer between two libvirtd servers"

Refer: http://libvirt.org/migration.html#nativepeer2peer

>
> $ virsh migrate --verbose --copy-storage-all \
> --p2p --live cvm1 qemu+tcp://kashyapc@devstack3/system
>
> (3) Tunnelled migration, client and peer2peer between two libvirtd servers

The below is called "Tunnelled migration, peer2peer between two libvirtd servers" Refer: http://libvirt.org/migration.html#scenariotunnelpeer2peer2

>
> $ virsh migrate --verbose --copy-storage-all \
> --p2p --tunnelled --live cvm1 qemu+tcp://kashyapc@devstack3/system

[. . .]

Revision history for this message
In , Lukas (lukas-redhat-bugs) wrote :

bump

no longer affects: qemu (Ubuntu)
Revision history for this message
In , Frank (frank-redhat-bugs) wrote :

Hi,

I ran into the same problem and using qemu+tcp instead of qemu+ssh solved it. However, it took me a lot of hours to figure this out. :(

I like to add that the error seems to depend on the VMs workload. I was able to reproduce the error with a higher workload while the live migration worked fine with a lighter workload.

Best,
Frank

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

Per the rh bug, does using qemu+tcp instead of qemu+ssh work around it?

Revision history for this message
Lukas Vacek (lukas-vacek) wrote :

As far as I can tell, the reports mentioning problems only with qemu+ssh and not qemu+tcp are unrelated. Probably best if you could verify yourself. Thanks.

Revision history for this message
In , Cole (cole-redhat-bugs) wrote :

It's been a while since the last report. Is anyone still seeing this with more recent libvirt + distro?

Revision history for this message
In , Cole (cole-redhat-bugs) wrote :

Since there's no response, closing as DEFERRED. But if anyone is still affected with newer libvirt versions, please re-open and we can triage from there

Changed in libvirt:
importance: Unknown → Medium
status: Unknown → Won't Fix
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Following upstream to Won't Fix for now.

Changed in libvirt (Ubuntu):
status: Confirmed → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.