Volume based live migration aborted unexpectedly

Bug #1524898 reported by Ryan Moats on 2015-12-10
26
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
High
Lee Yarwood

Bug Description

Volume based live migration is failing during tempest testing in the check and experimental pipelines

http://logstash.openstack.org/#/dashboard/file/logstash.json?query=message:\%22Live%20Migration%20failure:%20operation%20failed:%20migration%20job:%20unexpectedly%20failed\%22%20AND%20tags:\%22screen-n-cpu.txt\%22

shows 42 failures since 12/8

2015-12-18 15:26:26.411 ERROR nova.virt.libvirt.driver [req-9a4dda34-987c-42a6-a20b-22729cd202e7 tempest-LiveBlockMigrationTestJSON-1501972407 tempest-LiveBlockMigrationTestJSON-1128863260] [instance: b50e1dc8-9b4c-4edb-be0e-8777cae14f01] Live Migration failure: operation failed: migration job: unexpectedly failed
2015-12-18 15:26:26.412 DEBUG nova.virt.libvirt.driver [req-9a4dda34-987c-42a6-a20b-22729cd202e7 tempest-LiveBlockMigrationTestJSON-1501972407 tempest-LiveBlockMigrationTestJSON-1128863260] [instance: b50e1dc8-9b4c-4edb-be0e-8777cae14f01] Migration operation thread notification thread_finished /opt/stack/new/nova/nova/virt/libvirt/driver.py:6094
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/eventlet/hubs/hub.py", line 457, in fire_timers
    timer()
  File "/usr/local/lib/python2.7/dist-packages/eventlet/hubs/timer.py", line 58, in __call__
    cb(*args, **kw)
  File "/usr/local/lib/python2.7/dist-packages/eventlet/event.py", line 168, in _do_send
    waiter.switch(result)
  File "/usr/local/lib/python2.7/dist-packages/eventlet/greenthread.py", line 214, in main
    result = function(*args, **kwargs)
  File "/opt/stack/new/nova/nova/utils.py", line 1161, in context_wrapper
    return func(*args, **kwargs)
  File "/opt/stack/new/nova/nova/virt/libvirt/driver.py", line 5696, in _live_migration_operation
    instance=instance)
  File "/usr/local/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 204, in __exit__
    six.reraise(self.type_, self.value, self.tb)
  File "/opt/stack/new/nova/nova/virt/libvirt/driver.py", line 5651, in _live_migration_operation
    CONF.libvirt.live_migration_bandwidth)
  File "/usr/local/lib/python2.7/dist-packages/eventlet/tpool.py", line 183, in doit
    result = proxy_call(self._autowrap, f, *args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/eventlet/tpool.py", line 141, in proxy_call
    rv = execute(f, *args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/eventlet/tpool.py", line 122, in execute
    six.reraise(c, e, tb)
  File "/usr/local/lib/python2.7/dist-packages/eventlet/tpool.py", line 80, in tworker
    rv = meth(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/libvirt.py", line 1511, in migrateToURI
    if ret == -1: raise libvirtError ('virDomainMigrateToURI() failed', dom=self)
libvirtError: operation failed: migration job: unexpectedly failed

Matt Riedemann (mriedem) on 2015-12-10
summary: - Volume based live migration fails unexpectedly
+ Volume based live migration aborted unexpectedly
Ryan Moats (rmoats) on 2015-12-10
description: updated
Matt Riedemann (mriedem) wrote :
Download full text (3.2 KiB)

http://logs.openstack.org/17/255217/5/check/gate-tempest-dsvm-neutron-dvr-multinode-full/15995d2/logs/subnode-2/screen-n-cpu.txt.gz?#_2015-12-10_08_14_34_437

2015-12-10 08:14:34.437 ERROR nova.virt.libvirt.driver [req-e8766399-ceeb-4b6c-82e9-bb5de1d8eed3 tempest-LiveBlockMigrationTestJSON-1334467432 tempest-LiveBlockMigrationTestJSON-138512217] [instance: f21cf8ce-fd9e-4fa3-9cf6-9fb5da8751ca] Live Migration failure: operation failed: migration job: unexpectedly failed
2015-12-10 08:14:34.438 DEBUG nova.virt.libvirt.driver [req-e8766399-ceeb-4b6c-82e9-bb5de1d8eed3 tempest-LiveBlockMigrationTestJSON-1334467432 tempest-LiveBlockMigrationTestJSON-138512217] [instance: f21cf8ce-fd9e-4fa3-9cf6-9fb5da8751ca] Migration operation thread notification thread_finished /opt/stack/new/nova/nova/virt/libvirt/driver.py:6130
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/eventlet/hubs/hub.py", line 457, in fire_timers
    timer()
  File "/usr/local/lib/python2.7/dist-packages/eventlet/hubs/timer.py", line 58, in __call__
    cb(*args, **kw)
  File "/usr/local/lib/python2.7/dist-packages/eventlet/event.py", line 168, in _do_send
    waiter.switch(result)
  File "/usr/local/lib/python2.7/dist-packages/eventlet/greenthread.py", line 214, in main
    result = function(*args, **kwargs)
  File "/opt/stack/new/nova/nova/utils.py", line 1178, in context_wrapper
    return func(*args, **kwargs)
  File "/opt/stack/new/nova/nova/virt/libvirt/driver.py", line 5732, in _live_migration_operation
    instance=instance)
  File "/usr/local/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 204, in __exit__
    six.reraise(self.type_, self.value, self.tb)
  File "/opt/stack/new/nova/nova/virt/libvirt/driver.py", line 5700, in _live_migration_operation
    CONF.libvirt.live_migration_bandwidth)
  File "/usr/local/lib/python2.7/dist-packages/eventlet/tpool.py", line 183, in doit
    result = proxy_call(self._autowrap, f, *args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/eventlet/tpool.py", line 141, in proxy_call
    rv = execute(f, *args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/eventlet/tpool.py", line 122, in execute
    six.reraise(c, e, tb)
  File "/usr/local/lib/python2.7/dist-packages/eventlet/tpool.py", line 80, in tworker
    rv = meth(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/libvirt.py", line 1586, in migrateToURI2
    if ret == -1: raise libvirtError ('virDomainMigrateToURI2() failed', dom=self)
libvirtError: operation failed: migration job: unexpectedly failed

From the libvirtd logs:

http://logs.openstack.org/17/255217/5/check/gate-tempest-dsvm-neutron-dvr-multinode-full/15995d2/logs/subnode-2/libvirt/libvirtd.txt.gz#_2015-12-10_08_14_33_802

2015-12-10 08:14:33.802+0000: 12436: debug : qemuMonitorIOProcess:396 : QEMU_MONITOR_IO_PROCESS: mon=0x7f70100095e0 buf={"return": {"status": "failed"}, "id": "libvirt-108"}

QEMU log for the domain:

http://logs.openstack.org/17/255217/5/check/gate-tempest-dsvm-neutron-dvr-multinode-full/15995d2/logs/libvirt/qemu/instance-0000000a.txt.gz

From logstash it looks like this started on 12/8, and is isolated to the OVH and BlueBox node...

Read more...

tags: added: libvirt live-migration
tags: added: volumes
Changed in nova:
status: New → Confirmed
importance: Undecided → High
Daniel Berrange (berrange) wrote :

The odd thing here is the sequence of events from QEMU

We see a normal sequence of 'query-migrate' monitor calls showing ongoing progress until we only have 300 kb left:

2015-12-10 08:14:30.282+0000: 12440: debug : qemuMonitorSend:959 : QEMU_MONITOR_SEND_MSG: mon=0x7f70100095e0 msg={"execute":"query-migrate","id":"libvirt-106"}
2015-12-10 08:14:30.284+0000: 12436: debug : qemuMonitorJSONIOProcessLine:177 : QEMU_MONITOR_RECV_REPLY: mon=0x7f70100095e0 reply={"return": {"expected-downtime": 138, "status": "active", "setup-time": 1, "total-time": 4524, "ram": {"total": 67514368, "remaining": 344064, "mbps": 102.161541, "transferred": 58021706, "duplicate": 3962, "dirty-pages-rate": 431, "skipped": 0, "normal-bytes": 58519552, "normal": 14287}}, "id": "libvirt-106"}

Then we see the STOP event, which shows QEMU is about to switch to the target host

2015-12-10 08:14:30.301+0000: 12436: debug : qemuMonitorJSONIOProcessLine:172 : QEMU_MONITOR_RECV_EVENT: mon=0x7f70100095e0 event={"timestamp": {"seconds": 1449735270, "microseconds": 301459}, "event": "STOP"}

We can see from the status only 200 KB are left at this point

2015-12-10 08:14:30.335+0000: 12440: debug : qemuMonitorSend:959 : QEMU_MONITOR_SEND_MSG: mon=0x7f70100095e0 msg={"execute":"query-migrate","id":"libvirt-107"}
2015-12-10 08:14:33.749+0000: 12436: debug : qemuMonitorJSONIOProcessLine:177 : QEMU_MONITOR_RECV_REPLY: mon=0x7f70100095e0 reply={"return": {"expected-downtime": 144, "status": "active", "setup-time": 1, "total-time": 7988, "ram": {"total": 67514368, "remaining": 204800, "mbps": 97.487385, "transferred": 58813786, "duplicate": 3962, "dirty-pages-rate": 431, "skipped": 0, "normal-bytes": 58662912, "normal": 14322}}, "id": "libvirt-107"}

Now the very strange thing - we see a RESUME event, which shows QEMU has started running the source once more - the switchover to the target host has failed:

2015-12-10 08:14:33.761+0000: 12436: debug : qemuMonitorJSONIOProcessLine:172 : QEMU_MONITOR_RECV_EVENT: mon=0x7f70100095e0 event={"timestamp": {"seconds": 1449735273, "microseconds": 760967}, "event": "RESUME"}

The migrate status confirms the failure

2015-12-10 08:14:33.801+0000: 12440: debug : qemuMonitorSend:959 : QEMU_MONITOR_SEND_MSG: mon=0x7f70100095e0 msg={"execute":"query-migrate","id":"libvirt-108"}
2015-12-10 08:14:33.802+0000: 12436: debug : qemuMonitorJSONIOProcessLine:177 : QEMU_MONITOR_RECV_REPLY: mon=0x7f70100095e0 reply={"return": {"status": "failed"}, "id": "libvirt-108"}

Looking at the libvirtd logs I see this QEMU version

Debian-2.0.0+dfsg-2ubuntu1.21

Looking in the QEMU code for version 2.0.0, there are only two ways this can happen - one is an I/O failure on the migration data stream (which there is no sign of here) and the other is a failure to stop VM CPUs.

The only reason we can fail to stop VM CPUs is if QEMU fails to flush outstanding guest I/O operations.

There have been some issues related to I/O and migration fixed in QEMU over time since 2.0.0 was released. So my money is probably on this being a QEMU bug. AFAIK, Nova and libvirt are both operating correctly.

Matt Riedemann (mriedem) wrote :

This fails primarily on OVH nodes which are the slowest ones we have in the community infra system, so that indicates there is a speed component to this.

description: updated

Looking at todays failures, which is a much more modern qemu we've got a similar case:

http://logs.openstack.org/55/328055/12/check/gate-tempest-dsvm-multinode-live-migration/e4d241f/logs/subnode-2/libvirt/libvirtd.txt.gz#_2016-06-28_12_51_58_804
http://logs.openstack.org/55/328055/12/check/gate-tempest-dsvm-multinode-live-migration/e4d241f/logs/libvirt/qemu/instance-00000003.txt.gz

a) 12:51:58.804+0000: receives a query-migrate reply saying all is fine
b) 12:51:58.908+0000: receives a STOP from the source qemu saying it's nearly finished
c) 12:52:01.358+0000: ! receives a query-migrate reply saying migration has failed
d) 12:52:01.377760Z qemu-system-x86_64: load of migration failed: Input/output error
e) 12:52:01.360+0000: "event": "RESUME"
f) 12:52:03.136445Z qemu-system-x86_64: terminating on signal 15 from pid 18359

(d) comes from the destination.
OK, so migration failed - but we can't really tell who blinked first; the destination gives the I/O error probably because the source migration went away from it's point of view (probably!), either due to a network issue or the source dieing. Given that we see the migration failed slightly earlier on the source it's probably the source dieing by itself - but unfortunately with no error in teh source qemu log before it quits.

We need some debugging in the version of qemu that's run - unfortunately the ubuntu qemu packages dont seem to have any of qemu's tracing turned on either.

Daniel Berrange (berrange) wrote :

This change is attempting to replace the system provided QEMU with a custom build containing some debug hacks to try and gather more data

https://review.openstack.org/#/c/335549/

Daniel Berrange (berrange) wrote :

We now have a job which has failed with the custom QEMU present

http://logs.openstack.org/03/335903/1/check/gate-tempest-dsvm-multinode-live-migration/5bae103/

See the source QEMU log:

http://logs.openstack.org/03/335903/1/check/gate-tempest-dsvm-multinode-live-migration/5bae103/logs/subnode-2/libvirt/qemu/instance-00000003.txt.gz

migration_completion: global_state_store ret=0
migration_completion: vm_stop_force_state ret=-5

This message with ret=-5 is reporting the return value of the 'vm_stop' API call, which in turn is reporting do_vm_stop which in turn is reporting the value from bdrv_flush_all(). The latter reports errno values, and -5 corresponds to EIO.

So, we're failing the migration switch over because the block driver was unable to flush outstanding I/O for some unknown reason.

Daniel Berrange (berrange) wrote :

The failure from QEMU happens shortly before 14:28:15

If we look in syslog at this same time we see I/O errors:

http://logs.openstack.org/03/335903/1/check/gate-tempest-dsvm-multinode-live-migration/5bae103/logs/subnode-2/syslog.txt.gz#_Jun_30_14_28_12

Jun 30 14:28:11 ubuntu-xenial-2-node-ovh-gra1-2121639-75684 kernel: connection1:0: detected conn error (1020)
Jun 30 14:28:12 ubuntu-xenial-2-node-ovh-gra1-2121639-75684 iscsid[537]: Kernel reported iSCSI connection 1:0 error (1020 - ISCSI_ERR_TCP_CONN_CLOSE: TCP connection closed) state (3)
Jun 30 14:28:15 ubuntu-xenial-2-node-ovh-gra1-2121639-75684 kernel: blk_update_request: I/O error, dev sda, sector 0
Jun 30 14:28:15 ubuntu-xenial-2-node-ovh-gra1-2121639-75684 iscsid[537]: connection1:0 is operational after recovery (1 attempts)

Yes, the same is true on yesterdays:

http://logs.openstack.org/55/328055/12/check/gate-tempest-dsvm-multinode-live-migration/e4d241f/logs/syslog.txt.gz

Jun 28 12:51:56 ubuntu-xenial-2-node-ovh-bhs1-2053574 iscsid[507]: Kernel reported iSCSI connection 1:0 error (1020 - ISCSI_ERR_TCP_CONN_CLOSE: TCP connection closed) state (3)

(There is a libvirt assert shortly after that but it looks like iscsi falls over first:
Jun 28 12:52:01 ubuntu-xenial-2-node-ovh-bhs1-2053574 libvirtd[1769]: libvirtd: malloc.c:3723: _int_malloc: Assertion `(unsigned long) (size) >= (unsigned long) (nb)' failed.

Daniel Berrange (berrange) wrote :

The syslog on the host running the target QEMU also sees an iscsi error a few seconds before the source host got one

http://logs.openstack.org/03/335903/1/check/gate-tempest-dsvm-multinode-live-migration/5bae103/logs/syslog.txt.gz

Jun 30 14:28:08 ubuntu-xenial-2-node-ovh-gra1-2121639 kernel: connection1:0: detected conn error (1020)
Jun 30 14:28:09 ubuntu-xenial-2-node-ovh-gra1-2121639 iscsid[525]: Kernel reported iSCSI connection 1:0 error (1020 - ISCSI_ERR_TCP_CONN_CLOSE: TCP connection closed) state (3)

Interestingly, this host is the one actually running the tgtd iscsi server and cinder

Daniel Berrange (berrange) wrote :

Interestingly we also see these iscsi connection errors on CI runs which don't fail. It looks like they're happening at a slightly different time though, so QEMU is lucky to not be doing I/O and thus doesn't fail

Dan: On both sides? Of course that's the optimistic view; the pessimistic view
is that it did the I/O and the error didn't get reported.

Lee Yarwood (lyarwood) wrote :

This could be use to both nodes using the same initiator (/etc/iscsi/initiatorname.iscsi). I can see that the src host is using iqn.1993-08.org.debian:01:d492516748 from the n-cpu.log [1] but I can't seem to find a way of confirming what the dest is using.

[1] http://logs.openstack.org/03/335903/1/check/gate-tempest-dsvm-multinode-live-migration/5bae103/logs/subnode-2/screen-n-cpu.txt.gz?#_2016-06-30_14_27_53_898

Kashyap Chamarthy (kashyapc) wrote :

Lee points out on IRC (#openstack-nova) that the Ubuntu Xenial machines have a pre-populated 'InitiatorName':

-----
$ wget https://cloud-images.ubuntu.com/xenial/current/xenial-server-cloudimg-amd64-disk1.img
$ virt-cat -a xenial-server-cloudimg-amd64-disk1.img /etc/iscsi/initiatorname.iscsi
## DO NOT EDIT OR REMOVE THIS FILE!
## If you remove this file, the iSCSI daemon will not start.
## If you change the InitiatorName, existing access control lists
## may reject this initiator. The InitiatorName must be unique
## for each iSCSI initiator. Do NOT duplicate iSCSI InitiatorNames.
InitiatorName=iqn.1993-08.org.debian:01:56aff7b418d
-----

And that it ought to be auto-generated to get a unique name, by providing: "GenerateName=yes" (Source: https://wiki.debian.org/SAN/iSCSI/open-iscsi)

Lee Yarwood (lyarwood) wrote :

I've added two more DEBUG statements to capture both the connection_info dict used by connect_volume and each hosts connectors/initiators using danpb's topic below :

https://review.openstack.org/#/q/topic:mig-debug

Matt Riedemann (mriedem) wrote :

Maybe we should try using in-qemu iscsi to see if we still have the same issue, or at least see if it fails with a better error message. The package is available for xenial which the live migration job is using now:

http://packages.ubuntu.com/xenial/qemu-block-extra

The qemu native iscsi support was added to nova in kilo:

https://specs.openstack.org/openstack/nova-specs/specs/kilo/implemented/qemu-built-in-iscsi-initiator.html

Otherwise I'd say let's skip the volume-backed live migration test(s) in the live migration job so we can try and stabilize that job (the tests would still run in the non-voting multinode job).

Reviewed: https://review.openstack.org/353002
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=2f68cd3e9f47ca5bb864c26c1b0b123918197ef4
Submitter: Jenkins
Branch: master

commit 2f68cd3e9f47ca5bb864c26c1b0b123918197ef4
Author: Timofey Durakov <email address hidden>
Date: Tue Aug 9 19:35:23 2016 +0300

    Skipping test_volume_backed_live_migration for live_migration job

    For non-shared storage live-migration job is unstable due to
    https://bugs.launchpad.net/nova/+bug/1524898. This patch temporary
    exclude this test from tests to be run.

    Related-Bug: #1524898

    Change-Id: I87559b8e90855837ad7b5f291bd9a8fc646421d3

Matt Riedemann (mriedem) wrote :

We're testing against the ubuntu cloud archive on xenial nodes now so we have libvirt 2.5.0 and qemu 2.8. We should re-enable the test_volume_backed_live_migration test to see if anything has improved.

Sean Dague (sdague) on 2017-06-28
Changed in nova:
status: Confirmed → In Progress
Matt Riedemann (mriedem) wrote :

We're now running against pike UCA for queens changes:

http://logs.openstack.org/63/508163/8/check/legacy-tempest-dsvm-multinode-live-migration/ad23f3d/logs/dpkg-l.txt.gz

ii libvirt-bin 3.6.0-1ubuntu5~cloud0 amd64 programs for the libvirt library

ii qemu-system 1:2.10+dfsg-0ubuntu3~cloud0 amd64 QEMU full system emulation binaries

So maybe another attempt at enabling volume-backed live migration testing is in order.

Fix proposed to branch: master
Review: https://review.openstack.org/536105

Changed in nova:
assignee: nobody → Lee Yarwood (lyarwood)

Change abandoned by Lee Yarwood (<email address hidden>) on branch: master
Review: https://review.openstack.org/536105
Reason: Missed the changes already posted for this.

Change abandoned by Lee Yarwood (<email address hidden>) on branch: master
Review: https://review.openstack.org/536105

Change abandoned by Matt Riedemann (<email address hidden>) on branch: master
Review: https://review.openstack.org/459321

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers