Volume based live migration aborted unexpectedly

Bug #1524898 reported by Ryan Moats
26
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
High
Unassigned

Bug Description

Volume based live migration is failing during tempest testing in the check and experimental pipelines

http://logstash.openstack.org/#/dashboard/file/logstash.json?query=message:\%22Live%20Migration%20failure:%20operation%20failed:%20migration%20job:%20unexpectedly%20failed\%22%20AND%20tags:\%22screen-n-cpu.txt\%22

shows 42 failures since 12/8

2015-12-18 15:26:26.411 ERROR nova.virt.libvirt.driver [req-9a4dda34-987c-42a6-a20b-22729cd202e7 tempest-LiveBlockMigrationTestJSON-1501972407 tempest-LiveBlockMigrationTestJSON-1128863260] [instance: b50e1dc8-9b4c-4edb-be0e-8777cae14f01] Live Migration failure: operation failed: migration job: unexpectedly failed
2015-12-18 15:26:26.412 DEBUG nova.virt.libvirt.driver [req-9a4dda34-987c-42a6-a20b-22729cd202e7 tempest-LiveBlockMigrationTestJSON-1501972407 tempest-LiveBlockMigrationTestJSON-1128863260] [instance: b50e1dc8-9b4c-4edb-be0e-8777cae14f01] Migration operation thread notification thread_finished /opt/stack/new/nova/nova/virt/libvirt/driver.py:6094
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/eventlet/hubs/hub.py", line 457, in fire_timers
    timer()
  File "/usr/local/lib/python2.7/dist-packages/eventlet/hubs/timer.py", line 58, in __call__
    cb(*args, **kw)
  File "/usr/local/lib/python2.7/dist-packages/eventlet/event.py", line 168, in _do_send
    waiter.switch(result)
  File "/usr/local/lib/python2.7/dist-packages/eventlet/greenthread.py", line 214, in main
    result = function(*args, **kwargs)
  File "/opt/stack/new/nova/nova/utils.py", line 1161, in context_wrapper
    return func(*args, **kwargs)
  File "/opt/stack/new/nova/nova/virt/libvirt/driver.py", line 5696, in _live_migration_operation
    instance=instance)
  File "/usr/local/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 204, in __exit__
    six.reraise(self.type_, self.value, self.tb)
  File "/opt/stack/new/nova/nova/virt/libvirt/driver.py", line 5651, in _live_migration_operation
    CONF.libvirt.live_migration_bandwidth)
  File "/usr/local/lib/python2.7/dist-packages/eventlet/tpool.py", line 183, in doit
    result = proxy_call(self._autowrap, f, *args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/eventlet/tpool.py", line 141, in proxy_call
    rv = execute(f, *args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/eventlet/tpool.py", line 122, in execute
    six.reraise(c, e, tb)
  File "/usr/local/lib/python2.7/dist-packages/eventlet/tpool.py", line 80, in tworker
    rv = meth(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/libvirt.py", line 1511, in migrateToURI
    if ret == -1: raise libvirtError ('virDomainMigrateToURI() failed', dom=self)
libvirtError: operation failed: migration job: unexpectedly failed

Matt Riedemann (mriedem)
summary: - Volume based live migration fails unexpectedly
+ Volume based live migration aborted unexpectedly
Ryan Moats (rmoats)
description: updated
Revision history for this message
Matt Riedemann (mriedem) wrote :
Download full text (3.2 KiB)

http://logs.openstack.org/17/255217/5/check/gate-tempest-dsvm-neutron-dvr-multinode-full/15995d2/logs/subnode-2/screen-n-cpu.txt.gz?#_2015-12-10_08_14_34_437

2015-12-10 08:14:34.437 ERROR nova.virt.libvirt.driver [req-e8766399-ceeb-4b6c-82e9-bb5de1d8eed3 tempest-LiveBlockMigrationTestJSON-1334467432 tempest-LiveBlockMigrationTestJSON-138512217] [instance: f21cf8ce-fd9e-4fa3-9cf6-9fb5da8751ca] Live Migration failure: operation failed: migration job: unexpectedly failed
2015-12-10 08:14:34.438 DEBUG nova.virt.libvirt.driver [req-e8766399-ceeb-4b6c-82e9-bb5de1d8eed3 tempest-LiveBlockMigrationTestJSON-1334467432 tempest-LiveBlockMigrationTestJSON-138512217] [instance: f21cf8ce-fd9e-4fa3-9cf6-9fb5da8751ca] Migration operation thread notification thread_finished /opt/stack/new/nova/nova/virt/libvirt/driver.py:6130
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/eventlet/hubs/hub.py", line 457, in fire_timers
    timer()
  File "/usr/local/lib/python2.7/dist-packages/eventlet/hubs/timer.py", line 58, in __call__
    cb(*args, **kw)
  File "/usr/local/lib/python2.7/dist-packages/eventlet/event.py", line 168, in _do_send
    waiter.switch(result)
  File "/usr/local/lib/python2.7/dist-packages/eventlet/greenthread.py", line 214, in main
    result = function(*args, **kwargs)
  File "/opt/stack/new/nova/nova/utils.py", line 1178, in context_wrapper
    return func(*args, **kwargs)
  File "/opt/stack/new/nova/nova/virt/libvirt/driver.py", line 5732, in _live_migration_operation
    instance=instance)
  File "/usr/local/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 204, in __exit__
    six.reraise(self.type_, self.value, self.tb)
  File "/opt/stack/new/nova/nova/virt/libvirt/driver.py", line 5700, in _live_migration_operation
    CONF.libvirt.live_migration_bandwidth)
  File "/usr/local/lib/python2.7/dist-packages/eventlet/tpool.py", line 183, in doit
    result = proxy_call(self._autowrap, f, *args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/eventlet/tpool.py", line 141, in proxy_call
    rv = execute(f, *args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/eventlet/tpool.py", line 122, in execute
    six.reraise(c, e, tb)
  File "/usr/local/lib/python2.7/dist-packages/eventlet/tpool.py", line 80, in tworker
    rv = meth(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/libvirt.py", line 1586, in migrateToURI2
    if ret == -1: raise libvirtError ('virDomainMigrateToURI2() failed', dom=self)
libvirtError: operation failed: migration job: unexpectedly failed

From the libvirtd logs:

http://logs.openstack.org/17/255217/5/check/gate-tempest-dsvm-neutron-dvr-multinode-full/15995d2/logs/subnode-2/libvirt/libvirtd.txt.gz#_2015-12-10_08_14_33_802

2015-12-10 08:14:33.802+0000: 12436: debug : qemuMonitorIOProcess:396 : QEMU_MONITOR_IO_PROCESS: mon=0x7f70100095e0 buf={"return": {"status": "failed"}, "id": "libvirt-108"}

QEMU log for the domain:

http://logs.openstack.org/17/255217/5/check/gate-tempest-dsvm-neutron-dvr-multinode-full/15995d2/logs/libvirt/qemu/instance-0000000a.txt.gz

From logstash it looks like this started on 12/8, and is isolated to the OVH and BlueBox node...

Read more...

tags: added: libvirt live-migration
tags: added: volumes
Changed in nova:
status: New → Confirmed
importance: Undecided → High
Revision history for this message
Matt Riedemann (mriedem) wrote :
Revision history for this message
Daniel Berrange (berrange) wrote :

The odd thing here is the sequence of events from QEMU

We see a normal sequence of 'query-migrate' monitor calls showing ongoing progress until we only have 300 kb left:

2015-12-10 08:14:30.282+0000: 12440: debug : qemuMonitorSend:959 : QEMU_MONITOR_SEND_MSG: mon=0x7f70100095e0 msg={"execute":"query-migrate","id":"libvirt-106"}
2015-12-10 08:14:30.284+0000: 12436: debug : qemuMonitorJSONIOProcessLine:177 : QEMU_MONITOR_RECV_REPLY: mon=0x7f70100095e0 reply={"return": {"expected-downtime": 138, "status": "active", "setup-time": 1, "total-time": 4524, "ram": {"total": 67514368, "remaining": 344064, "mbps": 102.161541, "transferred": 58021706, "duplicate": 3962, "dirty-pages-rate": 431, "skipped": 0, "normal-bytes": 58519552, "normal": 14287}}, "id": "libvirt-106"}

Then we see the STOP event, which shows QEMU is about to switch to the target host

2015-12-10 08:14:30.301+0000: 12436: debug : qemuMonitorJSONIOProcessLine:172 : QEMU_MONITOR_RECV_EVENT: mon=0x7f70100095e0 event={"timestamp": {"seconds": 1449735270, "microseconds": 301459}, "event": "STOP"}

We can see from the status only 200 KB are left at this point

2015-12-10 08:14:30.335+0000: 12440: debug : qemuMonitorSend:959 : QEMU_MONITOR_SEND_MSG: mon=0x7f70100095e0 msg={"execute":"query-migrate","id":"libvirt-107"}
2015-12-10 08:14:33.749+0000: 12436: debug : qemuMonitorJSONIOProcessLine:177 : QEMU_MONITOR_RECV_REPLY: mon=0x7f70100095e0 reply={"return": {"expected-downtime": 144, "status": "active", "setup-time": 1, "total-time": 7988, "ram": {"total": 67514368, "remaining": 204800, "mbps": 97.487385, "transferred": 58813786, "duplicate": 3962, "dirty-pages-rate": 431, "skipped": 0, "normal-bytes": 58662912, "normal": 14322}}, "id": "libvirt-107"}

Now the very strange thing - we see a RESUME event, which shows QEMU has started running the source once more - the switchover to the target host has failed:

2015-12-10 08:14:33.761+0000: 12436: debug : qemuMonitorJSONIOProcessLine:172 : QEMU_MONITOR_RECV_EVENT: mon=0x7f70100095e0 event={"timestamp": {"seconds": 1449735273, "microseconds": 760967}, "event": "RESUME"}

The migrate status confirms the failure

2015-12-10 08:14:33.801+0000: 12440: debug : qemuMonitorSend:959 : QEMU_MONITOR_SEND_MSG: mon=0x7f70100095e0 msg={"execute":"query-migrate","id":"libvirt-108"}
2015-12-10 08:14:33.802+0000: 12436: debug : qemuMonitorJSONIOProcessLine:177 : QEMU_MONITOR_RECV_REPLY: mon=0x7f70100095e0 reply={"return": {"status": "failed"}, "id": "libvirt-108"}

Looking at the libvirtd logs I see this QEMU version

Debian-2.0.0+dfsg-2ubuntu1.21

Looking in the QEMU code for version 2.0.0, there are only two ways this can happen - one is an I/O failure on the migration data stream (which there is no sign of here) and the other is a failure to stop VM CPUs.

The only reason we can fail to stop VM CPUs is if QEMU fails to flush outstanding guest I/O operations.

There have been some issues related to I/O and migration fixed in QEMU over time since 2.0.0 was released. So my money is probably on this being a QEMU bug. AFAIK, Nova and libvirt are both operating correctly.

Revision history for this message
Matt Riedemann (mriedem) wrote :

This fails primarily on OVH nodes which are the slowest ones we have in the community infra system, so that indicates there is a speed component to this.

description: updated
Revision history for this message
Dr. David Alan Gilbert (dgilbert-h) wrote :

Looking at todays failures, which is a much more modern qemu we've got a similar case:

http://logs.openstack.org/55/328055/12/check/gate-tempest-dsvm-multinode-live-migration/e4d241f/logs/subnode-2/libvirt/libvirtd.txt.gz#_2016-06-28_12_51_58_804
http://logs.openstack.org/55/328055/12/check/gate-tempest-dsvm-multinode-live-migration/e4d241f/logs/libvirt/qemu/instance-00000003.txt.gz

a) 12:51:58.804+0000: receives a query-migrate reply saying all is fine
b) 12:51:58.908+0000: receives a STOP from the source qemu saying it's nearly finished
c) 12:52:01.358+0000: ! receives a query-migrate reply saying migration has failed
d) 12:52:01.377760Z qemu-system-x86_64: load of migration failed: Input/output error
e) 12:52:01.360+0000: "event": "RESUME"
f) 12:52:03.136445Z qemu-system-x86_64: terminating on signal 15 from pid 18359

(d) comes from the destination.
OK, so migration failed - but we can't really tell who blinked first; the destination gives the I/O error probably because the source migration went away from it's point of view (probably!), either due to a network issue or the source dieing. Given that we see the migration failed slightly earlier on the source it's probably the source dieing by itself - but unfortunately with no error in teh source qemu log before it quits.

We need some debugging in the version of qemu that's run - unfortunately the ubuntu qemu packages dont seem to have any of qemu's tracing turned on either.

Revision history for this message
Daniel Berrange (berrange) wrote :

This change is attempting to replace the system provided QEMU with a custom build containing some debug hacks to try and gather more data

https://review.openstack.org/#/c/335549/

Revision history for this message
Daniel Berrange (berrange) wrote :

We now have a job which has failed with the custom QEMU present

http://logs.openstack.org/03/335903/1/check/gate-tempest-dsvm-multinode-live-migration/5bae103/

See the source QEMU log:

http://logs.openstack.org/03/335903/1/check/gate-tempest-dsvm-multinode-live-migration/5bae103/logs/subnode-2/libvirt/qemu/instance-00000003.txt.gz

migration_completion: global_state_store ret=0
migration_completion: vm_stop_force_state ret=-5

This message with ret=-5 is reporting the return value of the 'vm_stop' API call, which in turn is reporting do_vm_stop which in turn is reporting the value from bdrv_flush_all(). The latter reports errno values, and -5 corresponds to EIO.

So, we're failing the migration switch over because the block driver was unable to flush outstanding I/O for some unknown reason.

Revision history for this message
Daniel Berrange (berrange) wrote :

The failure from QEMU happens shortly before 14:28:15

If we look in syslog at this same time we see I/O errors:

http://logs.openstack.org/03/335903/1/check/gate-tempest-dsvm-multinode-live-migration/5bae103/logs/subnode-2/syslog.txt.gz#_Jun_30_14_28_12

Jun 30 14:28:11 ubuntu-xenial-2-node-ovh-gra1-2121639-75684 kernel: connection1:0: detected conn error (1020)
Jun 30 14:28:12 ubuntu-xenial-2-node-ovh-gra1-2121639-75684 iscsid[537]: Kernel reported iSCSI connection 1:0 error (1020 - ISCSI_ERR_TCP_CONN_CLOSE: TCP connection closed) state (3)
Jun 30 14:28:15 ubuntu-xenial-2-node-ovh-gra1-2121639-75684 kernel: blk_update_request: I/O error, dev sda, sector 0
Jun 30 14:28:15 ubuntu-xenial-2-node-ovh-gra1-2121639-75684 iscsid[537]: connection1:0 is operational after recovery (1 attempts)

Revision history for this message
Dr. David Alan Gilbert (dgilbert-h) wrote :

Yes, the same is true on yesterdays:

http://logs.openstack.org/55/328055/12/check/gate-tempest-dsvm-multinode-live-migration/e4d241f/logs/syslog.txt.gz

Jun 28 12:51:56 ubuntu-xenial-2-node-ovh-bhs1-2053574 iscsid[507]: Kernel reported iSCSI connection 1:0 error (1020 - ISCSI_ERR_TCP_CONN_CLOSE: TCP connection closed) state (3)

(There is a libvirt assert shortly after that but it looks like iscsi falls over first:
Jun 28 12:52:01 ubuntu-xenial-2-node-ovh-bhs1-2053574 libvirtd[1769]: libvirtd: malloc.c:3723: _int_malloc: Assertion `(unsigned long) (size) >= (unsigned long) (nb)' failed.

Revision history for this message
Daniel Berrange (berrange) wrote :

The syslog on the host running the target QEMU also sees an iscsi error a few seconds before the source host got one

http://logs.openstack.org/03/335903/1/check/gate-tempest-dsvm-multinode-live-migration/5bae103/logs/syslog.txt.gz

Jun 30 14:28:08 ubuntu-xenial-2-node-ovh-gra1-2121639 kernel: connection1:0: detected conn error (1020)
Jun 30 14:28:09 ubuntu-xenial-2-node-ovh-gra1-2121639 iscsid[525]: Kernel reported iSCSI connection 1:0 error (1020 - ISCSI_ERR_TCP_CONN_CLOSE: TCP connection closed) state (3)

Interestingly, this host is the one actually running the tgtd iscsi server and cinder

Revision history for this message
Daniel Berrange (berrange) wrote :

Interestingly we also see these iscsi connection errors on CI runs which don't fail. It looks like they're happening at a slightly different time though, so QEMU is lucky to not be doing I/O and thus doesn't fail

Revision history for this message
Dr. David Alan Gilbert (dgilbert-h) wrote :

Dan: On both sides? Of course that's the optimistic view; the pessimistic view
is that it did the I/O and the error didn't get reported.

Revision history for this message
Lee Yarwood (lyarwood) wrote :

This could be use to both nodes using the same initiator (/etc/iscsi/initiatorname.iscsi). I can see that the src host is using iqn.1993-08.org.debian:01:d492516748 from the n-cpu.log [1] but I can't seem to find a way of confirming what the dest is using.

[1] http://logs.openstack.org/03/335903/1/check/gate-tempest-dsvm-multinode-live-migration/5bae103/logs/subnode-2/screen-n-cpu.txt.gz?#_2016-06-30_14_27_53_898

Revision history for this message
Kashyap Chamarthy (kashyapc) wrote :

Lee points out on IRC (#openstack-nova) that the Ubuntu Xenial machines have a pre-populated 'InitiatorName':

-----
$ wget https://cloud-images.ubuntu.com/xenial/current/xenial-server-cloudimg-amd64-disk1.img
$ virt-cat -a xenial-server-cloudimg-amd64-disk1.img /etc/iscsi/initiatorname.iscsi
## DO NOT EDIT OR REMOVE THIS FILE!
## If you remove this file, the iSCSI daemon will not start.
## If you change the InitiatorName, existing access control lists
## may reject this initiator. The InitiatorName must be unique
## for each iSCSI initiator. Do NOT duplicate iSCSI InitiatorNames.
InitiatorName=iqn.1993-08.org.debian:01:56aff7b418d
-----

And that it ought to be auto-generated to get a unique name, by providing: "GenerateName=yes" (Source: https://wiki.debian.org/SAN/iSCSI/open-iscsi)

Revision history for this message
Lee Yarwood (lyarwood) wrote :

I've added two more DEBUG statements to capture both the connection_info dict used by connect_volume and each hosts connectors/initiators using danpb's topic below :

https://review.openstack.org/#/q/topic:mig-debug

Revision history for this message
Lee Yarwood (lyarwood) wrote :
Revision history for this message
Matt Riedemann (mriedem) wrote :

Maybe we should try using in-qemu iscsi to see if we still have the same issue, or at least see if it fails with a better error message. The package is available for xenial which the live migration job is using now:

http://packages.ubuntu.com/xenial/qemu-block-extra

The qemu native iscsi support was added to nova in kilo:

https://specs.openstack.org/openstack/nova-specs/specs/kilo/implemented/qemu-built-in-iscsi-initiator.html

Otherwise I'd say let's skip the volume-backed live migration test(s) in the live migration job so we can try and stabilize that job (the tests would still run in the non-voting multinode job).

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (master)

Reviewed: https://review.openstack.org/353002
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=2f68cd3e9f47ca5bb864c26c1b0b123918197ef4
Submitter: Jenkins
Branch: master

commit 2f68cd3e9f47ca5bb864c26c1b0b123918197ef4
Author: Timofey Durakov <email address hidden>
Date: Tue Aug 9 19:35:23 2016 +0300

    Skipping test_volume_backed_live_migration for live_migration job

    For non-shared storage live-migration job is unstable due to
    https://bugs.launchpad.net/nova/+bug/1524898. This patch temporary
    exclude this test from tests to be run.

    Related-Bug: #1524898

    Change-Id: I87559b8e90855837ad7b5f291bd9a8fc646421d3

Revision history for this message
Matt Riedemann (mriedem) wrote :

We're testing against the ubuntu cloud archive on xenial nodes now so we have libvirt 2.5.0 and qemu 2.8. We should re-enable the test_volume_backed_live_migration test to see if anything has improved.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/459321

Sean Dague (sdague)
Changed in nova:
status: Confirmed → In Progress
Revision history for this message
Matt Riedemann (mriedem) wrote :

We're now running against pike UCA for queens changes:

http://logs.openstack.org/63/508163/8/check/legacy-tempest-dsvm-multinode-live-migration/ad23f3d/logs/dpkg-l.txt.gz

ii libvirt-bin 3.6.0-1ubuntu5~cloud0 amd64 programs for the libvirt library

ii qemu-system 1:2.10+dfsg-0ubuntu3~cloud0 amd64 QEMU full system emulation binaries

So maybe another attempt at enabling volume-backed live migration testing is in order.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/536105

Changed in nova:
assignee: nobody → Lee Yarwood (lyarwood)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (master)

Change abandoned by Lee Yarwood (<email address hidden>) on branch: master
Review: https://review.openstack.org/536105
Reason: Missed the changes already posted for this.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Change abandoned by Lee Yarwood (<email address hidden>) on branch: master
Review: https://review.openstack.org/536105

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Change abandoned by Matt Riedemann (<email address hidden>) on branch: master
Review: https://review.openstack.org/459321

Changed in nova:
assignee: Lee Yarwood (lyarwood) → melanie witt (melwitt)
Revision history for this message
Lee Yarwood (lyarwood) wrote :

nova-multinode-live-migration is going to start running these again so I think we can close this out now?

https://zuul.opendev.org/t/openstack/build/7b1e7671fd42449088293faecf0e0da7

Lee Yarwood (lyarwood)
Changed in nova:
status: In Progress → Invalid
assignee: melanie witt (melwitt) → nobody
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers