(libvirt) live migration fails on source host due to "Assertion `!(bs->open_flags & BDRV_O_INACTIVE)' failed."

Bug #1706377 reported by Matt Riedemann
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Confirmed
Undecided
Unassigned

Bug Description

Seeing this here:

http://logs.openstack.org/66/483566/10/check/gate-grenade-dsvm-neutron-multinode-live-migration-nv/0437fbe/logs/new/screen-n-cpu.txt.gz#_2017-07-24_16_41_40_410

2017-07-24 16:41:40.410 31027 ERROR nova.compute.manager [req-815b2093-b72d-4ef5-a8e5-b00113e0a688 tempest-LiveMigrationTest-911619613 tempest-LiveMigrationTest-911619613] [instance: 39285c11-48d7-4ae6-be42-5b49826ea380] Unexpected error during post live migration at destination host.: InstanceNotFound: Instance 39285c11-48d7-4ae6-be42-5b49826ea380 could not be found.
2017-07-24 16:41:40.411 31027 DEBUG nova.compute.manager [req-815b2093-b72d-4ef5-a8e5-b00113e0a688 tempest-LiveMigrationTest-911619613 tempest-LiveMigrationTest-911619613] [instance: 39285c11-48d7-4ae6-be42-5b49826ea380] Checking state _get_power_state /opt/stack/new/nova/nova/compute/manager.py:1142
2017-07-24 16:41:40.560 31027 ERROR oslo_messaging.rpc.server [req-815b2093-b72d-4ef5-a8e5-b00113e0a688 tempest-LiveMigrationTest-911619613 tempest-LiveMigrationTest-911619613] Exception during message handling: InstanceNotFound: Instance 39285c11-48d7-4ae6-be42-5b49826ea380 could not be found.
2017-07-24 16:41:40.560 31027 ERROR oslo_messaging.rpc.server Traceback (most recent call last):
2017-07-24 16:41:40.560 31027 ERROR oslo_messaging.rpc.server File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/rpc/server.py", line 160, in _process_incoming
2017-07-24 16:41:40.560 31027 ERROR oslo_messaging.rpc.server res = self.dispatcher.dispatch(message)
2017-07-24 16:41:40.560 31027 ERROR oslo_messaging.rpc.server File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/rpc/dispatcher.py", line 213, in dispatch
2017-07-24 16:41:40.560 31027 ERROR oslo_messaging.rpc.server return self._do_dispatch(endpoint, method, ctxt, args)
2017-07-24 16:41:40.560 31027 ERROR oslo_messaging.rpc.server File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/rpc/dispatcher.py", line 183, in _do_dispatch
2017-07-24 16:41:40.560 31027 ERROR oslo_messaging.rpc.server result = func(ctxt, **new_args)
2017-07-24 16:41:40.560 31027 ERROR oslo_messaging.rpc.server File "/opt/stack/new/nova/nova/exception_wrapper.py", line 76, in wrapped
2017-07-24 16:41:40.560 31027 ERROR oslo_messaging.rpc.server function_name, call_dict, binary)
2017-07-24 16:41:40.560 31027 ERROR oslo_messaging.rpc.server File "/usr/local/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 220, in __exit__
2017-07-24 16:41:40.560 31027 ERROR oslo_messaging.rpc.server self.force_reraise()
2017-07-24 16:41:40.560 31027 ERROR oslo_messaging.rpc.server File "/usr/local/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 196, in force_reraise
2017-07-24 16:41:40.560 31027 ERROR oslo_messaging.rpc.server six.reraise(self.type_, self.value, self.tb)
2017-07-24 16:41:40.560 31027 ERROR oslo_messaging.rpc.server File "/opt/stack/new/nova/nova/exception_wrapper.py", line 67, in wrapped
2017-07-24 16:41:40.560 31027 ERROR oslo_messaging.rpc.server return f(self, context, *args, **kw)
2017-07-24 16:41:40.560 31027 ERROR oslo_messaging.rpc.server File "/opt/stack/new/nova/nova/compute/utils.py", line 864, in decorated_function
2017-07-24 16:41:40.560 31027 ERROR oslo_messaging.rpc.server return function(self, context, *args, **kwargs)
2017-07-24 16:41:40.560 31027 ERROR oslo_messaging.rpc.server File "/opt/stack/new/nova/nova/compute/manager.py", line 199, in decorated_function
2017-07-24 16:41:40.560 31027 ERROR oslo_messaging.rpc.server return function(self, context, *args, **kwargs)
2017-07-24 16:41:40.560 31027 ERROR oslo_messaging.rpc.server File "/opt/stack/new/nova/nova/compute/manager.py", line 5713, in post_live_migration_at_destination
2017-07-24 16:41:40.560 31027 ERROR oslo_messaging.rpc.server 'destination host.', instance=instance)
2017-07-24 16:41:40.560 31027 ERROR oslo_messaging.rpc.server File "/usr/local/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 220, in __exit__
2017-07-24 16:41:40.560 31027 ERROR oslo_messaging.rpc.server self.force_reraise()
2017-07-24 16:41:40.560 31027 ERROR oslo_messaging.rpc.server File "/usr/local/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 196, in force_reraise
2017-07-24 16:41:40.560 31027 ERROR oslo_messaging.rpc.server six.reraise(self.type_, self.value, self.tb)
2017-07-24 16:41:40.560 31027 ERROR oslo_messaging.rpc.server File "/opt/stack/new/nova/nova/compute/manager.py", line 5708, in post_live_migration_at_destination
2017-07-24 16:41:40.560 31027 ERROR oslo_messaging.rpc.server block_device_info)
2017-07-24 16:41:40.560 31027 ERROR oslo_messaging.rpc.server File "/opt/stack/new/nova/nova/virt/libvirt/driver.py", line 7117, in post_live_migration_at_destination
2017-07-24 16:41:40.560 31027 ERROR oslo_messaging.rpc.server guest = self._host.get_guest(instance)
2017-07-24 16:41:40.560 31027 ERROR oslo_messaging.rpc.server File "/opt/stack/new/nova/nova/virt/libvirt/host.py", line 534, in get_guest
2017-07-24 16:41:40.560 31027 ERROR oslo_messaging.rpc.server return libvirt_guest.Guest(self.get_domain(instance))
2017-07-24 16:41:40.560 31027 ERROR oslo_messaging.rpc.server File "/opt/stack/new/nova/nova/virt/libvirt/host.py", line 555, in get_domain
2017-07-24 16:41:40.560 31027 ERROR oslo_messaging.rpc.server raise exception.InstanceNotFound(instance_id=instance.uuid)
2017-07-24 16:41:40.560 31027 ERROR oslo_messaging.rpc.server InstanceNotFound: Instance 39285c11-48d7-4ae6-be42-5b49826ea380 could not be found.
2017-07-24 16:41:40.560 31027 ERROR oslo_messaging.rpc.server

This is in a Pike CI job.

It fails here:

https://github.com/openstack/nova/blob/87a0143267743e884c60f3c93f80d8fdea441322/nova/virt/libvirt/driver.py#L7117

The TODO after that seems to suggest we should expect the guest domain is persisted on the host by libvirtd at this point.

I'm pretty sure this is a duplicate but where something actually fails on the source side once we consider the live migration complete so it doesn't actually end up on the destination host.

This is the resulting error back on the source host when post live migration at destination fails:

http://logs.openstack.org/66/483566/10/check/gate-grenade-dsvm-neutron-multinode-live-migration-nv/0437fbe/logs/subnode-2/old/screen-n-cpu.txt.gz#_2017-07-24_16_41_40_557

And right before the source calls post live migration at destination, I see this error in the source host logs:

http://logs.openstack.org/66/483566/10/check/gate-grenade-dsvm-neutron-multinode-live-migration-nv/0437fbe/logs/subnode-2/old/screen-n-cpu.txt.gz#_2017-07-24_16_41_39_717

2017-07-24 16:41:39.717 19889 ERROR nova.virt.libvirt.driver [req-815b2093-b72d-4ef5-a8e5-b00113e0a688 tempest-LiveMigrationTest-911619613 tempest-LiveMigrationTest-911619613] [instance: 39285c11-48d7-4ae6-be42-5b49826ea380] Live Migration failure: Unable to read from monitor: Connection reset by peer
2017-07-24 16:41:39.718 19889 DEBUG nova.virt.libvirt.driver [req-815b2093-b72d-4ef5-a8e5-b00113e0a688 tempest-LiveMigrationTest-911619613 tempest-LiveMigrationTest-911619613] [instance: 39285c11-48d7-4ae6-be42-5b49826ea380] Migration operation thread notification thread_finished /opt/stack/old/nova/nova/virt/libvirt/driver.py:6527
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/eventlet/hubs/hub.py", line 457, in fire_timers
    timer()
  File "/usr/local/lib/python2.7/dist-packages/eventlet/hubs/timer.py", line 58, in __call__
    cb(*args, **kw)
  File "/usr/local/lib/python2.7/dist-packages/eventlet/event.py", line 168, in _do_send
    waiter.switch(result)
  File "/usr/local/lib/python2.7/dist-packages/eventlet/greenthread.py", line 214, in main
    result = function(*args, **kwargs)
  File "/opt/stack/old/nova/nova/utils.py", line 1087, in context_wrapper
    return func(*args, **kwargs)
  File "/opt/stack/old/nova/nova/virt/libvirt/driver.py", line 6179, in _live_migration_operation
    instance=instance)
  File "/usr/local/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 220, in __exit__
    self.force_reraise()
  File "/usr/local/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 196, in force_reraise
    six.reraise(self.type_, self.value, self.tb)
  File "/opt/stack/old/nova/nova/virt/libvirt/driver.py", line 6172, in _live_migration_operation
    bandwidth=CONF.libvirt.live_migration_bandwidth)
  File "/opt/stack/old/nova/nova/virt/libvirt/guest.py", line 623, in migrate
    destination, params=params, flags=flags)
  File "/usr/local/lib/python2.7/dist-packages/eventlet/tpool.py", line 186, in doit
    result = proxy_call(self._autowrap, f, *args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/eventlet/tpool.py", line 144, in proxy_call
    rv = execute(f, *args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/eventlet/tpool.py", line 125, in execute
    six.reraise(c, e, tb)
  File "/usr/local/lib/python2.7/dist-packages/eventlet/tpool.py", line 83, in tworker
    rv = meth(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/libvirt.py", line 1674, in migrateToURI3
    if ret == -1: raise libvirtError ('virDomainMigrateToURI3() failed', dom=self)
libvirtError: Unable to read from monitor: Connection reset by peer

And right before that, it looks like the driver considered the live migration complete:

http://logs.openstack.org/66/483566/10/check/gate-grenade-dsvm-neutron-multinode-live-migration-nv/0437fbe/logs/subnode-2/old/screen-n-cpu.txt.gz#_2017-07-24_16_41_39_702

2017-07-24 16:41:39.702 19889 INFO nova.virt.libvirt.driver [req-815b2093-b72d-4ef5-a8e5-b00113e0a688 tempest-LiveMigrationTest-911619613 tempest-LiveMigrationTest-911619613] [instance: 39285c11-48d7-4ae6-be42-5b49826ea380] Migration operation has completed
2017-07-24 16:41:39.702 19889 INFO nova.compute.manager [req-815b2093-b72d-4ef5-a8e5-b00113e0a688 tempest-LiveMigrationTest-911619613 tempest-LiveMigrationTest-911619613] [instance: 39285c11-48d7-4ae6-be42-5b49826ea380] _post_live_migration() is started..

And the failure is in the qemu domain log:

http://logs.openstack.org/66/483566/10/check/gate-grenade-dsvm-neutron-multinode-live-migration-nv/0437fbe/logs/subnode-2/libvirt/qemu/instance-00000011.txt.gz

2017-07-24 16:41:39.286+0000: initiating migration
qemu-system-x86_64: /build/qemu-orucB6/qemu-2.8+dfsg/block/io.c:1514: bdrv_co_pwritev: Assertion `!(bs->open_flags & BDRV_O_INACTIVE)' failed.
2017-07-24 16:41:39.699+0000: shutting down, reason=crashed

Which was actually reported in April in bug 1685340. From Kashyap it sounded like something was trying to write to the image on the source host while it was being migrated.

I'm not sure if there is something we can do on the source host to detect this or work around it?

Matt Riedemann (mriedem)
Changed in nova:
status: New → Confirmed
Revision history for this message
Matt Riedemann (mriedem) wrote :
summary: (libvirt) Unexpected error during post live migration at destination
- host.: InstanceNotFound
+ host.: InstanceNotFound - Assertion `!(bs->open_flags &
+ BDRV_O_INACTIVE)' failed.
summary: - (libvirt) Unexpected error during post live migration at destination
- host.: InstanceNotFound - Assertion `!(bs->open_flags &
- BDRV_O_INACTIVE)' failed.
+ (libvirt) live migration fails on source host due to "Assertion
+ `!(bs->open_flags & BDRV_O_INACTIVE)' failed."
Revision history for this message
Matt Riedemann (mriedem) wrote :

Similar issue here:

http://logs.openstack.org/10/490110/2/check/gate-tempest-dsvm-multinode-live-migration-ubuntu-xenial/6c1da1c/logs/subnode-2/libvirt/qemu/instance-00000003.txt.gz

/build/qemu-orucB6/qemu-2.8+dfsg/nbd/server.c:nbd_co_receive_request():L1135: reading from socket failed

Revision history for this message
Matt Riedemann (mriedem) wrote :
Revision history for this message
Adam Vinsh (adam-vinsh) wrote :

Hey all.
Just started up a big live-migration project in our pike based cluster. Testing showed that migrations were working well. However after a batch of about 200 instances 3 have now lost customer data. This manifests as a migration going into error state as described in this bug again. Then we re-try the migration to the same destination and it completes. Customer then reports that all of their data has been wiped back to the base image they first deployed months ago on the affected instances. My guess is the local disk image isn't fully copied over before it's deleted at the source so nova just boots from the base image.

Indeed in libvirt logs for these failures is this signature:
drv_co_pwritev: Assertion `!(bs->open_flags & BDRV_O_INACTIVE)' failed.

Does anyone working on this have any current info on how we can fix this in our pike cluster? This same issue hasn't happened in the queens cluster. We can't yet upgrade the pike cluster for various reasons that require a bunch of live migration to begin with.

-Adam

Revision history for this message
Matt Riedemann (mriedem) wrote :

@Adam, I'm not aware of any nova changes since pike that have resolved this issue, but I can say anecdotally that I don't think we're seeing this in the upstream CI system anymore once we started using newer versions of libvirt/qemu in our CI jobs (I'd have to probably dig into the details of when this was reported and what libvirt/qemu versions were used versus when we stopped since this as a result of using a newer ubuntu cloud archive release or using ubuntu bionic nodes).

It would be helpful if you posted which versions of libvirt/qemu you're using on your pike and queens clusters so we could compare with what we're using in the upstream CI jobs for stable/pike and stable/queens (and beyond).

Revision history for this message
Adam Vinsh (adam-vinsh) wrote :

Thanks for the update Matt.
I figured it was some bad match between those packages and nova.
We run the same version of libvirt/qemu in our Queens cluster that hasn't seen this problem that happens in the pike cluster.

Many customers use very heavy API hitters in our cloud such as terraform.. the versions of libvirt in xenial don't have fixes needed that prevent libvirt from constantly crashing in these use cases. So we had to pull in the bionic version of libvirt because 16.04 doesn't have all the backports. RE: https://bugs.launchpad.net/cloud-archive/+bug/1789659

Ubuntu 16.04
4.4.0-150-generic #176-Ubuntu SMP Wed May 29 18:56:26 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

libvirt-bin 4.0.0-1ubuntu8
libvirt-clients 4.0.0-1ubuntu8
libvirt-daemon 4.0.0-1ubuntu8
libvirt-daemon-driver-storage-rbd 4.0.0-1ubuntu8
libvirt-daemon-system 4.0.0-1ubuntu8
libvirt0:amd64 4.0.0-1ubuntu8
python-libvirt 4.0.0-1

ipxe-qemu 1.0.0+git-20180124.fbe8c52d-0ubuntu2.2~cloud0
ipxe-qemu-256k-compat-efi-roms 1.0.0+git-20150424.a25a16d-0ubuntu2~cloud0 all
qemu 1:2.11+dfsg-1ubuntu7.13~cloud0 amd64
qemu-block-extra:amd64 1:2.11+dfsg-1ubuntu7.13~cloud0 amd64
qemu-kvm 1:2.11+dfsg-1ubuntu7.13~cloud0 amd64
qemu-slof 20151103+dfsg-1ubuntu1.1 all
qemu-system 1:2.11+dfsg-1ubuntu7.13~cloud0 amd64
qemu-system-arm 1:2.11+dfsg-1ubuntu7.13~cloud0 amd64
qemu-system-common 1:2.11+dfsg-1ubuntu7.13~cloud0 amd64
qemu-system-mips 1:2.11+dfsg-1ubuntu7.13~cloud0 amd64
qemu-system-misc 1:2.11+dfsg-1ubuntu7.13~cloud0 amd64
qemu-system-ppc 1:2.11+dfsg-1ubuntu7.13~cloud0 amd64
qemu-system-s390x 1:2.11+dfsg-1ubuntu7.13~cloud0 amd64
qemu-system-sparc 1:2.11+dfsg-1ubuntu7.13~cloud0 amd64 QEMU full system emulation binaries (sparc)
qemu-system-x86 1:2.11+dfsg-1ubuntu7.13~cloud0 amd64 QEMU full system emulation binaries (x86)
qemu-user 1:2.11+dfsg-1ubuntu7.13~cloud0 amd64 QEMU user mode emulation binaries
qemu-user-binfmt 1:2.11+dfsg-1ubuntu7.13~cloud0 amd64 QEMU user mode binfmt registration for qemu-user
qemu-utils 1:2.11+dfsg-1ubuntu7.13~cloud0 amd64 QEMU utilities

Revision history for this message
Adam Vinsh (adam-vinsh) wrote :

Correction..
pike cluster with the issue quemu versions are 1:2.10+dfsg-0ubuntu3.8~cloud1
queens cluster without the issue quemu versions are 1:2.11+dfsg-1ubuntu7.13~cloud0

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.