libvirt.libvirtError: internal error: missing block job data for disk 'vda'

Bug #1901739 reported by Lee Yarwood
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
High
Lee Yarwood
Ussuri
Undecided
Unassigned
Victoria
Undecided
Unassigned

Bug Description

Description
===========

nova-grenade-multinode has failed a number of times, due to tempest.api.compute.admin.test_live_migration.LiveAutoBlockMigrationV225Test.test_live_block_migration_paused test failures with the following logs:

https://api.us-east.open-edge.io:8080/swift/v1/AUTH_e02c11e4e2c24efc98022353c88ab506/zuul_opendev_logs_30d/759831/4/gate/nova-grenade-multinode/30d8eb1/job-output.txt

2020-10-27 16:59:26.667763 | primary | 2020-10-27 16:59:26.667 | tempest.api.compute.admin.test_live_migration.LiveAutoBlockMigrationV225Test.test_live_block_migration_paused[id-1e107f21-61b2-4988-8f22-b196e938ab88]
2020-10-27 16:59:26.669590 | primary | 2020-10-27 16:59:26.669 | ---------------------------------------------------------------------------------------------------------------------------------[..]
testtools.matchers._impl.MismatchError: 'ubuntu-bionic-rax-iad-0021082674' != 'ubuntu-bionic-rax-iad-0021082634': Live Migration failed. Migrations list for Instance

https://api.us-east.open-edge.io:8080/swift/v1/AUTH_e02c11e4e2c24efc98022353c88ab506/zuul_opendev_logs_30d/759831/4/gate/nova-grenade-multinode/30d8eb1/logs/screen-n-cpu.txt

Oct 27 16:59:14.123461 ubuntu-bionic-rax-iad-0021082634 nova-compute[9691]: ERROR nova.virt.libvirt.driver [-] [instance: 4d564e22-8ba4-48fb-ac93-27bea660fd77] Live Migration failure: internal error: missing block job data for disk 'vda': libvirt.libvirtError: internal error: missing block job data for disk 'vda'
Oct 27 16:59:14.123702 ubuntu-bionic-rax-iad-0021082634 nova-compute[9691]: DEBUG nova.virt.libvirt.driver [-] [instance: 4d564e22-8ba4-48fb-ac93-27bea660fd77] Migration operation thread notification {{(pid=9691) thread_finished /opt/stack/new/nova/nova/virt/libvirt/driver.py:9416}}
Oct 27 16:59:14.135971 ubuntu-bionic-rax-iad-0021082634 nova-compute[9691]: Traceback (most recent call last):
Oct 27 16:59:14.135971 ubuntu-bionic-rax-iad-0021082634 nova-compute[9691]: File "/usr/local/lib/python3.6/dist-packages/eventlet/hubs/hub.py", line 476, in fire_timers
Oct 27 16:59:14.135971 ubuntu-bionic-rax-iad-0021082634 nova-compute[9691]: timer()
Oct 27 16:59:14.135971 ubuntu-bionic-rax-iad-0021082634 nova-compute[9691]: File "/usr/local/lib/python3.6/dist-packages/eventlet/hubs/timer.py", line 59, in __call__
Oct 27 16:59:14.135971 ubuntu-bionic-rax-iad-0021082634 nova-compute[9691]: cb(*args, **kw)
Oct 27 16:59:14.135971 ubuntu-bionic-rax-iad-0021082634 nova-compute[9691]: File "/usr/local/lib/python3.6/dist-packages/eventlet/event.py", line 175, in _do_send
Oct 27 16:59:14.135971 ubuntu-bionic-rax-iad-0021082634 nova-compute[9691]: waiter.switch(result)
Oct 27 16:59:14.135971 ubuntu-bionic-rax-iad-0021082634 nova-compute[9691]: File "/usr/local/lib/python3.6/dist-packages/eventlet/greenthread.py", line 221, in main
Oct 27 16:59:14.135971 ubuntu-bionic-rax-iad-0021082634 nova-compute[9691]: result = function(*args, **kwargs)
Oct 27 16:59:14.135971 ubuntu-bionic-rax-iad-0021082634 nova-compute[9691]: File "/opt/stack/new/nova/nova/utils.py", line 661, in context_wrapper
Oct 27 16:59:14.135971 ubuntu-bionic-rax-iad-0021082634 nova-compute[9691]: return func(*args, **kwargs)
Oct 27 16:59:14.135971 ubuntu-bionic-rax-iad-0021082634 nova-compute[9691]: File "/opt/stack/new/nova/nova/virt/libvirt/driver.py", line 9070, in _live_migration_operation
Oct 27 16:59:14.135971 ubuntu-bionic-rax-iad-0021082634 nova-compute[9691]: LOG.error("Live Migration failure: %s", e, instance=instance)
Oct 27 16:59:14.135971 ubuntu-bionic-rax-iad-0021082634 nova-compute[9691]: File "/usr/local/lib/python3.6/dist-packages/oslo_utils/excutils.py", line 220, in __exit__
Oct 27 16:59:14.135971 ubuntu-bionic-rax-iad-0021082634 nova-compute[9691]: self.force_reraise()
Oct 27 16:59:14.135971 ubuntu-bionic-rax-iad-0021082634 nova-compute[9691]: File "/usr/local/lib/python3.6/dist-packages/oslo_utils/excutils.py", line 196, in force_reraise
Oct 27 16:59:14.135971 ubuntu-bionic-rax-iad-0021082634 nova-compute[9691]: six.reraise(self.type_, self.value, self.tb)
Oct 27 16:59:14.135971 ubuntu-bionic-rax-iad-0021082634 nova-compute[9691]: File "/usr/local/lib/python3.6/dist-packages/six.py", line 703, in reraise
Oct 27 16:59:14.135971 ubuntu-bionic-rax-iad-0021082634 nova-compute[9691]: raise value
Oct 27 16:59:14.135971 ubuntu-bionic-rax-iad-0021082634 nova-compute[9691]: File "/opt/stack/new/nova/nova/virt/libvirt/driver.py", line 9063, in _live_migration_operation
Oct 27 16:59:14.135971 ubuntu-bionic-rax-iad-0021082634 nova-compute[9691]: bandwidth=CONF.libvirt.live_migration_bandwidth)
Oct 27 16:59:14.135971 ubuntu-bionic-rax-iad-0021082634 nova-compute[9691]: File "/opt/stack/new/nova/nova/virt/libvirt/guest.py", line 681, in migrate
Oct 27 16:59:14.135971 ubuntu-bionic-rax-iad-0021082634 nova-compute[9691]: destination, params=params, flags=flags)
Oct 27 16:59:14.135971 ubuntu-bionic-rax-iad-0021082634 nova-compute[9691]: File "/usr/local/lib/python3.6/dist-packages/eventlet/tpool.py", line 190, in doit
Oct 27 16:59:14.135971 ubuntu-bionic-rax-iad-0021082634 nova-compute[9691]: result = proxy_call(self._autowrap, f, *args, **kwargs)
Oct 27 16:59:14.135971 ubuntu-bionic-rax-iad-0021082634 nova-compute[9691]: File "/usr/local/lib/python3.6/dist-packages/eventlet/tpool.py", line 148, in proxy_call
Oct 27 16:59:14.135971 ubuntu-bionic-rax-iad-0021082634 nova-compute[9691]: rv = execute(f, *args, **kwargs)
Oct 27 16:59:14.135971 ubuntu-bionic-rax-iad-0021082634 nova-compute[9691]: File "/usr/local/lib/python3.6/dist-packages/eventlet/tpool.py", line 129, in execute
Oct 27 16:59:14.135971 ubuntu-bionic-rax-iad-0021082634 nova-compute[9691]: six.reraise(c, e, tb)
Oct 27 16:59:14.135971 ubuntu-bionic-rax-iad-0021082634 nova-compute[9691]: File "/usr/local/lib/python3.6/dist-packages/six.py", line 703, in reraise
Oct 27 16:59:14.135971 ubuntu-bionic-rax-iad-0021082634 nova-compute[9691]: raise value
Oct 27 16:59:14.135971 ubuntu-bionic-rax-iad-0021082634 nova-compute[9691]: File "/usr/local/lib/python3.6/dist-packages/eventlet/tpool.py", line 83, in tworker
Oct 27 16:59:14.135971 ubuntu-bionic-rax-iad-0021082634 nova-compute[9691]: rv = meth(*args, **kwargs)
Oct 27 16:59:14.135971 ubuntu-bionic-rax-iad-0021082634 nova-compute[9691]: File "/usr/local/lib/python3.6/dist-packages/libvirt.py", line 1941, in migrateToURI3
Oct 27 16:59:14.135971 ubuntu-bionic-rax-iad-0021082634 nova-compute[9691]: raise libvirtError('virDomainMigrateToURI3() failed')
Oct 27 16:59:14.135971 ubuntu-bionic-rax-iad-0021082634 nova-compute[9691]: libvirt.libvirtError: internal error: missing block job data for disk 'vda'

https://api.us-east.open-edge.io:8080/swift/v1/AUTH_e02c11e4e2c24efc98022353c88ab506/zuul_opendev_logs_30d/759831/4/gate/nova-grenade-multinode/30d8eb1/logs/libvirt/libvirtd.txt

2020-10-27 16:59:13.867+0000: 31617: debug : qemuMonitorJSONIOProcessEvent:183 : handle BLOCK_JOB_READY handler=0x7f9e065ffe70 data=0x55b1b50ec200
2020-10-27 16:59:13.867+0000: 31617: debug : qemuMonitorEmitBlockJob:1523 : mon=0x7f9e1804d570
2020-10-27 16:59:13.867+0000: 31617: debug : qemuProcessHandleBlockJob:944 : Block job for device drive-virtio-disk0 (domain: 0x7f9e18007920,instance-0000001e) type 2 status 3
2020-10-27 16:59:13.867+0000: 31617: debug : virObjectEventDispose:124 : obj=0x55b1b511a380
2020-10-27 16:59:13.867+0000: 31617: debug : virObjectEventDispose:124 : obj=0x55b1b5113060
2020-10-27 16:59:13.867+0000: 31620: debug : qemuDomainObjExitMonitorInternal:7655 : Exited monitor (mon=0x7f9e1804d570 vm=0x7f9e18007920 name=instance-0000001e)
2020-10-27 16:59:13.867+0000: 31620: debug : qemuDomainObjEndJob:7516 : Stopping job: async nested (async=migration out vm=0x7f9e18007920 name=instance-0000001e)
2020-10-27 16:59:13.867+0000: 31620: error : qemuMigrationSrcNBDStorageCopyReady:512 : internal error: missing block job data for disk 'vda'
2020-10-27 16:59:13.867+0000: 31620: debug : qemuMigrationSrcNBDCopyCancel:704 : Cancelling drive mirrors for domain instance-0000001e
2020-10-27 16:59:13.867+0000: 31620: debug : qemuMigrationSrcNBDCopyCancelled:619 : All disk mirrors are gone

Steps to reproduce
==================

Run nova-grenade-multinode.

Expected result
===============

tempest.api.compute.admin.test_live_migration.LiveAutoBlockMigrationV225Test.test_live_block_migration_paused succeeds.

Actual result
=============

tempest.api.compute.admin.test_live_migration.LiveAutoBlockMigrationV225Test.test_live_block_migration_paused fails.

Environment
===========
1. Exact version of OpenStack you are running. See the following
  list for all releases: http://docs.openstack.org/releases/

   https://review.opendev.org/#/c/759831/

2. Which hypervisor did you use?
   (For example: Libvirt + KVM, Libvirt + XEN, Hyper-V, PowerKVM, ...)
   What's the version of that?

   libvirt + QEMU

2. Which storage type did you use?
   (For example: Ceph, LVM, GPFS, ...)
   What's the version of that?

   N/A

3. Which networking type did you use?
   (For example: nova-network, Neutron with OpenVSwitch, ...)

   N/A

Logs & Configs
==============

melanie witt (melwitt)
tags: added: gate-failure
Revision history for this message
melanie witt (melwitt) wrote :
Revision history for this message
Lee Yarwood (lyarwood) wrote :

This might be related to https://bugzilla.redhat.com/show_bug.cgi?id=1713009

Talking to the libvirt/QEMU devs at the time the actual test of live migrating a paused instance back and forth (so it never runs on the dest before being migrated back to the src) is pointless. We might want to consider removing it from Tempest.

Revision history for this message
Kashyap Chamarthy (kashyapc) wrote :

So, this seems to be due to back-n-forth live migration (without shared storage) of a paused instance — I don't see what is the purpose of this (although, I can make up an imaginary rationale.)

That said, the error makes sense ... due to the missing block device data the 'drive-mirror' (which does the copy) got cancelled, and in turn the NBD copy got cancelled too.

Still digging the cause, though.

(Since the Gate logs might go away, I've collected the important source and destination libvirt/QEMU logs here: https://kashyapc.fedorapeople.org/LP-1901739/)

Revision history for this message
Kashyap Chamarthy (kashyapc) wrote :

(Answering my own question, one use-case for the back-n-forth LM is: You migrate a paused a instance; it succeeds. But then you realize: "oh, I actually need to do some maintenance work on this destination, so I need to migrate this VM _back_ to the source host again".)

        - - -

The said libvirt error:

    "internal error: missing block job data for disk ..."

... was introduced in this libvirt commit:

    https://libvirt.org/git/?p=libvirt.git;a=commitdiff;h=103a4245ae
    — qemu: blockjob: Pass job into qemuBlockJobUpdateDisk and rename it

I'm checking with the libvirt developers for some more input. As this is related to libvirt Block Layer interface with NBD.

Revision history for this message
Kashyap Chamarthy (kashyapc) wrote :

Lee,

On the potentionally related RHEL bug[1], I don't think that's the case here. As we've seen on IRC, there's no assertion failure in the QEMU instance logs (neither source, nor in destination).

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1713009 — [QEMU] Back-n-forth live migration of a paused VM results in QEMU crash with: "bdrv_inactivate_recurse: Assertion `!(bs->open_flags & BDRV_O_INACTIVE)' failed."

Changed in nova:
status: New → Confirmed
importance: Undecided → High
Changed in nova:
assignee: nobody → Lee Yarwood (lyarwood)
status: Confirmed → In Progress
Revision history for this message
Kashyap Chamarthy (kashyapc) wrote :

So, the source and destination libvirt are still using libvirt-5.4.0 (i.e. Ubuntu "Bionic"), which uses legacy "-drive" approach.

Peter from libvirt says libvirt-6.0.0 (which uses the modern "-blockdev" approach) should fix it ... and sure enough, Lee tells me on IRC that we have Zuul-v3 live migration job that is on libvirt-6.0.0, and is NOT hitting this error.

So we're going to flip this job ("nova-live-migration") as well to Ubuntu "Focal", which has libvirt-6.0.0.

Maybe CLOSED --> Fixed-In-Next-Release?

Revision history for this message
Lee Yarwood (lyarwood) wrote :

Leave the bug open, I've referenced it in the change moving us over to Focal that I'll also land on stable/victoria to avoid us hitting the same issue there.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.opendev.org/752557
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=4ac4a04d1843b0450e8d6d80189ce3e85253dcd0
Submitter: Zuul
Branch: master

commit 4ac4a04d1843b0450e8d6d80189ce3e85253dcd0
Author: Lee Yarwood <email address hidden>
Date: Thu Sep 17 20:59:39 2020 +0100

    zuul: Replace nova-live-migration with zuulv3 jobs

    This change removes the original nova-live-migration job and replaces it
    directly with the new Focal based zuulv3 native job.

    The nova-dsvm-multinode-base base job is no longer used and so also
    removed as part of this change.

    Note that this new nova-live-migration job does not yet contain any
    ceph coverage like the original, this is still pending and will be
    completed early in the W cycle.

    This change is being merged ahead of this to resolve bug #1901739, a
    known QEMU -drive issue caused by the previous jobs use of libvirt 5.4.0
    as provided by Ubuntu Bionic. The fix here being the migration to Ubuntu
    Focal based jobs and libvirt 6.0.0 that now defaults to using QEMU
    -blockdev.

    Closes-Bug: #1901739
    Change-Id: Ib342e2d3c395830b4667a60de7e492d3b9de2f0a

Changed in nova:
status: In Progress → Fix Released
Revision history for this message
Lee Yarwood (lyarwood) wrote :

I've moved this back to in-progress as we had to reintroduce the bionic base job as it is still used by our grenade multinode job. https://review.opendev.org/#/c/742056/ will move this and allow us to close this out again.

Changed in nova:
status: Fix Released → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 22.1.0

This issue was fixed in the openstack/nova 22.1.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 23.0.0.0rc1

This issue was fixed in the openstack/nova 23.0.0.0rc1 release candidate.

Lee Yarwood (lyarwood)
Changed in nova:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/ussuri)

Fix proposed to branch: stable/ussuri
Review: https://review.opendev.org/c/openstack/nova/+/795433

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/train)

Fix proposed to branch: stable/train
Review: https://review.opendev.org/c/openstack/nova/+/795436

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (stable/ussuri)

Change abandoned by "Elod Illes <email address hidden>" on branch: stable/ussuri
Review: https://review.opendev.org/c/openstack/nova/+/795433
Reason: Abandoning as this was squashed into the "supermegasquash" patch: https://review.opendev.org/c/openstack/nova/+/795432

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/ussuri)

Reviewed: https://review.opendev.org/c/openstack/nova/+/795432
Committed: https://opendev.org/openstack/nova/commit/341ba7aa175a0a082fec6e5360ae3afa2596ca95
Submitter: "Zuul (22348)"
Branch: stable/ussuri

commit 341ba7aa175a0a082fec6e5360ae3afa2596ca95
Author: Lee Yarwood <email address hidden>
Date: Fri Mar 6 11:00:49 2020 +0000

    [CI] Fix gate by using zuulv3 live migration and grenade jobs

    This patch is a combination of several legacy-to-zuulv3 job patches to
    unblock the gate: with the latest Ceph release the legacy grenade jobs
    started to fail with the following erros (back till ussuri):

    'Error EPERM: configuring pool size as 1 is disabled by default.'

    The patch contains almost a clean backport of the job configuration.

    Conflicts:
        gate/live_migration/hooks/run_tests.sh
        roles/run-evacuate-hook/tasks/main.yaml

    NOTE(melwitt): The conflict is because change
    I67255fa1b919a27e92028da95d71ddd4bf53edc1 (lower-constraints: Bump
    packaging to 20.4) is not in Ussuri.

    NOTE(lyarwood): An additional change was required to the
    run-evacuate-hook as we are now running against Bionic based hosts.
    These hosts only have a single libvirtd service running so stop and
    start only this during an evacuation run.

    List of included patches:

    1. zuul: Start to migrate nova-live-migration to zuulv3

    2. zuul: Replace nova-live-migration with zuulv3 jobs

       Closes-Bug: #1901739
       Change-Id: Ib342e2d3c395830b4667a60de7e492d3b9de2f0a
       (cherry picked from commit 4ac4a04d1843b0450e8d6d80189ce3e85253dcd0)
       (cherry picked from commit 478be6f4fbbbc7b05becd5dd92a27f0c4e8f8ef8)

    3. zuul: Replace grenade and nova-grenade-multinode with grenade-multinode

       Change-Id: I02b2b851a74f24816d2f782a66d94de81ee527b0
       (cherry picked from commit 91e53e4c2b90ea57aeac4ec522dd7c8c54961d09)
       (cherry picked from commit c45bedd98d50af865d727b7456c974c8e27bff8b)
       (cherry picked from commit 2af08fb5ead8ca1fa4d6b8ea00f3c5c3d26e562c)

    Change-Id: Ibbb3930a6e629e93a424b3ae048f599f11923be3
    (cherry picked from commit 1c733d973015999ee692ed48fb10a282c50fdc49)

tags: added: in-stable-ussuri
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (stable/train)

Change abandoned by "melanie witt <email address hidden>" on branch: stable/train
Review: https://review.opendev.org/c/openstack/nova/+/795435
Reason: Trying https://review.opendev.org/c/openstack/nova/+/795435 instead.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Change abandoned by "melanie witt <email address hidden>" on branch: stable/train
Review: https://review.opendev.org/c/openstack/nova/+/795436
Reason: Trying https://review.opendev.org/c/openstack/nova/+/795435 instead.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 21.2.2

This issue was fixed in the openstack/nova 21.2.2 release.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.