virDomainGetBlockJobInfo fails during swap_volume as disk '$disk' not found in domain

Bug #1929710 reported by Lee Yarwood
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Triaged
Medium
Unassigned

Bug Description

Description
===========

The error handling around swap_volume is missing the following failure when calling virDomainGetBlockJobInfo() after the entire device is detached by QEMU (?) after it encounters a failure during the block copy job that at first pauses and then somehow resumes:

https://8a5fc27780098c5ee1bc-3ac81d180a9c011938b2cbb0293272f3.ssl.cf5.rackcdn.com/790660/5/gate/nova-next/e915ed4/controller/logs/screen-n-cpu.txt

May 26 09:49:47.314813 ubuntu-focal-vexxhost-ca-ymq-1-0024823853 nova-compute[114649]: ERROR nova.virt.libvirt.driver [None req-7cfcd661-29d4-4cc3-bc54-db0e7fed1a6e tempest-TestVolumeSwap-1841575704 tempest-TestVolumeSwap-1841575704-project-admin] Failure rebasing volume /dev/sdb on vdb.: libvirt.libvirtError: invalid argument: disk 'vdb' not found in domain
May 26 09:49:47.314813 ubuntu-focal-vexxhost-ca-ymq-1-0024823853 nova-compute[114649]: ERROR nova.virt.libvirt.driver Traceback (most recent call last):
May 26 09:49:47.314813 ubuntu-focal-vexxhost-ca-ymq-1-0024823853 nova-compute[114649]: ERROR nova.virt.libvirt.driver File "/opt/stack/nova/nova/virt/libvirt/driver.py", line 2107, in _swap_volume
May 26 09:49:47.314813 ubuntu-focal-vexxhost-ca-ymq-1-0024823853 nova-compute[114649]: ERROR nova.virt.libvirt.driver while not dev.is_job_complete():
May 26 09:49:47.314813 ubuntu-focal-vexxhost-ca-ymq-1-0024823853 nova-compute[114649]: ERROR nova.virt.libvirt.driver File "/opt/stack/nova/nova/virt/libvirt/guest.py", line 800, in is_job_complete
May 26 09:49:47.314813 ubuntu-focal-vexxhost-ca-ymq-1-0024823853 nova-compute[114649]: ERROR nova.virt.libvirt.driver status = self.get_job_info()
May 26 09:49:47.314813 ubuntu-focal-vexxhost-ca-ymq-1-0024823853 nova-compute[114649]: ERROR nova.virt.libvirt.driver File "/opt/stack/nova/nova/virt/libvirt/guest.py", line 707, in get_job_info
May 26 09:49:47.314813 ubuntu-focal-vexxhost-ca-ymq-1-0024823853 nova-compute[114649]: ERROR nova.virt.libvirt.driver status = self._guest._domain.blockJobInfo(self._disk, flags=0)
May 26 09:49:47.314813 ubuntu-focal-vexxhost-ca-ymq-1-0024823853 nova-compute[114649]: ERROR nova.virt.libvirt.driver File "/usr/local/lib/python3.8/dist-packages/eventlet/tpool.py", line 190, in doit
May 26 09:49:47.314813 ubuntu-focal-vexxhost-ca-ymq-1-0024823853 nova-compute[114649]: ERROR nova.virt.libvirt.driver result = proxy_call(self._autowrap, f, *args, **kwargs)
May 26 09:49:47.314813 ubuntu-focal-vexxhost-ca-ymq-1-0024823853 nova-compute[114649]: ERROR nova.virt.libvirt.driver File "/usr/local/lib/python3.8/dist-packages/eventlet/tpool.py", line 148, in proxy_call
May 26 09:49:47.314813 ubuntu-focal-vexxhost-ca-ymq-1-0024823853 nova-compute[114649]: ERROR nova.virt.libvirt.driver rv = execute(f, *args, **kwargs)
May 26 09:49:47.314813 ubuntu-focal-vexxhost-ca-ymq-1-0024823853 nova-compute[114649]: ERROR nova.virt.libvirt.driver File "/usr/local/lib/python3.8/dist-packages/eventlet/tpool.py", line 129, in execute
May 26 09:49:47.314813 ubuntu-focal-vexxhost-ca-ymq-1-0024823853 nova-compute[114649]: ERROR nova.virt.libvirt.driver six.reraise(c, e, tb)
May 26 09:49:47.314813 ubuntu-focal-vexxhost-ca-ymq-1-0024823853 nova-compute[114649]: ERROR nova.virt.libvirt.driver File "/usr/local/lib/python3.8/dist-packages/six.py", line 719, in reraise
May 26 09:49:47.314813 ubuntu-focal-vexxhost-ca-ymq-1-0024823853 nova-compute[114649]: ERROR nova.virt.libvirt.driver raise value
May 26 09:49:47.314813 ubuntu-focal-vexxhost-ca-ymq-1-0024823853 nova-compute[114649]: ERROR nova.virt.libvirt.driver File "/usr/local/lib/python3.8/dist-packages/eventlet/tpool.py", line 83, in tworker
May 26 09:49:47.314813 ubuntu-focal-vexxhost-ca-ymq-1-0024823853 nova-compute[114649]: ERROR nova.virt.libvirt.driver rv = meth(*args, **kwargs)
May 26 09:49:47.314813 ubuntu-focal-vexxhost-ca-ymq-1-0024823853 nova-compute[114649]: ERROR nova.virt.libvirt.driver File "/usr/local/lib/python3.8/dist-packages/libvirt.py", line 985, in blockJobInfo
May 26 09:49:47.314813 ubuntu-focal-vexxhost-ca-ymq-1-0024823853 nova-compute[114649]: ERROR nova.virt.libvirt.driver raise libvirtError('virDomainGetBlockJobInfo() failed')
May 26 09:49:47.314813 ubuntu-focal-vexxhost-ca-ymq-1-0024823853 nova-compute[114649]: ERROR nova.virt.libvirt.driver libvirt.libvirtError: invalid argument: disk 'vdb' not found in domain
May 26 09:49:47.314813 ubuntu-focal-vexxhost-ca-ymq-1-0024823853 nova-compute[114649]: ERROR nova.virt.libvirt.driver

https://zuul.opendev.org/t/openstack/build/e915ed4aeb9346bba83910bd79e9502b/log/controller/logs/libvirt/libvirtd_log.txt

2021-05-26 09:49:40.189+0000: 79419: info : qemuMonitorSend:993 : QEMU_MONITOR_SEND_MSG: mon=0x7fc4bc07e7d0 msg={"execute":"blockdev-add","arguments":{"node-name":"libvirt-4-format","read-only":false,"cache":{"direct":true,"no-flush":false},"driver":"raw","file":"libvirt-4-storage"},"id":"libvirt-375"}^M

2021-05-26 09:49:46.154+0000: 79422: info : qemuMonitorSend:993 : QEMU_MONITOR_SEND_MSG: mon=0x7fc4bc07e7d0 msg={"execute":"blockdev-add","arguments":{"node-name":"libvirt-5-format","read-only":false,"cache":{"direct":true,"no-flush":false},"driver":"raw","file":"libvirt-5-storage"},"id":"libvirt-379"}^M

2021-05-26 09:49:46.165+0000: 79422: debug : qemuMonitorBlockdevMirror:3112 : jobname=copy-vdb-libvirt-4-format, persistjob=1, device=libvirt-4-format, target=libvirt-5-format, bandwidth=0, granularity=0, buf_size=0, shallow=0

2021-05-26 09:49:46.167+0000: 79417: debug : qemuProcessHandleJobStatusChange:1002 : job 'copy-vdb-libvirt-4-format'(domain: 0x7fc4b416b0e0,instance-0000000b) state changed to 'created'(1)

2021-05-26 09:49:46.167+0000: 79417: debug : qemuProcessHandleJobStatusChange:1002 : job 'copy-vdb-libvirt-4-format'(domain: 0x7fc4b416b0e0,instance-0000000b) state changed to 'running'(2)

2021-05-26 09:49:46.763+0000: 79417: debug : qemuProcessHandleJobStatusChange:1002 : job 'copy-vdb-libvirt-4-format'(domain: 0x7fc4b416b0e0,instance-0000000b) state changed to 'paused'(3)

2021-05-26 09:49:46.763+0000: 79417: debug : qemuProcessHandleJobStatusChange:1002 : job 'copy-vdb-libvirt-4-format'(domain: 0x7fc4b416b0e0,instance-0000000b) state changed to 'running'(2)

2021-05-26 09:49:46.841+0000: 79417: debug : qemuProcessHandleDeviceDeleted:1362 : Device virtio-disk1 removed from domain 0x7fc4b416b0e0 instance-0000000b

2021-05-26 09:49:47.457+0000: 79417: debug : qemuProcessHandleJobStatusChange:1002 : job 'copy-vdb-libvirt-4-format'(domain: 0x7fc4b416b0e0,instance-0000000b) state changed to 'aborting'(8)

2021-05-26 09:49:47.458+0000: 79417: debug : qemuProcessHandleJobStatusChange:1002 : job 'copy-vdb-libvirt-4-format'(domain: 0x7fc4b416b0e0,instance-0000000b) state changed to 'concluded'(9)

2021-05-26 09:49:47.459+0000: 79417: debug : qemuProcessHandleJobStatusChange:1002 : job 'copy-vdb-libvirt-4-format'(domain: 0x7fc4b416b0e0,instance-0000000b) state changed to 'null'(11)

Steps to reproduce
==================

$ cat queries/virDomainGetBlockJobInfo.yaml
query: >
 message:"virDomainGetBlockJobInfo() failed" AND
 tags:"screen-n-cpu.txt"

$ elastic-recheck-query queries/virDomainGetBlockJobInfo.yaml
total hits: 6
build_branch
  100% master
build_change
  50% 786588
  50% 792322
build_hostids
  50% 1b47a855be51bba01ac6d5e6fdc4859bc17ebe2c8faaeb83392f8ff3 79fb0487675c0137b7ac30f24b5de71c70afb836e46746de770fa0c0
  50% 33381c047c348ffefebf6b10cb7f0473c2359757d0bf11cc101eec54 33381c047c348ffefebf6b10cb7f0473c2359757d0bf11cc101eec54
build_name
  100% nova-next
build_node
  100% ubuntu-focal
build_queue
  100% check
build_status
  100% FAILURE
build_zuul_url
  100% N/A
filename
  100% controller/logs/screen-n-cpu.txt
log_url
  50% https://89bc735e8a094e3d60b7-4f6db7cd5400cfa66e1c80fde6bd4076.ssl.cf1.rackcdn.com/792322/1/check/nova-next/de697b4/controller/logs/screen-n-cpu.txt
  50% https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_035/786588/6/check/nova-next/0357703/controller/logs/screen-n-cpu.txt
loglevel
  100% ERROR
module
  33% nova.compute.manager
  33% nova.virt.libvirt.driver
  33% oslo_messaging.rpc.server
node_provider
  50% ovh-bhs1
  50% rax-iad
port
  50% 48014
  50% 58238
project
  100% openstack/nova
syslog_pid
  50% 107528
  50% 108261
syslog_program
  50% ubuntu-focal-ovh-bhs1-0024748800 nova-compute
  50% ubuntu-focal-rax-iad-0024745546 nova-compute
tags
  100% screen-n-cpu.txt screen oslofmt
voting
  100% 1
zuul_attempts
  100% 1
zuul_executor
  50% ze01.opendev.org
  50% ze07.opendev.org

Expected result
===============
swap_volume at least fails correctly leaving the original device attached.

Actual result
=============
swap_volume fails and the original device appears detached from the device.

Environment
===========
1. Exact version of OpenStack you are running. See the following
  list for all releases: http://docs.openstack.org/releases/

   master

2. Which hypervisor did you use?
   (For example: Libvirt + KVM, Libvirt + XEN, Hyper-V, PowerKVM, ...)
   What's the version of that?

   libvirt + QEMU (no KVM in the gate)

2. Which storage type did you use?
   (For example: Ceph, LVM, GPFS, ...)
   What's the version of that?

   images_type=default=qcow2

3. Which networking type did you use?
   (For example: nova-network, Neutron with OpenVSwitch, ...)

   N/A

Logs & Configs
==============

Tags: gate-failure
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/nova/+/793219

Revision history for this message
Lee Yarwood (lyarwood) wrote :

I've added the QEMU project directly to this bug to see if anyone can help us understand what the underlying block job failure is within QEMU and why it then appears to remove the entire device from the instance causing libvirt and Nova to fallover.

description: updated
Revision history for this message
melanie witt (melwitt) wrote :

Just a note that the QEMU project is moving their bug tracking to gitlab [1] and will automatically migrate launchpad bugs, but it might be more expedient to open an issue on their gitlab tracker.

[1] https://bugs.launchpad.net/qemu/+bug/1914282/comments/3

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (master)

Reviewed: https://review.opendev.org/c/openstack/nova/+/793219
Committed: https://opendev.org/openstack/nova/commit/d5ed968826895d362f4f2aa21decfdebb9b1fd84
Submitter: "Zuul (22348)"
Branch: master

commit d5ed968826895d362f4f2aa21decfdebb9b1fd84
Author: Lee Yarwood <email address hidden>
Date: Wed May 26 19:27:45 2021 +0100

    zuul: Skip swap_volume tests as part of nova-next

    The volume update or swap_volume API has long been a source of gate
    failures within Nova. Most recently we've seen increased instability
    when running the temptest.api.compute.admin.test_volume_swap tests as
    part of the nova-next job as documented in bug #1929710.

    This change temporarily removes the failing test from the nova-next job
    while the underlying issue is identified lower in the virt stack.

    Change-Id: Ib56a034fb08e309981d0b4553b8cee8d16b10152
    Related-Bug: #1929710

Lee Yarwood (lyarwood)
no longer affects: qemu
Changed in nova:
importance: Undecided → Medium
Lee Yarwood (lyarwood)
Changed in nova:
status: New → Triaged
assignee: nobody → Lee Yarwood (lyarwood)
Lee Yarwood (lyarwood)
Changed in nova:
assignee: Lee Yarwood (lyarwood) → nobody
Revision history for this message
melanie witt (melwitt) wrote :

QEMU issue opened for this bug is at https://gitlab.com/qemu-project/qemu/-/issues/287

Revision history for this message
Alexey Stupnikov (astupnikov) wrote (last edit ):
Download full text (7.4 KiB)

This problem was reproduced in CI for one of my stable/wallaby patches:
https://zuul.opendev.org/t/openstack/build/3ae58b4f95d7442fb9110c4044acee2b/logs
https://1f24fdc3bf9479288b3a-225501c9cce0e8ba4bd0f4849071211d.ssl.cf1.rackcdn.com/888333/1/check/nova-next/3ae58b4/compute1/logs/libvirt/qemu/instance-00000011_log.txt
https://1f24fdc3bf9479288b3a-225501c9cce0e8ba4bd0f4849071211d.ssl.cf1.rackcdn.com/888333/1/check/nova-next/3ae58b4/compute1/logs/libvirt/libvirtd_log.txt
https://1f24fdc3bf9479288b3a-225501c9cce0e8ba4bd0f4849071211d.ssl.cf1.rackcdn.com/888333/1/check/nova-next/3ae58b4/compute1/logs/screen-n-cpu.txt
https://1f24fdc3bf9479288b3a-225501c9cce0e8ba4bd0f4849071211d.ssl.cf1.rackcdn.com/888333/1/check/nova-next/3ae58b4/job-output.txt

Extracts:

Jul 13 08:35:22.230036 np0034667036 nova-compute[54149]: WARNING nova.virt.libvirt.driver [None req-7222880d-af15-4a90-b16d-24bdf5763e09 None None] Received event <DeviceRemovedEvent: 1689237322.2269797, dd46f9ba-2352-4da6-be3a-ecf68757e3c0 => virtio-disk1> from libvirt but the driver is not waiting for it; ignored.
Jul 13 08:35:22.242368 np0034667036 nova-compute[54149]: ERROR nova.virt.libvirt.driver [None req-473da26b-0641-4cda-84ef-9e451e5eddb9 tempest-TestVolumeSwap-869256485 tempest-TestVolumeSwap-869256485-project-admin] Failure rebasing volume /dev/sda on vdb.: libvirt.libvirtError: invalid argument: disk 'vdb' not found in domain
Jul 13 08:35:22.242368 np0034667036 nova-compute[54149]: ERROR nova.virt.libvirt.driver Traceback (most recent call last):
Jul 13 08:35:22.242368 np0034667036 nova-compute[54149]: ERROR nova.virt.libvirt.driver File "/opt/stack/nova/nova/virt/libvirt/driver.py", line 2128, in _swap_volume
Jul 13 08:35:22.242368 np0034667036 nova-compute[54149]: ERROR nova.virt.libvirt.driver while not dev.is_job_complete():
Jul 13 08:35:22.242368 np0034667036 nova-compute[54149]: ERROR nova.virt.libvirt.driver File "/opt/stack/nova/nova/virt/libvirt/guest.py", line 810, in is_job_complete
Jul 13 08:35:22.242368 np0034667036 nova-compute[54149]: ERROR nova.virt.libvirt.driver status = self.get_job_info()
Jul 13 08:35:22.242368 np0034667036 nova-compute[54149]: ERROR nova.virt.libvirt.driver File "/opt/stack/nova/nova/virt/libvirt/guest.py", line 717, in get_job_info
Jul 13 08:35:22.242368 np0034667036 nova-compute[54149]: ERROR nova.virt.libvirt.driver status = self._guest._domain.blockJobInfo(self._disk, flags=0)
Jul 13 08:35:22.242368 np0034667036 nova-compute[54149]: ERROR nova.virt.libvirt.driver File "/usr/local/lib/python3.8/dist-packages/eventlet/tpool.py", line 190, in doit
Jul 13 08:35:22.242368 np0034667036 nova-compute[54149]: ERROR nova.virt.libvirt.driver result = proxy_call(self._autowrap, f, *args, **kwargs)
Jul 13 08:35:22.242368 np0034667036 nova-compute[54149]: ERROR nova.virt.libvirt.driver File "/usr/local/lib/python3.8/dist-packages/eventlet/tpool.py", line 148, in proxy_call
Jul 13 08:35:22.242368 np0034667036 nova-compute[54149]: ERROR nova.virt.libvirt.driver rv = execute(f, *args, **kwargs)
Jul 13 08:35:22.242368 np0034667036 nova-compute[54149]: ERROR nova.virt.libvirt.driver File "/usr/local/lib/python3.8/dist-packages...

Read more...

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.