Second DEVICE_DELETED event missing during virtio-blk disk device detach

Bug #1894804 reported by Lee Yarwood on 2020-09-08
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
QEMU
Undecided
Unassigned
qemu (Ubuntu)
Undecided
Unassigned

Bug Description

We are in the process of moving OpenStack CI across to use 20.04 Focal as the underlying OS and encountering the following issue in any test attempting to detach disk devices from running QEMU instances.

We can see a single DEVICE_DELETED event raised to libvirtd for the /machine/peripheral/virtio-disk1/virtio-backend device but we do not see a second event for the actual disk. As a result the device is still marked as present in libvirt but QEMU reports it as missing in subsequent attempts to remove the device.

The following log snippets can also be found in the following pastebin that's slightly easier to gork:

http://paste.openstack.org/show/797564/

https://review.opendev.org/#/c/746981/ libvirt: Bump MIN_{LIBVIRT,QEMU}_VERSION and NEXT_MIN_{LIBVIRT,QEMU}_VERSION

https://zuul.opendev.org/t/openstack/build/4c56def513884c5eb3ba7b0adf7fa260 nova-ceph-multistore

https://zuul.opendev.org/t/openstack/build/4c56def513884c5eb3ba7b0adf7fa260/log/controller/logs/dpkg-l.txt

ii libvirt-daemon 6.0.0-0ubuntu8.3 amd64 Virtualization daemon
ii libvirt-daemon-driver-qemu 6.0.0-0ubuntu8.3 amd64 Virtualization daemon QEMU connection driver
ii libvirt-daemon-system 6.0.0-0ubuntu8.3 amd64 Libvirt daemon configuration files
ii libvirt-daemon-system-systemd 6.0.0-0ubuntu8.3 amd64 Libvirt daemon configuration files (systemd)
ii libvirt-dev:amd64 6.0.0-0ubuntu8.3 amd64 development files for the libvirt library
ii libvirt0:amd64 6.0.0-0ubuntu8.3 amd64 library for interfacing with different virtualization systems
[..]
ii qemu-block-extra:amd64 1:4.2-3ubuntu6.4 amd64 extra block backend modules for qemu-system and qemu-utils
ii qemu-slof 20191209+dfsg-1 all Slimline Open Firmware -- QEMU PowerPC version
ii qemu-system 1:4.2-3ubuntu6.4 amd64 QEMU full system emulation binaries
ii qemu-system-arm 1:4.2-3ubuntu6.4 amd64 QEMU full system emulation binaries (arm)
ii qemu-system-common 1:4.2-3ubuntu6.4 amd64 QEMU full system emulation binaries (common files)
ii qemu-system-data 1:4.2-3ubuntu6.4 all QEMU full system emulation (data files)
ii qemu-system-mips 1:4.2-3ubuntu6.4 amd64 QEMU full system emulation binaries (mips)
ii qemu-system-misc 1:4.2-3ubuntu6.4 amd64 QEMU full system emulation binaries (miscellaneous)
ii qemu-system-ppc 1:4.2-3ubuntu6.4 amd64 QEMU full system emulation binaries (ppc)
ii qemu-system-s390x 1:4.2-3ubuntu6.4 amd64 QEMU full system emulation binaries (s390x)
ii qemu-system-sparc 1:4.2-3ubuntu6.4 amd64 QEMU full system emulation binaries (sparc)
ii qemu-system-x86 1:4.2-3ubuntu6.4 amd64 QEMU full system emulation binaries (x86)
ii qemu-utils 1:4.2-3ubuntu6.4 amd64 QEMU utilities

https://zuul.opendev.org/t/openstack/build/4c56def513884c5eb3ba7b0adf7fa260/log/controller/logs/libvirt/qemu/instance-0000003a_log.txt

2020-09-07 19:29:55.021+0000: starting up libvirt version: 6.0.0, package: 0ubuntu8.3 (Marc Deslauriers <email address hidden> Thu, 30 Jul 2020 06:40:28 -0400), qemu version: 4.2.0Debian 1:4.2-3ubuntu6.4, kernel: 5.4.0-45-generic, hostname: ubuntu-focal-ovh-bhs1-0019682147
LC_ALL=C \
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin \
HOME=/var/lib/libvirt/qemu/domain-86-instance-0000003a \
XDG_DATA_HOME=/var/lib/libvirt/qemu/domain-86-instance-0000003a/.local/share \
XDG_CACHE_HOME=/var/lib/libvirt/qemu/domain-86-instance-0000003a/.cache \
XDG_CONFIG_HOME=/var/lib/libvirt/qemu/domain-86-instance-0000003a/.config \
QEMU_AUDIO_DRV=none \
/usr/bin/qemu-system-x86_64 \
-name guest=instance-0000003a,debug-threads=on \
-S \
-object secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-86-instance-0000003a/master-key.aes \
-machine pc-i440fx-4.2,accel=tcg,usb=off,dump-guest-core=off \
-cpu qemu64 \
-m 128 \
-overcommit mem-lock=off \
-smp 1,sockets=1,cores=1,threads=1 \
-uuid 0d59f238-daef-40d4-adf9-a4fa24c35231 \
-smbios 'type=1,manufacturer=OpenStack Foundation,product=OpenStack Nova,version=21.1.0,serial=0d59f238-daef-40d4-adf9-a4fa24c35231,uuid=0d59f238-daef-40d4-adf9-a4fa24c35231,family=Virtual Machine' \
-no-user-config \
-nodefaults \
-chardev socket,id=charmonitor,fd=39,server,nowait \
-mon chardev=charmonitor,id=monitor,mode=control \
-rtc base=utc \
-no-shutdown \
-boot strict=on \
-device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 \
-object secret,id=libvirt-3-storage-secret0,data=zT+XibedVJZM2du1+PXpIXHMVJ9a0pVcKihOtCGwlB0=,keyid=masterKey0,iv=536Lfw+nsyvDhFBTOQG4zA==,format=base64 \
-blockdev '{"driver":"rbd","pool":"vms","image":"0d59f238-daef-40d4-adf9-a4fa24c35231_disk","server":[{"host":"158.69.70.115","port":"6789"}],"user":"cinder","auth-client-required":["cephx","none"],"key-secret":"libvirt-3-storage-secret0","node-name":"libvirt-3-storage","cache":{"direct":false,"no-flush":false},"auto-read-only":true,"discard":"unmap"}' \
-blockdev '{"node-name":"libvirt-3-format","read-only":false,"cache":{"direct":false,"no-flush":false},"driver":"raw","file":"libvirt-3-storage"}' \
-device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x4,drive=libvirt-3-format,id=virtio-disk0,bootindex=1,write-cache=on \
-object secret,id=libvirt-2-storage-secret0,data=SO9AgCCTvkBBMYHZe+LVzoCF4GUNgvBtkFwRRIji7WI=,keyid=masterKey0,iv=MzGu/h2Api4mMG9lL8hvdg==,format=base64 \
-blockdev '{"driver":"rbd","pool":"volumes","image":"volume-04dd79b2-3c05-4492-b1d7-7969d24df768","server":[{"host":"158.69.70.115","port":"6789"}],"user":"cinder","auth-client-required":["cephx","none"],"key-secret":"libvirt-2-storage-secret0","node-name":"libvirt-2-storage","cache":{"direct":false,"no-flush":false},"auto-read-only":true,"discard":"unmap"}' \
-blockdev '{"node-name":"libvirt-2-format","read-only":false,"discard":"unmap","cache":{"direct":false,"no-flush":false},"driver":"raw","file":"libvirt-2-storage"}' \
-device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x7,drive=libvirt-2-format,id=virtio-disk1,write-cache=on,serial=04dd79b2-3c05-4492-b1d7-7969d24df768 \
-object secret,id=libvirt-1-storage-secret0,data=lhbR9+ewiXiaf3dKoQWP3bk6hlLMLRXnbhh9ZkjZ9dQ=,keyid=masterKey0,iv=WWHpGuOHkwXqxlLxGUqpcA==,format=base64 \
-blockdev '{"driver":"rbd","pool":"vms","image":"0d59f238-daef-40d4-adf9-a4fa24c35231_disk.config","server":[{"host":"158.69.70.115","port":"6789"}],"user":"cinder","auth-client-required":["cephx","none"],"key-secret":"libvirt-1-storage-secret0","node-name":"libvirt-1-storage","cache":{"direct":false,"no-flush":false},"auto-read-only":true,"discard":"unmap"}' \
-blockdev '{"node-name":"libvirt-1-format","read-only":true,"cache":{"direct":false,"no-flush":false},"driver":"raw","file":"libvirt-1-storage"}' \
-device ide-cd,bus=ide.0,unit=0,drive=libvirt-1-format,id=ide0-0-0,write-cache=on \
-netdev tap,fd=41,id=hostnet0 \
-device virtio-net-pci,host_mtu=1400,netdev=hostnet0,id=net0,mac=fa:16:3e:4d:bb:0b,bus=pci.0,addr=0x3 \
-add-fd set=2,fd=43 \
-chardev pty,id=charserial0,logfile=/dev/fdset/2,logappend=on \
-device isa-serial,chardev=charserial0,id=serial0 \
-vnc 0.0.0.0:3 \
-device cirrus-vga,id=video0,bus=pci.0,addr=0x2 \
-device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x5 \
-object rng-random,id=objrng0,filename=/dev/urandom \
-device virtio-rng-pci,rng=objrng0,id=rng0,bus=pci.0,addr=0x6 \
-sandbox on,obsolete=deny,elevateprivileges=deny,spawn=deny,resourcecontrol=deny \
-msg timestamp=on
char device redirected to /dev/pts/1 (label charserial0)

https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_4c5/746981/5/check/nova-ceph-multistore/4c56def/testr_results.html

tempest.api.compute.servers.test_server_rescue_negative.ServerRescueNegativeTestJSON.test_rescued_vm_detach_volume

2020-09-07 19:30:13,764 100285 INFO [tempest.lib.common.rest_client] Request (ServerRescueNegativeTestJSON:_run_cleanups): 202 DELETE https://158.69.70.115/compute/v2.1/servers/0d59f238-daef-40d4-adf9-a4fa24c35231/os-volume_attachments/04dd79b2-3c05-4492-b1d7-7969d24df768 1.261s
2020-09-07 19:30:13,764 100285 DEBUG [tempest.lib.common.rest_client] Request - Headers: {'Content-Type': 'application/json', 'Accept': 'application/json', 'X-Auth-Token': '<omitted>'}
        Body: None
    Response - Headers: {'date': 'Mon, 07 Sep 2020 19:30:12 GMT', 'server': 'Apache/2.4.41 (Ubuntu)', 'content-length': '0', 'content-type': 'application/json', 'openstack-api-version': 'compute 2.1', 'x-openstack-nova-api-version': '2.1', 'vary': 'OpenStack-API-Version,X-OpenStack-Nova-API-Version', 'x-openstack-request-id': 'req-502a0106-3eb9-4d42-9dd4-c43ba89187b6', 'x-compute-request-id': 'req-502a0106-3eb9-4d42-9dd4-c43ba89187b6', 'connection': 'close', 'status': '202', 'content-location': 'https://158.69.70.115/compute/v2.1/servers/0d59f238-daef-40d4-adf9-a4fa24c35231/os-volume_attachments/04dd79b2-3c05-4492-b1d7-7969d24df768'}
        Body: b''

# First attempt to detach the device by n-cpu

https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_4c5/746981/5/check/nova-ceph-multistore/4c56def/controller/logs/screen-n-cpu.txt (gzipped)

29957 Sep 07 19:30:14.185403 ubuntu-focal-ovh-bhs1-0019682147 nova-compute[92697]: DEBUG nova.virt.libvirt.guest [None req-502a0106-3eb9-4d42-9dd4-c43ba89187b6 tempest-ServerRescueNegativeTestJSON-73411582 tempest-ServerRescueNegativeTestJSON-73411582] detach device xml: <disk type="network" de
29958 Sep 07 19:30:14.185403 ubuntu-focal-ovh-bhs1-0019682147 nova-compute[92697]: <driver name="qemu" type="raw" cache="writeback" discard="unmap"/>
29959 Sep 07 19:30:14.185403 ubuntu-focal-ovh-bhs1-0019682147 nova-compute[92697]: <source protocol="rbd" name="volumes/volume-04dd79b2-3c05-4492-b1d7-7969d24df768">
29960 Sep 07 19:30:14.185403 ubuntu-focal-ovh-bhs1-0019682147 nova-compute[92697]: <host name="158.69.70.115" port="6789"/>
29961 Sep 07 19:30:14.185403 ubuntu-focal-ovh-bhs1-0019682147 nova-compute[92697]: </source>
29962 Sep 07 19:30:14.185403 ubuntu-focal-ovh-bhs1-0019682147 nova-compute[92697]: <target dev="vdb" bus="virtio"/>
29963 Sep 07 19:30:14.185403 ubuntu-focal-ovh-bhs1-0019682147 nova-compute[92697]: <serial>04dd79b2-3c05-4492-b1d7-7969d24df768</serial>
29964 Sep 07 19:30:14.185403 ubuntu-focal-ovh-bhs1-0019682147 nova-compute[92697]: <address type="pci" domain="0x0000" bus="0x00" slot="0x07" function="0x0"/>
29965 Sep 07 19:30:14.185403 ubuntu-focal-ovh-bhs1-0019682147 nova-compute[92697]: </disk>
29966 Sep 07 19:30:14.185403 ubuntu-focal-ovh-bhs1-0019682147 nova-compute[92697]: {{(pid=92697) detach_device /opt/stack/nova/nova/virt/libvirt/guest.py:510}}

# DEVICE_DELETED only raised for /machine/peripheral/virtio-disk1/virtio-backend

https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_4c5/746981/5/check/nova-ceph-multistore/4c56def/controller/logs/libvirt/libvirtd_log.txt (gzipped)

329344 2020-09-07 19:30:14.165+0000: 65559: debug : qemuDomainObjEnterMonitorInternal:9869 : Entering monitor (mon=0x7f769405e470 vm=0x7f768c0df0b0 name=instance-0000003a)
329345 2020-09-07 19:30:14.165+0000: 65559: debug : qemuMonitorDelDevice:2848 : devalias=virtio-disk1
329346 2020-09-07 19:30:14.165+0000: 65559: debug : qemuMonitorDelDevice:2850 : mon:0x7f769405e470 vm:0x7f768c0df0b0 fd:39
329347 2020-09-07 19:30:14.165+0000: 65559: info : qemuMonitorSend:993 : QEMU_MONITOR_SEND_MSG: mon=0x7f769405e470 msg={"execute":"device_del","arguments":{"id":"virtio-disk1"},"id":"libvirt-367"}^M
329348 fd=-1
329349 2020-09-07 19:30:14.165+0000: 65555: info : qemuMonitorIOWrite:450 : QEMU_MONITOR_IO_WRITE: mon=0x7f769405e470 buf={"execute":"device_del","arguments":{"id":"virtio-disk1"},"id":"libvirt-367"}^M
329350 len=79 ret=79 errno=0
329351 2020-09-07 19:30:14.168+0000: 65555: debug : qemuMonitorJSONIOProcessLine:220 : Line [{"return": {}, "id": "libvirt-367"}]
329352 2020-09-07 19:30:14.168+0000: 65555: info : qemuMonitorJSONIOProcessLine:239 : QEMU_MONITOR_RECV_REPLY: mon=0x7f769405e470 reply={"return": {}, "id": "libvirt-367"}
329353 2020-09-07 19:30:14.168+0000: 65559: debug : qemuDomainObjExitMonitorInternal:9892 : Exited monitor (mon=0x7f769405e470 vm=0x7f768c0df0b0 name=instance-0000003a)
329354 2020-09-07 19:30:14.201+0000: 65555: debug : qemuMonitorJSONIOProcessLine:220 : Line [{"timestamp": {"seconds": 1599507014, "microseconds": 201037}, "event": "DEVICE_DELETED", "data": {"path": "/machine/peripheral/virtio-disk1/virtio-backend"}}]
329355 2020-09-07 19:30:14.208+0000: 65555: info : qemuMonitorJSONIOProcessLine:234 : QEMU_MONITOR_RECV_EVENT: mon=0x7f769405e470 event={"timestamp": {"seconds": 1599507014, "microseconds": 201037}, "event": "DEVICE_DELETED", "data": {"path": "/machine/peripheral/virtio-disk1/virtio-backend"}}
329356 2020-09-07 19:30:14.208+0000: 65555: debug : qemuMonitorJSONIOProcessEvent:181 : mon=0x7f769405e470 obj=0x55dd95d0cba0
329357 2020-09-07 19:30:14.208+0000: 65555: debug : qemuMonitorEmitEvent:1198 : mon=0x7f769405e470 event=DEVICE_DELETED
329358 2020-09-07 19:30:14.208+0000: 65555: debug : qemuProcessHandleEvent:549 : vm=0x7f768c0df0b0
329359 2020-09-07 19:30:14.208+0000: 65555: debug : virObjectEventNew:631 : obj=0x55dd95d3bf60
329360 2020-09-07 19:30:14.208+0000: 65555: debug : qemuMonitorJSONIOProcessEvent:205 : handle DEVICE_DELETED handler=0x7f7691732840 data=0x55dd95eae3c0
329361 2020-09-07 19:30:14.208+0000: 65555: debug : qemuMonitorJSONHandleDeviceDeleted:1287 : missing device in device deleted event

# Second attempt to detach the device by n-cpu

https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_4c5/746981/5/check/nova-ceph-multistore/4c56def/controller/logs/screen-n-cpu.txt (gzipped)

30046 Sep 07 19:30:19.192548 ubuntu-focal-ovh-bhs1-0019682147 nova-compute[92697]: DEBUG oslo.service.loopingcall [None req-502a0106-3eb9-4d42-9dd4-c43ba89187b6 tempest-ServerRescueNegativeTestJSON-73411582 tempest-ServerRescueNegativeTestJSON-73411582] Waiting for function nova.virt.libvirt.gu
30047 Sep 07 19:30:19.194846 ubuntu-focal-ovh-bhs1-0019682147 nova-compute[92697]: DEBUG nova.virt.libvirt.guest [-] detach device xml: <disk type="network" device="disk">
30048 Sep 07 19:30:19.194846 ubuntu-focal-ovh-bhs1-0019682147 nova-compute[92697]: <driver name="qemu" type="raw" cache="writeback" discard="unmap"/>
30049 Sep 07 19:30:19.194846 ubuntu-focal-ovh-bhs1-0019682147 nova-compute[92697]: <source protocol="rbd" name="volumes/volume-04dd79b2-3c05-4492-b1d7-7969d24df768">
30050 Sep 07 19:30:19.194846 ubuntu-focal-ovh-bhs1-0019682147 nova-compute[92697]: <host name="158.69.70.115" port="6789"/>
30051 Sep 07 19:30:19.194846 ubuntu-focal-ovh-bhs1-0019682147 nova-compute[92697]: </source>
30052 Sep 07 19:30:19.194846 ubuntu-focal-ovh-bhs1-0019682147 nova-compute[92697]: <target dev="vdb" bus="virtio"/>
30053 Sep 07 19:30:19.194846 ubuntu-focal-ovh-bhs1-0019682147 nova-compute[92697]: <serial>04dd79b2-3c05-4492-b1d7-7969d24df768</serial>
30054 Sep 07 19:30:19.194846 ubuntu-focal-ovh-bhs1-0019682147 nova-compute[92697]: <address type="pci" domain="0x0000" bus="0x00" slot="0x07" function="0x0"/>
30055 Sep 07 19:30:19.194846 ubuntu-focal-ovh-bhs1-0019682147 nova-compute[92697]: </disk>
30056 Sep 07 19:30:19.194846 ubuntu-focal-ovh-bhs1-0019682147 nova-compute[92697]: {{(pid=92697) detach_device /opt/stack/nova/nova/virt/libvirt/guest.py:510}}

# DeviceNotFound raised by QEMU

https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_4c5/746981/5/check/nova-ceph-multistore/4c56def/controller/logs/libvirt/libvirtd_log.txt (gzipped)

332479 2020-09-07 19:30:19.196+0000: 65560: debug : qemuDomainObjBeginJobInternal:9416 : Starting job: job=modify agentJob=none asyncJob=none (vm=0x7f768c0df0b0 name=instance-0000003a, current job=none agentJob=none async=none)
332480 2020-09-07 19:30:19.196+0000: 65560: debug : qemuDomainObjBeginJobInternal:9470 : Started job: modify (async=none vm=0x7f768c0df0b0 name=instance-0000003a)
332481 2020-09-07 19:30:19.196+0000: 65560: debug : qemuDomainObjEnterMonitorInternal:9869 : Entering monitor (mon=0x7f769405e470 vm=0x7f768c0df0b0 name=instance-0000003a)
332482 2020-09-07 19:30:19.196+0000: 65560: debug : qemuMonitorDelDevice:2848 : devalias=virtio-disk1
332483 2020-09-07 19:30:19.196+0000: 65560: debug : qemuMonitorDelDevice:2850 : mon:0x7f769405e470 vm:0x7f768c0df0b0 fd:39
332484 2020-09-07 19:30:19.196+0000: 65560: info : qemuMonitorSend:993 : QEMU_MONITOR_SEND_MSG: mon=0x7f769405e470 msg={"execute":"device_del","arguments":{"id":"virtio-disk1"},"id":"libvirt-369"}^M
332485 fd=-1
332486 2020-09-07 19:30:19.196+0000: 65555: info : qemuMonitorIOWrite:450 : QEMU_MONITOR_IO_WRITE: mon=0x7f769405e470 buf={"execute":"device_del","arguments":{"id":"virtio-disk1"},"id":"libvirt-369"}^M
332487 len=79 ret=79 errno=0
332488 2020-09-07 19:30:19.197+0000: 65555: debug : qemuMonitorJSONIOProcessLine:220 : Line [{"id": "libvirt-369", "error": {"class": "DeviceNotFound", "desc": "Device 'virtio-disk1' not found"}}]
332489 2020-09-07 19:30:19.197+0000: 65555: info : qemuMonitorJSONIOProcessLine:239 : QEMU_MONITOR_RECV_REPLY: mon=0x7f769405e470 reply={"id": "libvirt-369", "error": {"class": "DeviceNotFound", "desc": "Device 'virtio-disk1' not found"}}
332490 2020-09-07 19:30:19.197+0000: 65560: debug : qemuDomainObjExitMonitorInternal:9892 : Exited monitor (mon=0x7f769405e470 vm=0x7f768c0df0b0 name=instance-0000003a)
332491 2020-09-07 19:30:19.197+0000: 65560: debug : qemuDomainDeleteDevice:128 : Detaching of device virtio-disk1 failed and no event arrived

# n-cpu continues to retry the detach

30245 Sep 07 19:30:26.209322 ubuntu-focal-ovh-bhs1-0019682147 nova-compute[92697]: DEBUG nova.virt.libvirt.guest [-] detach device xml: <disk type="network" device="disk">
30246 Sep 07 19:30:26.209322 ubuntu-focal-ovh-bhs1-0019682147 nova-compute[92697]: <driver name="qemu" type="raw" cache="writeback" discard="unmap"/>
30247 Sep 07 19:30:26.209322 ubuntu-focal-ovh-bhs1-0019682147 nova-compute[92697]: <source protocol="rbd" name="volumes/volume-04dd79b2-3c05-4492-b1d7-7969d24df768">
30248 Sep 07 19:30:26.209322 ubuntu-focal-ovh-bhs1-0019682147 nova-compute[92697]: <host name="158.69.70.115" port="6789"/>
30249 Sep 07 19:30:26.209322 ubuntu-focal-ovh-bhs1-0019682147 nova-compute[92697]: </source>
30250 Sep 07 19:30:26.209322 ubuntu-focal-ovh-bhs1-0019682147 nova-compute[92697]: <target dev="vdb" bus="virtio"/>
30251 Sep 07 19:30:26.209322 ubuntu-focal-ovh-bhs1-0019682147 nova-compute[92697]: <serial>04dd79b2-3c05-4492-b1d7-7969d24df768</serial>
30252 Sep 07 19:30:26.209322 ubuntu-focal-ovh-bhs1-0019682147 nova-compute[92697]: <address type="pci" domain="0x0000" bus="0x00" slot="0x07" function="0x0"/>
30253 Sep 07 19:30:26.209322 ubuntu-focal-ovh-bhs1-0019682147 nova-compute[92697]: </disk>

30276 Sep 07 19:30:42.028517 ubuntu-focal-ovh-bhs1-0019682147 nova-compute[92697]: DEBUG nova.virt.libvirt.guest [-] detach device xml: <disk type="network" device="disk">
30277 Sep 07 19:30:42.028517 ubuntu-focal-ovh-bhs1-0019682147 nova-compute[92697]: <driver name="qemu" type="raw" cache="writeback" discard="unmap"/>
30278 Sep 07 19:30:42.028517 ubuntu-focal-ovh-bhs1-0019682147 nova-compute[92697]: <source protocol="rbd" name="volumes/volume-04dd79b2-3c05-4492-b1d7-7969d24df768">
30279 Sep 07 19:30:42.028517 ubuntu-focal-ovh-bhs1-0019682147 nova-compute[92697]: <host name="158.69.70.115" port="6789"/>
30280 Sep 07 19:30:42.028517 ubuntu-focal-ovh-bhs1-0019682147 nova-compute[92697]: </source>
30281 Sep 07 19:30:42.028517 ubuntu-focal-ovh-bhs1-0019682147 nova-compute[92697]: <target dev="vdb" bus="virtio"/>
30282 Sep 07 19:30:42.028517 ubuntu-focal-ovh-bhs1-0019682147 nova-compute[92697]: <serial>04dd79b2-3c05-4492-b1d7-7969d24df768</serial>
30283 Sep 07 19:30:42.028517 ubuntu-focal-ovh-bhs1-0019682147 nova-compute[92697]: <address type="pci" domain="0x0000" bus="0x00" slot="0x07" function="0x0"/>
30284 Sep 07 19:30:42.028517 ubuntu-focal-ovh-bhs1-0019682147 nova-compute[92697]: </disk>

30356 Sep 07 19:30:53.232072 ubuntu-focal-ovh-bhs1-0019682147 nova-compute[92697]: DEBUG nova.virt.libvirt.guest [-] detach device xml: <disk type="network" device="disk">
30357 Sep 07 19:30:53.232072 ubuntu-focal-ovh-bhs1-0019682147 nova-compute[92697]: <driver name="qemu" type="raw" cache="writeback" discard="unmap"/>
30358 Sep 07 19:30:53.232072 ubuntu-focal-ovh-bhs1-0019682147 nova-compute[92697]: <source protocol="rbd" name="volumes/volume-04dd79b2-3c05-4492-b1d7-7969d24df768">
30359 Sep 07 19:30:53.232072 ubuntu-focal-ovh-bhs1-0019682147 nova-compute[92697]: <host name="158.69.70.115" port="6789"/>
30360 Sep 07 19:30:53.232072 ubuntu-focal-ovh-bhs1-0019682147 nova-compute[92697]: </source>
30361 Sep 07 19:30:53.232072 ubuntu-focal-ovh-bhs1-0019682147 nova-compute[92697]: <target dev="vdb" bus="virtio"/>
30362 Sep 07 19:30:53.232072 ubuntu-focal-ovh-bhs1-0019682147 nova-compute[92697]: <serial>04dd79b2-3c05-4492-b1d7-7969d24df768</serial>
30363 Sep 07 19:30:53.232072 ubuntu-focal-ovh-bhs1-0019682147 nova-compute[92697]: <address type="pci" domain="0x0000" bus="0x00" slot="0x07" function="0x0"/>
30364 Sep 07 19:30:53.232072 ubuntu-focal-ovh-bhs1-0019682147 nova-compute[92697]: </disk>

30381 Sep 07 19:31:06.239532 ubuntu-focal-ovh-bhs1-0019682147 nova-compute[92697]: DEBUG nova.virt.libvirt.guest [-] detach device xml: <disk type="network" device="disk">
30382 Sep 07 19:31:06.239532 ubuntu-focal-ovh-bhs1-0019682147 nova-compute[92697]: <driver name="qemu" type="raw" cache="writeback" discard="unmap"/>
30383 Sep 07 19:31:06.239532 ubuntu-focal-ovh-bhs1-0019682147 nova-compute[92697]: <source protocol="rbd" name="volumes/volume-04dd79b2-3c05-4492-b1d7-7969d24df768">
30384 Sep 07 19:31:06.239532 ubuntu-focal-ovh-bhs1-0019682147 nova-compute[92697]: <host name="158.69.70.115" port="6789"/>
30385 Sep 07 19:31:06.239532 ubuntu-focal-ovh-bhs1-0019682147 nova-compute[92697]: </source>
30386 Sep 07 19:31:06.239532 ubuntu-focal-ovh-bhs1-0019682147 nova-compute[92697]: <target dev="vdb" bus="virtio"/>
30387 Sep 07 19:31:06.239532 ubuntu-focal-ovh-bhs1-0019682147 nova-compute[92697]: <serial>04dd79b2-3c05-4492-b1d7-7969d24df768</serial>
30388 Sep 07 19:31:06.239532 ubuntu-focal-ovh-bhs1-0019682147 nova-compute[92697]: <address type="pci" domain="0x0000" bus="0x00" slot="0x07" function="0x0"/>
30389 Sep 07 19:31:06.239532 ubuntu-focal-ovh-bhs1-0019682147 nova-compute[92697]: </disk>

30478 Sep 07 19:31:21.369016 ubuntu-focal-ovh-bhs1-0019682147 nova-compute[92697]: DEBUG nova.virt.libvirt.guest [-] detach device xml: <disk type="network" device="disk">
30479 Sep 07 19:31:21.369016 ubuntu-focal-ovh-bhs1-0019682147 nova-compute[92697]: <driver name="qemu" type="raw" cache="writeback" discard="unmap"/>
30480 Sep 07 19:31:21.369016 ubuntu-focal-ovh-bhs1-0019682147 nova-compute[92697]: <source protocol="rbd" name="volumes/volume-04dd79b2-3c05-4492-b1d7-7969d24df768">
30481 Sep 07 19:31:21.369016 ubuntu-focal-ovh-bhs1-0019682147 nova-compute[92697]: <host name="158.69.70.115" port="6789"/>
30482 Sep 07 19:31:21.369016 ubuntu-focal-ovh-bhs1-0019682147 nova-compute[92697]: </source>
30483 Sep 07 19:31:21.369016 ubuntu-focal-ovh-bhs1-0019682147 nova-compute[92697]: <target dev="vdb" bus="virtio"/>
30484 Sep 07 19:31:21.369016 ubuntu-focal-ovh-bhs1-0019682147 nova-compute[92697]: <serial>04dd79b2-3c05-4492-b1d7-7969d24df768</serial>
30485 Sep 07 19:31:21.369016 ubuntu-focal-ovh-bhs1-0019682147 nova-compute[92697]: <address type="pci" domain="0x0000" bus="0x00" slot="0x07" function="0x0"/>
30486 Sep 07 19:31:21.369016 ubuntu-focal-ovh-bhs1-0019682147 nova-compute[92697]: </disk>

30796 Sep 07 19:31:42.590535 ubuntu-focal-ovh-bhs1-0019682147 nova-compute[92697]: DEBUG nova.virt.libvirt.guest [-] detach device xml: <disk type="network" device="disk">
30797 Sep 07 19:31:42.590535 ubuntu-focal-ovh-bhs1-0019682147 nova-compute[92697]: <driver name="qemu" type="raw" cache="writeback" discard="unmap"/>
30798 Sep 07 19:31:42.590535 ubuntu-focal-ovh-bhs1-0019682147 nova-compute[92697]: <source protocol="rbd" name="volumes/volume-04dd79b2-3c05-4492-b1d7-7969d24df768">
30799 Sep 07 19:31:42.590535 ubuntu-focal-ovh-bhs1-0019682147 nova-compute[92697]: <host name="158.69.70.115" port="6789"/>
30800 Sep 07 19:31:42.590535 ubuntu-focal-ovh-bhs1-0019682147 nova-compute[92697]: </source>
30801 Sep 07 19:31:42.590535 ubuntu-focal-ovh-bhs1-0019682147 nova-compute[92697]: <target dev="vdb" bus="virtio"/>
30802 Sep 07 19:31:42.590535 ubuntu-focal-ovh-bhs1-0019682147 nova-compute[92697]: <serial>04dd79b2-3c05-4492-b1d7-7969d24df768</serial>
30803 Sep 07 19:31:42.590535 ubuntu-focal-ovh-bhs1-0019682147 nova-compute[92697]: <address type="pci" domain="0x0000" bus="0x00" slot="0x07" function="0x0"/>
30804 Sep 07 19:31:42.590535 ubuntu-focal-ovh-bhs1-0019682147 nova-compute[92697]: </disk>

31050 Sep 07 19:32:01.613201 ubuntu-focal-ovh-bhs1-0019682147 nova-compute[92697]: DEBUG nova.virt.libvirt.guest [-] detach device xml: <disk type="network" device="disk">
31051 Sep 07 19:32:01.613201 ubuntu-focal-ovh-bhs1-0019682147 nova-compute[92697]: <driver name="qemu" type="raw" cache="writeback" discard="unmap"/>
31052 Sep 07 19:32:01.613201 ubuntu-focal-ovh-bhs1-0019682147 nova-compute[92697]: <source protocol="rbd" name="volumes/volume-04dd79b2-3c05-4492-b1d7-7969d24df768">
31053 Sep 07 19:32:01.613201 ubuntu-focal-ovh-bhs1-0019682147 nova-compute[92697]: <host name="158.69.70.115" port="6789"/>
31054 Sep 07 19:32:01.613201 ubuntu-focal-ovh-bhs1-0019682147 nova-compute[92697]: </source>
31055 Sep 07 19:32:01.613201 ubuntu-focal-ovh-bhs1-0019682147 nova-compute[92697]: <target dev="vdb" bus="virtio"/>
31056 Sep 07 19:32:01.613201 ubuntu-focal-ovh-bhs1-0019682147 nova-compute[92697]: <serial>04dd79b2-3c05-4492-b1d7-7969d24df768</serial>
31057 Sep 07 19:32:01.613201 ubuntu-focal-ovh-bhs1-0019682147 nova-compute[92697]: <address type="pci" domain="0x0000" bus="0x00" slot="0x07" function="0x0"/>
31058 Sep 07 19:32:01.613201 ubuntu-focal-ovh-bhs1-0019682147 nova-compute[92697]: </disk>

# n-cpu eventually gives up trying to detach the device

https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_4c5/746981/5/check/nova-ceph-multistore/4c56def/controller/logs/screen-n-cpu.txt (gzipped)

31102 Sep 07 19:32:06.850434 ubuntu-focal-ovh-bhs1-0019682147 nova-compute[92697]: ERROR oslo.service.loopingcall [-] Dynamic interval looping call 'oslo_service.loopingcall.RetryDecorator.__call__.<locals>._func' failed: nova.exception.DeviceDetachFailed: Device detach failed for vdb: Unable t
31103 Sep 07 19:32:06.850434 ubuntu-focal-ovh-bhs1-0019682147 nova-compute[92697]: ERROR oslo.service.loopingcall Traceback (most recent call last):
31104 Sep 07 19:32:06.850434 ubuntu-focal-ovh-bhs1-0019682147 nova-compute[92697]: ERROR oslo.service.loopingcall File "/usr/local/lib/python3.8/dist-packages/oslo_service/loopingcall.py", line 150, in _run_loop
31105 Sep 07 19:32:06.850434 ubuntu-focal-ovh-bhs1-0019682147 nova-compute[92697]: ERROR oslo.service.loopingcall result = func(*self.args, **self.kw)
31106 Sep 07 19:32:06.850434 ubuntu-focal-ovh-bhs1-0019682147 nova-compute[92697]: ERROR oslo.service.loopingcall File "/usr/local/lib/python3.8/dist-packages/oslo_service/loopingcall.py", line 428, in _func
31107 Sep 07 19:32:06.850434 ubuntu-focal-ovh-bhs1-0019682147 nova-compute[92697]: ERROR oslo.service.loopingcall return self._sleep_time
31108 Sep 07 19:32:06.850434 ubuntu-focal-ovh-bhs1-0019682147 nova-compute[92697]: ERROR oslo.service.loopingcall File "/usr/local/lib/python3.8/dist-packages/oslo_utils/excutils.py", line 220, in __exit__
31109 Sep 07 19:32:06.850434 ubuntu-focal-ovh-bhs1-0019682147 nova-compute[92697]: ERROR oslo.service.loopingcall self.force_reraise()
31110 Sep 07 19:32:06.850434 ubuntu-focal-ovh-bhs1-0019682147 nova-compute[92697]: ERROR oslo.service.loopingcall File "/usr/local/lib/python3.8/dist-packages/oslo_utils/excutils.py", line 196, in force_reraise
31111 Sep 07 19:32:06.850434 ubuntu-focal-ovh-bhs1-0019682147 nova-compute[92697]: ERROR oslo.service.loopingcall six.reraise(self.type_, self.value, self.tb)
31112 Sep 07 19:32:06.850434 ubuntu-focal-ovh-bhs1-0019682147 nova-compute[92697]: ERROR oslo.service.loopingcall File "/usr/local/lib/python3.8/dist-packages/six.py", line 703, in reraise
31113 Sep 07 19:32:06.850434 ubuntu-focal-ovh-bhs1-0019682147 nova-compute[92697]: ERROR oslo.service.loopingcall raise value
31114 Sep 07 19:32:06.850434 ubuntu-focal-ovh-bhs1-0019682147 nova-compute[92697]: ERROR oslo.service.loopingcall File "/usr/local/lib/python3.8/dist-packages/oslo_service/loopingcall.py", line 407, in _func
31115 Sep 07 19:32:06.850434 ubuntu-focal-ovh-bhs1-0019682147 nova-compute[92697]: ERROR oslo.service.loopingcall result = f(*args, **kwargs)
31116 Sep 07 19:32:06.850434 ubuntu-focal-ovh-bhs1-0019682147 nova-compute[92697]: ERROR oslo.service.loopingcall File "/opt/stack/nova/nova/virt/libvirt/guest.py", line 489, in _do_wait_and_retry_detach
31117 Sep 07 19:32:06.850434 ubuntu-focal-ovh-bhs1-0019682147 nova-compute[92697]: ERROR oslo.service.loopingcall raise exception.DeviceDetachFailed(
31118 Sep 07 19:32:06.850434 ubuntu-focal-ovh-bhs1-0019682147 nova-compute[92697]: ERROR oslo.service.loopingcall nova.exception.DeviceDetachFailed: Device detach failed for vdb: Unable to detach the device from the live config.
31119 Sep 07 19:32:06.850434 ubuntu-focal-ovh-bhs1-0019682147 nova-compute[92697]: ERROR oslo.service.loopingcall

Kashyap Chamarthy (kashyapc) wrote :

Just to expand a little more on the "second DEVICE_DELETED" event:

Apparently this is an "oddity" of VirtIO devices where we see _two_ DEVICE_DELETED events upon hot-unplugging a device. So, ideally, we should see the following two events, but we're only seeing the first one:

(1) "event": "DEVICE_DELETED", "data": {"path": "/machine/peripheral/virtio-disk1/virtio-backend"}}

(2) "event": "DEVICE_DELETED", "data": {"device": "virtio-disk1", "path": "/machine/peripheral/virtio-disk1"}}

Kashyap Chamarthy (kashyapc) wrote :

Attaching the QMP requests and responses filtered out from the giant libvirtd.log.

This file was generated using:

    $> grep -Ei '(QEMU_MONITOR_SEND_MSG|QEMU_MONITOR_RECV_)' libvirtd.log |&
       tee QMP_requests_and_replies_from_libvirtd.log

(And then I `gzip`ed it; compressed - 3.1MB; uncompressed -39MB)

Daniel Berrange (berrange) wrote :

Note that there are many VMs running, so it is important to isolate logs for just one of them, by matching on mon=0xXXXXXXXX.

For example considering this

$ grep 0x7f021c0b9c70 libvirtd.log | grep -E 'QEMU_MONITOR_(RECV_REPLY|RECV_EVENT|SEND_MSG)'

We can see the "device_del" request:

2020-09-03 19:53:47.114+0000: 65331: info : qemuMonitorSend:993 : QEMU_MONITOR_SEND_MSG: mon=0x7f021c0b9c70 msg={"execute":"device_del","arguments":{"id":"virtio-disk1"},"id":"libvirt-367"}
2020-09-03 19:53:47.116+0000: 65328: info : qemuMonitorJSONIOProcessLine:239 : QEMU_MONITOR_RECV_REPLY: mon=0x7f021c0b9c70 reply={"return": {}, "id": "libvirt-367"}

and then the first DEVICE_DELETED event:

2020-09-03 19:53:52.539+0000: 65328: info : qemuMonitorJSONIOProcessLine:234 : QEMU_MONITOR_RECV_EVENT: mon=0x7f021c0b9c70 event={"timestamp": {"seconds": 1599162832, "microseconds": 539721}, "event": "DEVICE_DELETED", "data": {"path": "/machine/peripheral/virtio-disk1/virtio-backend"}}

The second, important, DEVICE_DELETED event never arrives. OpenStack hopelessly just retries device_del forever more.

Note in normal case the first DEVICE_DELETED event (for "/machine/peripheral/virtio-disk1/virtio-backend") gets emitted from VCPU thread context. The second DEVICE_DELETED event (for "/machine/peripheral/virtio-disk1/") is emitted from the RCU thread, and that one is missing here. THe second one is the event that libvirt needs in order to determine that unplug is complete.

sean mooney (sean-k-mooney) wrote :

hum i was hoping to indicate this affect focal in some what but not sure how to do that but this issue
happens with the ubunutu 20.04 version of qemu 4.2
it does not seam to happen with the centos 8 build of the same qemu so i dont know if there is a delta in packages or if its just a case that in the openstack ci we have more testing based on ubuntu 20.04 and its appearing there more as a result.

affects: ubuntu → qemu (Ubuntu)
Download full text (9.4 KiB)

Hi,
interesting bug report - thanks lee, Kashyap and Sean - as well as Danil for taking a look already.

If this would always fail no unplugs would work ever which I knew can't be right as I test that. So we need to find what is different...

@Openstack people - is that reliably triggering at the very first hot-detach for you or more like "happening in one of a million detaches triggered by the CI"?

First I simplified the case from "massive openstack CI" to just a few commands and small XMLs...
So I tried to cut ceph out of the equation here and compared hot remove between Bionic and Focal qemu/libvirt when removing a local image file.

$ qemu-img create -f qcow2 /var/lib/uvtool/libvirt/images/testdisk.qcow 20m
Disk:
    <disk type='file' device='disk'>
      <driver name='qemu' type='qcow2'/>
      <source file='/var/lib/uvtool/libvirt/images/testdisk.qcow'/>
      <target dev='vdc' bus='virtio'/>
    </disk>

The systems have just this guest each and the log debug config is like:
 /etc/libvirt/libvirtd.conf:
 > log_filters="3:qemu 3:libvirt 4:object 2:json 4:event 3:util"
 > log_outputs="2:file:/var/log/libvirtd.log"
 $ rm /var/log/libvirtd.log
 $ systemctl restart libvirtd

Detach via:
 $ virsh detach-disk f-on-b vdc --live

I compared the monitor streams in those two cases but they both deliver DEVICE_DELETED for the
/machine/peripheral/virtio-disk2/virtio-backend as well as /machine/peripheral/virtio-disk2.

@Jamespage could you get me a system with at least one auxiliary ceph disk that I can attach/detach to check if this might be in any way ceph specific?

Details in case we need them for comparison later on ...:

## Bionic ##

libvirt checks devices before removal:

2020-09-21 11:00:19.287+0000: 5755: info : qemuMonitorSend:1076 : QEMU_MONITOR_SEND_MSG: mon=0x7fdf94001430 msg={"execute":"qom-list","arguments":{"path":"/machine/peripheral"},"id":"libvirt-12"}
2020-09-21 11:00:19.288+0000: 5686: info : qemuMonitorJSONIOProcessLine:213 : QEMU_MONITOR_RECV_REPLY: mon=0x7fdf94001430 reply={"return": [{"name": "virtio-disk2", "type": "child<virtio-blk-pci>"}, {"name": "virtio-disk1", "type": "child<virtio-blk-pci>"}, {"name": "virtio-disk0", "type": "child<virtio-blk-pci>"}, {"name": "video0", "type": "child<cirrus-vga>"}, {"name": "serial0", "type": "child<isa-serial>"}, {"name": "balloon0", "type": "child<virtio-balloon-pci>"}, {"name": "net0", "type": "child<virtio-net-pci>"}, {"name": "usb", "type": "child<piix3-usb-uhci>"}, {"name": "type", "type": "string"}], "id": "libvirt-12"}

Then send removal command

2020-09-21 11:00:48.672+0000: 5691: info : qemuMonitorSend:1076 : QEMU_MONITOR_SEND_MSG: mon=0x7fdf94001430 msg={"execute":"device_del","arguments":{"id":"virtio-disk2"},"id":"libvirt-13"}
2020-09-21 11:00:48.673+0000: 5686: info : qemuMonitorJSONIOProcessLine:213 : QEMU_MONITOR_RECV_REPLY: mon=0x7fdf94001430 reply={"return": {}, "id": "libvirt-13"}
020-09-21 11:00:48.721+0000: 5686: info : qemuMonitorJSONIOProcessLine:208 : QEMU_MONITOR_RECV_EVENT: mon=0x7fdf94001430 event={"timestamp": {"seconds": 1600686048, "microseconds": 720908}, "event": "DEVICE_DELETED", "data": {"path": "/machine/peripheral/virtio-disk2/virtio-back...

Read more...

As outlined before the time from device_del to the DEVICE_DELETED is indeed "a bit longer" in Focal (from ~0.1s to ~6s), but other than that I couldn't find anything else yet.

We were wondering due to [1] in a related bug if it might be dependent to (over)load.
I was consuming Disk and CPU in a controlled fashion (stress-ng --cpu 8 --io 4 --hdd 4 1x in the container the load is in and 1x on the Host). On that I was running attach/detach loops but in all of a two hundred retries it was "->device_del <-DEVICE_DELETED <-DEVICE_DELETED ->blockdev-del ->blockdev-del".

James Page was doing tests (thanks) with real Openstack and Ceph but couldn't - so far -trigger the issue either. He will likely later update the bug as well, but is currently trying to ramp up concurrency to more likely hit the issue.

https://bugs.launchpad.net/nova/+bug/1882521/comments/8

James Page (james-page) wrote :

I'm not able to reproduce at this point in time - as the version of libvirt/qemu is the same I deployed an Ussuri cloud on focal and ran block device add/remove with concurrency for several hundred iterations but I was not able to trigger a detach failure.

Download full text (7.6 KiB)

Thanks James, but now I'm unsure where to go from here as it isn't reproducible with many tries at different scales that James and I did.

@Sean/Lee
Since you wondered if it might be due to Ubuntu Delta on top of 4.2 - there are two things we could compare Ubuntu's qemu to then:
1. qemu 4.2 as released by upstream
2. qemu 4.2 as build in centos (they have some delta as well)
Not sure I can provide you #2 easily and #1 will always need a bit of delta to build&integrate correctly. All of that would be doable still, but in general if I'd provide you PPA builds could you try those in your environment to !reliably! trigger the issue allowing us to play bisect-ping-pong? The word "reliable" is important here as we'd need to sort out builds/patches by a reliable yes/no on each step.

@Sean/Lee
- which "the centos 8 build of the same qemu" version is that exactly? I might take a look at comparing the patches applied. We are at 4.2.1 already, the latest I found there was 4.2.0-29.el8.3.x86_64.rpm which already is the advanced-virt version (otherwise 2.12 based).

@Sean/Lee
All of the following suggested approaches depend on the question if you can test this reliably with different qemu PPA builds:
- qemu 4.2 (as upstream) vs usual Ubuntu build -> find the offending patch in our delta
- test Ubuntu 20.10 which has qemu 5.0 and libvirt 6.6 -> if fixed there find by which change
- qemu 4.2 Ubuntu vs qemu 4.2 as in CentOS (but build for Ubuntu) -> if the latter works better then let us find by which (set of) patches.

I was checking the Delta on Centos8 advanced-virt qemu 4.2 as that was reported
to (maybe) work better. I was comparing which patches are applied, that are no
on Ubuntu and which of those might be related.
Among several individual fixes for some issues the biggest patch sets
are feature backports for Enhanced LUKS/backup/snapshot handling,
multifd migration/cancel, block-mirror, HMAT changes, virtiofs, qemu-img zero
write, arm time handling and some related build time self tests.
Due to the nature of these changes some affect the block handling by affecting block/job/hotplug. But they might only do so by accident, nothing is clearly for addressing the issue that
was reported. And even of the list below most seem unrelated - so as Sean assumed maybe it just isn't exercised on centos enough to be seen there?

eca0f3524a4eb57d03a56b0cbcef5527a0981ce4 backup: don't acquire aio_context in backup_clean
58226634c4b02af7b10862f7fbd3610a344bfb7f backup: Improve error for bdrv_getlength() failure
958a04bd32af18d9a207bcc78046e56a202aebc2 backup: Make sure that source and target size match
7b8e4857426f2e2de2441749996c6161b550bada block: Add flags to bdrv(_co)_truncate()
92b92799dc8662b6f71809100a4aabc1ae408ebb block: Add flags to BlockDriver.bdrv_co_truncate()
087ab8e775f48766068e65de1bc99d03b40d1670 block: always fill entire LUKS header space with zeros
8c6242b6f383e43fd11d2c50f8bcdd2bba1100fc block-backend: Add flags to blk_truncate()
564806c529d4e0acad209b1e5b864a8886092f1f block-backend: Reorder flush/pdiscard function definitions
0abf2581717a19d9749d5c2ff8acd0ac203452c2 block/backup-top: Don't acquire context while dropping top
1de6b45fb5c1489b45...

Read more...

Changed in qemu (Ubuntu):
status: New → Incomplete
Lee Yarwood (lyarwood) wrote :

@Christian Yes we can test this in OpenStack CI with PPAs if you could provide Bionic based builds.

Better yet I could even try to setup a git bisect locally if you have steps to build from a source tree somewhere?

@James Did you try using OpenStack CI sized hosts (8 vCPUs, 8GB of RAM) and the QEMU virt_type? That was the most reliable way I found of hitting this and running the `tox -e full` tempest target to envoke a full tempest run.

Lee Yarwood (lyarwood) wrote :

> @Christian Yes we can test this in OpenStack CI with PPAs if you
> could provide Bionic based builds.

I'm not sure what I was thinking here but we *can* use Focal based builds of QEMU in CI now.

Lee Yarwood (lyarwood) wrote :

While testing >=v5.0.0 on Fedora 32 I've stumbled across the following change in QEMU:

https://github.com/qemu/qemu/commit/cce8944cc9efab47d4bf29cfffb3470371c3541b

The change suggests that previously parallel requests to detach a PCIe device (such as virtio-blk) from QEMU would result in the original request being cancelled. We are hitting the above check and error with v5.0.0 in OpenStack CI that made me wonder if we are hitting the original issue in this bug where QEMU is < v5.0.0?

Lee Yarwood (lyarwood) wrote :
Download full text (3.8 KiB)

Here's an example from QMP_requests_and_replies_from_libvirtd.log.gz attached earlier to the bug by Kashyap. We can see two device_del commands prior to only seeing a single event emitted by QEMU:

http://paste.openstack.org/show/798607/

# zgrep 0x7f768806eaf0 QMP_requests_and_replies_from_libvirtd.log.gz | egrep '(device_del|DEVICE_DELETED)'
2020-09-07 19:50:11.315+0000: 65558: info : qemuMonitorSend:993 : QEMU_MONITOR_SEND_MSG: mon=0x7f768806eaf0 msg={"execute":"device_del","arguments":{"id":"virtio-disk1"},"id":"libvirt-367"}
2020-09-07 19:50:17.535+0000: 65557: info : qemuMonitorSend:993 : QEMU_MONITOR_SEND_MSG: mon=0x7f768806eaf0 msg={"execute":"device_del","arguments":{"id":"virtio-disk1"},"id":"libvirt-369"}
2020-09-07 19:50:18.138+0000: 65555: info : qemuMonitorJSONIOProcessLine:234 : QEMU_MONITOR_RECV_EVENT: mon=0x7f768806eaf0 event={"timestamp": {"seconds": 1599508218, "microseconds": 138046}, "event": "DEVICE_DELETED", "data": {"path": "/machine/peripheral/virtio-disk1/virtio-backend"}}
2020-09-07 19:50:24.549+0000: 65556: info : qemuMonitorSend:993 : QEMU_MONITOR_SEND_MSG: mon=0x7f768806eaf0 msg={"execute":"device_del","arguments":{"id":"virtio-disk1"},"id":"libvirt-371"}
2020-09-07 19:50:33.561+0000: 65556: info : qemuMonitorSend:993 : QEMU_MONITOR_SEND_MSG: mon=0x7f768806eaf0 msg={"execute":"device_del","arguments":{"id":"virtio-disk1"},"id":"libvirt-373"}
2020-09-07 19:50:44.572+0000: 65557: info : qemuMonitorSend:993 : QEMU_MONITOR_SEND_MSG: mon=0x7f768806eaf0 msg={"execute":"device_del","arguments":{"id":"virtio-disk1"},"id":"libvirt-375"}
2020-09-07 19:50:57.583+0000: 65558: info : qemuMonitorSend:993 : QEMU_MONITOR_SEND_MSG: mon=0x7f768806eaf0 msg={"execute":"device_del","arguments":{"id":"virtio-disk1"},"id":"libvirt-377"}
2020-09-07 19:51:12.595+0000: 65556: info : qemuMonitorSend:993 : QEMU_MONITOR_SEND_MSG: mon=0x7f768806eaf0 msg={"execute":"device_del","arguments":{"id":"virtio-disk1"},"id":"libvirt-379"}
2020-09-07 19:51:30.548+0000: 65557: info : qemuMonitorSend:993 : QEMU_MONITOR_SEND_MSG: mon=0x7f768806eaf0 msg={"execute":"device_del","arguments":{"id":"virtio-disk1"},"id":"libvirt-381"}
2020-09-07 19:51:51.804+0000: 65559: info : qemuMonitorSend:993 : QEMU_MONITOR_SEND_MSG: mon=0x7f768806eaf0 msg={"execute":"device_del","arguments":{"id":"virtio-disk1"},"id":"libvirt-383"}
2020-09-07 19:53:33.286+0000: 65559: info : qemuMonitorSend:993 : QEMU_MONITOR_SEND_MSG: mon=0x7f768806eaf0 msg={"execute":"device_del","arguments":{"id":"virtio-disk1"},"id":"libvirt-385"}
2020-09-07 19:53:40.310+0000: 65558: info : qemuMonitorSend:993 : QEMU_MONITOR_SEND_MSG: mon=0x7f768806eaf0 msg={"execute":"device_del","arguments":{"id":"virtio-disk1"},"id":"libvirt-387"}
2020-09-07 19:53:49.353+0000: 65556: info : qemuMonitorSend:993 : QEMU_MONITOR_SEND_MSG: mon=0x7f768806eaf0 msg={"execute":"device_del","arguments":{"id":"virtio-disk1"},"id":"libvirt-389"}
2020-09-07 19:54:00.771+0000: 65560: info : qemuMonitorSend:993 : QEMU_MONITOR_SEND_MSG: mon=0x7f768806eaf0 msg={"execute":"device_del","arguments":{"id":"virtio-disk1"},"id":"libvirt-391"}
2020-09-07 19:54:15.737+0000: 65559: info : qemuMonitorSend:993 : QEMU_MONITOR_SEND_MSG: mon=0x7...

Read more...

Thanks Lee,
builds are most reliably done in Lauchpad IMHO and the Ubuntu Delta is a quilt stack which isn't as easily bisectable.
If we end up searching not between Ubuntu Delta I can help to get you qemu built from git for that.
But for some initially probing which range of changes we want to actually look at let me provide you PPAs.

Try #1 in PPA [1] is the fix you referred [2] to that avoids the double delete to cause unexpected results. It will also add the warnings you have seen. So giving it a test should be a great first try.

Let me know what the results with that are.

[1]: https://launchpad.net/~ci-train-ppa-service/+archive/ubuntu/4294/+packages
[2]: https://github.com/qemu/qemu/commit/cce8944cc9efab47d4bf29cfffb3470371c3541b

Lee Yarwood (lyarwood) wrote :
Download full text (11.7 KiB)

Thanks Christian,

I'll confirm that we see error from [2] with that PPA shortly.

In the meantime I've removed the device detach retry logic from OpenStack Nova and now always see two DEVICE_DELETED events raised by QEMU after a single device_del request from libvirt:

http://paste.openstack.org/show/798637/

# zgrep 'debug.*"event": "DEVICE_DELETED"' libvirtd_log.txt.gz
2020-10-01 21:37:33.180+0000: 55842: debug : qemuMonitorJSONIOProcessLine:220 : Line [{"timestamp": {"seconds": 1601588253, "microseconds": 179881}, "event": "DEVICE_DELETED", "data": {"path": "/machine/peripheral/virtio-disk1/virtio-backend"}}]
2020-10-01 21:37:33.242+0000: 55842: debug : qemuMonitorJSONIOProcessLine:220 : Line [{"timestamp": {"seconds": 1601588253, "microseconds": 242565}, "event": "DEVICE_DELETED", "data": {"device": "virtio-disk1", "path": "/machine/peripheral/virtio-disk1"}}]

2020-10-01 21:40:17.062+0000: 55842: debug : qemuMonitorJSONIOProcessLine:220 : Line [{"timestamp": {"seconds": 1601588417, "microseconds": 62704}, "event": "DEVICE_DELETED", "data": {"path": "/machine/peripheral/virtio-disk1/virtio-backend"}}]
2020-10-01 21:40:17.114+0000: 55842: debug : qemuMonitorJSONIOProcessLine:220 : Line [{"timestamp": {"seconds": 1601588417, "microseconds": 113888}, "event": "DEVICE_DELETED", "data": {"device": "virtio-disk1", "path": "/machine/peripheral/virtio-disk1"}}]

2020-10-01 21:40:20.911+0000: 55842: debug : qemuMonitorJSONIOProcessLine:220 : Line [{"timestamp": {"seconds": 1601588420, "microseconds": 911560}, "event": "DEVICE_DELETED", "data": {"path": "/machine/peripheral/virtio-disk1/virtio-backend"}}]
2020-10-01 21:40:20.985+0000: 55842: debug : qemuMonitorJSONIOProcessLine:220 : Line [{"timestamp": {"seconds": 1601588420, "microseconds": 985753}, "event": "DEVICE_DELETED", "data": {"device": "virtio-disk1", "path": "/machine/peripheral/virtio-disk1"}}]

2020-10-01 21:42:53.528+0000: 55842: debug : qemuMonitorJSONIOProcessLine:220 : Line [{"timestamp": {"seconds": 1601588573, "microseconds": 528330}, "event": "DEVICE_DELETED", "data": {"path": "/machine/peripheral/net1/virtio-backend"}}]
2020-10-01 21:42:53.583+0000: 55842: debug : qemuMonitorJSONIOProcessLine:220 : Line [{"timestamp": {"seconds": 1601588573, "microseconds": 583063}, "event": "DEVICE_DELETED", "data": {"device": "net1", "path": "/machine/peripheral/net1"}}]

2020-10-01 21:44:21.156+0000: 55842: debug : qemuMonitorJSONIOProcessLine:220 : Line [{"timestamp": {"seconds": 1601588661, "microseconds": 155903}, "event": "DEVICE_DELETED", "data": {"path": "/machine/peripheral/net1/virtio-backend"}}]
2020-10-01 21:44:21.214+0000: 55842: debug : qemuMonitorJSONIOProcessLine:220 : Line [{"timestamp": {"seconds": 1601588661, "microseconds": 213714}, "event": "DEVICE_DELETED", "data": {"device": "net1", "path": "/machine/peripheral/net1"}}]

2020-10-01 21:45:55.422+0000: 55842: debug : qemuMonitorJSONIOProcessLine:220 : Line [{"timestamp": {"seconds": 1601588755, "microseconds": 422238}, "event": "DEVICE_DELETED", "data": {"path": "/machine/peripheral/net1/virtio-backend"}}]
2020-10-01 21:45:55.479+0000: 55842: debug : qemuMonitorJSONIOProcessLine:220 : Line [{"timestamp": {"seconds": 160...

I was seeing in my manual tests that the delete->Event was taking a bit longer, maybe your retry logic kicks in before it is complete now. And due to that the problem takes place.

So with the PPA you should get:
a) a warning, but no more the cancelled/undefined behavior on double delete
b) can tune your retry logic to no more trigger the warning messages

So I wasn't sure from the logs - is the fix in the PPA good to overcome the issue completely or is it missing something?

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers