Failing device detachments on Focal: "Unable to detach the device from the live config"

Bug #1882521 reported by Dr. Jens Harbott
20
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Cinder
Undecided
Unassigned
OpenStack Compute (nova)
High
Lee Yarwood
Train
Undecided
Lee Yarwood
Ussuri
Undecided
Lee Yarwood
Victoria
Undecided
Unassigned

Bug Description

The following tests are failing consistently when deploying devstack on Focal in the CI, see https://review.opendev.org/734029 for detailed logs:

tempest.api.compute.servers.test_server_rescue_negative.ServerRescueNegativeTestJSON.test_rescued_vm_detach_volume
tempest.api.compute.volumes.test_attach_volume.AttachVolumeMultiAttachTest.test_resize_server_with_multiattached_volume
tempest.api.compute.servers.test_server_rescue.ServerStableDeviceRescueTest.test_stable_device_rescue_disk_virtio_with_volume_attached
tearDownClass (tempest.api.compute.servers.test_server_rescue.ServerStableDeviceRescueTest)

Sample extract from nova-compute log:

Jun 08 08:48:24.384559 ubuntu-focal-rax-dfw-0017012548 nova-compute[82495]: DEBUG oslo.service.loopingcall [-] Exception which is in the suggested list of exceptions occurred while invoking function: nova.virt.libvirt.guest.Guest.detach_device_with_retry.<locals>._do_wait_and_retry_detach. {{(pid=82495) _func /usr/local/lib/python3.8/dist-packages/oslo_service/loopingcall.py:410}}
Jun 08 08:48:24.384862 ubuntu-focal-rax-dfw-0017012548 nova-compute[82495]: DEBUG oslo.service.loopingcall [-] Cannot retry nova.virt.libvirt.guest.Guest.detach_device_with_retry.<locals>._do_wait_and_retry_detach upon suggested exception since retry count (7) reached max retry count (7). {{(pid=82495) _func /usr/local/lib/python3.8/dist-packages/oslo_service/loopingcall.py:416}}
Jun 08 08:48:24.388855 ubuntu-focal-rax-dfw-0017012548 nova-compute[82495]: ERROR oslo.service.loopingcall [-] Dynamic interval looping call 'oslo_service.loopingcall.RetryDecorator.__call__.<locals>._func' failed: nova.exception.DeviceDetachFailed: Device detach failed for vdb: Unable to detach the device from the live config.
Jun 08 08:48:24.388855 ubuntu-focal-rax-dfw-0017012548 nova-compute[82495]: ERROR oslo.service.loopingcall Traceback (most recent call last):
Jun 08 08:48:24.388855 ubuntu-focal-rax-dfw-0017012548 nova-compute[82495]: ERROR oslo.service.loopingcall File "/usr/local/lib/python3.8/dist-packages/oslo_service/loopingcall.py", line 150, in _run_loop
Jun 08 08:48:24.388855 ubuntu-focal-rax-dfw-0017012548 nova-compute[82495]: ERROR oslo.service.loopingcall result = func(*self.args, **self.kw)
Jun 08 08:48:24.388855 ubuntu-focal-rax-dfw-0017012548 nova-compute[82495]: ERROR oslo.service.loopingcall File "/usr/local/lib/python3.8/dist-packages/oslo_service/loopingcall.py", line 428, in _func
Jun 08 08:48:24.388855 ubuntu-focal-rax-dfw-0017012548 nova-compute[82495]: ERROR oslo.service.loopingcall return self._sleep_time
Jun 08 08:48:24.388855 ubuntu-focal-rax-dfw-0017012548 nova-compute[82495]: ERROR oslo.service.loopingcall File "/usr/local/lib/python3.8/dist-packages/oslo_utils/excutils.py", line 220, in __exit__
Jun 08 08:48:24.388855 ubuntu-focal-rax-dfw-0017012548 nova-compute[82495]: ERROR oslo.service.loopingcall self.force_reraise()
Jun 08 08:48:24.388855 ubuntu-focal-rax-dfw-0017012548 nova-compute[82495]: ERROR oslo.service.loopingcall File "/usr/local/lib/python3.8/dist-packages/oslo_utils/excutils.py", line 196, in force_reraise
Jun 08 08:48:24.388855 ubuntu-focal-rax-dfw-0017012548 nova-compute[82495]: ERROR oslo.service.loopingcall six.reraise(self.type_, self.value, self.tb)
Jun 08 08:48:24.388855 ubuntu-focal-rax-dfw-0017012548 nova-compute[82495]: ERROR oslo.service.loopingcall File "/usr/local/lib/python3.8/dist-packages/six.py", line 703, in reraise
Jun 08 08:48:24.388855 ubuntu-focal-rax-dfw-0017012548 nova-compute[82495]: ERROR oslo.service.loopingcall raise value
Jun 08 08:48:24.388855 ubuntu-focal-rax-dfw-0017012548 nova-compute[82495]: ERROR oslo.service.loopingcall File "/usr/local/lib/python3.8/dist-packages/oslo_service/loopingcall.py", line 407, in _func
Jun 08 08:48:24.388855 ubuntu-focal-rax-dfw-0017012548 nova-compute[82495]: ERROR oslo.service.loopingcall result = f(*args, **kwargs)
Jun 08 08:48:24.388855 ubuntu-focal-rax-dfw-0017012548 nova-compute[82495]: ERROR oslo.service.loopingcall File "/opt/stack/nova/nova/virt/libvirt/guest.py", line 453, in _do_wait_and_retry_detach
Jun 08 08:48:24.388855 ubuntu-focal-rax-dfw-0017012548 nova-compute[82495]: ERROR oslo.service.loopingcall raise exception.DeviceDetachFailed(
Jun 08 08:48:24.388855 ubuntu-focal-rax-dfw-0017012548 nova-compute[82495]: ERROR oslo.service.loopingcall nova.exception.DeviceDetachFailed: Device detach failed for vdb: Unable to detach the device from the live config.
Jun 08 08:48:24.388855 ubuntu-focal-rax-dfw-0017012548 nova-compute[82495]: ERROR oslo.service.loopingcall
Jun 08 08:48:24.390684 ubuntu-focal-rax-dfw-0017012548 nova-compute[82495]: WARNING nova.virt.block_device [None req-8af75b5f-2587-4ce7-9523-d2902eb45a38 tempest-ServerRescueNegativeTestJSON-1578800383 tempest-ServerRescueNegativeTestJSON-1578800383] [instance: 76f86b1f-8b11-44e6-b718-eda3e7e18937] Guest refused to detach volume 6b0cac03-d6d4-48ae-bf56-06de389c0869: nova.exception.DeviceDetachFailed: Device detach failed for vdb: Unable to detach the device from the live config.

Revision history for this message
Dr. Jens Harbott (j-harbott) wrote :

There are also some warnings like this

Jun 08 08:37:13.280300 ubuntu-focal-rax-dfw-0017012548 nova-compute[82495]: DEBUG oslo_concurrency.processutils [None req-bc4635cd-81f9-435d-b60b-fdd64dffa958 None None] Running cmd (subprocess): /usr/bin/python3 -m oslo_concurrency.prlimit --as=1073741824 --cpu=30 -- env LC_ALL=C LANG=C qemu-img info /opt/stack/data/nova/instances/001ba6d1-94e8-4108-9b2c-5ee92df425b9/disk --force-share --output=json {{(pid=82495) execute /usr/local/lib/python3.8/dist-packages/oslo_concurrency/processutils.py:371}}
Jun 08 08:37:13.288934 ubuntu-focal-rax-dfw-0017012548 nova-compute[91401]: Exception ignored in: <function _after_fork at 0x7f0fb37e3b80>
Jun 08 08:37:13.288934 ubuntu-focal-rax-dfw-0017012548 nova-compute[91401]: Traceback (most recent call last):
Jun 08 08:37:13.288934 ubuntu-focal-rax-dfw-0017012548 nova-compute[91401]: File "/usr/lib/python3.8/threading.py", line 1454, in _after_fork
Jun 08 08:37:13.288934 ubuntu-focal-rax-dfw-0017012548 nova-compute[91401]: assert len(_active) == 1
Jun 08 08:37:13.288934 ubuntu-focal-rax-dfw-0017012548 nova-compute[91401]: AssertionError:

but I'm not sure whether they might be related, something different, or just purely cosmetical. They also appear for other services like c-vol, so not nova-specific. See also https://bugs.launchpad.net/bugs/1863021

Revision history for this message
Ghanshyam Mann (ghanshyammann) wrote :

I cannot spot the clear issue from nova or cinder flow of failed volume. Issue happening during the test cleanup when attached volume from rescued server is requested for detach

tempest log:

Test performed all operation till unrescue the server and server became Active with same attachment it had before rescue.

- https://zuul.opendev.org/t/openstack/build/9290c83e18a741a5bdab4e28de5eedb7/log/job-output.txt#28627

Now cleanup started and test requested the volume detach. It waiting for volume status in 'detaching' and after lot of time volume
status changed to 'in-use'

- https://zuul.opendev.org/t/openstack/build/9290c83e18a741a5bdab4e28de5eedb7/log/job-output.txt#29157-29168

Trying to find the flow of same volume in cinder log:

volume is created fine:

- https://zuul.opendev.org/t/openstack/build/9290c83e18a741a5bdab4e28de5eedb7/log/controller/logs/screen-c-vol.txt#8021

iscsi target is set correctly

- https://zuul.opendev.org/t/openstack/build/9290c83e18a741a5bdab4e28de5eedb7/log/controller/logs/screen-c-vol.txt#8150

attachment is updated fine
- https://zuul.opendev.org/t/openstack/build/9290c83e18a741a5bdab4e28de5eedb7/log/controller/logs/screen-c-vol.txt#8268

Detach request

n-api log: req-baec6727-9502-40d0-b4c4-f4e0e56973e0
- https://zuul.opendev.org/t/openstack/build/9290c83e18a741a5bdab4e28de5eedb7/log/controller/logs/screen-n-api.txt#32946

-c-api
- https://zuul.opendev.org/t/openstack/build/9290c83e18a741a5bdab4e28de5eedb7/log/controller/logs/screen-c-api.txt#22785

detaching of volume is completed
- https://zuul.opendev.org/t/openstack/build/9290c83e18a741a5bdab4e28de5eedb7/log/controller/logs/screen-c-api.txt#22798

removed the iscsi target
- https://zuul.opendev.org/t/openstack/build/9290c83e18a741a5bdab4e28de5eedb7/log/controller/logs/screen-c-vol.txt#11495

In n-cpu log i am seeing another test trying to detach volume from mountpoint /dev/vdb and failing but it is different server so not sure
if that is any issue?

Another test req-4542a8aa-6deb-434c-9905-99d3eb44f029 tempest-AttachVolumeMultiAttachTest-1857486939

- https://zuul.opendev.org/t/openstack/build/9290c83e18a741a5bdab4e28de5eedb7/log/controller/logs/screen-n-cpu.txt#49802

failing tests req-baec6727-9502-40d0-b4c4-f4e0e56973e0 tempest-ServerStableDeviceRescueTest-543821642
- https://zuul.opendev.org/t/openstack/build/9290c83e18a741a5bdab4e28de5eedb7/log/controller/logs/screen-n-cpu.txt#49785

Changed in nova:
importance: Undecided → High
Revision history for this message
Ghanshyam Mann (ghanshyammann) wrote :

adding cinder also if something from cinder side during detachment.

Revision history for this message
sean mooney (sean-k-mooney) wrote :
Download full text (5.4 KiB)

we have a similar bug downstream against queens in the other direction alos
https://bugzilla.redhat.com/show_bug.cgi?id=1838786

in the downstream bug the failure is on attach
libvirtError: Requested operation is not valid: target vdf already exists

where as upstream we have "nova.exception.DeviceDetachFailed: Device detach failed for vdb: Unable to detach the device from the live config."

in both caes its as if the live domain view and nova's/cinder'd view or the domain are getting out of sync with each other

the call trace upstream traceback for detach looks like this

 Traceback (most recent call last):
   File "/opt/stack/nova/nova/virt/block_device.py", line 328, in driver_detach
     virt_driver.detach_volume(context, connection_info, instance, mp,
   File "/opt/stack/nova/nova/virt/libvirt/driver.py", line 2004, in detach_volume
     wait_for_detach = guest.detach_device_with_retry(guest.get_disk,
   File "/opt/stack/nova/nova/virt/libvirt/guest.py", line 425, in detach_device_with_retry
     _try_detach_device(conf, persistent, live)
   File "/opt/stack/nova/nova/virt/libvirt/guest.py", line 414, in _try_detach_device
     ctx.reraise = True
   File "/usr/local/lib/python3.8/dist-packages/oslo_utils/excutils.py", line 220, in __exit__
     self.force_reraise()
   File "/usr/local/lib/python3.8/dist-packages/oslo_utils/excutils.py", line 196, in force_reraise
     six.reraise(self.type_, self.value, self.tb)
   File "/usr/local/lib/python3.8/dist-packages/six.py", line 703, in reraise
     raise value
   File "/opt/stack/nova/nova/virt/libvirt/guest.py", line 387, in _try_detach_device
     self.detach_device(conf, persistent=persistent, live=live)
   File "/opt/stack/nova/nova/virt/libvirt/guest.py", line 475, in detach_device
     self._domain.detachDeviceFlags(device_xml, flags=flags)
   File "/usr/local/lib/python3.8/dist-packages/eventlet/tpool.py", line 190, in doit
     result = proxy_call(self._autowrap, f, *args, **kwargs)
   File "/usr/local/lib/python3.8/dist-packages/eventlet/tpool.py", line 148, in proxy_call
     rv = execute(f, *args, **kwargs)
   File "/usr/local/lib/python3.8/dist-packages/eventlet/tpool.py", line 129, in execute
     six.reraise(c, e, tb)
   File "/usr/local/lib/python3.8/dist-packages/six.py", line 703, in reraise
     raise value
   File "/usr/local/lib/python3.8/dist-packages/eventlet/tpool.py", line 83, in tworker
     rv = meth(*args, **kwargs)
   File "/usr/local/lib/python3.8/dist-packages/libvirt.py", line 1408, in detachDeviceFlags
     if ret == -1: raise libvirtError ('virDomainDetachDeviceFlags() failed', dom=self)
 libvirt.libvirtError: device not found: no target device vdb

and the down stream traceback for attach looks like

Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 5433, in _attach_volume
    do_driver_attach=True)
  File "/usr/lib/python2.7/site-packages/nova/virt/block_device.py", line 46, in wrapped
    ret_val = method(obj, context, *args, **kwargs)
  File "/usr/lib/python2.7/site-packages/nova/virt/block_device.py", line 624, in attach
    virt_driver, do_driver_attach)
  File "/usr/lib/python2.7/site-packages...

Read more...

Revision history for this message
Lee Yarwood (lyarwood) wrote :

I've marked this as confirmed and bumped the importance given the impending move to Focal for all testing in the Victoria release.

C#4 is unrelated, I've continued to see detach issues on Bionic while using QEMU 4.0.0 and libvirt v5.0.0 however I've been unable to reproduce outside of upstream CI.

Each time this has been hit however it appears that the Guest OS (cirros) isn't able to react to the ACPI request to detach the disk device. This could simply be another case of the instances needing to be given more resources for these requests to be served quickly enough to satisfy Nova or relaxing the timeout within Nova.

Changed in nova:
importance: High → Critical
assignee: nobody → Lee Yarwood (lyarwood)
status: New → Confirmed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/749827

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Related fix proposed to branch: master
Review: https://review.opendev.org/749929

Revision history for this message
Lee Yarwood (lyarwood) wrote : Re: Failing device detachments on Focal

So I'm even more convinced this is a host resource (I'd assume vCPU?) issue as I can reproduce this consistently on a 20.04 virtual machine with the same resources as our CI instances when running `$ tox -e full` in tempest, `$ tox -e full-serial` doesn't reproduce the issue at all. I'm going to try limiting things with --concurrency and try a slightly larger guest to see if that helps.

Revision history for this message
Kashyap Chamarthy (kashyapc) wrote :

Some notes based my understanding (and from a brief chat with libvirt/QEMU developers):

- DEVICE_DELETED is the event that QEMU sends to libvirt, *once* the device was removed by the guest, so that libvirt can clean-up. So if we see DEVICE_DELETED that means the device was successfully detached from QEMU's point of view (therefore, from the guest's PoV, too)

- The presence of the '/sys/module/pci_hotplug/' directory in the guest confirms that it is capable of handling hotplug/hotunplug events. (And Lee confirmed on IRC that the CirrOS guest _does_ have this directory)

So, if you _can_ see DEVICE_DELETED, then it sounds like the problem is somewhere _else_ than the guest OS.

Revision history for this message
Kashyap Chamarthy (kashyapc) wrote :

Aside:

Also, a libvirt dev suggested to get the communication between libvirt and the QEMU monitor. We should have them in the CI by default.

(To manually enable: https://kashyapc.fedorapeople.org/virt/openstack/request-nova-libvirt-qemu-debug-logs.txt)

Revision history for this message
Kashyap Chamarthy (kashyapc) wrote :

Combing through the libvirtd.log (https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_c3a/734029/2/check/devstack-platform-focal/c3ab542/controller/logs/libvirt/libvirtd_log.txt),

(1) QEMU sent the 'device_del' command (to detach the device):

---
2020-09-03 20:01:52.711+0000: 65330: info : qemuMonitorSend:993 : QEMU_MONITOR_SEND_MSG: mon=0x7f021c0f3b30 msg={"execute":"device_del","arguments":{"id":"virtio-disk1"},"id":"libvirt-372"}
---

(2) The reply was 'success'; good:

---
2020-09-03 20:01:52.714+0000: 65328: info : qemuMonitorJSONIOProcessLine:239 : QEMU_MONITOR_RECV_REPLY: mon=0x7f021c0f3b30 reply={"return": {}, "id": "libvirt-372"}
---

(3) And QEMU even emits the event DEVICE_DELETED:

---
2020-09-03 20:01:53.019+0000: 65328: info : qemuMonitorJSONIOProcessLine:234 : QEMU_MONITOR_RECV_EVENT: mon=0x7f021c0f3b30 event={"timestamp": {"seconds": 1599163313, "microseconds": 18874},
 "event": "DEVICE_DELETED", "data": {"path": "/machine/peripheral/virtio-disk1/virtio-backend"}}
---

So far so good ...

(4) ... but then, we see this "missing device in device deleted event"

---
2020-09-03 20:01:53.019+0000: 65328: debug : qemuMonitorJSONIOProcessEvent:205 : handle DEVICE_DELETED handler=0x7f0230572840 data=0x55d556edf3c0
2020-09-03 20:01:53.019+0000: 65328: debug : qemuMonitorJSONHandleDeviceDeleted:1287 : missing device in device deleted event
---

I'm not entirely sure of the significance (or lack thereof) of the above.

Revision history for this message
Kashyap Chamarthy (kashyapc) wrote :

Ignore the less interesting comment#11, the more "interesting" bit from the libvirtd.log is here:

libvirt asks QEMU to execute 'device_del' (i.e. to detach the device from the guest):

---
2020-09-03 19:58:35.441+0000: 65331: info : qemuMonitorSend:993 : QEMU_MONITOR_SEND_MSG: mon=0x7f021c0b9c70 msg={"execute":"device_del","arguments":{"id":"virtio-disk1"},"id":"libvirt-399"}
 fd=-1
---

But the reply from QEMU is failure to detach the device:

---
2020-09-03 19:58:35.443+0000: 65328: info : qemuMonitorJSONIOProcessLine:239 : QEMU_MONITOR_RECV_REPLY: mon=0x7f021c0b9c70 reply={"id": "libvirt-399", "error": {"class": "DeviceNotFound", "d
esc": "Device 'virtio-disk1' not found"}}

[...]

93e20 name=instance-0000007a)
2020-09-03 19:58:35.443+0000: 65331: debug : qemuDomainDeleteDevice:128 : Detaching of device virtio-disk1 failed and no event arrived
---

I learn the above "... no event arrived" means QEMU didn't tell libvirt which device was deleted.

I still don't have a robust answer to the root cause yet.

Revision history for this message
Lee Yarwood (lyarwood) wrote :

After talking to libvirt/QEMU folks yesterday I've raised the following bug:

Second DEVICE_DELETED event missing during virtio-blk disk device detach
https://bugs.launchpad.net/qemu/+bug/1894804

I'm trying to reproduce this on Fedora today to also raise this downstream in bugzilla.

tags: added: gate-failure
tags: added: victoria-rc-potential
Lee Yarwood (lyarwood)
Changed in nova:
importance: Critical → High
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/752654

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (master)

Reviewed: https://review.opendev.org/752654
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=57ac83d4d71c903bbaf88a9b6b88a86916c1767c
Submitter: Zuul
Branch: master

commit 57ac83d4d71c903bbaf88a9b6b88a86916c1767c
Author: Lee Yarwood <email address hidden>
Date: Fri Sep 18 10:45:21 2020 +0100

    releasenote: Add known issue for bug #1894804

    Related-Bug: #1882521
    Change-Id: Ib9059dde41b0a07144ffa26552577308b1ffc9e1

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/755526

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.opendev.org/755799

Changed in nova:
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (master)

Change abandoned by Lee Yarwood (<email address hidden>) on branch: master
Review: https://review.opendev.org/755526

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.opendev.org/755799
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=dd1e6d4b0cee465fd89744e306fcd25228b3f7cc
Submitter: Zuul
Branch: master

commit dd1e6d4b0cee465fd89744e306fcd25228b3f7cc
Author: Lee Yarwood <email address hidden>
Date: Fri Oct 2 15:11:25 2020 +0100

    libvirt: Increase incremental and max sleep time during device detach

    Bug #1894804 outlines how DEVICE_DELETED events were often missing from
    QEMU on Focal based OpenStack CI hosts as originally seen in bug
     #1882521. This has eventually been tracked down to some undefined QEMU
    behaviour when a new device_del QMP command is received while another is
    still being processed, causing the original attempt to be aborted.

    We hit this race in slower OpenStack CI envs as n-cpu rather crudely
    retries attempts to detach devices using the RetryDecorator from
    oslo.service. The default incremental sleep time currently being tight
    enough to ensure QEMU is still processing the first device_del request
    on these slower CI hosts when n-cpu asks libvirt to retry the detach,
    sending another device_del to QEMU hitting the above behaviour.

    Additionally we have also seen the following check being hit when
    testing with QEMU >= v5.0.0. This check now rejects overlapping
    device_del requests in QEMU rather than aborting the original:

    https://github.com/qemu/qemu/commit/cce8944cc9efab47d4bf29cfffb3470371c3541b

    This change aims to avoid this situation entirely by raising the default
    incremental sleep time between detach requests from 2 seconds to 10,
    leaving enough time for the first attempt to complete. The overall
    maximum sleep time is also increased from 30 to 60 seconds.

    Future work will aim to entirely remove this retry logic with a libvirt
    event driven approach, polling for the the
    VIR_DOMAIN_EVENT_ID_DEVICE_REMOVED and
    VIR_DOMAIN_EVENT_ID_DEVICE_REMOVAL_FAILED events before retrying.

    Finally, the cleanup of unused arguments in detach_device_with_retry is
    left for a follow up change in order to keep this initial change small
    enough to quickly backport.

    Closes-Bug: #1882521
    Related-Bug: #1894804
    Change-Id: Ib9ed7069cef5b73033351f7a78a3fb566753970d

Changed in nova:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/victoria)

Fix proposed to branch: stable/victoria
Review: https://review.opendev.org/757305

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/ussuri)

Fix proposed to branch: stable/ussuri
Review: https://review.opendev.org/757306

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/train)

Fix proposed to branch: stable/train
Review: https://review.opendev.org/757307

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/758971

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/victoria)

Reviewed: https://review.opendev.org/757305
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=4819f694b2e2d5688fdac7e850f1a6c592253d6b
Submitter: Zuul
Branch: stable/victoria

commit 4819f694b2e2d5688fdac7e850f1a6c592253d6b
Author: Lee Yarwood <email address hidden>
Date: Fri Oct 2 15:11:25 2020 +0100

    libvirt: Increase incremental and max sleep time during device detach

    Bug #1894804 outlines how DEVICE_DELETED events were often missing from
    QEMU on Focal based OpenStack CI hosts as originally seen in bug
     #1882521. This has eventually been tracked down to some undefined QEMU
    behaviour when a new device_del QMP command is received while another is
    still being processed, causing the original attempt to be aborted.

    We hit this race in slower OpenStack CI envs as n-cpu rather crudely
    retries attempts to detach devices using the RetryDecorator from
    oslo.service. The default incremental sleep time currently being tight
    enough to ensure QEMU is still processing the first device_del request
    on these slower CI hosts when n-cpu asks libvirt to retry the detach,
    sending another device_del to QEMU hitting the above behaviour.

    Additionally we have also seen the following check being hit when
    testing with QEMU >= v5.0.0. This check now rejects overlapping
    device_del requests in QEMU rather than aborting the original:

    https://github.com/qemu/qemu/commit/cce8944cc9efab47d4bf29cfffb3470371c3541b

    This change aims to avoid this situation entirely by raising the default
    incremental sleep time between detach requests from 2 seconds to 10,
    leaving enough time for the first attempt to complete. The overall
    maximum sleep time is also increased from 30 to 60 seconds.

    Future work will aim to entirely remove this retry logic with a libvirt
    event driven approach, polling for the the
    VIR_DOMAIN_EVENT_ID_DEVICE_REMOVED and
    VIR_DOMAIN_EVENT_ID_DEVICE_REMOVAL_FAILED events before retrying.

    Finally, the cleanup of unused arguments in detach_device_with_retry is
    left for a follow up change in order to keep this initial change small
    enough to quickly backport.

    Closes-Bug: #1882521
    Related-Bug: #1894804
    Change-Id: Ib9ed7069cef5b73033351f7a78a3fb566753970d
    (cherry picked from commit dd1e6d4b0cee465fd89744e306fcd25228b3f7cc)

tags: added: in-stable-victoria
Lee Yarwood (lyarwood)
Changed in cinder:
status: New → Invalid
Revision history for this message
melanie witt (melwitt) wrote : Re: Failing device detachments on Focal

Looks like this bug is still occurring in the gate, I just investigated the latest failure on patch https://review.opendev.org/c/openstack/nova/+/765933 and found the following error in n-cpu [1]:

  Guest refused to detach volume 72149632-316d-4095-8876-a63e917ca514: nova.exception.DeviceDetachFailed: Device detach failed for vdb: Unable to detach the device from the live config.

Here's a logstash query:

20 hits in the last 7 days, check and gate, all failures

http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22Guest%20refused%20to%20detach%20volume%5C%22%20AND%20message%3A%5C%22nova.exception.DeviceDetachFailed%5C%22%20AND%20message%3A%5C%22Unable%20to%20detach%20the%20device%20from%20the%20live%20config%5C%22%20AND%20tags%3A%5C%22screen-n-cpu.txt%5C%22&from=7d

[1] https://zuul.opendev.org/t/openstack/build/0e55ed7e1f1d4444a84e4e9acffdcc98/log/compute1/logs/screen-n-cpu.txt#10730

Revision history for this message
Lee Yarwood (lyarwood) wrote :

Continuing to see this again after the break, I've had a look at some example libvirtd logs an unlike https://bugs.launchpad.net/qemu/+bug/1894804 we don't even see the first DEVICE_DELETED event being emitted from QEMU now.

This could still be the same issue *if* n-cpu is asking libvirtd to send the second device_del command *before* QEMU has had change to emit the first DEVICE_DELETED event. So we could again extend the time between requests from n-cpu to libvirtd but that's another hackaround.

I've highlighted again to gibi that the only real fix within Nova for this is to move to a libvirtd event based approach as outlined during the PTG and I had started to PoC below:

https://review.opendev.org/c/openstack/nova/+/749929

Revision history for this message
melanie witt (melwitt) wrote :

Updated the bug title ^ so that ctrl + F on http://status.openstack.org/elastic-recheck will find it when looking for a bug ID for "recheck bug <ID>"

summary: - Failing device detachments on Focal
+ Failing device detachments on Focal: "Unable to detach the device from
+ the live config"
Revision history for this message
melanie witt (melwitt) wrote :
Revision history for this message
melanie witt (melwitt) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 23.0.0.0rc1

This issue was fixed in the openstack/nova 23.0.0.0rc1 release candidate.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (master)

Reviewed: https://review.opendev.org/c/openstack/nova/+/770246
Committed: https://opendev.org/openstack/nova/commit/e56cc4f439846558fc13298c2360d7cdd473cc89
Submitter: "Zuul (22348)"
Branch: master

commit e56cc4f439846558fc13298c2360d7cdd473cc89
Author: Balazs Gibizer <email address hidden>
Date: Tue Jan 12 08:23:17 2021 +0100

    Replace blind retry with libvirt event waiting in detach

    Nova so far applied a retry loop that tried to periodically detach the
    device from libvirt while the device was visible in the domain xml. This
    could lead to an issue where an already progressing detach on the
    libvirt side is interrupted by nova re-sending the detach request for
    the same device. See bug #1882521 for more information.

    Also if there was both a persistent and a live domain the nova tried the
    detach from both at the same call. This lead to confusion about the
    result when such call failed. Was the detach failed partially?

    We can do better, at least for the live detach case. Based on the
    libvirt developers detaching from the persistent domain always
    succeeds and it is a synchronous process. Detaching from the live
    domain can be both synchronous or asynchronous depending on the guest
    OS and the load on the hypervisor. But for live detach libvirt always
    sends an event [1] nova can wait for.

    So this patch does two things.

    1) Separates the detach from the persistent domain from the detach from
       the live domain to make the error cases clearer.

    2) Changes the retry mechanism.

       Detaching from the persistent domain is not retried. If libvirt
       reports device not found, while both persistent and live detach
       is needed, the error is ignored, and the process continues with
       the live detach. In any other case the error considered as fatal.

       Detaching from the live domain is changed to always wait for the
       libvirt event. In case of timeout, the live detach is retried.
       But a failure event from libvirt considered fatal, based on the
       information from the libvirt developers, so in this case the
       detach is not retried.

    Related-Bug: #1882521

    [1]https://libvirt.org/html/libvirt-libvirt-domain.html#virConnectDomainEventDeviceRemovedCallback

    Change-Id: I7f2b6330decb92e2838aa7cee47fb228f00f47da

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.opendev.org/c/openstack/nova/+/777151
Committed: https://opendev.org/openstack/nova/commit/52d6cd941ce65baadb140551233406ad642e1e70
Submitter: "Zuul (22348)"
Branch: master

commit 52d6cd941ce65baadb140551233406ad642e1e70
Author: Balazs Gibizer <email address hidden>
Date: Tue Feb 23 16:30:34 2021 +0100

    Move the guest.get_disk test to test_guest

    As a follow up of I86153d31b02e6b74b42d53a6800297cbd0e5cbb4 the two
    get_disk test that was mistakenly added to test_driver is now moved to
    test_guest where they belong.

    Change-Id: I17bd591ffb96b9b296bea04c87e286a83d40570e
    Related-Bug: #1882521

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.opendev.org/c/openstack/nova/+/777159
Committed: https://opendev.org/openstack/nova/commit/a7f2b65d17376710a311bd2f30a6524437bb3aa8
Submitter: "Zuul (22348)"
Branch: master

commit a7f2b65d17376710a311bd2f30a6524437bb3aa8
Author: Balazs Gibizer <email address hidden>
Date: Tue Feb 23 17:29:37 2021 +0100

    Follow up type hints for a634103

    It is a follow up patch for I86153d31b02e6b74b42d53a6800297cbd0e5cbb4
    to add type hints to the functions that was touched by the original
    patch.

    Change-Id: I332ea49184200fcaf8d1480da9658fcbb2f325c5
    Related-Bug: #1882521

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/train)

Reviewed: https://review.opendev.org/c/openstack/nova/+/757307
Committed: https://opendev.org/openstack/nova/commit/618103db9bff9341a90342b372d8a77419e085f4
Submitter: "Zuul (22348)"
Branch: stable/train

commit 618103db9bff9341a90342b372d8a77419e085f4
Author: Lee Yarwood <email address hidden>
Date: Fri Oct 2 15:11:25 2020 +0100

    libvirt: Increase incremental and max sleep time during device detach

    Bug #1894804 outlines how DEVICE_DELETED events were often missing from
    QEMU on Focal based OpenStack CI hosts as originally seen in bug
     #1882521. This has eventually been tracked down to some undefined QEMU
    behaviour when a new device_del QMP command is received while another is
    still being processed, causing the original attempt to be aborted.

    We hit this race in slower OpenStack CI envs as n-cpu rather crudely
    retries attempts to detach devices using the RetryDecorator from
    oslo.service. The default incremental sleep time currently being tight
    enough to ensure QEMU is still processing the first device_del request
    on these slower CI hosts when n-cpu asks libvirt to retry the detach,
    sending another device_del to QEMU hitting the above behaviour.

    Additionally we have also seen the following check being hit when
    testing with QEMU >= v5.0.0. This check now rejects overlapping
    device_del requests in QEMU rather than aborting the original:

    https://github.com/qemu/qemu/commit/cce8944cc9efab47d4bf29cfffb3470371c3541b

    This change aims to avoid this situation entirely by raising the default
    incremental sleep time between detach requests from 2 seconds to 10,
    leaving enough time for the first attempt to complete. The overall
    maximum sleep time is also increased from 30 to 60 seconds.

    Future work will aim to entirely remove this retry logic with a libvirt
    event driven approach, polling for the the
    VIR_DOMAIN_EVENT_ID_DEVICE_REMOVED and
    VIR_DOMAIN_EVENT_ID_DEVICE_REMOVAL_FAILED events before retrying.

    Finally, the cleanup of unused arguments in detach_device_with_retry is
    left for a follow up change in order to keep this initial change small
    enough to quickly backport.

    Closes-Bug: #1882521
    Related-Bug: #1894804
    Change-Id: Ib9ed7069cef5b73033351f7a78a3fb566753970d
    (cherry picked from commit dd1e6d4b0cee465fd89744e306fcd25228b3f7cc)
    (cherry picked from commit 4819f694b2e2d5688fdac7e850f1a6c592253d6b)
    (cherry picked from commit f32286c62e11d43f6b01bc82caacf588447862e5)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (stable/wallaby)

Related fix proposed to branch: stable/wallaby
Review: https://review.opendev.org/c/openstack/nova/+/788720

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Related fix proposed to branch: stable/wallaby
Review: https://review.opendev.org/c/openstack/nova/+/788721

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Related fix proposed to branch: stable/wallaby
Review: https://review.opendev.org/c/openstack/nova/+/788723

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 21.2.1

This issue was fixed in the openstack/nova 21.2.1 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 20.6.1

This issue was fixed in the openstack/nova 20.6.1 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (stable/wallaby)

Reviewed: https://review.opendev.org/c/openstack/nova/+/788720
Committed: https://opendev.org/openstack/nova/commit/30317e6b3f5ee5e7fd1e8c564a1d6f3e3a2fb91b
Submitter: "Zuul (22348)"
Branch: stable/wallaby

commit 30317e6b3f5ee5e7fd1e8c564a1d6f3e3a2fb91b
Author: Balazs Gibizer <email address hidden>
Date: Tue Jan 12 08:23:17 2021 +0100

    Replace blind retry with libvirt event waiting in detach

    Nova so far applied a retry loop that tried to periodically detach the
    device from libvirt while the device was visible in the domain xml. This
    could lead to an issue where an already progressing detach on the
    libvirt side is interrupted by nova re-sending the detach request for
    the same device. See bug #1882521 for more information.

    Also if there was both a persistent and a live domain the nova tried the
    detach from both at the same call. This lead to confusion about the
    result when such call failed. Was the detach failed partially?

    We can do better, at least for the live detach case. Based on the
    libvirt developers detaching from the persistent domain always
    succeeds and it is a synchronous process. Detaching from the live
    domain can be both synchronous or asynchronous depending on the guest
    OS and the load on the hypervisor. But for live detach libvirt always
    sends an event [1] nova can wait for.

    So this patch does two things.

    1) Separates the detach from the persistent domain from the detach from
       the live domain to make the error cases clearer.

    2) Changes the retry mechanism.

       Detaching from the persistent domain is not retried. If libvirt
       reports device not found, while both persistent and live detach
       is needed, the error is ignored, and the process continues with
       the live detach. In any other case the error considered as fatal.

       Detaching from the live domain is changed to always wait for the
       libvirt event. In case of timeout, the live detach is retried.
       But a failure event from libvirt considered fatal, based on the
       information from the libvirt developers, so in this case the
       detach is not retried.

    Related-Bug: #1882521

    [1]https://libvirt.org/html/libvirt-libvirt-domain.html#virConnectDomainEventDeviceRemovedCallback

    Change-Id: I7f2b6330decb92e2838aa7cee47fb228f00f47da
    (cherry picked from commit e56cc4f439846558fc13298c2360d7cdd473cc89)

tags: added: in-stable-wallaby
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.opendev.org/c/openstack/nova/+/788721
Committed: https://opendev.org/openstack/nova/commit/5f488d8cd1195ded0b868ce54eea1d13fa37c27c
Submitter: "Zuul (22348)"
Branch: stable/wallaby

commit 5f488d8cd1195ded0b868ce54eea1d13fa37c27c
Author: Balazs Gibizer <email address hidden>
Date: Tue Feb 23 16:30:34 2021 +0100

    Move the guest.get_disk test to test_guest

    As a follow up of I86153d31b02e6b74b42d53a6800297cbd0e5cbb4 the two
    get_disk test that was mistakenly added to test_driver is now moved to
    test_guest where they belong.

    Change-Id: I17bd591ffb96b9b296bea04c87e286a83d40570e
    Related-Bug: #1882521
    (cherry picked from commit 52d6cd941ce65baadb140551233406ad642e1e70)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.opendev.org/c/openstack/nova/+/788723
Committed: https://opendev.org/openstack/nova/commit/9f90c7268cc24e6c8b93858563a3aefa50d0fb56
Submitter: "Zuul (22348)"
Branch: stable/wallaby

commit 9f90c7268cc24e6c8b93858563a3aefa50d0fb56
Author: Balazs Gibizer <email address hidden>
Date: Tue Feb 23 17:29:37 2021 +0100

    Follow up type hints for a634103

    It is a follow up patch for I86153d31b02e6b74b42d53a6800297cbd0e5cbb4
    to add type hints to the functions that was touched by the original
    patch.

    Change-Id: I332ea49184200fcaf8d1480da9658fcbb2f325c5
    Related-Bug: #1882521
    (cherry picked from commit a7f2b65d17376710a311bd2f30a6524437bb3aa8)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/nova/+/806943

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (master)

Reviewed: https://review.opendev.org/c/openstack/nova/+/806943
Committed: https://opendev.org/openstack/nova/commit/b67b928c381511f6101778ab50f281ecd689d283
Submitter: "Zuul (22348)"
Branch: master

commit b67b928c381511f6101778ab50f281ecd689d283
Author: Balazs Gibizer <email address hidden>
Date: Wed Sep 1 17:32:56 2021 +0200

    Parse alias from domain hostdev

    In I16e7df6932bb7dff243706ee49338ba6b3782085 we missed that
    LibvirtConfigGuestHostdevPCI is not a child class of
    LibvirtConfigGuestInterface and therefore we missed parsing out the
    alias field from the domain xml for hostdevs. The new libvirt driver
    device detach logic[1] uses the alias as the identifier towards libvirt
    so now hostdevs cannot be detached. This patch parses out the alias
    field to fix the issue.

    Closes-Bug: #1942345
    Related-Bug: #1882521

    [1] https://review.opendev.org/q/topic:bug/1882521

    Change-Id: I30d30a772475cb82d0fd088f14a54a35646bd1dc

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (stable/wallaby)

Related fix proposed to branch: stable/wallaby
Review: https://review.opendev.org/c/openstack/nova/+/807767

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers