Failing device detachments on Focal: "Unable to detach the device from the live config"
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
| Cinder |
Undecided
|
Unassigned | ||
| OpenStack Compute (nova) |
High
|
Lee Yarwood | ||
| Train |
Undecided
|
Lee Yarwood | ||
| Ussuri |
Undecided
|
Lee Yarwood | ||
| Victoria |
Undecided
|
Unassigned |
Bug Description
The following tests are failing consistently when deploying devstack on Focal in the CI, see https:/
tempest.
tempest.
tempest.
tearDownClass (tempest.
Sample extract from nova-compute log:
Jun 08 08:48:24.384559 ubuntu-
Jun 08 08:48:24.384862 ubuntu-
Jun 08 08:48:24.388855 ubuntu-
Jun 08 08:48:24.388855 ubuntu-
Jun 08 08:48:24.388855 ubuntu-
Jun 08 08:48:24.388855 ubuntu-
Jun 08 08:48:24.388855 ubuntu-
Jun 08 08:48:24.388855 ubuntu-
Jun 08 08:48:24.388855 ubuntu-
Jun 08 08:48:24.388855 ubuntu-
Jun 08 08:48:24.388855 ubuntu-
Jun 08 08:48:24.388855 ubuntu-
Jun 08 08:48:24.388855 ubuntu-
Jun 08 08:48:24.388855 ubuntu-
Jun 08 08:48:24.388855 ubuntu-
Jun 08 08:48:24.388855 ubuntu-
Jun 08 08:48:24.388855 ubuntu-
Jun 08 08:48:24.388855 ubuntu-
Jun 08 08:48:24.388855 ubuntu-
Jun 08 08:48:24.388855 ubuntu-
Jun 08 08:48:24.390684 ubuntu-
Dr. Jens Harbott (j-harbott) wrote : | #1 |
Ghanshyam Mann (ghanshyammann) wrote : | #2 |
I cannot spot the clear issue from nova or cinder flow of failed volume. Issue happening during the test cleanup when attached volume from rescued server is requested for detach
tempest log:
Test performed all operation till unrescue the server and server became Active with same attachment it had before rescue.
- https:/
Now cleanup started and test requested the volume detach. It waiting for volume status in 'detaching' and after lot of time volume
status changed to 'in-use'
Trying to find the flow of same volume in cinder log:
volume is created fine:
iscsi target is set correctly
attachment is updated fine
- https:/
Detach request
n-api log: req-baec6727-
- https:/
detaching of volume is completed
- https:/
removed the iscsi target
- https:/
In n-cpu log i am seeing another test trying to detach volume from mountpoint /dev/vdb and failing but it is different server so not sure
if that is any issue?
Another test req-4542a8aa-
failing tests req-baec6727-
- https:/
Changed in nova: | |
importance: | Undecided → High |
Ghanshyam Mann (ghanshyammann) wrote : | #3 |
adding cinder also if something from cinder side during detachment.
sean mooney (sean-k-mooney) wrote : | #4 |
we have a similar bug downstream against queens in the other direction alos
https:/
in the downstream bug the failure is on attach
libvirtError: Requested operation is not valid: target vdf already exists
where as upstream we have "nova.exception
in both caes its as if the live domain view and nova's/cinder'd view or the domain are getting out of sync with each other
the call trace upstream traceback for detach looks like this
Traceback (most recent call last):
File "/opt/stack/
virt_
File "/opt/stack/
wait_
File "/opt/stack/
_try_
File "/opt/stack/
ctx.reraise = True
File "/usr/local/
self.
File "/usr/local/
six.
File "/usr/local/
raise value
File "/opt/stack/
self.
File "/opt/stack/
self.
File "/usr/local/
result = proxy_call(
File "/usr/local/
rv = execute(f, *args, **kwargs)
File "/usr/local/
six.reraise(c, e, tb)
File "/usr/local/
raise value
File "/usr/local/
rv = meth(*args, **kwargs)
File "/usr/local/
if ret == -1: raise libvirtError ('virDomainDeta
libvirt.
and the down stream traceback for attach looks like
Traceback (most recent call last):
File "/usr/lib/
do_
File "/usr/lib/
ret_val = method(obj, context, *args, **kwargs)
File "/usr/lib/
virt_driver, do_driver_attach)
File "/usr/lib/
Lee Yarwood (lyarwood) wrote : | #5 |
I've marked this as confirmed and bumped the importance given the impending move to Focal for all testing in the Victoria release.
C#4 is unrelated, I've continued to see detach issues on Bionic while using QEMU 4.0.0 and libvirt v5.0.0 however I've been unable to reproduce outside of upstream CI.
Each time this has been hit however it appears that the Guest OS (cirros) isn't able to react to the ACPI request to detach the disk device. This could simply be another case of the instances needing to be given more resources for these requests to be served quickly enough to satisfy Nova or relaxing the timeout within Nova.
Changed in nova: | |
importance: | High → Critical |
assignee: | nobody → Lee Yarwood (lyarwood) |
status: | New → Confirmed |
Related fix proposed to branch: master
Review: https:/
OpenStack Infra (hudson-openstack) wrote : | #7 |
Related fix proposed to branch: master
Review: https:/
So I'm even more convinced this is a host resource (I'd assume vCPU?) issue as I can reproduce this consistently on a 20.04 virtual machine with the same resources as our CI instances when running `$ tox -e full` in tempest, `$ tox -e full-serial` doesn't reproduce the issue at all. I'm going to try limiting things with --concurrency and try a slightly larger guest to see if that helps.
Kashyap Chamarthy (kashyapc) wrote : | #9 |
Some notes based my understanding (and from a brief chat with libvirt/QEMU developers):
- DEVICE_DELETED is the event that QEMU sends to libvirt, *once* the device was removed by the guest, so that libvirt can clean-up. So if we see DEVICE_DELETED that means the device was successfully detached from QEMU's point of view (therefore, from the guest's PoV, too)
- The presence of the '/sys/module/
So, if you _can_ see DEVICE_DELETED, then it sounds like the problem is somewhere _else_ than the guest OS.
Kashyap Chamarthy (kashyapc) wrote : | #10 |
Aside:
Also, a libvirt dev suggested to get the communication between libvirt and the QEMU monitor. We should have them in the CI by default.
(To manually enable: https:/
Kashyap Chamarthy (kashyapc) wrote : | #11 |
Combing through the libvirtd.log (https:/
(1) QEMU sent the 'device_del' command (to detach the device):
---
2020-09-03 20:01:52.711+0000: 65330: info : qemuMonitorSend:993 : QEMU_MONITOR_
---
(2) The reply was 'success'; good:
---
2020-09-03 20:01:52.714+0000: 65328: info : qemuMonitorJSON
---
(3) And QEMU even emits the event DEVICE_DELETED:
---
2020-09-03 20:01:53.019+0000: 65328: info : qemuMonitorJSON
"event": "DEVICE_DELETED", "data": {"path": "/machine/
---
So far so good ...
(4) ... but then, we see this "missing device in device deleted event"
---
2020-09-03 20:01:53.019+0000: 65328: debug : qemuMonitorJSON
2020-09-03 20:01:53.019+0000: 65328: debug : qemuMonitorJSON
---
I'm not entirely sure of the significance (or lack thereof) of the above.
Kashyap Chamarthy (kashyapc) wrote : | #12 |
Ignore the less interesting comment#11, the more "interesting" bit from the libvirtd.log is here:
libvirt asks QEMU to execute 'device_del' (i.e. to detach the device from the guest):
---
2020-09-03 19:58:35.441+0000: 65331: info : qemuMonitorSend:993 : QEMU_MONITOR_
fd=-1
---
But the reply from QEMU is failure to detach the device:
---
2020-09-03 19:58:35.443+0000: 65328: info : qemuMonitorJSON
esc": "Device 'virtio-disk1' not found"}}
[...]
93e20 name=instance-
2020-09-03 19:58:35.443+0000: 65331: debug : qemuDomainDelet
---
I learn the above "... no event arrived" means QEMU didn't tell libvirt which device was deleted.
I still don't have a robust answer to the root cause yet.
Lee Yarwood (lyarwood) wrote : | #13 |
After talking to libvirt/QEMU folks yesterday I've raised the following bug:
Second DEVICE_DELETED event missing during virtio-blk disk device detach
https:/
I'm trying to reproduce this on Fedora today to also raise this downstream in bugzilla.
tags: | added: gate-failure |
tags: | added: victoria-rc-potential |
Changed in nova: | |
importance: | Critical → High |
Related fix proposed to branch: master
Review: https:/
Reviewed: https:/
Committed: https:/
Submitter: Zuul
Branch: master
commit 57ac83d4d71c903
Author: Lee Yarwood <email address hidden>
Date: Fri Sep 18 10:45:21 2020 +0100
releasenote: Add known issue for bug #1894804
Related-Bug: #1882521
Change-Id: Ib9059dde41b0a0
Related fix proposed to branch: master
Review: https:/
Fix proposed to branch: master
Review: https:/
Changed in nova: | |
status: | Confirmed → In Progress |
Change abandoned by Lee Yarwood (<email address hidden>) on branch: master
Review: https:/
Reviewed: https:/
Committed: https:/
Submitter: Zuul
Branch: master
commit dd1e6d4b0cee465
Author: Lee Yarwood <email address hidden>
Date: Fri Oct 2 15:11:25 2020 +0100
libvirt: Increase incremental and max sleep time during device detach
Bug #1894804 outlines how DEVICE_DELETED events were often missing from
QEMU on Focal based OpenStack CI hosts as originally seen in bug
#1882521. This has eventually been tracked down to some undefined QEMU
behaviour when a new device_del QMP command is received while another is
still being processed, causing the original attempt to be aborted.
We hit this race in slower OpenStack CI envs as n-cpu rather crudely
retries attempts to detach devices using the RetryDecorator from
oslo.service. The default incremental sleep time currently being tight
enough to ensure QEMU is still processing the first device_del request
on these slower CI hosts when n-cpu asks libvirt to retry the detach,
sending another device_del to QEMU hitting the above behaviour.
Additionally we have also seen the following check being hit when
testing with QEMU >= v5.0.0. This check now rejects overlapping
device_del requests in QEMU rather than aborting the original:
https:/
This change aims to avoid this situation entirely by raising the default
incremental sleep time between detach requests from 2 seconds to 10,
leaving enough time for the first attempt to complete. The overall
maximum sleep time is also increased from 30 to 60 seconds.
Future work will aim to entirely remove this retry logic with a libvirt
event driven approach, polling for the the
VIR_
VIR_
Finally, the cleanup of unused arguments in detach_
left for a follow up change in order to keep this initial change small
enough to quickly backport.
Closes-Bug: #1882521
Related-Bug: #1894804
Change-Id: Ib9ed7069cef5b7
Changed in nova: | |
status: | In Progress → Fix Released |
Fix proposed to branch: stable/victoria
Review: https:/
Fix proposed to branch: stable/ussuri
Review: https:/
Fix proposed to branch: stable/train
Review: https:/
Related fix proposed to branch: master
Review: https:/
Reviewed: https:/
Committed: https:/
Submitter: Zuul
Branch: stable/victoria
commit 4819f694b2e2d56
Author: Lee Yarwood <email address hidden>
Date: Fri Oct 2 15:11:25 2020 +0100
libvirt: Increase incremental and max sleep time during device detach
Bug #1894804 outlines how DEVICE_DELETED events were often missing from
QEMU on Focal based OpenStack CI hosts as originally seen in bug
#1882521. This has eventually been tracked down to some undefined QEMU
behaviour when a new device_del QMP command is received while another is
still being processed, causing the original attempt to be aborted.
We hit this race in slower OpenStack CI envs as n-cpu rather crudely
retries attempts to detach devices using the RetryDecorator from
oslo.service. The default incremental sleep time currently being tight
enough to ensure QEMU is still processing the first device_del request
on these slower CI hosts when n-cpu asks libvirt to retry the detach,
sending another device_del to QEMU hitting the above behaviour.
Additionally we have also seen the following check being hit when
testing with QEMU >= v5.0.0. This check now rejects overlapping
device_del requests in QEMU rather than aborting the original:
https:/
This change aims to avoid this situation entirely by raising the default
incremental sleep time between detach requests from 2 seconds to 10,
leaving enough time for the first attempt to complete. The overall
maximum sleep time is also increased from 30 to 60 seconds.
Future work will aim to entirely remove this retry logic with a libvirt
event driven approach, polling for the the
VIR_
VIR_
Finally, the cleanup of unused arguments in detach_
left for a follow up change in order to keep this initial change small
enough to quickly backport.
Closes-Bug: #1882521
Related-Bug: #1894804
Change-Id: Ib9ed7069cef5b7
(cherry picked from commit dd1e6d4b0cee465
tags: | added: in-stable-victoria |
Changed in cinder: | |
status: | New → Invalid |
Looks like this bug is still occurring in the gate, I just investigated the latest failure on patch https:/
Guest refused to detach volume 72149632-
Here's a logstash query:
20 hits in the last 7 days, check and gate, all failures
Lee Yarwood (lyarwood) wrote : | #26 |
Continuing to see this again after the break, I've had a look at some example libvirtd logs an unlike https:/
This could still be the same issue *if* n-cpu is asking libvirtd to send the second device_del command *before* QEMU has had change to emit the first DEVICE_DELETED event. So we could again extend the time between requests from n-cpu to libvirtd but that's another hackaround.
I've highlighted again to gibi that the only real fix within Nova for this is to move to a libvirtd event based approach as outlined during the PTG and I had started to PoC below:
melanie witt (melwitt) wrote : | #27 |
Updated the bug title ^ so that ctrl + F on http://
summary: |
- Failing device detachments on Focal + Failing device detachments on Focal: "Unable to detach the device from + the live config" |
melanie witt (melwitt) wrote : | #28 |
https:/
melanie witt (melwitt) wrote : | #29 |
https:/
This issue was fixed in the openstack/nova 23.0.0.0rc1 release candidate.
Reviewed: https:/
Committed: https:/
Submitter: "Zuul (22348)"
Branch: master
commit e56cc4f43984655
Author: Balazs Gibizer <email address hidden>
Date: Tue Jan 12 08:23:17 2021 +0100
Replace blind retry with libvirt event waiting in detach
Nova so far applied a retry loop that tried to periodically detach the
device from libvirt while the device was visible in the domain xml. This
could lead to an issue where an already progressing detach on the
libvirt side is interrupted by nova re-sending the detach request for
the same device. See bug #1882521 for more information.
Also if there was both a persistent and a live domain the nova tried the
detach from both at the same call. This lead to confusion about the
result when such call failed. Was the detach failed partially?
We can do better, at least for the live detach case. Based on the
libvirt developers detaching from the persistent domain always
succeeds and it is a synchronous process. Detaching from the live
domain can be both synchronous or asynchronous depending on the guest
OS and the load on the hypervisor. But for live detach libvirt always
sends an event [1] nova can wait for.
So this patch does two things.
1) Separates the detach from the persistent domain from the detach from
the live domain to make the error cases clearer.
2) Changes the retry mechanism.
Detaching from the persistent domain is not retried. If libvirt
reports device not found, while both persistent and live detach
is needed, the error is ignored, and the process continues with
the live detach. In any other case the error considered as fatal.
Detaching from the live domain is changed to always wait for the
libvirt event. In case of timeout, the live detach is retried.
But a failure event from libvirt considered fatal, based on the
information from the libvirt developers, so in this case the
detach is not retried.
Related-Bug: #1882521
[1]https:/
Change-Id: I7f2b6330decb92
Reviewed: https:/
Committed: https:/
Submitter: "Zuul (22348)"
Branch: master
commit 52d6cd941ce65ba
Author: Balazs Gibizer <email address hidden>
Date: Tue Feb 23 16:30:34 2021 +0100
Move the guest.get_disk test to test_guest
As a follow up of I86153d31b02e6b
get_disk test that was mistakenly added to test_driver is now moved to
test_guest where they belong.
Change-Id: I17bd591ffb96b9
Related-Bug: #1882521
Reviewed: https:/
Committed: https:/
Submitter: "Zuul (22348)"
Branch: master
commit a7f2b65d1737671
Author: Balazs Gibizer <email address hidden>
Date: Tue Feb 23 17:29:37 2021 +0100
Follow up type hints for a634103
It is a follow up patch for I86153d31b02e6b
to add type hints to the functions that was touched by the original
patch.
Change-Id: I332ea49184200f
Related-Bug: #1882521
There are also some warnings like this
Jun 08 08:37:13.280300 ubuntu- focal-rax- dfw-0017012548 nova-compute[ 82495]: DEBUG oslo_concurrenc y.processutils [None req-bc4635cd- 81f9-435d- b60b-fdd64dffa9 58 None None] Running cmd (subprocess): /usr/bin/python3 -m oslo_concurrenc y.prlimit --as=1073741824 --cpu=30 -- env LC_ALL=C LANG=C qemu-img info /opt/stack/ data/nova/ instances/ 001ba6d1- 94e8-4108- 9b2c-5ee92df425 b9/disk --force-share --output=json {{(pid=82495) execute /usr/local/ lib/python3. 8/dist- packages/ oslo_concurrenc y/processutils. py:371} } focal-rax- dfw-0017012548 nova-compute[ 91401]: Exception ignored in: <function _after_fork at 0x7f0fb37e3b80> focal-rax- dfw-0017012548 nova-compute[ 91401]: Traceback (most recent call last): focal-rax- dfw-0017012548 nova-compute[ 91401]: File "/usr/lib/ python3. 8/threading. py", line 1454, in _after_fork focal-rax- dfw-0017012548 nova-compute[ 91401]: assert len(_active) == 1 focal-rax- dfw-0017012548 nova-compute[ 91401]: AssertionError:
Jun 08 08:37:13.288934 ubuntu-
Jun 08 08:37:13.288934 ubuntu-
Jun 08 08:37:13.288934 ubuntu-
Jun 08 08:37:13.288934 ubuntu-
Jun 08 08:37:13.288934 ubuntu-
but I'm not sure whether they might be related, something different, or just purely cosmetical. They also appear for other services like c-vol, so not nova-specific. See also https:/ /bugs.launchpad .net/bugs/ 1863021