commit dd1e6d4b0cee465fd89744e306fcd25228b3f7cc
Author: Lee Yarwood <email address hidden>
Date: Fri Oct 2 15:11:25 2020 +0100
libvirt: Increase incremental and max sleep time during device detach
Bug #1894804 outlines how DEVICE_DELETED events were often missing from
QEMU on Focal based OpenStack CI hosts as originally seen in bug
#1882521. This has eventually been tracked down to some undefined QEMU
behaviour when a new device_del QMP command is received while another is
still being processed, causing the original attempt to be aborted.
We hit this race in slower OpenStack CI envs as n-cpu rather crudely
retries attempts to detach devices using the RetryDecorator from
oslo.service. The default incremental sleep time currently being tight
enough to ensure QEMU is still processing the first device_del request
on these slower CI hosts when n-cpu asks libvirt to retry the detach,
sending another device_del to QEMU hitting the above behaviour.
Additionally we have also seen the following check being hit when
testing with QEMU >= v5.0.0. This check now rejects overlapping
device_del requests in QEMU rather than aborting the original:
This change aims to avoid this situation entirely by raising the default
incremental sleep time between detach requests from 2 seconds to 10,
leaving enough time for the first attempt to complete. The overall
maximum sleep time is also increased from 30 to 60 seconds.
Future work will aim to entirely remove this retry logic with a libvirt
event driven approach, polling for the the
VIR_DOMAIN_EVENT_ID_DEVICE_REMOVED and
VIR_DOMAIN_EVENT_ID_DEVICE_REMOVAL_FAILED events before retrying.
Finally, the cleanup of unused arguments in detach_device_with_retry is
left for a follow up change in order to keep this initial change small
enough to quickly backport.
Reviewed: https:/ /review. opendev. org/755799 /git.openstack. org/cgit/ openstack/ nova/commit/ ?id=dd1e6d4b0ce e465fd89744e306 fcd25228b3f7cc
Committed: https:/
Submitter: Zuul
Branch: master
commit dd1e6d4b0cee465 fd89744e306fcd2 5228b3f7cc
Author: Lee Yarwood <email address hidden>
Date: Fri Oct 2 15:11:25 2020 +0100
libvirt: Increase incremental and max sleep time during device detach
Bug #1894804 outlines how DEVICE_DELETED events were often missing from
QEMU on Focal based OpenStack CI hosts as originally seen in bug
#1882521. This has eventually been tracked down to some undefined QEMU
behaviour when a new device_del QMP command is received while another is
still being processed, causing the original attempt to be aborted.
We hit this race in slower OpenStack CI envs as n-cpu rather crudely
retries attempts to detach devices using the RetryDecorator from
oslo.service. The default incremental sleep time currently being tight
enough to ensure QEMU is still processing the first device_del request
on these slower CI hosts when n-cpu asks libvirt to retry the detach,
sending another device_del to QEMU hitting the above behaviour.
Additionally we have also seen the following check being hit when
testing with QEMU >= v5.0.0. This check now rejects overlapping
device_del requests in QEMU rather than aborting the original:
https:/ /github. com/qemu/ qemu/commit/ cce8944cc9efab4 7d4bf29cfffb347 0371c3541b
This change aims to avoid this situation entirely by raising the default
incremental sleep time between detach requests from 2 seconds to 10,
leaving enough time for the first attempt to complete. The overall
maximum sleep time is also increased from 30 to 60 seconds.
Future work will aim to entirely remove this retry logic with a libvirt DOMAIN_ EVENT_ID_ DEVICE_ REMOVED and DOMAIN_ EVENT_ID_ DEVICE_ REMOVAL_ FAILED events before retrying.
event driven approach, polling for the the
VIR_
VIR_
Finally, the cleanup of unused arguments in detach_ device_ with_retry is
left for a follow up change in order to keep this initial change small
enough to quickly backport.
Closes-Bug: #1882521 3033351f7a78a3f b566753970d
Related-Bug: #1894804
Change-Id: Ib9ed7069cef5b7