Comment 34 for bug 1882521

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/train)

Reviewed: https://review.opendev.org/c/openstack/nova/+/757307
Committed: https://opendev.org/openstack/nova/commit/618103db9bff9341a90342b372d8a77419e085f4
Submitter: "Zuul (22348)"
Branch: stable/train

commit 618103db9bff9341a90342b372d8a77419e085f4
Author: Lee Yarwood <email address hidden>
Date: Fri Oct 2 15:11:25 2020 +0100

    libvirt: Increase incremental and max sleep time during device detach

    Bug #1894804 outlines how DEVICE_DELETED events were often missing from
    QEMU on Focal based OpenStack CI hosts as originally seen in bug
     #1882521. This has eventually been tracked down to some undefined QEMU
    behaviour when a new device_del QMP command is received while another is
    still being processed, causing the original attempt to be aborted.

    We hit this race in slower OpenStack CI envs as n-cpu rather crudely
    retries attempts to detach devices using the RetryDecorator from
    oslo.service. The default incremental sleep time currently being tight
    enough to ensure QEMU is still processing the first device_del request
    on these slower CI hosts when n-cpu asks libvirt to retry the detach,
    sending another device_del to QEMU hitting the above behaviour.

    Additionally we have also seen the following check being hit when
    testing with QEMU >= v5.0.0. This check now rejects overlapping
    device_del requests in QEMU rather than aborting the original:

    https://github.com/qemu/qemu/commit/cce8944cc9efab47d4bf29cfffb3470371c3541b

    This change aims to avoid this situation entirely by raising the default
    incremental sleep time between detach requests from 2 seconds to 10,
    leaving enough time for the first attempt to complete. The overall
    maximum sleep time is also increased from 30 to 60 seconds.

    Future work will aim to entirely remove this retry logic with a libvirt
    event driven approach, polling for the the
    VIR_DOMAIN_EVENT_ID_DEVICE_REMOVED and
    VIR_DOMAIN_EVENT_ID_DEVICE_REMOVAL_FAILED events before retrying.

    Finally, the cleanup of unused arguments in detach_device_with_retry is
    left for a follow up change in order to keep this initial change small
    enough to quickly backport.

    Closes-Bug: #1882521
    Related-Bug: #1894804
    Change-Id: Ib9ed7069cef5b73033351f7a78a3fb566753970d
    (cherry picked from commit dd1e6d4b0cee465fd89744e306fcd25228b3f7cc)
    (cherry picked from commit 4819f694b2e2d5688fdac7e850f1a6c592253d6b)
    (cherry picked from commit f32286c62e11d43f6b01bc82caacf588447862e5)