OpenStack Compute (nova)

Bug #1882521
Comment #31

Comment 31 for bug 1882521

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-04-21: Related fix merged to nova (master)

#31

Reviewed: https://review.opendev.org/c/openstack/nova/+/770246
Committed: https://opendev.org/openstack/nova/commit/e56cc4f439846558fc13298c2360d7cdd473cc89
Submitter: "Zuul (22348)"
Branch: master

commit e56cc4f439846558fc13298c2360d7cdd473cc89
Author: Balazs Gibizer <email address hidden>
Date: Tue Jan 12 08:23:17 2021 +0100

Replace blind retry with libvirt event waiting in detach

    Nova so far applied a retry loop that tried to periodically detach the
    device from libvirt while the device was visible in the domain xml. This
    could lead to an issue where an already progressing detach on the
    libvirt side is interrupted by nova re-sending the detach request for
    the same device. See bug #1882521 for more information.

    Also if there was both a persistent and a live domain the nova tried the
    detach from both at the same call. This lead to confusion about the
    result when such call failed. Was the detach failed partially?

    We can do better, at least for the live detach case. Based on the
    libvirt developers detaching from the persistent domain always
    succeeds and it is a synchronous process. Detaching from the live
    domain can be both synchronous or asynchronous depending on the guest
    OS and the load on the hypervisor. But for live detach libvirt always
    sends an event [1] nova can wait for.

So this patch does two things.

1) Separates the detach from the persistent domain from the detach from
the live domain to make the error cases clearer.

2) Changes the retry mechanism.

       Detaching from the persistent domain is not retried. If libvirt
       reports device not found, while both persistent and live detach
       is needed, the error is ignored, and the process continues with
       the live detach. In any other case the error considered as fatal.

       Detaching from the live domain is changed to always wait for the
       libvirt event. In case of timeout, the live detach is retried.
       But a failure event from libvirt considered fatal, based on the
       information from the libvirt developers, so in this case the
       detach is not retried.

Related-Bug: #1882521

[1]https://libvirt.org/html/libvirt-libvirt-domain.html#virConnectDomainEventDeviceRemovedCallback

Change-Id: I7f2b6330decb92e2838aa7cee47fb228f00f47da

Reviewed:  https://review.opendev.org/c/openstack/nova/+/770246
Committed: https://opendev.org/openstack/nova/commit/e56cc4f439846558fc13298c2360d7cdd473cc89
Submitter: "Zuul (22348)"
Branch:    master

commit e56cc4f439846558fc13298c2360d7cdd473cc89
Author: Balazs Gibizer <balazs.gibizer@est.tech>
Date:   Tue Jan 12 08:23:17 2021 +0100

Replace blind retry with libvirt event waiting in detach
    
    Nova so far applied a retry loop that tried to periodically detach the
    device from libvirt while the device was visible in the domain xml. This
    could lead to an issue where an already progressing detach on the
    libvirt side is interrupted by nova re-sending the detach request for
    the same device. See bug #1882521 for more information.
    
    Also if there was both a persistent and a live domain the nova tried the
    detach from both at the same call. This lead to confusion about the
    result when such call failed. Was the detach failed partially?
    
    We can do better, at least for the live detach case. Based on the
    libvirt developers detaching from the persistent domain always
    succeeds and it is a synchronous process. Detaching from the live
    domain can be both synchronous or asynchronous depending on the guest
    OS and the load on the hypervisor. But for live detach libvirt always
    sends an event [1] nova can wait for.
    
    So this patch does two things.
    
    1) Separates the detach from the persistent domain from the detach from
       the live domain to make the error cases clearer.
    
    2) Changes the retry mechanism.
    
       Detaching from the persistent domain is not retried. If libvirt
       reports device not found, while both persistent and live detach
       is needed, the error is ignored, and the process continues with
       the live detach. In any other case the error considered as fatal.
    
       Detaching from the live domain is changed to always wait for the
       libvirt event. In case of timeout, the live detach is retried.
       But a failure event from libvirt considered fatal, based on the
       information from the libvirt developers, so in this case the
       detach is not retried.
    
    Related-Bug: #1882521
    
    [1]https://libvirt.org/html/libvirt-libvirt-domain.html#virConnectDomainEventDeviceRemovedCallback
    
    Change-Id: I7f2b6330decb92e2838aa7cee47fb228f00f47da