Comment 4 for bug 1869244

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (master)

Reviewed: https://review.opendev.org/714783
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=e37722c0f5f0b746135200db6f654674dc0f6f12
Submitter: Zuul
Branch: master

commit e37722c0f5f0b746135200db6f654674dc0f6f12
Author: Nate Johnston <email address hidden>
Date: Tue Mar 24 18:05:16 2020 -0400

    Wait before deleting trunk bridges for DPDK vhu

    DPDK vhostuser mode (DPDK/vhu) means that when an instance is powered
    off the port is deleted, and when an instance is powered on a port is
    created. This means a reboot is functionally a super fast
    delete-then-create. Neutron trunking mode in combination with DPDK/vhu
    implements a trunk bridge for each tenant, and the ports for the
    instances are created as subports of that bridge. The standard way a
    trunk bridge works is that when all the subports are deleted, a thread
    is spawned to delete the trunk bridge, because that is an expensive and
    time-consuming operation. That means that if the port in question is
    the only port on the trunk on that compute node, this happens:

    1. The port is deleted
    2. A thread is spawned to delete the trunk
    3. The port is recreated

    If the trunk is deleted after #3 happens then the instance has no
    networking and is inaccessible; this is the scenario that was dealt with
    in a previous change [1]. But there continue to be issues with errors
    "RowNotFound: Cannot find Bridge with name=tbr-XXXXXXXX-X". What is
    happening in this case is that the trunk is being deleted in the middle
    of the execution of #3, so that it stops existing in the middle of the
    port creation logic but before the port is actually recreated.

    Since this is a timing issue between two different threads it's
    difficult to stamp out entirely, but I think the best way to do it is to
    add a slight delay in the trunk deletion thread, just a second or two.
    That will give the port time to come back online and avoid the trunk
    deletion entirely.

    [1] https://review.opendev.org/623275

    Related-Bug: #1869244
    Change-Id: I36a98fe5da85da1f3a0315dd1a470f062de6f38b