concurrent interface detach/attach may accidently unbind port

Bug #1892870 reported by Alexandre arents
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Undecided
Alexandre arents
Ussuri
Undecided
Unassigned

Bug Description

Description
===========

When we run multiple concurrent detach interface on the same instance uuid/port-id,
a lot of those requests are accepted (HTTP 202) an processed because info_cache is updated
only when first request finish, so all request after the first one will fail BUT will send
an unbind that may unbind a port that have been just legitly attached to another instance.

Ex of use case: a kube operator that move all interface from one instance to another(rollout).

Steps to reproduce
==================

On devstack master:

INST=$(openstack server create --flavor m1.small --image cirros-0.4.0-x86_64-disk --nic net-id=private inst1 | awk '/ id/{print $4}')
HOST_IP=$(hostname -I | awk '{print $1}')
PORT=$(openstack port create --network private myport | awk '/ id/{print $4}')
openstack server add port $INST $PORT
TOKEN=$(openstack token issue|awk '/ id/{print $4}')

# create detach script with this content:
echo 'curl -g -i -X DELETE http://${HOST_IP}/compute/v2.1/servers/${INST}/os-interface/${PORT} -H "Accept: application/json" -H "User-Agent: python-novaclient" -H "X-Auth-Token: ${TOKEN}" -H "X-OpenStack-Nova-API-Version: 2.1"' > detach

# open in another terminal neutron logs filtering on PUT:
sudo journalctl -u <email address hidden> -f | grep -E PUT

# Now send plenty of detach:
for i in $(seq 1 20) ; do . detach ; sleep 0.2 ; done
HTTP/1.1 202 Accepted
...
HTTP/1.1 202 Accepted
...

In neutron logs you will see plenty of unbinding request:
 sudo journalctl -u <email address hidden> -f | grep -E PUT
Aug 25 12:15:01 alex-devstack neutron-server[1051]: INFO neutron.wsgi [req-9b454f98-a12c-49fb-8c39-e70094912029 req-914eaa3d-c139-416d-85b2-6d20e3e3c496 service neutron] 54.39.183.136 "PUT /v2.0/ports/0466658a-02ef-46da-a4ac-9a335c6d3352 HTTP/1.1" status: 200 len: 1098 time: 1.1390624

    ^^^ After the first PUT above, port is free to be attach.. if the app attach here the port risk to be accidently unbinded by other unbind PUT below:

Aug 25 12:15:01 alex-devstack neutron-server[1051]: INFO neutron.wsgi [req-a0f2c9ab-419c-46e4-9488-684c7e173241 req-b4d96937-27ee-4c3f-b114-f02e86979605 service neutron] 54.39.183.136 "PUT /v2.0/ports/0466658a-02ef-46da-a4ac-9a335c6d3352 HTTP/1.1" status: 200 len: 1098 time: 1.2870913
Aug 25 12:15:02 alex-devstack neutron-server[1051]: INFO neutron.wsgi [req-91a3b2e6-bf4f-47b5-b05b-707ac48cb1e4 req-2a4c305e-ece6-4c4a-a7af-4c3371e12ea7 service neutron] 54.39.183.136 "PUT /v2.0/ports/0466658a-02ef-46da-a4ac-9a335c6d3352 HTTP/1.1" status: 200 len: 1098 time: 1.7216117
Aug 25 12:15:02 alex-devstack neutron-server[1051]: INFO neutron.wsgi [req-e298313f-6b30-4815-ae52-4716e0df3e0f req-aed7f7f4-9dca-41a7-84b5-1e59ef6c50e9 service neutron] 54.39.183.136 "PUT /v2.0/ports/0466658a-02ef-46da-a4ac-9a335c6d3352 HTTP/1.1" status: 200 len: 1098 time: 1.4968698
Aug 25 12:15:03 alex-devstack neutron-server[1051]: INFO neutron.wsgi [req-a7bc3121-13b4-484c-9f2f-c9b3fd4b225b req-424edb50-c998-4fa3-b26f-8d0d01ea474e service neutron] 54.39.183.136 "PUT /v2.0/ports/0466658a-02ef-46da-a4ac-9a335c6d3352 HTTP/1.1" status: 200 len: 1099 time: 2.5676210
Aug 25 12:15:04 alex-devstack neutron-server[1051]: INFO neutron.wsgi [req-62ed888a-6f35-425e-a872-e4a5758b8a92 req-7ce9ce3e-503d-4845-8663-c1b6d02b89ca service neutron] 54.39.183.136 "PUT /v2.0/ports/0466658a-02ef-46da-a4ac-9a335c6d3352 HTTP/1.1" status: 200 len: 1099 time: 3.3949170
Aug 25 12:15:05 alex-devstack neutron-server[1051]: INFO neutron.wsgi [req-4a10d91b-7431-4255-b27d-a72343d2482e req-5614949b-b1e7-4862-ad95-f06c4ba7936e service neutron] 54.39.183.136 "PUT /v2.0/ports/0466658a-02ef-46da-a4ac-9a335c6d3352 HTTP/1.1" status: 200 len: 1099 time: 2.9760244
Aug 25 12:15:05 alex-devstack neutron-server[1051]: INFO neutron.wsgi [req-682d7521-af6b-4c9a-b90e-9a8ea6ed45bc req-76e2d9ec-f8a8-4882-b4d6-8ea3f1c7f818 service neutron] 54.39.183.136 "PUT /v2.0/ports/0466658a-02ef-46da-a4ac-9a335c6d3352 HTTP/1.1" status: 200 len: 1099 time: 4.5097113
Aug 25 12:15:06 alex-devstack neutron-server[1051]: INFO neutron.wsgi [req-ca379b27-7754-4125-8318-4e2ff7e97227 req-0e9ba555-6572-4388-a081-3b69cc92c2ec service neutron] 54.39.183.136 "PUT /v2.0/ports/0466658a-02ef-46da-a4ac-9a335c6d3352 HTTP/1.1" status: 200 len: 1099 time: 5.1839015
Aug 25 12:15:06 alex-devstack neutron-server[1051]: INFO neutron.wsgi [req-f6881052-baf4-4fb7-9767-739563331058 req-7a3194ff-0fad-4917-8d34-851a850cbae2 service neutron] 54.39.183.136 "PUT /v2.0/ports/0466658a-02ef-46da-a4ac-9a335c6d3352 HTTP/1.1" status: 200 len: 1099 time: 4.4715307
Aug 25 12:15:07 alex-devstack neutron-server[1051]: INFO neutron.wsgi [req-1e7106c5-6dfa-4d36-a889-35198479add8 req-6c513399-c0f5-4d0b-9315-e2e30f1e6370 service neutron] 54.39.183.136 "PUT /v2.0/ports/0466658a-02ef-46da-a4ac-9a335c6d3352 HTTP/1.1" status: 200 len: 1099 time: 5.4051881
Aug 25 12:15:10 alex-devstack neutron-server[1051]: INFO neutron.wsgi [req-909107a5-f03a-4ac4-8312-dcef83210908 req-38a6ef2b-b9f1-44cf-a823-ca67159115a4 service neutron] 54.39.183.136 "PUT /v2.0/ports/0466658a-02ef-46da-a4ac-9a335c6d3352 HTTP/1.1" status: 200 len: 1099 time: 8.5914369

if an attach pass, and is accidently unbind as previously explain,
apps can detect port is free and try to reattach an interface that is already
attached in libvirt and you GOT HTTP 500 with those error message on compute node:
oslo_db.exception.DBDuplicateEntry: (pymysql.err.IntegrityError) (1062, "Duplicate entry 'fa:16:3e:d5:26:e1/0466658a-02ef-46da-a4ac-9a335c6d3352-0' for key 'uniq_virtual_interfaces0address
0deleted'").

Expected result
===============
Only one unbind should be sent.

Actual result
=============
Multiple unbind are sent and one of them can break a port just attached

summary: - concurent interface detach/attach may accidently unbind port
+ concurrent interface detach/attach may accidently unbind port
tags: added: api race-condition
Changed in nova:
assignee: nobody → Alexandre arents (aarents)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.opendev.org/747957

Changed in nova:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.opendev.org/747957
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=39831c55999abe97e3bd26a1c0db2e4ceb467041
Submitter: Zuul
Branch: master

commit 39831c55999abe97e3bd26a1c0db2e4ceb467041
Author: Alexandre Arents <email address hidden>
Date: Mon Aug 24 14:05:27 2020 +0000

    Add a lock to prevent race during detach/attach of interface

    Only allow one detach/attach at a time with the same pattern instance-port_id
    in order to avoid race condition when multiple detach/attach are run
    concurrently.

    When multiple detach run concurrently on a specific instance-port_id,
    manager consider many of them as valid because info_cache still contains
    the port and info_cache is refreshed only once the first request complete.

    So during this gap of time, while the first request accomplishes the task,
    all subsequent requests are destined to fail and log a warning [1] in
    different location of code, depending on the outcome of the first request.
    The Issue is that all those caught requests finally run a
    deallocate_port_for_instance which will unbind the port.

    This may cause a race condition, because a successful attach can pass between
    those unbind, and be silently unbound, resulting in an infrastructure/DB
    inconsistency.

    [1] 'Detaching interface %(mac)s failed because the device is no longer found
         on the guest.'

    Closes-Bug: #1892870
    Change-Id: Iea5969d0bd16dc9a6f1ba950224b0115e466ce66

Changed in nova:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/ussuri)

Fix proposed to branch: stable/ussuri
Review: https://review.opendev.org/749033

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/ussuri)

Reviewed: https://review.opendev.org/749033
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=1825fa997d93fce2bb397409bc954c810a8243a7
Submitter: Zuul
Branch: stable/ussuri

commit 1825fa997d93fce2bb397409bc954c810a8243a7
Author: Alexandre Arents <email address hidden>
Date: Mon Aug 24 14:05:27 2020 +0000

    Add a lock to prevent race during detach/attach of interface

    Only allow one detach/attach at a time with the same pattern instance-port_id
    in order to avoid race condition when multiple detach/attach are run
    concurrently.

    When multiple detach run concurrently on a specific instance-port_id,
    manager consider many of them as valid because info_cache still contains
    the port and info_cache is refreshed only once the first request complete.

    So during this gap of time, while the first request accomplishes the task,
    all subsequent requests are destined to fail and log a warning [1] in
    different location of code, depending on the outcome of the first request.
    The Issue is that all those caught requests finally run a
    deallocate_port_for_instance which will unbind the port.

    This may cause a race condition, because a successful attach can pass between
    those unbind, and be silently unbound, resulting in an infrastructure/DB
    inconsistency.

    [1] 'Detaching interface %(mac)s failed because the device is no longer found
         on the guest.'

    Closes-Bug: #1892870
    Change-Id: Iea5969d0bd16dc9a6f1ba950224b0115e466ce66
    (cherry picked from commit 39831c55999abe97e3bd26a1c0db2e4ceb467041)

tags: added: in-stable-ussuri
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers