Bug #1732456 “dhcp agent race condition between between network_... : Bugs : neutron

Allain Legacy (alegacy) on 2017-11-15

description:

updated

Revision history for this message

Brian Haley (brian-haley) wrote on 2017-11-17:

#1

Hi - I've looked at the patch and think it's probably close, it would just need to account for the case where there is no network_id in the message.

Can you send this out for review so we can gather more feedback?

Changed in neutron:
status:	New → Confirmed
importance:	Undecided → Medium

Revision history for this message

Allain Legacy (alegacy) wrote on 2017-11-17:

#2

sure, i can try to rework it a bit. The patch as attached is like that because in our system deployment we are guaranteed that the neutron-server node is always at the same version (or better) than the agent therefore the network_id is always present. I realize that is not the case for other users/installations.

Allain Legacy (alegacy) on 2017-12-01

Changed in neutron:
assignee:	nobody → Allain Legacy (alegacy)
status:	Confirmed → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-12-01: Fix proposed to neutron (master)

#3

Fix proposed to branch: master
Review: https://review.openstack.org/524711

OpenStack Infra (hudson-openstack) on 2018-05-09

Changed in neutron:
assignee:	Allain Legacy (alegacy) → Brian Haley (brian-haley)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-05-11: Fix merged to neutron (master)

#4

Reviewed: https://review.openstack.org/524711
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=fa78b580105111b1238e5b18d415b08e8fb35d97
Submitter: Zuul
Branch: master

commit fa78b580105111b1238e5b18d415b08e8fb35d97
Author: Allain Legacy <email address hidden>
Date: Fri Nov 10 10:44:50 2017 -0600

dhcp: serializing port delete and network rpc calls

    The port delete events are not synchronized with network rpc events. This
    creates a condition which makes it possible for a port delete event to be
    processed just before a previously started network query completes.

The problematic order of operations is as follows:

1) a network is scheduled to an agent; a network rpc is sent to the
agent

2) the agent queries the network data from the server

3) while that query is in progress a port on that network is deleted; a
port rpc is sent to the agent

4) that port delete rpc is received before the network query rpc
completes

5) the port delete results in no action because the port was not present
on the agent

6) the network query finishes and adds the port to the cache (even
though the port has already been deleted)

7) some time passes and a new port is configured with the same IP
address as the port that was deleted in (3)

8) the dhcp host file is corrupted with 2 entries for the same IP
address.

9) dhcp queries for the newest port is rejected because of the duplicate
entry in the dhcp host file.

    The solution is to add the network_id to the port_delete_end rpc event
    so that the _net_lock(network_id) synchronization point can be acquired
    so that it is processed serially with other network related events.

    To ensure backwards compatibility with newer agents running against older
    servers the determination of which network_id value to use in the lock is
    handled using a utility that will fallback to the previous mode of operation
    whenever the network_id attribute is not present in the *_delete_end RPC
    events. That utility can be removed in the future when it is guaranteed
    that the network_id attribute will be present in RPC messages from the
    server.

Closes-Bug: #1732456

Change-Id: I735f8b1c9248b12e5feb6cbe970cf67f321e6ebc
Signed-off-by: Allain Legacy <email address hidden>

Reviewed:  https://review.openstack.org/524711
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=fa78b580105111b1238e5b18d415b08e8fb35d97
Submitter: Zuul
Branch:    master

commit fa78b580105111b1238e5b18d415b08e8fb35d97
Author: Allain Legacy <allain.legacy@windriver.com>
Date:   Fri Nov 10 10:44:50 2017 -0600

dhcp: serializing port delete and network rpc calls
    
    The port delete events are not synchronized with network rpc events.  This
    creates a condition which makes it possible for a port delete event to be
    processed just before a previously started network query completes.
    
    The problematic order of operations is as follows:
    
      1) a network is scheduled to an agent; a network rpc is sent to the
         agent
    
      2) the agent queries the network data from the server
    
      3) while that query is in progress a port on that network is deleted; a
         port rpc is sent to the agent
    
      4) that port delete rpc is received before the network query rpc
         completes
    
      5) the port delete results in no action because the port was not present
         on the agent
    
      6) the network query finishes and adds the port to the cache (even
         though the port has already been deleted)
    
      7) some time passes and a new port is configured with the same IP
         address as the port that was deleted in (3)
    
      8) the dhcp host file is corrupted with 2 entries for the same IP
         address.
    
      9) dhcp queries for the newest port is rejected because of the duplicate
         entry in the dhcp host file.
    
    The solution is to add the network_id to the port_delete_end rpc event
    so that the _net_lock(network_id) synchronization point can be acquired
    so that it is processed serially with other network related events.
    
    To ensure backwards compatibility with newer agents running against older
    servers the determination of which network_id value to use in the lock is
    handled using a utility that will fallback to the previous mode of operation
    whenever the network_id attribute is not present in the *_delete_end RPC
    events.  That utility can be removed in the future when it is guaranteed
    that the network_id attribute will be present in RPC messages from the
    server.
    
    Closes-Bug: #1732456
    
    Change-Id: I735f8b1c9248b12e5feb6cbe970cf67f321e6ebc
    Signed-off-by: Allain Legacy <allain.legacy@windriver.com>

Changed in neutron:
status:	In Progress → Fix Released

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-06-07: Fix included in openstack/neutron 13.0.0.0b2

#5

This issue was fixed in the openstack/neutron 13.0.0.0b2 development milestone.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-09-27: Fix proposed to neutron (stable/pike)

#6

Fix proposed to branch: stable/pike
Review: https://review.openstack.org/605562

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-09-27: Fix proposed to neutron (stable/queens)

#7

Fix proposed to branch: stable/queens
Review: https://review.openstack.org/605563

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-10-29: Fix merged to neutron (stable/queens)

#8

Reviewed: https://review.openstack.org/605563
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=4ca8baf6966657eff474bf88e1bd8c296e630a1e
Submitter: Zuul
Branch: stable/queens

commit 4ca8baf6966657eff474bf88e1bd8c296e630a1e
Author: Allain Legacy <email address hidden>
Date: Fri Nov 10 10:44:50 2017 -0600