[DHCP] Race condition during port processing events in DHCP agent

Bug #1913723 reported by Rodolfo Alonso
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Fix Released
Medium
Rodolfo Alonso

Bug Description

In the DHCP agent, network, subnet and port events are stored in a "ResourceProcessingQueue". When "neutron.agent.dhcp.agent._SYNC_STATE_LOCK" is free, the events are processes according to (1) a priority table and (2) the timestamp.

Since [1], the Neutron server sends a high priority cast to one random DHCP agent to speed up the port creation process. That means this event, regardless of the timestamp, will be processed first. That improves the VM creation process.

This bug exposes a problem detected in case that (1) only one single DHCP agent is attending to those events and (2) the queue processing is not fast enough to handle the events when arrive.

For example (seen in a customer deployment with thousands of ports assigned to the same DHCP agent):
- A port is created and processed. The port ID is stored in the DHCP agent cache.
- The port is deleted and the event arrives to the DHCP agent (*but* the event is not processed).
- Another port with the same IP address is created. The Neutron server will allow this process because the IP address is not used anymore. This port creation event arrives to the DHCP agent.
- The event queue is processed. At this point, this queue has both events: one deleting a port with IP address x.x.x.x and another event creating a port with the same IP address. Because the port creation event is processed before, a resync process occurs [2].

There is a very easy way to reproduce this error with a small trick: to add a time.sleep(10) at the beginning of [3]. Then we need to finish this processing thread sending a trivial operation, e.g. creating another port. Then process loop [4] will spawn a new thread that will stop in the sleep command; at this point we need to send both events, the deletion and the creation:
- Add the time.sleep(10) in the resource processing queue and restart the DHCP agent.
- Add a port with an IP. The DHCP agent will process this event correctly.

  $ openstack port create --fixed-ip ip-address=10.10.0.20 --network private port10; port=`openstack port show -c id -f value port10`

- This command should be done at the same time:
a) Create another port to consume the waiting thread.
b) Wait a small amount of time to let the thread finish and allow the process_loop to start another one. This new thread will stop in the sleep command.
c) Delete the port.
d) Create a port with the same IP --> the thread will have both events in the queue. As commented, because the queue processes first the creation event, that will trigger the unwanted resync:

  $ openstack port create --network private port_trivial; sleep 3; openstack port delete $port; openstack port create --fixed-ip ip-address=10.10.0.20 --network private port10; port=`openstack port show -c id -f value port10`

Log: http://paste.openstack.org/show/802126/

[1]https://review.opendev.org/c/openstack/neutron/+/626830
[2]https://github.com/openstack/neutron/blob/0fb63f729751d7e62ac107d3275d440632936978/neutron/agent/dhcp/agent.py#L664-L679
[3]https://github.com/openstack/neutron/blob/a9fc746249cd34cb7cc594c0b4d74d8ddf65bd46/neutron/agent/common/resource_processing_queue.py#L171
[4]https://github.com/openstack/neutron/blob/2bf558f6ec1dab70a81ba988f78d587433f43539/neutron/agent/dhcp/agent.py#L567

Changed in neutron:
importance: Undecided → Medium
assignee: nobody → Rodolfo Alonso (rodolfo-alonso-hernandez)
description: updated
description: updated
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 15.3.3

This issue was fixed in the openstack/neutron 15.3.3 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 16.3.1

This issue was fixed in the openstack/neutron 16.3.1 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 17.1.1

This issue was fixed in the openstack/neutron 17.1.1 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 18.0.0.0rc1

This issue was fixed in the openstack/neutron 18.0.0.0rc1 release candidate.

Changed in neutron:
status: New → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.