dhcp agent race condition between between network_create_end and port_delete_end RPC
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
neutron |
Fix Released
|
Medium
|
Brian Haley |
Bug Description
A race condition exists in the DHCP agent between the network_create_end and port_delete_end RPC callbacks. If a DHCP agent is fetching network info [1] at the exact same time as a port is being deleted [2] it is possible that the port delete RPC arrives after the network info RPC completed on the server but before the result is processed on the agent. That leads to a condition where the agent has a port in its network cache that has already been deleted. That then has the potential to add duplicate entries in the dnsmasq host file for one IP address (two different MAC addresses).
Because this is a timing issue I am not able to create a standalone/isolated test case to show this behavior. In one of our QA test labs this happens ~20% of the time when a VM is deleted at the roughly the same time that a DHCP server is moved from one node to another.
This is happening because there is no synchronization between the port_delete_end and the network_create_end RPC event handlers. Since the port_delete_end RPC does not have any network_id information there is no way to synchronize the two operations in the agent. In our system we have addressed this by changing the *_delete_end RPC notifications coming from [3] to also include the network_id and then changing [2] to acquire _net_lock(
This an example of the series of logs generated by dnsmasq when a duplicate entry is added because of the stale port described above. dnsmasq then refuses to serve the IP address because of the duplicate and the VM never gets an IP address.
2017-11-
2017-11-
2017-11-
2017-11-
2017-11-
2017-11-
[1] neutron.
[2] neutron.
[3] neutron.
[4] http://
description: | updated |
Changed in neutron: | |
assignee: | nobody → Allain Legacy (alegacy) |
status: | Confirmed → In Progress |
Changed in neutron: | |
assignee: | Allain Legacy (alegacy) → Brian Haley (brian-haley) |
Hi - I've looked at the patch and think it's probably close, it would just need to account for the case where there is no network_id in the message.
Can you send this out for review so we can gather more feedback?