race betwen ip allocations and dhcp agent notifications
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
neutron |
Expired
|
Medium
|
Unassigned |
Bug Description
I came across a race condition while doing some performance work. The symptom is messages in syslog like this:
Mar 5 17:26:06 ctrl dnsmasq-
Sure enough, there were duplicate entries in dnsmasq's host file for the same IP address. In this example, the MACs fa:16:3e:6d:15:03 and fa:16:3e:06:82:26 were both assigned 10.0.0.18.
I instrumented neutron-dhcp-agent to check the ports it emits into the host file for duplicate IP addresses:
2014-03-05 17:26:02.519 32309 ERROR neutron.
I took a look in neutron-server.log for these port IDs and MAC addresses:
2014-03-05 17:25:52.623 21630 DEBUG neutron.
2014-03-05 17:25:52.629 21630 DEBUG neutron.
2014-03-05 17:26:02.062 21718 DEBUG neutron.
2014-03-05 17:26:02.071 21681 DEBUG neutron.
2014-03-05 17:26:02.082 21718 DEBUG neutron.
2014-03-05 17:26:17.267 21676 DEBUG neutron.
Correctly, 10.0.0.18 was allocated, then deleted, then allocated, then deleted.
So why did neutron-dhcp-agent think 10.0.0.18 was allocated to two ports? The answer, I believe, is a race between the port allocation & de-allocation notifications that neutron-server sends to neutron-dhcp-agent and how neutron-dhcp-agent incrementally updates its cache of network state. The following schedule of events illustrates the problem:
1. neutron-server: allocates 10.0.0.18 for MAC 1
2. neutron-server: emits event "port_create_end for (10.0.0.18, MAC1)"
3. neutron-dhcp-agent: receives event "port_create_end for (10.0.0.18, MAC1)" and adds (10.0.0.18, MAC1) to host file
4. neutron-server: deallocates 10.0.0.18 for MAC1
5. neutron-server: allocates 10.0.0.18 for MAC2
6. neutron-server: emits event "port_create_end for (10.0.0.18, MAC2)"
7. neutron-server: emits event "port_delete_end for (10.0.0.18, MAC1)"
8. neutron-dhcp-agent: receives event "port_create_end for (10.0.0.18, MAC2)" and adds (10.0.0.18, MAC2) to host file
... uh oh, 10.0.0.18 is in the host file twice ...
9. neutron-dhcp-agent: receives event "port_delete_end for (10.0.0.18, MAC1)" and removes (10.0.0.18, MAC1) from the host file
...
The key thing to observe is that 8 happens before 9 (i.e., neutron-dhcp-agent finds out about the new port before it finds out about the old port being deleted). The database transaction semantics only ensured that (1, 4, 5) happened in order. There's no guarantee about the relative order of event emission (6 and 7) with respect to transaction commits because the events are emitted *after* the database transactions are committed.
The consequence of these duplicate host entries is that dnsmasq doesn't give MAC2 its IP address until after MAC1 is removed from the file. Given an inopportune timing for both MAC addresses to appear in the file and typical dhclient retry intervals of 30s, this race can delay the VM with MAC2 from getting its IP address for a substantial amount of time.
To fix the bug, I think neutron-dhcp-agent should be changed to accommodate the racy notifications. Whenever neutron-dhcp-agent is notified that an (IP=X, MAC=Y) pair is added to a network, neutron-dhcp-agent proactively drop existing ports from its cache and host entries with the ip address X instead of waiting for the pending notification.
tags: | added: neutron-core |
tags: | added: l3-ipam-dhcp |
Changed in neutron: | |
importance: | Undecided → Medium |
Changed in neutron: | |
assignee: | nobody → Veena (mveenasl) |
Changed in neutron: | |
assignee: | Veena (mveenasl) → nobody |
I was able to emulate the race by commenting port_delete_end handler in dhcp agent, saw messages in syslog about duplicate IP addresses, but no error messages in dhcp agent log. So no exceptions were raised in dhcp agent. Ideally we could catch such exceptions and refresh network cache entry in the agent. Checking for duplicate ips each time new port is added seems an overhead. Will look more into it.
Also wondering was it the latest neutron master (by the time the bug was reported) where error log appeared in dhcp agent?