race betwen ip allocations and dhcp agent notifications

Bug #1288493 reported by Peter Feiner
22
This bug affects 4 people
Affects Status Importance Assigned to Milestone
neutron
Expired
Medium
Unassigned

Bug Description

I came across a race condition while doing some performance work. The symptom is messages in syslog like this:

Mar 5 17:26:06 ctrl dnsmasq-dhcp[32356]: duplicate IP address 10.0.0.18 in .

Sure enough, there were duplicate entries in dnsmasq's host file for the same IP address. In this example, the MACs fa:16:3e:6d:15:03 and fa:16:3e:06:82:26 were both assigned 10.0.0.18.

I instrumented neutron-dhcp-agent to check the ports it emits into the host file for duplicate IP addresses:

2014-03-05 17:26:02.519 32309 ERROR neutron.agent.dhcp_agent [req-8f5d9c5d-4c20-4c75-95f5-f5f27e971a87 6e8f4d378eb147f8b9f055c13c113f4e 68153eceb4c64a049dea2de621858662] duplicates for ip 10.0.0.18: [(u'56f01383-92e5-43ff-948a-8549ce247b6c', u'fa:16:3e:6d:15:03'), (u'086b514f-c843-4f25-bad8-24f65c4932fa', u'fa:16:3e:06:82:26')]

I took a look in neutron-server.log for these port IDs and MAC addresses:

2014-03-05 17:25:52.623 21630 DEBUG neutron.db.db_base_plugin_v2 [req-6ff63c51-d3fc-4a59-bdc8-1f90f92a16f4 6e8f4d378eb147f8b9f055c13c113f4e 68153eceb4c64a049dea2de621858662] Generated mac for network 11df69c5-89eb-4ba6-b99e-69087237f4dc is fa:16:3e:6d:15:03 _generate_mac /usr/lib/python2.7/dist-packages/neutron/db/db_base_plugin_v2.py:302
2014-03-05 17:25:52.629 21630 DEBUG neutron.db.db_base_plugin_v2 [req-6ff63c51-d3fc-4a59-bdc8-1f90f92a16f4 6e8f4d378eb147f8b9f055c13c113f4e 68153eceb4c64a049dea2de621858662] Allocated IP - 10.0.0.18 from 10.0.0.18 to 10.0.255.254 _generate_ip /usr/lib/python2.7/dist-packages/neutron/db/db_base_plugin_v2.py:472
2014-03-05 17:26:02.062 21718 DEBUG neutron.db.db_base_plugin_v2 [req-f8a8fdf7-0f29-451e-b6dc-b174ead51b3a 6e8f4d378eb147f8b9f055c13c113f4e 68153eceb4c64a049dea2de621858662] Generated mac for network 11df69c5-89eb-4ba6-b99e-69087237f4dc is fa:16:3e:06:82:26 _generate_mac /usr/lib/python2.7/dist-packages/neutron/db/db_base_plugin_v2.py:302
2014-03-05 17:26:02.071 21681 DEBUG neutron.db.db_base_plugin_v2 [req-5bdd9f64-2ef8-4733-8fe1-1ae05e69c34a 7132e53040704655bda2dd1c355a60d9 1e38af7b2793478aba3a8a462d82ff84] Delete allocated IP 10.0.0.18 (11df69c5-89eb-4ba6-b99e-69087237f4dc/887baebc-9208-42e8-8243-9b8cea38d176) _delete_ip_allocation /usr/lib/python2.7/dist-packages/neutron/db/db_base_plugin_v2.py:444
2014-03-05 17:26:02.082 21718 DEBUG neutron.db.db_base_plugin_v2 [req-f8a8fdf7-0f29-451e-b6dc-b174ead51b3a 6e8f4d378eb147f8b9f055c13c113f4e 68153eceb4c64a049dea2de621858662] Allocated IP - 10.0.0.18 from 10.0.0.18 to 10.0.0.18 _generate_ip /usr/lib/python2.7/dist-packages/neutron/db/db_base_plugin_v2.py:472
2014-03-05 17:26:17.267 21676 DEBUG neutron.db.db_base_plugin_v2 [req-7225f965-5e1f-4123-a28c-a688b3c40cce 7132e53040704655bda2dd1c355a60d9 1e38af7b2793478aba3a8a462d82ff84] Delete allocated IP 10.0.0.18 (11df69c5-89eb-4ba6-b99e-69087237f4dc/887baebc-9208-42e8-8243-9b8cea38d176) _delete_ip_allocation /usr/lib/python2.7/dist-packages/neutron/db/db_base_plugin_v2.py:444

Correctly, 10.0.0.18 was allocated, then deleted, then allocated, then deleted.

So why did neutron-dhcp-agent think 10.0.0.18 was allocated to two ports? The answer, I believe, is a race between the port allocation & de-allocation notifications that neutron-server sends to neutron-dhcp-agent and how neutron-dhcp-agent incrementally updates its cache of network state. The following schedule of events illustrates the problem:

1. neutron-server: allocates 10.0.0.18 for MAC 1
2. neutron-server: emits event "port_create_end for (10.0.0.18, MAC1)"
3. neutron-dhcp-agent: receives event "port_create_end for (10.0.0.18, MAC1)" and adds (10.0.0.18, MAC1) to host file
4. neutron-server: deallocates 10.0.0.18 for MAC1
5. neutron-server: allocates 10.0.0.18 for MAC2
6. neutron-server: emits event "port_create_end for (10.0.0.18, MAC2)"
7. neutron-server: emits event "port_delete_end for (10.0.0.18, MAC1)"
8. neutron-dhcp-agent: receives event "port_create_end for (10.0.0.18, MAC2)" and adds (10.0.0.18, MAC2) to host file
... uh oh, 10.0.0.18 is in the host file twice ...
9. neutron-dhcp-agent: receives event "port_delete_end for (10.0.0.18, MAC1)" and removes (10.0.0.18, MAC1) from the host file
...

The key thing to observe is that 8 happens before 9 (i.e., neutron-dhcp-agent finds out about the new port before it finds out about the old port being deleted). The database transaction semantics only ensured that (1, 4, 5) happened in order. There's no guarantee about the relative order of event emission (6 and 7) with respect to transaction commits because the events are emitted *after* the database transactions are committed.

The consequence of these duplicate host entries is that dnsmasq doesn't give MAC2 its IP address until after MAC1 is removed from the file. Given an inopportune timing for both MAC addresses to appear in the file and typical dhclient retry intervals of 30s, this race can delay the VM with MAC2 from getting its IP address for a substantial amount of time.

To fix the bug, I think neutron-dhcp-agent should be changed to accommodate the racy notifications. Whenever neutron-dhcp-agent is notified that an (IP=X, MAC=Y) pair is added to a network, neutron-dhcp-agent proactively drop existing ports from its cache and host entries with the ip address X instead of waiting for the pending notification.

Bo Lin (linb)
tags: added: neutron-core
tags: added: l3-ipam-dhcp
Changed in neutron:
importance: Undecided → Medium
Veena (mveenasl)
Changed in neutron:
assignee: nobody → Veena (mveenasl)
Revision history for this message
Oleg Bondarev (obondarev) wrote :

I was able to emulate the race by commenting port_delete_end handler in dhcp agent, saw messages in syslog about duplicate IP addresses, but no error messages in dhcp agent log. So no exceptions were raised in dhcp agent. Ideally we could catch such exceptions and refresh network cache entry in the agent. Checking for duplicate ips each time new port is added seems an overhead. Will look more into it.

Also wondering was it the latest neutron master (by the time the bug was reported) where error log appeared in dhcp agent?

Jian Wen (wenjianhn)
Changed in neutron:
assignee: Veena (mveenasl) → nobody
Revision history for this message
Jian Wen (wenjianhn) wrote :

Related bugs:
   bug 1295715
      [mos] dnsmasq (for neutron-dhcp-agent) is sometimes configured with duplicate leases
   bug 1189909
      dhcp-agent does always provide IP address for instances with re-cycled IP addresses.

Revision history for this message
Ryan Moats (rmoats) wrote :

marking incomplete to start the cleanup timer - if this is still valid, please update status and take ownership

Changed in neutron:
status: New → Incomplete
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for neutron because there has been no activity for 60 days.]

Changed in neutron:
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.