neutron

Bug #1271344
Comment #20

Comment 20 for bug 1271344

Revision history for this message

Peter Feiner (pete5) wrote on 2014-02-28:

#20

I've hit this problem as well. It seems that dnsmasq just isn't getting some DHCPRELEASE packets that dhcp_release is putting on the "wire" (i.e., the tap device). For example, I booted 100 VMs, deleted those 100, then booted 100 more. In the second round of booting, dnsmasq wouldn't assign 10.0.0.39 because it was owned by another lease. However, according to neutron-dhcp-agent's logs, dhcp_release was invoked but, according to syslog, dnsmasq didn't process that message.

Here are the relevant syslog and neutron-dhcp-agent log messages:

Feb 28 12:21:30 ctrl dnsmasq-dhcp[19654]: DHCPOFFER(tapc712f8e7-39) 10.0.0.39 fa:16:3e:cd:24:1e
Feb 28 12:21:30 ctrl dnsmasq-dhcp[19654]: DHCPREQUEST(tapc712f8e7-39) 10.0.0.39 fa:16:3e:cd:24:1e
Feb 28 12:21:30 ctrl dnsmasq-dhcp[19654]: DHCPACK(tapc712f8e7-39) 10.0.0.39 fa:16:3e:cd:24:1e host-10-0-0-39
Feb 28 12:25:50 ctrl dnsmasq-dhcp[19654]: not using configured address 10.0.0.39 because it is leased to fa:16:3e:cd:24:1e

2014-02-28 12:23:50.650 19633 DEBUG neutron.agent.linux.utils [req-4b0f9265-fa4c-4f81-baa3-ef2815592640 7132e53040704655bda2dd1c355a60d9 1e38af7b2793478aba3a8a462d82ff84] Running command: ['sudo', '/usr/bin/neutron-rootwrap', '/etc/neutron/rootwrap.conf', 'ip', 'netns', 'exec', 'qdhcp-11df69c5-89eb-4ba6-b99e-69087237f4dc', 'dhcp_release', 'tapc712f8e7-39', '10.0.0.39', 'fa:16:3e:cd:24:1e'] create_process /usr/lib/python2.7/dist-packages/neutron/agent/linux/utils.py:47
2014-02-28 12:23:50.746 19633 DEBUG neutron.agent.linux.utils [req-4b0f9265-fa4c-4f81-baa3-ef2815592640 7132e53040704655bda2dd1c355a60d9 1e38af7b2793478aba3a8a462d82ff84]
Command: ['sudo', '/usr/bin/neutron-rootwrap', '/etc/neutron/rootwrap.conf', 'ip', 'netns', 'exec', 'qdhcp-11df69c5-89eb-4ba6-b99e-69087237f4dc', 'dhcp_release', 'tapc712f8e7-39', '10.0.0.39', 'fa:16:3e:cd:24:1e']
Exit code: 0
Stdout: ''
Stderr: '' execute /usr/lib/python2.7/dist-packages/neutron/agent/linux/utils.py:73

Note that dhcp_release is called with the correct IP and MAC addresses.

It's clear to me that dnsmasq either didn't get the message or it has a bug so didn't handle the message properly. I can imagine a few possibilities for the former: (1) dhcp_release's write the tap device fails but it doesn't check the return value; (2) writes to the tap device silently fail when it's in-kernel buffers are full; (3) a race in the kernel wherein writes to the tap device are lost; (4) a race in the kernel where packets are lost on read. These possibilities (1-4) could be confirmed or debunked with some effort.

Whatever the cause of the lost message may be, it seems to me that this approach of keeping neutron and dnsmasq in sync with incremental updates isn't robust. There are too many ways for dnsmasq's lease state and neutron's state to fall out of sync. One way is obviously this bug that we're all trying to diagnose. Another way is for a port_delete_end RPC to not be called, or not received, due to a crashed neutron process. Whatever the cause for dnsmasq and neutron falling out of sync, the only remedy at the moment is to kill the dnsmasq process then restart neutron-dhcp-agent.

Robert, in your bug report you suggested an interesting solution:

the lease change script is apparently called for all leases when SIGHUP
is received, so maybe we can terminate dead leases there? [2]

Is the lease change script capable of telling dnsmasq to drop all leases except for those allowed for by the hosts file?

Here are the relevant syslog and neutron-dhcp-agent log messages:

Note that dhcp_release is called with the correct IP and MAC addresses.

Robert, in your bug report you suggested an interesting solution:

the lease change script is apparently called for all leases when SIGHUP
    is received, so maybe we can terminate dead leases there? [2]

Is the lease change script capable of telling dnsmasq to drop all leases except for those allowed for by the hosts file?