neutron-dhcp-agent doesn't hand out leases for recently used addresses

Bug #1271344 reported by Robert Collins
84
This bug affects 15 people
Affects Status Importance Assigned to Milestone
neutron
Fix Released
High
Aaron Rosen
tripleo
Fix Released
High
Unassigned

Bug Description

Hi, last two days on ci-overcloud.tripleo.org, the neutron-dhcp-agent has stopped answering properly for new VMs - new VMs don't get IP addresses until neutron-dhcp-agent is restarted (we see cloud-init waiting in nova console-log)

There is a bunch of rabbit disconnect messages in the log, but they are not correlated with the fault.

Existing VMs can DHCP successfully, and the overlay network seems fine.

I've checked the dhcp agent host file and it is being updated.

restarting the neutron-dhcp-agent 'fixes' the problem.

kill -HUPing the dnsmasqs doesn't seem to fix things

-> The issue is that a prior VM gets (say ) 192.168.1.25, then is shutdown.
-> we then issue 1.25 to a new VM, but the dnsmasq process still has 1.25 in its in-memory leases table as being for the other adress.

 - DHCPRELEASE *is* being issued, perhaps sporadic failures or some other condition/

# WORKAROUNDS:
 - restart the agent in a cronjob
 - change reload_allocations to return self.restart(), avoiding the cache in dnsmasq but introducing short interrupts in DHCP on each new VM deployed.

# POSSIBLE SOLUTIONS
 - change dnsmasq to discard leases that were statically allocated and no longer have that MAC address (e.g. to trust the config more)
 - change neutron to only recycle port addresses when the discarded MAC has been deleted more than (lease time).
 - use dnsmasqs multiple-MAC feature to tell dnsmasq that a handover is happening.[1]
 - the lease change script is apparently called for all leases when SIGHUP is received, so maybe we can terminate dead leases there? [2]

[1] " As a special case, in DHCPv4, it is possible to include more than one hardware address. eg:
              --dhcp-host=11:22:33:44:55:66,12:34:56:78:90:12,192.168.0.2 This allows an IP address to be associated with multiple hardware addresses, and gives dnsmasq permission to abandon a DHCP lease to one of the hardware addresses when another one asks for a lease. Beware that this is a dangerous thing to do, it will only work reliably if only one of the hardware addresses is active at any time and there is no way for dnsmasq to enforce this. It is, for instance, useful to allocate a stable IP address to a laptop which has both wired and wireless interfaces.
"

[2] "When it receives a SIGHUP, dnsmasq clears its cache and then re-loads /etc/hosts and /etc/ethers and any file given by --dhcp-hostsfile, --dhcp-optsfile or --addn-hosts. The dhcp lease change script is called for all existing DHCP leases."

Changed in tripleo:
importance: Undecided → Critical
status: New → Triaged
Revision history for this message
Robert Collins (lifeless) wrote :
Download full text (5.7 KiB)

Ok, so poked at logs...

2014-01-22 01:58:58.029 14424 ERROR neutron.openstack.common.rpc.common [-] Failed to consume message from queue: Socket closed
2014-01-22 01:58:58.029 14424 TRACE neutron.openstack.common.rpc.common Traceback (most recent call last):
2014-01-22 01:58:58.029 14424 TRACE neutron.openstack.common.rpc.common File "/opt/stack/venvs/neutron/local/lib/python2.7/site-packages/neutron/openstack/common/rpc/impl_kombu.py", line 576, in ensure
2014-01-22 01:58:58.029 14424 TRACE neutron.openstack.common.rpc.common return method(*args, **kwargs)
2014-01-22 01:58:58.029 14424 TRACE neutron.openstack.common.rpc.common File "/opt/stack/venvs/neutron/local/lib/python2.7/site-packages/neutron/openstack/common/rpc/impl_kombu.py", line 656, in _consume
2014-01-22 01:58:58.029 14424 TRACE neutron.openstack.common.rpc.common return self.connection.drain_events(timeout=timeout)
2014-01-22 01:58:58.029 14424 TRACE neutron.openstack.common.rpc.common File "/opt/stack/venvs/neutron/local/lib/python2.7/site-packages/kombu/connection.py", line 279, in drain_events
2014-01-22 01:58:58.029 14424 TRACE neutron.openstack.common.rpc.common return self.transport.drain_events(self.connection, **kwargs)
2014-01-22 01:58:58.029 14424 TRACE neutron.openstack.common.rpc.common File "/opt/stack/venvs/neutron/local/lib/python2.7/site-packages/kombu/transport/pyamqp.py", line 90, in drain_events
2014-01-22 01:58:58.029 14424 TRACE neutron.openstack.common.rpc.common return connection.drain_events(**kwargs)
2014-01-22 01:58:58.029 14424 TRACE neutron.openstack.common.rpc.common File "/opt/stack/venvs/neutron/local/lib/python2.7/site-packages/amqp/connection.py", line 299, in drain_events
2014-01-22 01:58:58.029 14424 TRACE neutron.openstack.common.rpc.common chanmap, None, timeout=timeout,
2014-01-22 01:58:58.029 14424 TRACE neutron.openstack.common.rpc.common File "/opt/stack/venvs/neutron/local/lib/python2.7/site-packages/amqp/connection.py", line 362, in _wait_multiple
2014-01-22 01:58:58.029 14424 TRACE neutron.openstack.common.rpc.common channel, method_sig, args, content = read_timeout(timeout)
2014-01-22 01:58:58.029 14424 TRACE neutron.openstack.common.rpc.common File "/opt/stack/venvs/neutron/local/lib/python2.7/site-packages/amqp/connection.py", line 326, in read_timeout
2014-01-22 01:58:58.029 14424 TRACE neutron.openstack.common.rpc.common return self.method_reader.read_method()
2014-01-22 01:58:58.029 14424 TRACE neutron.openstack.common.rpc.common File "/opt/stack/venvs/neutron/local/lib/python2.7/site-packages/amqp/method_framing.py", line 189, in read_method
2014-01-22 01:58:58.029 14424 TRACE neutron.openstack.common.rpc.common raise m
2014-01-22 01:58:58.029 14424 TRACE neutron.openstack.common.rpc.common IOError: Socket closed
2014-01-22 01:58:58.029 14424 TRACE neutron.openstack.common.rpc.common
2014-01-22 01:58:58.041 14424 ERROR neutron.openstack.common.rpc.common [-] Failed to consume message from queue: Socket closed
2014-01-22 01:58:58.041 14424 TRACE neutron.openstack.common.rpc.common Traceback (most recent call last):
2014-01-22 01:58:58.041 14424 TRACE neutron.openstack.common.rpc....

Read more...

Revision history for this message
Robert Collins (lifeless) wrote :
Download full text (5.7 KiB)

Ok, so I sat on it. This was emitted, but it appeared to keep working fine so far...

2014-01-22 02:45:43.652 29020 ERROR neutron.openstack.common.rpc.common [-] Failed to consume message from queue: Socket closed
2014-01-22 02:45:43.652 29020 TRACE neutron.openstack.common.rpc.common Traceback (most recent call last):
2014-01-22 02:45:43.652 29020 TRACE neutron.openstack.common.rpc.common File "/opt/stack/venvs/neutron/local/lib/python2.7/site-packages/neutron/openstack/common/rpc/impl_kombu.py", line 576, in ensure
2014-01-22 02:45:43.652 29020 TRACE neutron.openstack.common.rpc.common return method(*args, **kwargs)
2014-01-22 02:45:43.652 29020 TRACE neutron.openstack.common.rpc.common File "/opt/stack/venvs/neutron/local/lib/python2.7/site-packages/neutron/openstack/common/rpc/impl_kombu.py", line 656, in _consume
2014-01-22 02:45:43.652 29020 TRACE neutron.openstack.common.rpc.common return self.connection.drain_events(timeout=timeout)
2014-01-22 02:45:43.652 29020 TRACE neutron.openstack.common.rpc.common File "/opt/stack/venvs/neutron/local/lib/python2.7/site-packages/kombu/connection.py", line 279, in drain_events
2014-01-22 02:45:43.652 29020 TRACE neutron.openstack.common.rpc.common return self.transport.drain_events(self.connection, **kwargs)
2014-01-22 02:45:43.652 29020 TRACE neutron.openstack.common.rpc.common File "/opt/stack/venvs/neutron/local/lib/python2.7/site-packages/kombu/transport/pyamqp.py", line 90, in drain_events
2014-01-22 02:45:43.652 29020 TRACE neutron.openstack.common.rpc.common return connection.drain_events(**kwargs)
2014-01-22 02:45:43.652 29020 TRACE neutron.openstack.common.rpc.common File "/opt/stack/venvs/neutron/local/lib/python2.7/site-packages/amqp/connection.py", line 299, in drain_events
2014-01-22 02:45:43.652 29020 TRACE neutron.openstack.common.rpc.common chanmap, None, timeout=timeout,
2014-01-22 02:45:43.652 29020 TRACE neutron.openstack.common.rpc.common File "/opt/stack/venvs/neutron/local/lib/python2.7/site-packages/amqp/connection.py", line 362, in _wait_multiple
2014-01-22 02:45:43.652 29020 TRACE neutron.openstack.common.rpc.common channel, method_sig, args, content = read_timeout(timeout)
2014-01-22 02:45:43.652 29020 TRACE neutron.openstack.common.rpc.common File "/opt/stack/venvs/neutron/local/lib/python2.7/site-packages/amqp/connection.py", line 326, in read_timeout
2014-01-22 02:45:43.652 29020 TRACE neutron.openstack.common.rpc.common return self.method_reader.read_method()
2014-01-22 02:45:43.652 29020 TRACE neutron.openstack.common.rpc.common File "/opt/stack/venvs/neutron/local/lib/python2.7/site-packages/amqp/method_framing.py", line 189, in read_method
2014-01-22 02:45:43.652 29020 TRACE neutron.openstack.common.rpc.common raise m
2014-01-22 02:45:43.652 29020 TRACE neutron.openstack.common.rpc.common IOError: Socket closed
2014-01-22 02:45:43.652 29020 TRACE neutron.openstack.common.rpc.common
2014-01-22 02:45:43.666 29020 ERROR neutron.openstack.common.rpc.common [-] Failed to consume message from queue: Socket closed
2014-01-22 02:45:43.666 29020 TRACE neutron.openstack.common.rpc.common Traceback (most recent call last):
2014...

Read more...

Revision history for this message
Robert Collins (lifeless) wrote :

Occured again, this time I had time to check the agent hosts files- they were updated so it's not dropped messages. Possibly it's not signalling dnsmasq (or possibly dnsmasq is buggy and not handling the signal correctly)

description: updated
description: updated
description: updated
Revision history for this message
Robert Collins (lifeless) wrote :
Download full text (15.4 KiB)

strace of a dhcp response from an existing VM (which works if it ifdown/ifups);

) = 1 (in [3])
recvmsg(3, {msg_name(16)={sa_family=AF_INET, sin_port=htons(68), sin_addr=inet_addr("0.0.0.0")}, msg_iov(1)=[{"\1\1\6\0\307t?H\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0x\347\321#"..., 1444}], msg_controllen=32, {cmsg_len=28, cmsg_level=SOL_IP, cmsg_type=, ...}, msg_flags=0}, MSG_PEEK|MSG_TRUNC) = 300
recvmsg(3, {msg_name(16)={sa_family=AF_INET, sin_port=htons(68), sin_addr=inet_addr("0.0.0.0")}, msg_iov(1)=[{"\1\1\6\0\307t?H\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0x\347\321#"..., 1444}], msg_controllen=32, {cmsg_len=28, cmsg_level=SOL_IP, cmsg_type=, ...}, msg_flags=0}, 0) = 300
ioctl(3, SIOCGIFNAME, {ifr_index=10, ifr_name="tap4f798c52-88"}) = 0
ioctl(3, SIOCGIFADDR, {ifr_name="tap4f798c52-88", ifr_addr={AF_INET, inet_addr("192.168.1.2")}}) = 0
sendto(4, "\24\0\0\0\26\0\5\3\213\1\0\0\0\0\0\0\2^\22\1", 20, 0, {sa_family=AF_NETLINK, pid=0, groups=00000000}, 12) = 20
recvmsg(4, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000000}, msg_iov(1)=[{"D\0\0\0\24\0\2\0\213\1\0\0\337;\0\0\2\10\200\376\1\0\0\0\10\0\1\0\177\0\0\1"..., 156}], msg_controllen=0, msg_flags=0}, MSG_PEEK|MSG_TRUNC) = 156
recvmsg(4, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000000}, msg_iov(1)=[{"D\0\0\0\24\0\2\0\213\1\0\0\337;\0\0\2\10\200\376\1\0\0\0\10\0\1\0\177\0\0\1"..., 156}], msg_controllen=0, msg_flags=0}, 0) = 156
recvmsg(4, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000000}, msg_iov(1)=[{"\24\0\0\0\3\0\2\0\213\1\0\0\337;\0\0\0\0\0\0", 156}], msg_controllen=0, msg_flags=0}, MSG_PEEK|MSG_TRUNC) = 20
recvmsg(4, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000000}, msg_iov(1)=[{"\24\0\0\0\3\0\2\0\213\1\0\0\337;\0\0\0\0\0\0", 156}], msg_controllen=0, msg_flags=0}, 0) = 20
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=118, ...}) = 0
write(11, "<30>Jan 29 03:26:19 dnsmasq-dhcp"..., 108) = 108
alarm(82487) = 82487
select(10, [3 4 5 6 7 8 9], [], [], NULL) = 1 (in [3])
recvmsg(3, {msg_name(16)={sa_family=AF_INET, sin_port=htons(68), sin_addr=inet_addr("0.0.0.0")}, msg_iov(1)=[{"\1\1\6\0\307t?H\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0x\347\321#"..., 1444}], msg_controllen=32, {cmsg_len=28, cmsg_level=SOL_IP, cmsg_type=, ...}, msg_flags=0}, MSG_PEEK|MSG_TRUNC) = 300
recvmsg(3, {msg_name(16)={sa_family=AF_INET, sin_port=htons(68), sin_addr=inet_addr("0.0.0.0")}, msg_iov(1)=[{"\1\1\6\0\307t?H\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0x\347\321#"..., 1444}], msg_controllen=32, {cmsg_len=28, cmsg_level=SOL_IP, cmsg_type=, ...}, msg_flags=0}, 0) = 300
ioctl(3, SIOCGIFNAME, {ifr_index=10, ifr_name="tap4f798c52-88"}) = 0
ioctl(3, SIOCGIFADDR, {ifr_name="tap4f798c52-88", ifr_addr={AF_INET, inet_addr("192.168.1.2")}}) = 0
sendto(4, "\24\0\0\0\26\0\5\3\214\1\0\0\0\0\0\0\2^\22\1", 20, 0, {sa_family=AF_NETLINK, pid=0, groups=00000000}, 12) = 20
recvmsg(4, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000000}, msg_iov(1)=[{"D\0\0\0\24\0\2\0\214\1\0\0\337;\0\0\2\10\200\376\1\0\0\0\10\0\1\0\177\0\0\1"..., 156}], msg_controllen=0, msg_flags=0}, MSG_PEEK|MSG_TRUNC) = 156
recvmsg(4, {msg_name(12)={sa_family=AF_NETLINK, pid=0, grou...

Revision history for this message
Robert Collins (lifeless) wrote :

And - I think - a bad one:
) = 1 (in [3])
recvmsg(3, {msg_name(16)={sa_family=AF_INET, sin_port=htons(68), sin_addr=inet_addr("0.0.0.0")}, msg_iov(1)=[{"\1\1\6\0\211\333\31K\0\3\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\372\26>5"..., 1444}], msg_controllen=32, {cmsg_len=28, cmsg_level=SOL_IP, cmsg_type=, ...}, msg_flags=0}, MSG_PEEK|MSG_TRUNC) = 300
recvmsg(3, {msg_name(16)={sa_family=AF_INET, sin_port=htons(68), sin_addr=inet_addr("0.0.0.0")}, msg_iov(1)=[{"\1\1\6\0\211\333\31K\0\3\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\372\26>5"..., 1444}], msg_controllen=32, {cmsg_len=28, cmsg_level=SOL_IP, cmsg_type=, ...}, msg_flags=0}, 0) = 300
ioctl(3, SIOCGIFNAME, {ifr_index=10, ifr_name="tap4f798c52-88"}) = 0
ioctl(3, SIOCGIFADDR, {ifr_name="tap4f798c52-88", ifr_addr={AF_INET, inet_addr("192.168.1.2")}}) = 0
sendto(4, "\24\0\0\0\26\0\5\3\306\1\0\0\0\0\0\0\2\256A\335", 20, 0, {sa_family=AF_NETLINK, pid=0, groups=00000000}, 12) = 20
recvmsg(4, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000000}, msg_iov(1)=[{"D\0\0\0\24\0\2\0\306\1\0\0\337;\0\0\2\10\200\376\1\0\0\0\10\0\1\0\177\0\0\1"..., 156}], msg_controllen=0, msg_flags=0}, MSG_PEEK|MSG_TRUNC) = 156
recvmsg(4, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000000}, msg_iov(1)=[{"D\0\0\0\24\0\2\0\306\1\0\0\337;\0\0\2\10\200\376\1\0\0\0\10\0\1\0\177\0\0\1"..., 156}], msg_controllen=0, msg_flags=0}, 0) = 156
recvmsg(4, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000000}, msg_iov(1)=[{"\24\0\0\0\3\0\2\0\306\1\0\0\337;\0\0\0\0\0\0", 156}], msg_controllen=0, msg_flags=0}, MSG_PEEK|MSG_TRUNC) = 20
recvmsg(4, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000000}, msg_iov(1)=[{"\24\0\0\0\3\0\2\0\306\1\0\0\337;\0\0\0\0\0\0", 156}], msg_controllen=0, msg_flags=0}, 0) = 20
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=118, ...}) = 0
write(11, "<28>Jan 29 03:29:53 dnsmasq-dhcp"..., 124) = 124
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=118, ...}) = 0
write(11, "<30>Jan 29 03:29:53 dnsmasq-dhcp"..., 108) = 108
alarm(82273) = 82273
select(10, [3 4 5 6 7 8 9], [], [], NULL

Revision history for this message
Robert Collins (lifeless) wrote :

So for a given VM:
mac of fa:16:3e:8b:ab:2f

Jan 29 03:34:29 ci-overcloud-notcompute-7sdhmmrtddid dhclient: DHCPDISCOVER on int-br-untagged to 255.255.255.255 port 67 interval 8 (xid=0x582d3e61)
Jan 29 03:34:31 ci-overcloud-notcompute-7sdhmmrtddid dnsmasq-dhcp[15331]: not using configured address 192.168.1.25 because it is leased to fa:16:3e:b4:e1:f9
Jan 29 03:34:31 ci-overcloud-notcompute-7sdhmmrtddid dnsmasq-dhcp[15331]: DHCPDISCOVER(tap4f798c52-88) fa:16:3e:8b:ab:2f no address available

but neutron-port-list | grep fa:16:3e:b4:e1:f9 is empty.

-> the issue is that the lease is still held in neutron

summary: - neutron-dhcp-agent not getting updates after ~24h running
+ neutron-dhcp-agent doesn't hand out leases for recently used addresses
description: updated
description: updated
description: updated
description: updated
Changed in neutron:
assignee: nobody → Aaron Rosen (arosen)
Revision history for this message
Salvatore Orlando (salvatore-orlando) wrote :

Aaron,

Can you please triage this?
It seems related to your fix for bug 1202392

Revision history for this message
Robert Collins (lifeless) wrote :

Jan 28 06:43:09 ci-overcloud-notcompute-7sdhmmrtddid dnsmasq[22906]: cleared cache
Jan 28 06:43:09 ci-overcloud-notcompute-7sdhmmrtddid dnsmasq-dhcp[22906]: read /var/run/neutron/dhcp/b703205c-1340-4258-a654-c14963ccef62/host
Jan 28 06:43:09 ci-overcloud-notcompute-7sdhmmrtddid dnsmasq-dhcp[22906]: read /var/run/neutron/dhcp/b703205c-1340-4258-a654-c14963ccef62/opts
Jan 28 06:43:09 ci-overcloud-notcompute-7sdhmmrtddid dnsmasq-dhcp[22906]: DHCPRELEASE(tap54b1142d-fd) 10.0.0.5 fa:16:3e:33:5a:db

description: updated
Revision history for this message
Clint Byrum (clint-fewbar) wrote :

I think we are hitting this on our cd-undercloud as well. It is devastating to reliability when doing repeated deployments of baremetal servers.

Revision history for this message
Aaron Rosen (arosen) wrote :

This patch looks like it could be causing the issue: https://review.openstack.org/#/c/56263/8 I think i see a bug in it as well.

Revision history for this message
Aaron Rosen (arosen) wrote :

Actually this looks like the dhcp_release call is getting blocked by the security groups. Security groups used to not be working in the ml2 plugin on the gate and that is how this was able to slip in.

Revision history for this message
Robert Collins (lifeless) wrote :

http://paste.ubuntu.com/6911812/ iptables rules active on the machine which clint saw this on in comment #9.

Revision history for this message
Robert Collins (lifeless) wrote :

And for the dhcp agent netns:
ip netns exec qdhcp-e85b44c7-1136-4217-954e-cdf0acdddfe1 iptables-save | pastebinit
http://paste.ubuntu.com/6911818/

Revision history for this message
Robert Collins (lifeless) wrote :

ran dhcp_release by hand:
Feb 10 22:51:10 undercloud-notcompute-jws3awlsb2kh dnsmasq-dhcp[26029]: DHCPRELEASE(tap56efc4c6-55) 10.10.16.141 78:e7:d1:23:97:cd

Revision history for this message
Carl Baldwin (carl-baldwin) wrote :

I thought that dhcp_release was just some sort of IPC to the dnsmasq process to release its lease internally. Is there a packet that is being block that is preventing that from happening?

Do the neutron DHCP agent logs have evidence of the dhcp_release call being executed?

Revision history for this message
Robert Collins (lifeless) wrote :

And a quick grep going back before clints issue.
Feb 10 18:22:48 undercloud-notcompute-jws3awlsb2kh dnsmasq-dhcp[9534]: DHCPRELEASE(tap56efc4c6-55) 10.10.16.147 78:e7:d1:23:99:35 unknown lease
Feb 10 19:03:02 undercloud-notcompute-jws3awlsb2kh dnsmasq-dhcp[9534]: DHCPRELEASE(tap56efc4c6-55) 10.10.16.147 78:e7:d1:23:94:95 unknown lease
Feb 10 19:51:08 undercloud-notcompute-jws3awlsb2kh dnsmasq-dhcp[9534]: DHCPRELEASE(tap56efc4c6-55) 10.10.16.147 78:e7:d1:23:99:35 unknown lease
Feb 10 20:31:22 undercloud-notcompute-jws3awlsb2kh dnsmasq-dhcp[26029]: DHCPRELEASE(tap56efc4c6-55) 10.10.16.147 78:e7:d1:23:94:95 unknown lease
Feb 10 22:51:10 undercloud-notcompute-jws3awlsb2kh dnsmasq-dhcp[26029]: DHCPRELEASE(tap56efc4c6-55) 10.10.16.141 78:e7:d1:23:97:cd

Revision history for this message
Aaron Rosen (arosen) wrote :

Hrm I lied this doesn't seem security group related :/ I had reproduced this once though I'm not seeing it now.

Right now I've created a /29 subnet and booted 4 vms then deleted them the booted them again so that their ips are recycled though each time they are all getting ips now. I'll keep banging on this.

Revision history for this message
Carl Baldwin (carl-baldwin) wrote :

I'm looking more for a message in the dhcp AGENT log that indicates that the agent is executing the dhcp_release command.

Revision history for this message
Robert Collins (lifeless) wrote :

@carl grep -i release /var/log/upstart/neutron-dhcp-agent.log | pastebinit
http://paste.ubuntu.com/6912327/

Revision history for this message
Peter Feiner (pete5) wrote :
Download full text (3.4 KiB)

I've hit this problem as well. It seems that dnsmasq just isn't getting some DHCPRELEASE packets that dhcp_release is putting on the "wire" (i.e., the tap device). For example, I booted 100 VMs, deleted those 100, then booted 100 more. In the second round of booting, dnsmasq wouldn't assign 10.0.0.39 because it was owned by another lease. However, according to neutron-dhcp-agent's logs, dhcp_release was invoked but, according to syslog, dnsmasq didn't process that message.

Here are the relevant syslog and neutron-dhcp-agent log messages:

Feb 28 12:21:30 ctrl dnsmasq-dhcp[19654]: DHCPOFFER(tapc712f8e7-39) 10.0.0.39 fa:16:3e:cd:24:1e
Feb 28 12:21:30 ctrl dnsmasq-dhcp[19654]: DHCPREQUEST(tapc712f8e7-39) 10.0.0.39 fa:16:3e:cd:24:1e
Feb 28 12:21:30 ctrl dnsmasq-dhcp[19654]: DHCPACK(tapc712f8e7-39) 10.0.0.39 fa:16:3e:cd:24:1e host-10-0-0-39
Feb 28 12:25:50 ctrl dnsmasq-dhcp[19654]: not using configured address 10.0.0.39 because it is leased to fa:16:3e:cd:24:1e

2014-02-28 12:23:50.650 19633 DEBUG neutron.agent.linux.utils [req-4b0f9265-fa4c-4f81-baa3-ef2815592640 7132e53040704655bda2dd1c355a60d9 1e38af7b2793478aba3a8a462d82ff84] Running command: ['sudo', '/usr/bin/neutron-rootwrap', '/etc/neutron/rootwrap.conf', 'ip', 'netns', 'exec', 'qdhcp-11df69c5-89eb-4ba6-b99e-69087237f4dc', 'dhcp_release', 'tapc712f8e7-39', '10.0.0.39', 'fa:16:3e:cd:24:1e'] create_process /usr/lib/python2.7/dist-packages/neutron/agent/linux/utils.py:47
2014-02-28 12:23:50.746 19633 DEBUG neutron.agent.linux.utils [req-4b0f9265-fa4c-4f81-baa3-ef2815592640 7132e53040704655bda2dd1c355a60d9 1e38af7b2793478aba3a8a462d82ff84]
Command: ['sudo', '/usr/bin/neutron-rootwrap', '/etc/neutron/rootwrap.conf', 'ip', 'netns', 'exec', 'qdhcp-11df69c5-89eb-4ba6-b99e-69087237f4dc', 'dhcp_release', 'tapc712f8e7-39', '10.0.0.39', 'fa:16:3e:cd:24:1e']
Exit code: 0
Stdout: ''
Stderr: '' execute /usr/lib/python2.7/dist-packages/neutron/agent/linux/utils.py:73

Note that dhcp_release is called with the correct IP and MAC addresses.

It's clear to me that dnsmasq either didn't get the message or it has a bug so didn't handle the message properly. I can imagine a few possibilities for the former: (1) dhcp_release's write the tap device fails but it doesn't check the return value; (2) writes to the tap device silently fail when it's in-kernel buffers are full; (3) a race in the kernel wherein writes to the tap device are lost; (4) a race in the kernel where packets are lost on read. These possibilities (1-4) could be confirmed or debunked with some effort.

Whatever the cause of the lost message may be, it seems to me that this approach of keeping neutron and dnsmasq in sync with incremental updates isn't robust. There are too many ways for dnsmasq's lease state and neutron's state to fall out of sync. One way is obviously this bug that we're all trying to diagnose. Another way is for a port_delete_end RPC to not be called, or not received, due to a crashed neutron process. Whatever the cause for dnsmasq and neutron falling out of sync, the only remedy at the moment is to kill the dnsmasq process then restart neutron-dhcp-agent.

Robert, in your bug report you suggested an interesting solution:

    ...

Read more...

Revision history for this message
Robert Collins (lifeless) wrote :

@Peter yes, I believe it could. It would be called with all active leases and thus be able to check for skew and fix - and only do that on the working set (which would typically be all ports though, so roughly full-size).

Changed in tripleo:
importance: Critical → High
Revision history for this message
Robert Collins (lifeless) wrote :

Downgraded this bug in tripleo to 'high' because we're not having CI broken right now, but its still a significant concern for deployments.

Aaron Rosen (arosen)
Changed in neutron:
importance: Undecided → High
Revision history for this message
Duncan Idaho (ghola) wrote :

I've hit this as well and have made this observation: This only happens for us when multiple dnsmasq processes are serving different networks. The packet sent by dhcp_release hits the 0.0.0.0:67 bind on only one of the dnsmasq processes and is ignored if it's the wrong one.

Revision history for this message
Sudheendra Murthy (sudhi-vm) wrote :

Adding self.restart() in release_lease(...) fixed this issue for us, although as suggested by Robert, it may cause short hiccups in the DHCP service. Since reload_allocations() is also called when a new VM is provisioned, restarting in release_lease instead of reload_allocations results in 1 less restart per VM.

Revision history for this message
Duncan Idaho (ghola) wrote :

Updating from dnsmasq 2.59 to 2.66 solved this for us. We think this is the fix:
http://thekelleys.org.uk/gitweb/?p=dnsmasq.git;a=commit;h=9380ba70d67db6b69f817d8e318de5ba1e990b12

Revision history for this message
Vinay Bannai (vbannai) wrote :

We tried 2.66 in our environment. Did not fix the problem.

Revision history for this message
Salvatore Orlando (salvatore-orlando) wrote :

Targeting to J-1 milestone, as it would make easier to track the bug

Changed in neutron:
milestone: none → juno-1
Revision history for this message
Byron McCollum (byron-mccollum) wrote :

I am able to reproduce this, but only if I run configured as an all-in-one. If I separate network and compute functions, it seems much more reliable at DHCP releasing.

Kyle Mestery (mestery)
Changed in neutron:
status: New → Confirmed
Revision history for this message
Sudheendra Murthy (sudhi-vm) wrote :

Updating dnsmasq to 2.66 solved this issue for us.

Kyle Mestery (mestery)
Changed in neutron:
milestone: juno-1 → juno-2
Revision history for this message
Kyle Mestery (mestery) wrote :

Moving this back to incomplete, as there are reports that dnsmasq 2.66 solves this issue.

Changed in neutron:
status: Confirmed → Incomplete
milestone: juno-2 → none
Revision history for this message
Elias De Coninck (elias-deconinck) wrote :

Updating to dnsmasq 2.66 solved it for me aswel.

Revision history for this message
Marios Andreou (marios-b) wrote :

this seems to be fixed upstream in dnsmasq - is this still being seen for tripleo ci? (will try and ping derekh to have a look here).

Revision history for this message
Marios Andreou (marios-b) wrote :

derek says se hasn't seen this happen... so I think we can just close...

Aaron Rosen (arosen)
Changed in neutron:
status: Incomplete → Invalid
status: Invalid → Fix Released
Changed in tripleo:
status: Triaged → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers