dnsmasq does not remove leases for deleted VMs - leases and host files point to different MACS

Bug #1783908 reported by Arjun Baindur
16
This bug affects 3 people
Affects Status Importance Assigned to Milestone
neutron
Fix Released
Medium
Unassigned

Bug Description

We see this sporadically, sometimes it works, sometimes it doesn't. We delete VMs, then create new ones. The new fixed IP can't get an IP via DHCP, because the leases file still points to MAC of some old, deleted VMs. The host file is correctly updated.

For example we see in syslog/messages:

Jul 26 16:45:04 stan dnsmasq-dhcp[18540]: not using configured address 10.96.0.21 because it is leased to fa:16:3e:ae:c4:b0
Jul 26 16:45:04 stan dnsmasq-dhcp[18540]: DHCPDISCOVER(tap6d5fc8f6-28) fa:16:3e:2f:29:37 no address available

And tcpdump on tap interface of qdhcp indicates it is receiving the DHCP requests, but nothing is sent out.

On inspection of the leases and host file, we see the MAC address differs for the fixed IP:

[root@stan ~]# cat /var/opt/pf9/neutron/dhcp/52f8298e-f867-4581-8ae7-fac0c9196ae4/host | grep 10.96.0.21
fa:16:3e:2f:29:37,test-pf9-vmha-one-region-c7-702498-490-2.platform9.sys.,10.96.0.21
[root@stan ~]# cat /var/opt/pf9/neutron/dhcp/52f8298e-f867-4581-8ae7-fac0c9196ae4/leases | grep 10.96.0.21
1532649702 fa:16:3e:ae:c4:b0 10.96.0.21 test-pf9-vmha-one-region-c7-702498-490-2 *

On another dhcp-agent host, we saw the leases file pointed to some OTHER old third MAC.

We then deleted this VM. We observed that the lease was removed on one host, but it remained on the other host. So it appears to be random. Enabling DEBUG logs shows the dhcp_release command is being invoked, without error:

2018-07-26 16:52:06.214 10660 DEBUG neutron.agent.dhcp.agent [req-22449d0c-c0e3-4571-b4f9-a3150f8596c5 5412536c76f942d980e1834f48340f77 f175f441ebbb4c2b8fedf6469d6415fc - - -] Calling driver for network: 52f8298e-f867-4581-8ae7-fac0c9196ae4 action: reload_allocations call_driver /opt/pf9/pf9-neutron/lib/python2.7/site-packages/neutron/agent/dhcp/agent.py:133
2018-07-26 16:52:06.217 10660 DEBUG neutron.agent.linux.utils [req-22449d0c-c0e3-4571-b4f9-a3150f8596c5 5412536c76f942d980e1834f48340f77 f175f441ebbb4c2b8fedf6469d6415fc - - -] Running command: ['sudo', '/opt/pf9/pf9-neutron/bin/neutron-rootwrap', '/opt/pf9/etc/pf9-neutron/rootwrap.conf', 'ip', 'netns', 'exec', 'qdhcp-52f8298e-f867-4581-8ae7-fac0c9196ae4', 'dhcp_release', 'tap4d6bb403-59', '10.96.0.21', 'fa:16:3e:2f:29:37'] create_process /opt/pf9/pf9-neutron/lib/python2.7/site-packages/neutron/agent/linux/utils.py:92
2018-07-26 16:52:06.498 10660 DEBUG neutron.agent.linux.dhcp [req-22449d0c-c0e3-4571-b4f9-a3150f8596c5 5412536c76f942d980e1834f48340f77 f175f441ebbb4c2b8fedf6469d6415fc - - -] Building host file: /var/opt/pf9/neutron/dhcp/52f8298e-f867-4581-8ae7-fac0c9196ae4/host _output_hosts_file /opt/pf9/pf9-neutron/lib/python2.7/site-packages/neutron/agent/linux/dhcp.py:685
2018-07-26 16:52:06.505 10660 DEBUG neutron.agent.linux.dhcp [req-22449d0c-c0e3-4571-b4f9-a3150f8596c5 5412536c76f942d980e1834f48340f77 f175f441ebbb4c2b8fedf6469d6415fc - - -] Done building host file /var/opt/pf9/neutron/dhcp/52f8298e-f867-4581-8ae7-fac0c9196ae4/host _output_hosts_file /opt/pf9/pf9-neutron/lib/python2.7/site-packages/neutron/agent/linux/dhcp.py:724

The entry was removed from hosts file, but remains in leases despite above dhcp_release remaining. If we spin up a new VM with this fixed IP, it can't get an IP due to old lease:

root@krusty:~# cat /var/opt/pf9/neutron/dhcp/52f8298e-f867-4581-8ae7-fac0c9196ae4/host | grep 10.96.0.21
root@krusty:~# cat /var/opt/pf9/neutron/dhcp/52f8298e-f867-4581-8ae7-fac0c9196ae4/leases | grep 10.96.0.21
1532735433 fa:16:3e:2f:29:37 10.96.0.21 test-pf9-vmha-one-region-c7-702498-490-2 *
root@krusty:~#

Revision history for this message
Arjun Baindur (abaindur) wrote :

We see this on both Ubuntu14 and Centos 7.2 and Centos 7.4. Kernel and OS versions:

[root@stan pf9]# uname -a
Linux stan.platform9.sys 3.10.0-693.11.6.el7.x86_64 #1 SMP Thu Jan 4 01:06:37 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
[root@stan pf9]# rpm -qa | grep dns
dnsmasq-2.66-14.el7_1.x86_64
ldns-1.6.16-7.el7.x86_64
dnsmasq-utils-2.66-14.el7_1.x86_64

root@krusty:~# uname -a
Linux krusty 3.13.0-58-generic #97-Ubuntu SMP Wed Jul 8 02:56:15 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
root@krusty:~# dpkg -l | grep dns
ii dnsmasq 2.68-1ubuntu0.2 all Small caching DNS proxy and DHCP/TFTP server
ii dnsmasq-base 2.68-1ubuntu0.2 amd64 Small caching DNS proxy and DHCP/TFTP server
ii dnsmasq-utils 2.68-1ubuntu0.2 amd64 Utilities for manipulating DHCP leases
ii dnsutils 1:9.9.5.dfsg-3ubuntu0.8 amd64 Clients provided with BIND
ii libdns100 1:9.9.5.dfsg-3ubuntu0.8 amd64 DNS Shared Library used by BIND

summary: - dnsmasq not removes leases for deleted VMs
+ dnsmasq does not remove leases for deleted VMs - leases and host files
+ point to different MACS
Revision history for this message
YAMAMOTO Takashi (yamamoto) wrote :

looks similar with https://bugs.launchpad.net/neutron/+bug/1764481
does your version of neutron have the fix?

Changed in neutron:
status: New → Incomplete
Arjun Baindur (abaindur)
Changed in neutron:
status: Incomplete → New
Revision history for this message
Arjun Baindur (abaindur) wrote :

No... this is on Pike. That fix was only backported to Queens.

Still, seems like that just masks the issue by retrying. I don't understand why it happens in the first place, or why it started cropping up all of a sudden. Neutron DHCP and dnsmasq has been around since the beginning and for quite a while now.

Revision history for this message
Arjun Baindur (abaindur) wrote :

Also, I don't see dnsmasq reloading based on timestamp in output of ps, even when I deploy or delete new VMs:

[root@stan ~]# date
Fri Jul 27 11:58:25 PDT 2018
[root@stan ~]# ps -ef | grep 52f8298e
root 3375 567 0 11:58 pts/13 00:00:00 grep --color=auto 52f8298e
nobody 19924 1 0 Jul26 ? 00:00:01 dnsmasq --no-hosts --no-resolv --strict-order --except-interface=lo --pid-file=/var/opt/pf9/neutron/dhcp/52f8298e-f867-4581-8ae7-fac0c9196ae4/pid --dhcp-hostsfile=/var/opt/pf9/neutron/dhcp/52f8298e-f867-4581-8ae7-fac0c9196ae4/host --addn-hosts=/var/opt/pf9/neutron/dhcp/52f8298e-f867-4581-8ae7-fac0c9196ae4/addn_hosts --dhcp-optsfile=/var/opt/pf9/neutron/dhcp/52f8298e-f867-4581-8ae7-fac0c9196ae4/opts --dhcp-leasefile=/var/opt/pf9/neutron/dhcp/52f8298e-f867-4581-8ae7-fac0c9196ae4/leases --dhcp-match=set:ipxe,175 --bind-interfaces --interface=tap6d5fc8f6-28 --dhcp-range=set:tag0,10.96.0.0,static,255.255.0.0,86400s --dhcp-option-force=option:mtu,1500 --dhcp-lease-max=65536 --conf-file= --server=10.1.10.3 --server=10.1.10.19 --server=8.8.8.8 --domain=platform9.sys

When is the dnsmasq process reloaded, besides when restarting dhcp-agent? That bug indicates one cause is "2. dnsmasq is reloading, so there is noone to receive it when it arrives"

Revision history for this message
YAMAMOTO Takashi (yamamoto) wrote :

"Confirmed" in sense it at least needs a backport for pike.

I think retrying here is ok as UDP is not reliable anyway.
as you said, there might be other causes for which retrying is not an ideal solution though.

Changed in neutron:
importance: Undecided → Medium
status: New → Confirmed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/pike)

Fix proposed to branch: stable/pike
Review: https://review.openstack.org/586980

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/pike)

Reviewed: https://review.openstack.org/586980
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=6aada381f4695ccd904c4be7dbfdd95744d32e38
Submitter: Zuul
Branch: stable/pike

commit 6aada381f4695ccd904c4be7dbfdd95744d32e38
Author: Brian Haley <email address hidden>
Date: Wed Apr 11 21:22:23 2018 -0400

    Retry dhcp_release on failures

    Sometimes calls to dhcp_release(6) do not result in removal
    of a lease from the leases file, for example, when the release
    packet is not received by dnsmasq. Trying more than once is
    recommended in this case.

    Instead of blindly trying some number of times, we monitor
    the lease file contents, and retry the dhcp_release(6) call
    when an entry still remains. This is possible since
    dhcp_release(6) is being run from the DHCP server itself.
    We try three times and wait 0.3 seconds between tries.

    We also now check for any stale leases in the leases file
    that are unknown to neutron, also trying to remove them.

    Change-Id: Ic1864f7efbc94db1369ac7f3e2879fda86f95a11
    Closes-bug: #1764481
    Closes-bug: #1783908
    (cherry picked from commit fab032b426e1fa89dc473c61cbf15377fe4aaff3)

tags: added: in-stable-pike
Revision history for this message
Arjun Baindur (abaindur) wrote :

Yamamoto, we still hit this even with the "fix". I think it needs to be root caused... is it an issue with Linux (we hit it on both Centos7 and Ubuntu14 DHCP hosts)? issue with dnsmasq? We see repeatedly logs in dhcp-agent like:

2018-08-20 11:08:21.242 8027 WARNING neutron.agent.linux.dhcp [req-22bf544d-b47f-4d55-824a-cefbb214700a 20381bff60c24ac1a28e65b80d2185db f175f441ebbb4c2b8fedf6469d6415fc - - -] Could not release DHCP leases for these IP addresses after 3 tries: 10.96.0.19, 10.96.0.4, 10.96.0.18, 10.96.0.16, 10.96.0.8, 10.96.0.26, 10.96.0.32, 10.96.0.23, 10.96.0.21, 10.96.0.24, 10.96.0.12, 10.96.0.33, 10.96.0.7

We see it invoking the dhcp_release without error 3x for each of those IPs in logs just earlier

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 11.0.6

This issue was fixed in the openstack/neutron 11.0.6 release.

Revision history for this message
Brian Haley (brian-haley) wrote :

There have been a number of fixes in the past four years to the DHCP code. As I have not seen this issue lately it could have been fixed as part of another change. For that reason I'll close this but please open a new bug if the issue persists.

Changed in neutron:
status: Confirmed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.