Sometimes dhcp_releasepackets don't reach dnsmasq

Bug #1764481 reported by Brian Haley
20
This bug affects 4 people
Affects Status Importance Assigned to Milestone
neutron
Fix Released
Medium
Brian Haley

Bug Description

We have seen issues downstream where calling dhcp_release didn't cause the lease to be removed from the leases files used by dnsmasq. There are a couple of scenarios where this could happen:

1. The packet is simply lost, as it is UDP, even though it's being looped-back
2. dnsmasq is reloading, so there is noone to receive it when it arrives

For that reason we should make this more robust. A couple of possible solutions are:

1. Send the release more than once in succession. It's probably OK to just send some small number of packets for each lease we want to release, it would easily increase the odds that one makes it through.
2. Check the leases file to make sure dnsmasq processed the release, and re-send for those addresses that were missed. This method is slightly more complicated, but it should also increase the odds, and should do it with fewer packets being generated.

Each option has some overhead, but since the option is not releasing the lease, it's worth it.

I have proposed option #2 already at https://review.openstack.org/#/c/560703/ but wanted to make sure to get feedback on other proposals that might also solve the issue.

Changed in neutron:
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.openstack.org/560703
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=fab032b426e1fa89dc473c61cbf15377fe4aaff3
Submitter: Zuul
Branch: master

commit fab032b426e1fa89dc473c61cbf15377fe4aaff3
Author: Brian Haley <email address hidden>
Date: Wed Apr 11 21:22:23 2018 -0400

    Retry dhcp_release on failures

    Sometimes calls to dhcp_release(6) do not result in removal
    of a lease from the leases file, for example, when the release
    packet is not received by dnsmasq. Trying more than once is
    recommended in this case.

    Instead of blindly trying some number of times, we monitor
    the lease file contents, and retry the dhcp_release(6) call
    when an entry still remains. This is possible since
    dhcp_release(6) is being run from the DHCP server itself.
    We try three times and wait 0.3 seconds between tries.

    We also now check for any stale leases in the leases file
    that are unknown to neutron, also trying to remove them.

    Change-Id: Ic1864f7efbc94db1369ac7f3e2879fda86f95a11
    Closes-bug: #1764481

Changed in neutron:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/queens)

Fix proposed to branch: stable/queens
Review: https://review.openstack.org/565885

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/queens)

Reviewed: https://review.openstack.org/565885
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=56d54ccf820dd130e3fb582a96125461b6cede4a
Submitter: Zuul
Branch: stable/queens

commit 56d54ccf820dd130e3fb582a96125461b6cede4a
Author: Brian Haley <email address hidden>
Date: Wed Apr 11 21:22:23 2018 -0400

    Retry dhcp_release on failures

    Sometimes calls to dhcp_release(6) do not result in removal
    of a lease from the leases file, for example, when the release
    packet is not received by dnsmasq. Trying more than once is
    recommended in this case.

    Instead of blindly trying some number of times, we monitor
    the lease file contents, and retry the dhcp_release(6) call
    when an entry still remains. This is possible since
    dhcp_release(6) is being run from the DHCP server itself.
    We try three times and wait 0.3 seconds between tries.

    We also now check for any stale leases in the leases file
    that are unknown to neutron, also trying to remove them.

    Change-Id: Ic1864f7efbc94db1369ac7f3e2879fda86f95a11
    Closes-bug: #1764481
    (cherry picked from commit fab032b426e1fa89dc473c61cbf15377fe4aaff3)

tags: added: in-stable-queens
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 13.0.0.0b2

This issue was fixed in the openstack/neutron 13.0.0.0b2 development milestone.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 12.0.3

This issue was fixed in the openstack/neutron 12.0.3 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/pike)

Fix proposed to branch: stable/pike
Review: https://review.openstack.org/586980

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/pike)

Reviewed: https://review.openstack.org/586980
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=6aada381f4695ccd904c4be7dbfdd95744d32e38
Submitter: Zuul
Branch: stable/pike

commit 6aada381f4695ccd904c4be7dbfdd95744d32e38
Author: Brian Haley <email address hidden>
Date: Wed Apr 11 21:22:23 2018 -0400

    Retry dhcp_release on failures

    Sometimes calls to dhcp_release(6) do not result in removal
    of a lease from the leases file, for example, when the release
    packet is not received by dnsmasq. Trying more than once is
    recommended in this case.

    Instead of blindly trying some number of times, we monitor
    the lease file contents, and retry the dhcp_release(6) call
    when an entry still remains. This is possible since
    dhcp_release(6) is being run from the DHCP server itself.
    We try three times and wait 0.3 seconds between tries.

    We also now check for any stale leases in the leases file
    that are unknown to neutron, also trying to remove them.

    Change-Id: Ic1864f7efbc94db1369ac7f3e2879fda86f95a11
    Closes-bug: #1764481
    Closes-bug: #1783908
    (cherry picked from commit fab032b426e1fa89dc473c61cbf15377fe4aaff3)

tags: added: in-stable-pike
Revision history for this message
Arjun Baindur (abaindur) wrote :
Download full text (3.8 KiB)

This fix does not work. There is some other underlying issue. To test this, it appears sometimes dhcp_release does NOT work, no matter how many times we even invoke the command manually. In a different network's namespace, it registers perfectly fine, and we see DHCPRELASE recorded in /var/log/messsages:

[root@stan ~]# ip netns exec qdhcp-05dd69f2-ecd9-4757-b2d3-ee440faf4e42 dhcp_release tap4b752a43-fa 10.8.127.165 fa:16:3e:c3:f7:07
[root@stan ~]# ip netns exec qdhcp-05dd69f2-ecd9-4757-b2d3-ee440faf4e42 dhcp_release tap4b752a43-fa 10.8.127.165 fa:16:3e:c3:f7:07
[root@stan ~]# ip netns exec qdhcp-05dd69f2-ecd9-4757-b2d3-ee440faf4e42 dhcp_release tap4b752a43-fa 10.8.127.165 fa:16:3e:c3:f7:07
[root@stan ~]# ip netns exec qdhcp-05dd69f2-ecd9-4757-b2d3-ee440faf4e42 dhcp_release tap4b752a43-fa 10.8.127.165 fa:16:3e:c3:f7:07
[root@stan ~]# ip netns exec qdhcp-05dd69f2-ecd9-4757-b2d3-ee440faf4e42 dhcp_release tap4b752a43-fa 10.8.127.165 fa:16:3e:c3:f7:07
[root@stan ~]# ip netns exec qdhcp-05dd69f2-ecd9-4757-b2d3-ee440faf4e42 dhcp_release tap4b752a43-fa 10.8.127.165 fa:16:3e:c3:f7:07
[root@stan ~]# ip netns exec qdhcp-05dd69f2-ecd9-4757-b2d3-ee440faf4e42 dhcp_release tap4b752a43-fa 10.8.127.165 fa:16:3e:c3:f7:07
[root@stan ~]# ip netns exec qdhcp-05dd69f2-ecd9-4757-b2d3-ee440faf4e42 dhcp_release tap4b752a43-fa 10.8.127.165 fa:16:3e:c3:f7:07
[root@stan ~]# ip netns exec qdhcp-05dd69f2-ecd9-4757-b2d3-ee440faf4e42 dhcp_release tap4b752a43-fa 10.8.127.165 fa:16:3e:c3:f7:07

Aug 22 00:37:30 stan dnsmasq-dhcp[21443]: DHCPRELEASE(tap4b752a43-fa) 10.8.127.165 fa:16:3e:c3:f7:07 unknown lease
Aug 22 00:37:30 stan dnsmasq-dhcp[21443]: DHCPRELEASE(tap4b752a43-fa) 10.8.127.165 fa:16:3e:c3:f7:07 unknown lease
Aug 22 00:37:31 stan dnsmasq-dhcp[21443]: DHCPRELEASE(tap4b752a43-fa) 10.8.127.165 fa:16:3e:c3:f7:07 unknown lease
Aug 22 00:37:31 stan dnsmasq-dhcp[21443]: DHCPRELEASE(tap4b752a43-fa) 10.8.127.165 fa:16:3e:c3:f7:07 unknown lease
Aug 22 00:37:32 stan dnsmasq-dhcp[21443]: DHCPRELEASE(tap4b752a43-fa) 10.8.127.165 fa:16:3e:c3:f7:07 unknown lease
Aug 22 00:37:32 stan dnsmasq-dhcp[21443]: DHCPRELEASE(tap4b752a43-fa) 10.8.127.165 fa:16:3e:c3:f7:07 unknown lease

But when we try the same with another network, we see absolutely zero DHCPRELEASE:

[root@stan ~]# ip netns exec qdhcp-52f8298e-f867-4581-8ae7-fac0c9196ae4 dhcp_release tap6d5fc8f6-28 10.96.0.5 fa:16:3e:af:e5:d1
[root@stan ~]# ip netns exec qdhcp-52f8298e-f867-4581-8ae7-fac0c9196ae4 dhcp_release tap6d5fc8f6-28 10.96.0.5 fa:16:3e:af:e5:d1
[root@stan ~]# ip netns exec qdhcp-52f8298e-f867-4581-8ae7-fac0c9196ae4 dhcp_release tap6d5fc8f6-28 10.96.0.5 fa:16:3e:af:e5:d1
[root@stan ~]# ip netns exec qdhcp-52f8298e-f867-4581-8ae7-fac0c9196ae4 dhcp_release tap6d5fc8f6-28 10.96.0.5 fa:16:3e:af:e5:d1
[root@stan ~]# ip netns exec qdhcp-52f8298e-f867-4581-8ae7-fac0c9196ae4 dhcp_release tap6d5fc8f6-28 10.96.0.5 fa:16:3e:af:e5:d1
[root@stan ~]# ip netns exec qdhcp-52f8298e-f867-4581-8ae7-fac0c9196ae4 dhcp_release tap6d5fc8f6-28 10.96.0.5 fa:16:3e:af:e5:d1
[root@stan ~]# ip netns exec qdhcp-52f8298e-f867-4581-8ae7-fac0c9196ae4 dhcp_release tap6d5fc8f6-28 10.96.0.5 fa:16:3e:af:e5:d1
[root@stan ~]# ip netns exec qdhcp-52f82...

Read more...

Revision history for this message
Arjun Baindur (abaindur) wrote :

Note that I DO see the DHCP Release packets while tcpdump'ing in the qdhcp namespace. But dnsmasq logs for DHCPRELEASE in /var/log/messages show nothing for one network, even with several dozen attempts:

[root@stan ~]# ip netns exec qdhcp-52f8298e-f867-4581-8ae7-fac0c9196ae4 ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN qlen 1
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
6489: tap6d5fc8f6-28: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN qlen 1000
    link/ether fa:16:3e:d6:fc:52 brd ff:ff:ff:ff:ff:ff
    inet 10.96.0.3/16 brd 10.96.255.255 scope global tap6d5fc8f6-28
       valid_lft forever preferred_lft forever
    inet 169.254.169.254/16 brd 169.254.255.255 scope global tap6d5fc8f6-28
       valid_lft forever preferred_lft forever
    inet6 fe80::f816:3eff:fed6:fc52/64 scope link
       valid_lft forever preferred_lft forever

[root@stan ~]# ip netns exec qdhcp-52f8298e-f867-4581-8ae7-fac0c9196ae4 tcpdump -l -evvvnn -i any

00:50:22.737477 In 00:00:00:00:00:00 ethertype IPv4 (0x0800), length 592: (tos 0xc0, ttl 64, id 17834, offset 0, flags [none], proto ICMP (1), length 576)
    10.96.0.3 > 10.96.0.3: ICMP 10.96.0.3 udp port 67 unreachable, length 556
 (tos 0x0, ttl 64, id 62218, offset 0, flags [DF], proto UDP (17), length 576)
    10.96.0.3.41630 > 10.96.0.3.67: BOOTP/DHCP, Request from fa:16:3e:af:e5:d1, length 548, Flags [none] (0x0000)
   Client-IP 10.96.0.5
   Client-Ethernet-Address fa:16:3e:af:e5:d1
   Vendor-rfc1048 Extensions
     Magic Cookie 0x63825363
     DHCP-Message Option 53, length 1: Release
     Server-ID Option 54, length 4: 10.96.0.3
     END Option 255, length 0
     PAD Option 0, length 0, occurs 270
00:50:22.918503 In 00:00:00:00:00:00 ethertype IPv4 (0x0800), length 592: (tos 0x0, ttl 64, id 62254, offset 0, flags [DF], proto UDP (17), length 576)
    10.96.0.3.60661 > 10.96.0.3.67: [bad udp cksum 0x1703 -> 0x6e73!] BOOTP/DHCP, Request from fa:16:3e:af:e5:d1, length 548, Flags [none] (0x0000)
   Client-IP 10.96.0.5
   Client-Ethernet-Address fa:16:3e:af:e5:d1
   Vendor-rfc1048 Extensions
     Magic Cookie 0x63825363
     DHCP-Message Option 53, length 1: Release
     Server-ID Option 54, length 4: 10.96.0.3
     END Option 255, length 0
     PAD Option 0, length 0, occurs 298

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 11.0.6

This issue was fixed in the openstack/neutron 11.0.6 release.

Revision history for this message
Gustavo Randich (gustavo-randich) wrote :

This fix does not seem to work. We are seeing lots of "Could not release DHCP leases for these IP addresses after", and then "no address available"; we are restarting dhcp agent periodically to workaround this.

Revision history for this message
Brian Haley (brian-haley) wrote :

Gustavo - could you open a new bug for the issue as this one is almost three years old and upstream is at version 18.0.0 now (and I'm not dismissing Arjun's comment above I just never saw it even as the bug owner).

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.