dnsmasq replies are incorrect after multiple simultaneously reloads

Bug #1598078 reported by Andrey Shestakov on 2016-07-01
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Vasyl Saienko

Bug Description

When boot lot of instances by single request (nova boot --min-count 90), some instances do not receive DHCP reply.
After investigation found that DHCP server for some requests answers to correct address and for some requests answers to broadcast.

Packages captured by tcpdump:
13:37:44.298533 3c:fd:fe:9c:62:c4 > ff:ff:ff:ff:ff:ff, ethertype IPv4 (0x0800), length 590: (tos 0x0, ttl 20, id 0, offset 0, flags [none], proto UDP (17), length 576) > [udp sum ok] BOOTP/DHCP, Request from 3c:fd:fe:9c:62:c4, length 548, xid 0xfe9c62c4, Flags [Broadcast] (0x8000)
      Client-Ethernet-Address 3c:fd:fe:9c:62:c4
      Vendor-rfc1048 Extensions
        Magic Cookie 0x63825363
        DHCP-Message Option 53, length 1: Discover
        Parameter-Request Option 55, length 36:
          Subnet-Mask, Time-Zone, Default-Gateway, Time-Server
          IEN-Name-Server, Domain-Name-Server, RL, Hostname
          BS, Domain-Name, SS, RP
          EP, RSZ, TTL, BR
          YD, YS, NTP, Vendor-Option
          Requested-IP, Lease-Time, Server-ID, RN
          RB, Vendor-Class, TFTP, BF
          Option 128, Option 129, Option 130, Option 131
          Option 132, Option 133, Option 134, Option 135
        MSZ Option 57, length 2: 1260
        GUID Option 97, length 17:
        ARCH Option 93, length 2: 0
        NDI Option 94, length 3: 1.2.1
        Vendor-Class Option 60, length 32: "PXEClient:Arch:00000:UNDI:002001"
        END Option 255, length 0
        PAD Option 0, length 0, occurs 200
13:37:44.298819 fa:16:3e:f0:b1:23 > ff:ff:ff:ff:ff:ff, ethertype IPv4 (0x0800), length 401: (tos 0xc0, ttl 64, id 30122, offset 0, flags [none], proto UDP (17), length 387) > [udp sum ok] BOOTP/DHCP, Reply, length 359, xid 0xfe9c62c4, Flags [Broadcast] (0x8000)
      Client-Ethernet-Address 3c:fd:fe:9c:62:c4
      Vendor-rfc1048 Extensions
        Magic Cookie 0x63825363
        DHCP-Message Option 53, length 1: Offer
        Server-ID Option 54, length 4:
        Lease-Time Option 51, length 4: 600
        RN Option 58, length 4: 300
        RB Option 59, length 4: 525
        Subnet-Mask Option 1, length 4:
        BR Option 28, length 4:
        Domain-Name Option 15, length 14: "openstacklocal"
        Hostname Option 12, length 16: "host-10-51-5-125"
        TFTP Option 66, length 10: "^@"
        BF Option 67, length 11: "pxelinux.0^@"
        Default-Gateway Option 3, length

After restart of neutron-dhcp-agent this issue gone.
Looks like on bunch of port-create and port-update operations neutron-dhcp-agent sends HUP signal to dnsmasq for reload to frequently.
dnsmasq do clear cache and read files by signal event asynchronously which causes errors in loaded data.

Fix proposed to branch: master
Review: https://review.openstack.org/336462

Changed in neutron:
assignee: nobody → Andrey Shestakov (ashestakov)
status: New → In Progress
Changed in neutron:
assignee: Andrey Shestakov (ashestakov) → nobody

Change abandoned by Armando Migliaccio (<email address hidden>) on branch: master
Review: https://review.openstack.org/336462
Reason: This review is > 4 weeks without comment, and failed Jenkins the last time it was checked. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.

Changed in neutron:
status: In Progress → Incomplete
Launchpad Janitor (janitor) wrote :

[Expired for neutron because there has been no activity for 60 days.]

Changed in neutron:
status: Incomplete → Expired
Vasyl Saienko (vsaienko) wrote :

I faced with the same issue when did concurrent deployment of ironic nodes (100 simultaneous requests) some nodes failed to receive IP/PXE options, while restarting dnsmasq/neutron-dhcp-agent or sending HUP fixes issue.

Changed in neutron:
status: Expired → New
Vasyl Saienko (vsaienko) on 2017-10-06
Changed in neutron:
assignee: nobody → Vasyl Saienko (vsaienko)
Farhad Sunavala (fsbiz) wrote :

Did you manage to fix your issue?
We are facing identical issue. We have about 250 baremetal ironics. Ironic sends 3 DHCP messages (as oppoed to 1 send by VMs) and we can reproduce this issue by simply reloading 7-8 baremetals at a time.

I see the proposed fix did not make it. Does anyone have a good solution to this?

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers