quantum-dhcp-agent + quantum-dhcp-agent-dnsmasq-lease-update deadlock

Bug #1227237 reported by Peter Feiner
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
neutron
Invalid
Undecided
Unassigned

Bug Description

NB This isn't an issue in Havana since neutron doesn't use a DHCP lease update script (i.e., neutron runs dnsmasq without the --dhcp-script=xxx option).

Every week or so on our OpenStack cluster of modest size (about a dozen nodes, dozens to hundreds of VMs being booted per day), instances stop receiving replies to their DHCP requests. When we observe the instances failing to communicate with the DHCP server, traces of network traffic show that the DHCP requests are making it to the tap device that dnsmasq is listening on but that dnsmasq isn't sending any replies. Furthermore, dnsmasq seems to be configured to send the replies (i.e., the appropriate entries are in its DHCP hosts file and dnsmasq has been sent SIGHUP). Once the dnsmasq process stops sending replies, it seems to have stopped indefinitely. Killing the dnsmasq process and restarting quantum-dhcp-agent, which starts a new dnsmasq process, gets replies going again.

The dnsmasq process seems to have stopped sending replies because of its DHCP script quantum-dhcp-agent-dnsmasq-lease-update (i.e., from the option --dhcp-script=quantum-...). According to man dnsmasq(8), dnsmasq runs one copy of the lease update script at a time (i.e., executes it serially) and some request processing blocks on execution of the lease update script. Assuming the blocked request processing includes replying to DHCP requests, then a single deadlocked instance of quantum-dhcp-agent-dnsmasq-lease-update explains why no DHCP responses are being made.

Indeed, the script has deadlocked. Here's a snippet of output from 'ps auxf' showing the relevant dnsmasq and quantum-dhcp-agent-dnsmasq-lease-update and quantum-dhcp-agent processes:

quantum 16347 0.6 0.0 198756 56084 ? Ss Sep10 71:50 python /usr/bin/quantum-dhcp-agent --config-file=/etc/quantum/quantum.conf --config-file=/etc/quantum/dhcp_agent.ini --log-file=/var/log/quantum/dhcp-agent.log
...
nobody 19945 0.0 0.0 28820 1156 ? S Sep10 6:47 dnsmasq --no-hosts --no-resolv --strict-order --bind-interfaces --interface=tap8591954d-46 --except-interface=lo --pid-file=/var/lib/quantum/dhcp/773075d2-3df6-45e6-8c62-251a49fb8e1f/pid --dhcp-hostsfile=/var/lib/quantum/dhcp/773075d2-3df6-45e6-8c62-251a49fb8e1f/host --dhcp-optsfile=/var/lib/quantum/dhcp/773075d2-3df6-45e6-8c62-251a49fb8e1f/opts --dhcp-script=/usr/bin/quantum-dhcp-agent-dnsmasq-lease-update --leasefile-ro --dhcp-range=set:tag0,172.16.0.0,static,120s --conf-file= --domain=openstacklocal
root 19948 0.0 0.0 28792 472 ? S Sep10 3:02 \_ dnsmasq --no-hosts --no-resolv --strict-order --bind-interfaces --interface=tap8591954d-46 --except-interface=lo --pid-file=/var/lib/quantum/dhcp/773075d2-3df6-45e6-8c62-251a49fb8e1f/pid --dhcp-hostsfile=/var/lib/quantum/dhcp/773075d2-3df6-45e6-8c62-251a49fb8e1f/host --dhcp-optsfile=/var/lib/quantum/dhcp/773075d2-3df6-45e6-8c62-251a49fb8e1f/opts --dhcp-script=/usr/bin/quantum-dhcp-agent-dnsmasq-lease-update --leasefile-ro --dhcp-range=set:tag0,172.16.0.0,static,120s --conf-file= --domain=openstacklocal
root 24722 0.0 0.0 4292 352 ? S Sep17 0:00 \_ quantum-dhcp-agent-dnsmasq-lease-update old fa:16:3e:17:5f:46 172.16.0.129 172-16-0-129

Strace shows that 19945 is continuing to do stuff, but 19948 is blocked in a wait4 system call waiting for its child, 24722, to exit:

peter@gremlin ~/ssst % sudo strace -p 19948
Process 19948 attached - interrupt to quit
wait4(-1, ^C <unfinished ...>

But 24722 is blocked waiting to connect to the UDS that uses to communicate with quantum-dhcp-agent:

peter@gremlin ~/ssst % sudo strace -p 24722
Process 24722 attached - interrupt to quit
connect(3, {sa_family=AF_FILE, path="/var/lib/quantum/dhcp/lease_relay"}, 110

But quantum-dhcp-agent, pid 16347, is spinning epoll-ing on some epoll instance:

peter@gremlin ~/ssst % sudo strace -p 16347
Process 16347 attached - interrupt to quit
epoll_wait(5, {}, 1023, 1549) = 0
read(4, "2+\214~\346\235,\6\264\320\27\"\\\17?\354", 16) = 16
gettid() = 16347
epoll_wait(5, {}, 1023, 826) = 0
epoll_wait(5, {}, 1023, 3999) = 0
epoll_wait(5, {}, 1023, 0) = 0
epoll_wait(5, {}, 1023, 3999) = 0
epoll_wait(5, {}, 1023, 3999) = 0
epoll_wait(5,

Clearly that epoll instance doesn't include the UDS that the lease update script is trying to connect to in spite of having the file open:

peter@gremlin ~/ssst % sudo lsof | grep 16347 | grep lease_relay
python 16347 quantum 8u unix 0xffff880070c86e80 0t0 224377723 /var/lib/quantum/dhcp/lease_relay

I haven't looked into why quantum-dhcp-agent isn't epoll-ing on the UDS (fd 8). Presumably the epoll is the internal eventlet epoll.

We've observed this on Canonical's 2013.1-0ubuntu2~cloud0 package (which is based on 2013.1 AFAIK) on Ubuntu 12.04 and 2013.1.2 on CentOS 6.4 from a recent RDO installation. Although I haven't tested 2013.1.3 or the latest stable/grizzly code, I suspect the bug still exists because no similar bugs have been filed.

Revision history for this message
Peter Feiner (pete5) wrote :

We haven't actually seen the problem since I reported the bug. Since we
haven't changed the configuration of the cluster I witnessed it on, nor have
we changed our workload in drastically, I'm not sure why the bug hasn't
cropped up again.

I'm afraid that all of the dhcp-agent.log files I have are only from the past
4 days and they just repeat these DEBUG messages:

        Making asynchronous cast on q-plugin...
        UNIQUE_ID is e954ad8eb64140ad875521fa43834062.

If I observe the problem again, I'll be sure to do a detailed diagnosis again.

Changed in neutron:
status: New → Invalid
Alan Pevec (apevec)
tags: removed: grizzly-backport-potential
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.