quantum-dhcp-agent + quantum-dhcp-agent-dnsmasq-lease-update deadlock
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
neutron |
Invalid
|
Undecided
|
Unassigned |
Bug Description
NB This isn't an issue in Havana since neutron doesn't use a DHCP lease update script (i.e., neutron runs dnsmasq without the --dhcp-script=xxx option).
Every week or so on our OpenStack cluster of modest size (about a dozen nodes, dozens to hundreds of VMs being booted per day), instances stop receiving replies to their DHCP requests. When we observe the instances failing to communicate with the DHCP server, traces of network traffic show that the DHCP requests are making it to the tap device that dnsmasq is listening on but that dnsmasq isn't sending any replies. Furthermore, dnsmasq seems to be configured to send the replies (i.e., the appropriate entries are in its DHCP hosts file and dnsmasq has been sent SIGHUP). Once the dnsmasq process stops sending replies, it seems to have stopped indefinitely. Killing the dnsmasq process and restarting quantum-dhcp-agent, which starts a new dnsmasq process, gets replies going again.
The dnsmasq process seems to have stopped sending replies because of its DHCP script quantum-
Indeed, the script has deadlocked. Here's a snippet of output from 'ps auxf' showing the relevant dnsmasq and quantum-
quantum 16347 0.6 0.0 198756 56084 ? Ss Sep10 71:50 python /usr/bin/
...
nobody 19945 0.0 0.0 28820 1156 ? S Sep10 6:47 dnsmasq --no-hosts --no-resolv --strict-order --bind-interfaces --interface=
root 19948 0.0 0.0 28792 472 ? S Sep10 3:02 \_ dnsmasq --no-hosts --no-resolv --strict-order --bind-interfaces --interface=
root 24722 0.0 0.0 4292 352 ? S Sep17 0:00 \_ quantum-
Strace shows that 19945 is continuing to do stuff, but 19948 is blocked in a wait4 system call waiting for its child, 24722, to exit:
peter@gremlin ~/ssst % sudo strace -p 19948
Process 19948 attached - interrupt to quit
wait4(-1, ^C <unfinished ...>
But 24722 is blocked waiting to connect to the UDS that uses to communicate with quantum-dhcp-agent:
peter@gremlin ~/ssst % sudo strace -p 24722
Process 24722 attached - interrupt to quit
connect(3, {sa_family=AF_FILE, path="/
But quantum-dhcp-agent, pid 16347, is spinning epoll-ing on some epoll instance:
peter@gremlin ~/ssst % sudo strace -p 16347
Process 16347 attached - interrupt to quit
epoll_wait(5, {}, 1023, 1549) = 0
read(4, "2+\214~
gettid() = 16347
epoll_wait(5, {}, 1023, 826) = 0
epoll_wait(5, {}, 1023, 3999) = 0
epoll_wait(5, {}, 1023, 0) = 0
epoll_wait(5, {}, 1023, 3999) = 0
epoll_wait(5, {}, 1023, 3999) = 0
epoll_wait(5,
Clearly that epoll instance doesn't include the UDS that the lease update script is trying to connect to in spite of having the file open:
peter@gremlin ~/ssst % sudo lsof | grep 16347 | grep lease_relay
python 16347 quantum 8u unix 0xffff880070c86e80 0t0 224377723 /var/lib/
I haven't looked into why quantum-dhcp-agent isn't epoll-ing on the UDS (fd 8). Presumably the epoll is the internal eventlet epoll.
We've observed this on Canonical's 2013.1-
Changed in neutron: | |
status: | New → Invalid |
tags: | removed: grizzly-backport-potential |
We haven't actually seen the problem since I reported the bug. Since we
haven't changed the configuration of the cluster I witnessed it on, nor have
we changed our workload in drastically, I'm not sure why the bug hasn't
cropped up again.
I'm afraid that all of the dhcp-agent.log files I have are only from the past
4 days and they just repeat these DEBUG messages:
Making asynchronous cast on q-plugin... d875521fa438340 62.
UNIQUE_ID is e954ad8eb64140a
If I observe the problem again, I'll be sure to do a detailed diagnosis again.