Heavy Network IO Crashed dnsmasq with no recovery

Bug #1037065 reported by Thomas Vachon on 2012-08-15
16
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Undecided
Unassigned

Bug Description

Nova: 2012.1+stable~20120612-3ee026e-0ubuntu1.2

I was running a load test against a 4 node Cassandra cluster in Openstack. I have separate tenancy for each node to ensure there was no funny contention. Running the test 3 times produced the same results each time.

About 1/3 of the way through the test, the dnsmasq process crashes (with no warning or error in any log). The instance will continue "working" but only inside of the VNC console as all outside connectivity is now unroutable.

Here is a log from the dnsmasq process. The first two rows show that dnsmasq was working, then it just fails to route correctly back to the instance.

<snip>
Aug 14 15:34:17 compute4 dnsmasq-dhcp[4146]: DHCPREQUEST(br300) 10.0.0.11 fa:16:3e:03:89:9d
Aug 14 15:34:17 compute4 dnsmasq-dhcp[4146]: DHCPACK(br300) 10.0.0.11 fa:16:3e:03:89:9d cass1
Aug 14 15:34:28 compute4 dnsmasq-dhcp[4146]: DHCPDISCOVER(br300) 10.0.0.11 fa:16:3e:03:89:9d
Aug 14 15:34:28 compute4 dnsmasq-dhcp[4146]: DHCPOFFER(br300) 10.0.0.11 fa:16:3e:03:89:9d
Aug 14 15:34:31 compute4 dnsmasq-dhcp[4146]: DHCPDISCOVER(br300) 10.0.0.11 fa:16:3e:03:89:9d
Aug 14 15:34:31 compute4 dnsmasq-dhcp[4146]: DHCPOFFER(br300) 10.0.0.11 fa:16:3e:03:89:9d
Aug 14 15:34:36 compute4 dnsmasq-dhcp[4146]: DHCPDISCOVER(br300) 10.0.0.11 fa:16:3e:03:89:9d
Aug 14 15:34:36 compute4 dnsmasq-dhcp[4146]: DHCPOFFER(br300) 10.0.0.11 fa:16:3e:03:89:9d
Aug 14 15:34:47 compute4 dnsmasq-dhcp[4146]: DHCPDISCOVER(br300) 10.0.0.11 fa:16:3e:03:89:9d
Aug 14 15:34:47 compute4 dnsmasq-dhcp[4146]: DHCPOFFER(br300) 10.0.0.11 fa:16:3e:03:89:9d
Aug 14 15:34:59 compute4 dnsmasq-dhcp[4146]: DHCPDISCOVER(br300) 10.0.0.11 fa:16:3e:03:89:9d
Aug 14 15:34:59 compute4 dnsmasq-dhcp[4146]: DHCPOFFER(br300) 10.0.0.11 fa:16:3e:03:89:9d
</snip>

Here is the running dnsmasq processes after the meltdown (of interesting note, 10.0.0.3 is NOT assigned to an instance, and in fact a different server has that IP for a different instance)

nobody 4146 1 0 Aug07 ? 00:00:02 /usr/sbin/dnsmasq --strict-order --bind-interfaces --conf-file= --domain=ewr.domain.com --pid-file=/var/lib/nova/networks/nova-br300.pid --listen-address=10.0.0.5 --except-interface=lo --dhcp-range=10.0.0.3,static,120s --dhcp-lease-max=256 --dhcp-hostsfile=/var/lib/nova/networks/nova-br300.conf --dhcp-script=/usr/bin/nova-dhcpbridge --leasefile-ro
root 4147 4146 0 Aug07 ? 00:00:01 /usr/sbin/dnsmasq --strict-order --bind-interfaces --conf-file= --domain=ewr.domain.com --pid-file=/var/lib/nova/networks/nova-br300.pid --listen-address=10.0.0.5 --except-interface=lo --dhcp-range=10.0.0.3,static,120s --dhcp-lease-max=256 --dhcp-hostsfile=/var/lib/nova/networks/nova-br300.conf --dhcp-script=/usr/bin/nova-dhcpbridge --leasefile-ro

Mark McLoughlin (markmc) wrote :

Thanks for finding the root cause, marking this bug as a duplicate of the libvirt bug

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers