DNSMasq loses entry for instances

Bug #1649963 reported by Matthijs Grünbauer
14
This bug affects 3 people
Affects Status Importance Assigned to Milestone
neutron
New
Undecided
sudhakar kumar srivastava

Bug Description

After booting a large number of new instances, instances randomly lose their ip address after running for a while. The instances were all reachable before. When I login via the web console and run `ip -4 a`, the vm does not have an ipv4 address.
The host file kept by dnsmasq in /var/lib/neutron/dhcp/<network-id>/host, shows that the entry for that VM is missing in the file.

There is an (ugly) workaround: stop neutron-dhcp-agent, kill the dnsmasq process for the corresponding network-id, and start neutron-dhcp-agent again, and the entry gets added to the host file and the instance becomes reachable again.

It looks a lot like this bug: https://bugs.launchpad.net/neutron/+bug/1645509
Except in this bug, the reporter sees entries getting added to the file, and we're seeing entries getting removed from the file.

* Pre-conditions:
This is a production environment which hosts about 350-400 VMs. We are creating and deleting about 1000+ VMs per week. The issue we are seeing can affect anyone that uses the system, but it seems to happen more in networks with a lot of activity (a lot of newly created VMs).

* Step-by-step reproduction steps:
- Boot a large number of VMs (> 10) at the same time.
- SSH into the VMs and do your work.
- After a random amount of time the VM becomes unreachable

*Expected output:
VMs keep their ip address and stay reachable after booting.

* Actual output:
VMs are available at first, but eventually lose their ip address. The dnsmasq host file is missing the entry for that ip address.

* Version:
** Openstack Mitaka 9.0, deployed with Fuel.
** Ubuntu 14.04.5 LTS, running kernel 3.13.0-92-generic
** Neutron version 2:8.0.0-2~u14.04+mos48
** DNSMasq version 2.68-1ubuntu0.1

Changed in neutron:
assignee: nobody → sudhakar kumar srivastava (sudhakar.srivastava)
Revision history for this message
sudhakar kumar srivastava (sudhakar.srivastava) wrote :

Hi Matthijs Grünbauer,

I launched around 45 VM's in the same network and tried to check the reachability of each VM through ssh and ping from my network namespace and also verified the host file and lease file which listed all the VM's .

Specified behaviour as mentioned in the bug:

1) After login via the web console and run `ip -4 a`, the vm does not have an ipv4 address.
2) The host file kept by dnsmasq in /var/lib/neutron/dhcp/<network-id>/host, shows that the entry for that VM is missing in the file

Actual behaviour:

1) After login via the web console and run `ip -4 a`, it is listing the ipv4 address of the vm.
2)The host file is listing all the VM's that were created even after checking it after a day

So please refer the attachments"

Revision history for this message
sudhakar kumar srivastava (sudhakar.srivastava) wrote :
Revision history for this message
Matthijs Grünbauer (mgrunbauer) wrote :
Download full text (6.5 KiB)

Hi Sudhakar,

Thank you for looking at this. I've tried to collect more information.

How are you creating these VMs? We are able to reproduce it fairly consistently:
- Use `nova boot` to launch a VM
- Get the list of free floating ips
- Attach a free floating ip to VM.

If we run this 30 times in a row to create a set of VMs, we see it happening to about 50% of the newly created VMs.

We spawn these VMs back to back, so after the first VM comes up and has a floating address assigned, we boot the next one, and so on.
After booting these VMs, we install some software on the VMs and reboot them. Some of these VMs don't come up with a local address after the reboot.

If you check the console log for the VM, you can see that the VM did have an ip at first:

<<< console.log >>>
[ 8.654587] cloud-init[780]: Cloud-init v. 0.7.5 running 'init' at Thu, 12 Jan 2017 10:51:56 +0000. Up 8.56 seconds.
[ 8.704614] cloud-init[780]: ci-info: ++++++++++++++++++++++++Net device info+++++++++++++++++++++++++
[ 8.705831] cloud-init[780]: ci-info: +--------+------+------------+-------------+-------------------+
[ 8.710052] cloud-init[780]: ci-info: | Device | Up | Address | Mask | Hw-Address |
[ 8.713265] cloud-init[780]: ci-info: +--------+------+------------+-------------+-------------------+
[ 8.717381] cloud-init[780]: ci-info: | lo: | True | 127.0.0.1 | 255.0.0.0 | . |
[ 8.721663] cloud-init[780]: ci-info: | eth0: | True | 10.10.9.44 | 255.255.0.0 | fa:16:3e:26:3b:2b |
[ 8.725039] cloud-init[780]: ci-info: +--------+------+------------+-------------+-------------------+
[ 8.728280] cloud-init[780]: ci-info: +++++++++++++++++++++++++++++++++Route info++++++++++++++++++++++++++++++++++
[ 8.735376] cloud-init[780]: ci-info: +-------+-----------------+-----------+-----------------+-----------+-------+
[ 8.739221] cloud-init[780]: ci-info: | Route | Destination | Gateway | Genmask | Interface | Flags |
[ 8.744581] cloud-init[780]: ci-info: +-------+-----------------+-----------+-----------------+-----------+-------+
[ 8.752302] cloud-init[780]: ci-info: | 0 | 0.0.0.0 | 10.10.0.1 | 0.0.0.0 | eth0 | UG |
[ 8.757191] cloud-init[780]: ci-info: | 1 | 10.10.0.0 | 0.0.0.0 | 255.255.0.0 | eth0 | U |
[ 8.761186] cloud-init[780]: ci-info: | 2 | 169.254.169.254 | 10.10.0.1 | 255.255.255.255 | eth0 | UGH |
[ 8.763882] cloud-init[780]: ci-info: +-------+-----------------+-----------+-----------------+-----------+-------+

This VM came up normally for now, so we just booted another few VMs after this one. Somewhere in this process of booting new VMs, the "old" VM got broken. The 10.10.9.44 address cannot be reached anymore.
You can notice this more clearly after rebooting the VM. The VM will still send out DHCPDISCOVERs to get it's ip address back when it comes up. But this fails, the VM never gets a response from the DHCP server. If you check the console.log again, you see that the eth0 address is missing:

<<< console.log >>>
[ 304.872625] cloud-init[670]: ci-info: +++++++++++++++++++++++Net device info+++++++...

Read more...

Revision history for this message
sudhakar kumar srivastava (sudhakar.srivastava) wrote :

Hi Matthijs,

Following your reply[2017-01-12] we have additionally tried few more times and still we don't hit the issue where hosts are un-reachable.
here is the details of the network_topology, details of the scenario been tested, dnsmasq file entries(collected along with the host entries), host's entries collected after some interval of lauching.

Details of the topology been tested:
1. Currently we are using cloud-init image (installed cloud-init package using virt-manager and brought the associated image from the libvirt).

2. For the external connectivity we did the required changes like
>> overriding the br-ex and addind the port eth0 as part of the br-ex interface such that br-ex will be the interface where the traffic will be routed for external connectivity,

>> made required changes in /etc/network/interfaces(for ubuntu):
auto eth0
iface eth0 inet manual
up ip address add 0/0 dev $IFACE
up ip link set $IFACE up
up ifconfig $IFACE promisc
up ifconfig $IFACE multicast
down ip link set $IFACE down

auto br-ex
iface br-ex inet static
 address 10.125.155.41
 netmask 255.255.255.0
 gateway 10.125.155.1
 dns-nameservers 8.8.8.8

>> Followed with :
ifconfig br-ex promisc up
ifconfig eth0 0.0.0.0
ifconfig eth0 promisc
ifconfig br-ex 10.125.155.41 netmask 255.255.255.0
ovs-vsctl add-port br-ex eth0

2) Using arp-scan we listed the used IP's loacted the unassigned ips of our localnet.
3) Created public network with these unallocated ip's and assigned the Floating ip's range / subnet pool with DHCP disabled
4) Current router namespace has been attached with interfaces private network, and the gateway-ip(from the range of floating ip's) of public-network.
5) We have launched the VM's both with CLI as well using horizon.

We have collected the data after few hours of interval for the launched VM's. Please find the attachments for your reference.

Let us know if there is anything additional required to get validated.

Our setup versions are as follow:

neutron -8.3.1.dev83
Dnsmasq version 2.68
Ubuntu 14.04.5 LTS, running kernel 3.13.0-92-generic

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.