multiple dnsmaq instances on multiple nodes for the same network are not aware of each other's leases

Bug #1242712 reported by Yves-Gwenael Bourhis
26
This bug affects 5 people
Affects Status Importance Assigned to Milestone
neutron
Fix Released
High
Yves-Gwenael Bourhis

Bug Description

Configuration:
===========
In multi-node configuration with,

On one node:
# Quantum
disable_service n-net
enable_service q-svc
enable_service q-agt
enable_service q-dhcp
enable_service q-l3
enable_service q-meta
enable_service neutron

On the other node:
ENABLED_SERVICES=n-cpu,rabbit,neutron,q-agt,q-dhcp

- Set up a private network, and add a dhcp agent per node (for HA).
- Spawn a VM on each node and check each vm to have received a lease from a diferent dhcp server.

Problem:
=======

Each DHCP server is not aware the other dhcp server's leases and VMs in the same network do not always perform name resolution to each other VMs.

Changed in neutron:
assignee: nobody → Yves-Gwenael Bourhis (yves-gwenael-bourhis)
Changed in neutron:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/52930

Changed in neutron:
importance: Undecided → Medium
milestone: none → icehouse-1
Revision history for this message
Mark McClain (markmcclain) wrote :

All of the dnsmasq instances are working from the same static host allocation file for a network. What is the exact issue?

Changed in neutron:
status: In Progress → Incomplete
Revision history for this message
Yves-Gwenael Bourhis (yves-gwenael-bourhis) wrote :

As detailed here: https://review.openstack.org/#/c/52930/10//COMMIT_MSG

dnsmasq intances give a lease to machines defined by the --dhcp-hostsfile allocation file, however a dnsmasq instance is unable to resolve a host defined in this same file if the dnsmasq instance did not give the lease for it (if you have 2 dnsmasq instances on the same network, one will give the lease, the other will not).

dnsmasq resolves only:
 - machines to which it gave a lease which are defined in the --dhcp-hostsfile file (but it DOES NOT resolve machines in this same file to which a lease was not given if it was given by another dnsmasq instance)
- machines defined in the --addn-hosts file.
- machines defined in /etc/hosts if we do not give the --no-hosts parameter to dnsmaq (we do give this parameter because we do not want to resolve hosts that are on all networks served by the same node. So --addn-hosts defines only hosts in the same virtual network and --no-hosts prevents reading /etc/hosts from the node it is running on).

Changed in neutron:
status: Incomplete → In Progress
Thierry Carrez (ttx)
Changed in neutron:
milestone: icehouse-1 → icehouse-2
Revision history for this message
Carl Baldwin (carl-baldwin) wrote :

I believe this patch is necessary for multiple dnsmasq instances to resolve all names on a network. The old hosts file is not sufficient.

I petition for the removal of the +2 from the proposed patch. A good case has been made to show the need for the patch.

Revision history for this message
Carl Baldwin (carl-baldwin) wrote :

That should read -2 instead of +2.

Revision history for this message
Jian Wen (wenjianhn) wrote :

Why not set both of the dhcp servers as your dns servers?

Revision history for this message
Jian Wen (wenjianhn) wrote :

Ignore comment #6
Because one of them may fail.

> (if you have 2 dnsmasq instances on the same network, one will give the lease, the other will not).

I think this happens only if one the dnsmasq instances didn't update its lease in time while
the other dnsmasq updates its lease before the VM boots.

In the end the lease info of the two dnsmasq instances are supposed to be the same.

Changed in neutron:
status: In Progress → Incomplete
Revision history for this message
Yves-Gwenael Bourhis (yves-gwenael-bourhis) wrote :
Download full text (15.9 KiB)

> I think this happens only if one the dnsmasq instances didn't update its lease in time while
> the other dnsmasq updates its lease before the VM boots.

> In the end the lease info of the two dnsmasq instances are supposed to be the same.

No, it's not the case, because the VM's response is addressed to only "one" dhcp server (dnsmasq instance) and only this one, and the second dnsmaq instance never gives the final lease.

I have two physical devstacks, one running all services (hostname = devstack), one running neutron and nova (hostname = neutron) , each having a dnsmasq instance on the same virtual network:

vagrant@devstack:~/devstack$ neutron agent-list
+--------------------------------------+--------------------+----------+-------+----------------+
| id | agent_type | host | alive | admin_state_up |
+--------------------------------------+--------------------+----------+-------+----------------+
| 1971b15b-cc30-4f87-8cda-96b42c08dc8d | Loadbalancer agent | devstack | :-) | True |
| 3217cfe9-363d-404e-bfac-f2cf01b55c58 | Metadata agent | devstack | :-) | True |
| 52f80d2c-edb0-4afa-a5e6-000ab8fde326 | Open vSwitch agent | neutron | :-) | True |
| 6c2e946f-8ce3-41fa-809e-4f8611c1c9ab | L3 agent | devstack | :-) | True |
| 8c8dac60-21f4-416a-97e8-d28e156f745b | Open vSwitch agent | devstack | :-) | True |
| b59c91d7-3797-4adc-a22b-ab44ec71aecd | DHCP agent | neutron | :-) | True |
| d63fd996-5e9e-4c6d-8c81-2439c466617f | DHCP agent | devstack | :-) | True |
+--------------------------------------+--------------------+----------+-------+----------------+
vagrant@devstack:~/devstack$ neutron dhcp-agent-network-add b59c91d7-3797-4adc-a22b-ab44ec71aecd private
Added network private to DHCP agent

I spawn 2 VMs on each node:

$ nova boot --flavor 1 --image cirros-0.3.1-x86_64-uec --availability-zone nova:devstack cirros1 && nova boot --flavor 1 --image cirros-0.3.1-x86_64-uec --availability-zone nova:neutron cirros2

from dnsmasq1's namespace:
vagrant@devstack:~/devstack$ sudo ip netns exec qdhcp-a56e31af-7176-4c53-b4a9-87bb74184d24 bash
root@devstack:~/devstack# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
12: tapd3735dd4-23: <BROADCAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN
    link/ether fa:16:3e:9b:e3:8b brd ff:ff:ff:ff:ff:ff
    inet 172.17.0.2/24 brd 172.17.0.255 scope global tapd3735dd4-23
    inet6 fe80::f816:3eff:fe9b:e38b/64 scope link
       valid_lft forever preferred_lft forever
root@devstack:~/devstack# tcpdump -vvv -i tapd3735dd4-23 udp port bootps or udp port bootpc
tcpdump: listening on tapd3735dd4-23, link-type EN10MB (Ethernet), capture size 65535 bytes
10:59:21.503183 IP (tos 0x0, ttl 64, id 0, offset 0, flags [none], proto UDP (17), length 308)
    0.0.0.0.bootpc > 255.255.255.255.bootps: [udp sum ok] BOOTP/DHCP, Request from fa:16:3e:64:c9:b5 (oui Unkn...

Revision history for this message
Édouard Thuleau (ethuleau) wrote :

Yes :
- When a VM boot, it broadcast a DHCP discover.
- All DHCP server respond with DHCP offer to the VM.
- VM DHCP client can receive DHCP offers from multiple servers, but it will accept only one DHCP offer and then broadcast a DHCP request
- All DHCP server receive that DHCP request but only the DHCP server selected by VM DHCP client respond with a DHCP ACK and populate its lease cache. The other DHCP server withdraw any offers that they might have made to the client and return the offered address to the pool of available addresses.

In case of dnsmasq, the DNS cache is populated only when it send the DHCP ACK respond.

Revision history for this message
Jian Wen (wenjianhn) wrote :

The above comment makes sense.
Thanks.

Changed in neutron:
status: Incomplete → In Progress
Revision history for this message
Sylvain Afchain (sylvain-afchain) wrote :

Test made with two dhcp agent without Yves patch :

$ sudo ip netns exec qdhcp-59b09d4e-be08-4978-a77e-f3befc13d675 dig host-10-0-1-20.
enstacklocal @10.0.1.18

; <<>> DiG 9.8.1-P1 <<>> host-10-0-1-20.openstacklocal @10.0.1.18
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 5671
;; flags: qr aa rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;host-10-0-1-20.openstacklocal. IN A

;; ANSWER SECTION:
host-10-0-1-20.openstacklocal. 0 IN A 10.0.1.20

;; Query time: 7 msec
;; SERVER: 10.0.1.18#53(10.0.1.18)
;; WHEN: Thu Dec 5 03:18:36 2013
;; MSG SIZE rcvd: 63

$ sudo ip netns exec qdhcp-59b09d4e-be08-4978-a77e-f3befc13d675 dig host-10-0-1-20.openstacklocal @10.0.1.19

; <<>> DiG 9.8.1-P1 <<>> fsfsds.
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 36275
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;host-10-0-1-20.openstacklocal. IN A

;; Query time: 5 msec
;; SERVER: 10.0.4.2#53(10.0.4.2)
;; WHEN: Sun Dec 1 23:31:53 2013
;; MSG SIZE rcvd: 24

With the Yves patch :

$ sudo ip netns exec qdhcp-59b09d4e-be08-4978-a77e-f3befc13d675 dig host-10-0-1-20.openstacklocal @10.0.1.19

; <<>> DiG 9.8.1-P1 <<>> host-10-0-1-20.openstacklocal @10.0.1.19
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 61586
;; flags: qr aa rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;host-10-0-1-20.openstacklocal. IN A

;; ANSWER SECTION:
host-10-0-1-20.openstacklocal. 0 IN A 10.0.1.20

;; Query time: 14 msec
;; SERVER: 10.0.1.19#53(10.0.1.19)
;; WHEN: Thu Dec 5 04:17:31 2013
;; MSG SIZE rcvd: 63

I agree with the last comment and think that the Yves patch fix this issue.

Thierry Carrez (ttx)
Changed in neutron:
milestone: icehouse-2 → icehouse-3
tags: added: havana-backport-potential
Revision history for this message
Kevin Bringard (kbringard) wrote :

FWIW, I was able to cherry pick this patch and have it apply cleanly to dhcp.py in havana. The tests patch doesn't quite apply cleanly, so I'm trying to find time to work on those, but if someone else has a spare moment it shouldn't be too arduous to backport.

I can also confirm that this solves the issue in my Havana environments.

Thierry Carrez (ttx)
Changed in neutron:
milestone: icehouse-3 → icehouse-rc1
Changed in neutron:
importance: Medium → High
description: updated
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.openstack.org/52930
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=97a529ad8eaee80e196eb362c4e45901a96ae23c
Submitter: Jenkins
Branch: master

commit 97a529ad8eaee80e196eb362c4e45901a96ae23c
Author: Yves-Gwenael Bourhis <email address hidden>
Date: Mon Oct 21 16:14:06 2013 +0200

    Make dnsmasq aware of all names

    Each dnsmasq instance on a network is not aware of other dnsmasq's leases.

    When dnsmasq is launched with --no-hosts and is not provided an --addn-hosts
    file, it can resolve only the hosts to which it gives a dhcp lease and no more.
    i.e.:
    If dnsmasq service n°1 gives a lease to instance n°1, and dnsmasq service n°2
    gives a lease to instance n°2, both VM instances and dnsmasq services being on
    the same network: instance n°1 can not resolve instance n°2, because instance
    n°1 queries dnsmasq n°1, and since it did not give the lease to instance n°2,
    it can not resolve it (it is not aware of its existence). Same issue if
    instance n°2 tries to resolve instance n°1.

    The solution is to provide dnsmasq with an --addn-hosts file of all hosts on
    the network. With an --addn-hosts file, each dnsmasq instance is aware of all
    the hosts on the network even if they do not give the lease for a host,
    therefore each dnsmasq instance can resolve any host on their network even if
    they did not provide the lease for it themselves.

    Change-Id: Ic6d4f7854d250889dded5491e4693fcdce32ed00
    Fixes: bug #1242712

Changed in neutron:
status: In Progress → Fix Committed
Thierry Carrez (ttx)
Changed in neutron:
status: Fix Committed → Fix Released
Thierry Carrez (ttx)
Changed in neutron:
milestone: icehouse-rc1 → 2014.1
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.