enable neutron support distributed DHCP agents

Bug #1468236 reported by shihanzhang
40
This bug affects 6 people
Affects Status Importance Assigned to Milestone
neutron
Won't Fix
Wishlist
shihanzhang

Bug Description

the current DHCP service in neutron is centralized, it suffers from several ailments in the large-scale scenarios
1. VM can't get IP at booting time, the most serious is that leads to metadata service can't work.
2. DHCP agent need much time to reboot if it has served for a large VMs.
3. network node has a large number of namespaces, especially in public cloud, there are so many tenants and private networks.

I think we can run dhcp-agent across all compute nodes, but it does not completely like current DVR, the main difference are as below:
1. it simplify the dhcp-agent scheduler in neutron-server, when we create a VM, neutron-server just send the RPC message by port host id.
2. the dhcp-agent running on a compute node just serve the VMs on this compute node, if this dhcp-agent down, it just affect the VMs running on this node.
3. move the network-binding-dhcp-agent from neutron-server to dhcp-agent, it can remove the race hapens between neutron-server multi-workers.

Tags: rfe
tags: added: rfe
Revision history for this message
Kevin Benton (kevinbenton) wrote :

Why not run a dhcp agent on every compute node?

Revision history for this message
Assaf Muller (amuller) wrote :

You do, this spec is about optimizing that. You can read the spec for more details.

Revision history for this message
shihanzhang (shihanzhang) wrote :

hi Assaf Muller, thx for your explanation, you are right, this proposal will optimize that, if just run dhcp agent on every compute node without the change in this proposal , it has some problems as this spec describe.

Revision history for this message
Kyle Mestery (mestery) wrote :

Seems quite reasonable.

Changed in neutron:
status: New → Confirmed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron-specs (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/205429

Revision history for this message
Assaf Muller (amuller) wrote :

@shihanzhang, can you update the bug's description with a problem statement, and the high level architectural choices made? For example, why did you decide to make changes to the OVS agent? What are the pieces involved and why, etc.

Revision history for this message
Nell Jerram (neil-jerram) wrote :

FWIW, this is also what we do with networking-calico, i.e. run a DHCP agent on every compute node.

Changed in neutron:
importance: Undecided → Wishlist
Changed in neutron:
assignee: nobody → shihanzhang (shihanzhang)
Revision history for this message
Akihiro Motoki (amotoki) wrote :

Agree with Assaf. Could you update the bug description with a problem statement and a higher level approach,
particularly the problems with the current multiple DHCP agents approach.

Running dhcp-agent on each compute node itself sounds reasonable to me. it is similar to nova-network multi-host approach.

On the other hand, we need to asses the complexity and our code stability when it is introduced.

Revision history for this message
shihanzhang (shihanzhang) wrote :

amotoki, thanks very much for your suggestion, I have updated the description, hope this can be done in mitaka-1

description: updated
description: updated
description: updated
description: updated
description: updated
description: updated
description: updated
Revision history for this message
Nell Jerram (neil-jerram) wrote :

On the other hand, anyone working on this bug should be aware that there are also scaling issues with running a DHCP agent on every compute node.

(networking-calico does this, not because of the reasons stated in the bug description above, but because in the calico scenario there is no bridging of networks between compute nodes. So we have gained a little experience of it already.)

Broadly we have seen two issues with running a reference Neutron DHCP agent on every compute node.

1. Each DHCP agent still receives MAC/IP/hostname mapping updates for every VM on the relevant networks - rather than for just the VMs on its own compute node - and so can get behind in its updating of the Dnsmasq config; and that eventually results in DHCP not being ready when the VM is booting, and the VM not getting its IP addresses. This is the problem covered in more detail at https://bugs.launchpad.net/neutron/+bug/1453350. When running a DHCP agent on every compute node, however, the problem could also be mitigated substantially if there was a way for the Neutron server to send only the mapping updates that are relevant to each DHCP agent.

2. As there are more DHCP agents overall - 1 per compute node, instead of 1 or 2 per network - we see problems with load on the Neutron servers, when the number of compute nodes exceeds 250. Specifically, with 10 Neutron servers and >250 compute nodes, we see nova-compute->Neutron requests timing out (after 30s timeout), and hence VM launch failures. We have never exactly pinned down the nature of that loading, but we suspect a combination of (i) handling DHCP agent status reporting, (ii) handling DHCP agent resync requests, and (iii) fanning out port updates to such a large number of agents. We do know for sure that it's caused in some sense by the overall number of DHCP agents, because:

- 10 servers + 240 nodes with DHCP agents + 260 other compute nodes but with no DHCP agent => no VM boot problems

- 10 servers + 500 nodes with DHCP agents => lots of VM launch failures.

In addition there is a specific resync storm issue, which can be easily seen if lots of DHCP agents are started at about the same time. When that happens, the Neutron servers are overloaded and so can't answer many of the resync requests within 30s - so all of the DHCP agents whose requests were failed ask for a resync again...

I can dig out more detail of our exact observations, if that would be helpful - just ask. In the next comment I'll write more about how we've so far tried to address these problems for networking-calico.

Revision history for this message
Nell Jerram (neil-jerram) wrote :

In networking-calico we've tried to solve most of these problems by creating a Calico-specific replacement for the reference Neutron DHCP agent, and you can see the code for this (work in progress) at https://review.openstack.org/#/c/241310/8/networking_calico/agent/dhcp_agent.py. The Calico DHCP agent shares most of the architecture of the reference Neutron DHCP agent, so we are retaining most of the existing value there; the difference is just replacing the top-level script (neutron-dhcp-agent -> calico-dhcp-agent) and class (DhcpAgentWithStateReport -> CalicoDhcpAgent), with CalicoDhcpAgent getting its information about MAC/IP/hostname mappings from Calico's etcd database instead of by RPC from the Neutron server.

...

Revision history for this message
Nell Jerram (neil-jerram) wrote :
Revision history for this message
Nell Jerram (neil-jerram) wrote :

We then have a system where there is a Calico DHCP agent on each compute node, but the Neutron server is not aware of those because they don't report any agent state; hence the server does not have to handle DHCP agent state reports, or resync requests, or to fan out port updates to many agents. Instead the Calico plugin/mech driver writes information into a distributed etcd database, and the DHCP agents get the information that they need from that database.

(We do still need occasional resyncs between the Neutron DB and the etcd DB - but that is a problem that we already handle.)

Revision history for this message
shihanzhang (shihanzhang) wrote :

hi neil, thanks very much for you good suggestion! for your question:
1. this proposal will change the notification that neutron-server sends to dhcp-agent, when nova create a VM, it creates a port with compute host id, then neutron-server receive this API request, neutron-server will just send notification to compute host that the VM running, other dhcp-agent will not receive this notification, so your first problem will not exist in this proposal.\

2. this proposal will change the RPC request from dhcp-agent to neutron-server, when dhcp-agent reboot, it will just get the port info running on this compute node

3. now neutron-server has separate agent state reporting queues, so I think your second problem will not exist.

if you have other problems, feel free to comment on this bug

Revision history for this message
Akihiro Motoki (amotoki) wrote :

First I would like to clarify the problem statement.

Looking at the problem description, I see two problems:
* The number of ports used by dhcp-agent
* dnsmasq reloading time depending on the number of tenant IPs

I am still not sure how much these problems affects large-scale deployments.
If we introduce a distribute dhcp-agent, it increases the complexity.
i thnk it is important to asses the balance between the complexity and
the benefit we can get.

I think Neil has more experience about this because netwokring-calico tackles the similar problem.

Revision history for this message
Akihiro Motoki (amotoki) wrote :

This is some follow-up comments on the current problem statement.

Almost all of my previous comment have been answered by Neil's comments above.
I think the problem statement can be more simple.

(If the description is changed, we cannot track the discussion
context, so I would like to quote the current description here)

> The existing DHCP agent suffers from several ailments in the
> large-scale scenarios:
> 1. Centralized DHCP agent can't serve well for VMs
> * VM can't get IP at first booting time(For a new port,
> neutron-server will send notification to dhcp-agent, then
> dhcp-agent receives this notification and re-load dnsmasq, during
> dnsmasq re-loading, it can't handle new request)
> * DHCP agent need much time to reboot if it has served for a large
> VMs

Is the second item a cause of the problem?
If so, there are several solutions. One is to reduce the number of VMs
which one dnsmasq instance serves. Another is to use a different DHCP
server (we have a proposal to use ISC DHCP server.).

> 2. When not running HA it has the problem of single point of failure.

Isn't it a deployer choice whether the deployer needs HA or not.
The proposed 'distrbited' way is one of ways to improve HA,
so it is not a problem.

> 3. When running HA you get redundancy but at a price:
> * More IPs consume(every DHCP agent will consume a IP in this
> network)

If we have more agents for a network, they need more neutron ports.
It is a problem if num_dhcp_agents_per_network is bigger, but
I wonder how many num_dhcp_agents_per_network do you expect?

> * Wasted resources (like CPU and Memory, these DHCP agents are in
> ACTIVE/ACTIVE mode, they have all IP/MAC info in this network).
> * More network load (Neutron notifications, Guest traffic).

Theoretically true, but I am not sure this is a real problem.
DHCP traffic is intermittent. How much does it affect a deployment?

Revision history for this message
Eugene Nikanorov (enikanorov) wrote :

Personnaly I'm opposed to this idea.
Implementation similar to DVR, involving ovs-agent severily complicate the code and IMO not worth the benefit.

Revision history for this message
Akihiro Motoki (amotoki) wrote :

Regarding an approach, we can discuss it in a spec review once we have a consensus on this kind demand.
In general, it sounds there is a problem and some action is required.

Revision history for this message
shihanzhang (shihanzhang) wrote :

@Eugene Nikanorov, 'similar to DVR' is just means running dhcp-agent on every compute node, but it has below benefit :
1. it will simplify the dhcp-agent schedule in neutron-server, when we create a VM, neutron-server just send the RPC message by port host id
2. the dhcp-agent running on a compute node just serve the VMs on this compute node, if this dhcp-agent down, it just affect the VMs running on this node
3. in large-scale, even if there are a large numbers VM in this cloud, but for a compute node, there is a certain amount of VMs on a compute node
4. it will not complicate ovs-agent the code, maybe it just need to add some flow

Revision history for this message
shihanzhang (shihanzhang) wrote :

@Eugene Nikanorov, @amotoki, for current dhcp-agent, how could we do for the scenarios : in our public cloud, we have 2000 computes and 5 controller nodes and 5 network nodes, there are 10000 VMs running on it, we have faced same problems with Neil:
1. when a tenant boot a VM, sometimes the VM can't get fixed ip
2. when a network node reboot, the dhcp-agent need much time to recover it's service

we have implement this in our public cloud, it can solve our problems

Revision history for this message
Eugene Nikanorov (enikanorov) wrote :

shihanzhang, i can understand scalability problems. By the way, we've successfully tested a cloud with just 3 controllers running 12k simple VMs on 200+ nodes. DHCP was not an issue.

For your case you could just run DHCP agents on some computes or even on each compute.
That would be fine. However 'distributing' DHCP ports like in DVR approach will be as complex.
I don't think accessing DHCP server locally worth such effort.

Revision history for this message
shihanzhang (shihanzhang) wrote :

hi enikanorov, thanks for your comments!
You can read Neil comments on #10, if we just running dhcp-agent on each compute without changing the dhcp-agent schedule and RPC method, there are many problems.

Revision history for this message
Eugene Nikanorov (enikanorov) wrote :

Yes, I have read it. The failures Neil is seeing are not connected to the fact, that each DHCP agent receives full mapping.
So I'll reply his points:

1. https://bugs.launchpad.net/neutron/+bug/1453350 has nothing to do with amount of VMs/amount of DHCP agents. Distributed DHCP will not solve this problem.

2. Load on servers.
Distributed DHCP will make this problem worse. Especially coupled with DVR, which puts 2 additional agents per compute.
In fact it is much more complicated query to fetch ports on particular compute than to fetch ports belonging to a network.

The problems with server load should have been fixed by moving to separate state reports queue.

Revision history for this message
shihanzhang (shihanzhang) wrote :

hi enikanorov, for the bug #1453350 you have mentioned, I want to say that you are right, Distributed DHCP will not fully solve this problem, but it can mostly reduce the race, because the VM number on one compute node is certain and will not be very large.

2. why do you think "it is much more complicated query to fetch ports on particular compute", for query port, we just need host_id and network_id to filter, do you think it is much more complicated than just using network_id to filter?

description: updated
description: updated
Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

I am not convinced that going fully distributed is the answer to the scaling issues being reported here, but if we really wanted to go fully distributed we shouldn't use an agent-based approach.

Let's start the discussion at the drivers meeting. I am sure this will take a few rounds.

Changed in neutron:
status: Confirmed → Triaged
Revision history for this message
Nell Jerram (neil-jerram) wrote :

Armando's plan sounds good to me. I'd like to clarify a few things, though, as I'm not sure I fully explained _why_ I wrote various comments above.

- I wrote about my experience of scaling problems with a distributed DHCP in order to say 'Please be careful, don't assume that distribution will magically fix everything, because actually it can _create_ other scaling problems.' Then I wrote some detail about what I think those problems are, and how we are addressing them in a Calico-specific DHCP agent; but I did not mean to imply that exactly similar approaches would work for the Neutron DHCP agent.

- There is one scaling issue that is, currently, exactly the same for distributed and non-distributed DHCP: When there is a large number of VMs, it takes a long time for neutron.agent.linux.Dnsmasq to rewrite dnsmasq's config files and then send a HUP signal to dnsmasq. Therefore if there is a steady stream of port_updates coming to the DHCP agent (for example, if VMs are constantly being created and terminated), it is possible for a queue of unprocessed port_updates to build up, with an increasing time lag between when a port_update was initiated by the server, and when DHCP for that port is actually ready. This is the problem described by https://bugs.launchpad.net/neutron/+bug/1453350, and I believe that it _is_ related to the total number of VMs, but _not_ (as the code currently stands) to the number of DHCP agents. https://review.openstack.org/#/c/220758/ is one approach that helps with this, but there are broadly three other options as well: (1) use distributed DHCP and only send a subset of port_updates to each DHCP agent, for the ports that that DHCP agent should be responsible for; (2) use a different DHCP server that provides a more dynamic update interface, or a better Dnsmasq interface if there is one; (3) arrange somehow that Nova will not start booting the VM until DHCP is known to be ready for that VM's port(s).

- Looking again at _this_ bug, I think it needs more clarification of what the specific problems are. Firstly it is a bit of a problem that there are 2 competing overall statements, in Bug Description and in comment #15. I would be good for shihanzhang and amotoki to agree an overall statement, and then put that in Bug Description. Secondly there are parts of both current statements that I don't understand, for example:

  "network node has a large number of namespaces, especially in public cloud, there are so many tenants and private networks." Why is this a problem?

  "The number of ports used by dhcp-agent" What is the problem here?

Thanks; I hope this helps and doesn't just confuse further.

Revision history for this message
Kevin Benton (kevinbenton) wrote :

@Neil, the way you are deploying the agent is not the same as running on every compute node though with the normal setup. Your topology will have scaling issues because you want every compute node to serve the hosts it has locally.

In a normal setup, if you run on every compute node, any given number of agents will have a small subset of the networks scheduled to it.

Say you have 100 VMs per network. 100 networks, and 100 compute nodes.

With a normal deployment of dhcp agent on every node and a setting of 2 dhcp agents per network, that makes 2 networks per dhcp agent so 200 dnsmasq entries that it has to manage. This is an easy amount for each agent to handle and that's assuming a packing efficiency of 100 VMs per node.

So the calico deployment of having each node respond to all VMs it hosts is completely different scaling considerations than each agent just having responsibility for the networks scheduled to it.

Revision history for this message
Nell Jerram (neil-jerram) wrote :

Thanks, @Kevin; I already realized that the calico approach was different from normal Neutron DHCP scheduling and HA, but I hadn't fully understood how, so your explanation here is much appreciated.

Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

Some discussion happened during today's drivers meeting [1].

The neutron dhcp architecture already allows a deployer to increase the number of dhcp agents to deal with scale and HA. In some extreme cases, a dhcp agent can run on each compute, but it's clear that even though the architecture is possible, it is not to be promoted because the extra control plane load may have negative impact on the end-to-end system. Supporting an agent-based model to support distributed dhcp as advised here (one dhcp agent per compute host) can lead to code and deployment complexity that I don't think we should encourage.

On the other hand, there are distributed strategies that do not rely on agents to serve dhcp traffic at all (see dragonflow or ovn) that the author should be aware.

[1] http://eavesdrop.openstack.org/meetings/neutron_drivers/2015/neutron_drivers.2015-12-08-15.03.log.html

Changed in neutron:
status: Triaged → Won't Fix
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron-specs (master)

Change abandoned by shihanzhang (<email address hidden>) on branch: master
Review: https://review.openstack.org/205429
Reason: won't fix in neutron

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by shihanzhang (<email address hidden>) on branch: master
Review: https://review.openstack.org/184423
Reason: won't fix in neutron

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.