[RFE] admin-state-down doesn't evacuate bindings in the dhcp_agent_id column

Bug #1825345 reported by Thomas Goirand on 2019-04-18
16
This bug affects 2 people
Affects Status Importance Assigned to Milestone
neutron
Wishlist
Unassigned

Bug Description

Hi,

This is a real report from the production front, with a deployment causing us a lot of head-scratch because of a somehow broken hardware.

If, for some reason, a node running the neutron-dhcp-agent has some hardware issue, then an admin will probably want to disable the agent there. This is done with, for example:

neutron agent-update --admin-state-down e865d619-b122-4234-aebb-3f5c24df1c8e

or something like this too:

openstack network agent set --disable e865d619-b122-4234-aebb-3f5c24df1c8e

This works, and no new network will be assigned to this agent in the future, however, if there was some networks already assigned to this agent, they wont be evacuated.

What needs to be done is:

1/ Perform an update of the networkdhcpagentbindings table, and remove all instances of e865d619-b122-4234-aebb-3f5c24df1c8e that we see in dhcp_agent_id. The networks should be reassigned to another agent. Best would be to spread the load on many, if possible, otherwise reassigning all networks to a single agent would be ok-ish.
2/ Restart the neutron-dhcp-agent process where the network have been moved, so that new dnsmasq process start for this network.
3/ Attempt to get the disabled agent to restart as well, knowing that reaching it may fail (since it has been disabled, that's probably because it's broken somehow...).

Currently, one needs to do all of this by hand. I've done that, and restored connectivity to a working DHCP server, as our user expected. This is kind of painful and boring to do, plus that's not really what an openstack user is expecting.

In fact, if we could also provide something like this, it'd be super nice:

openstack network agent evacuate e865d619-b122-4234-aebb-3f5c24df1c8e

then we'd be using it during the "set --disable" process.

Cheers,

Thomas Goirand (zigo)

Hongbin Lu (hongbin.lu) wrote :

This looks like an enhancement. I tagged it as RFE

summary: - admin-state-down doesn't evacuate bindings in the dhcp_agent_id column
+ [RFE] admin-state-down doesn't evacuate bindings in the dhcp_agent_id
+ column
tags: added: rfe
Changed in neutron:
importance: Undecided → Wishlist
status: New → Confirmed
Miguel Lavalle (minsel) wrote :

Hi Zigo,

If I do $ neutron help | grep dhcp-agent, this is the output I get:

  dhcp-agent-list-hosting-net List DHCP agents hosting a network.
  dhcp-agent-network-add Add a network to a DHCP agent.
  dhcp-agent-network-remove Remove a network from a DHCP agent.
  net-list-on-dhcp-agent List the networks on a DHCP agent.

My questions are:

1) Why do you have to manipulate the networkdhcpagentbindings? Doesn't dhcp-agent-network-remove do the same thing (along with a corresponding dhcp-agent-network-add to another agent)?

2) Doesn't dhcp-agent-network-add cover your point 2?

3) Using the commands in the above list, couldn't the user write some scripts to achieve what you propose?

Thomas Goirand (thomas-goirand) wrote :

Hi Miguel,

I didn't know about these. In fact, I didn't search enough in the neutron command itself, I searched mostly on the openstack client command.

Yes, the above can be done "by hand" (through a script). Though it'd be really nice if there was automation on the Neutron side to do all of this automatically. For example, an option like this:

openstack network agent set --disable --automatic-migration e865d619-b122-4234-aebb-3f5c24df1c8e

Also, it'd be nice if we could do this with the openstack client, rather than neutron client, as the expectation is like mine: that everything is ported there.

Your thoughts?
Cheers,

Thomas Goirand (zigo)

Slawek Kaplonski (slaweq) wrote :

@Thomas:

I think that doing what Miguel said is possible with openstack client too. Please check commands "network agent remove network" and "network agent add network".

But I also think that maybe when agent's admin_state_up is set to False than maybe it should evacuate all networks/routers to the other agents. IMHO this would be good change.

Thomas Goirand (thomas-goirand) wrote :

I just tried in Stein, and to me, network agent remove network / network agent add network doesn't work (or at least, doesn't produce the same as neutron dhcp-agent-network-add/remove). At least, this needs investigation.

Miguel Lavalle (minsel) wrote :

@Thomas,

1) What you are saying is that with the openstack client you cannot do the same thing as with the corresponding commands in neutron clients?

2) Are the neutron client commands enough to accomplish what you propose? If yes, it means that we have the correct AP calls, we just need to use them correctly in the openstack client

Thomas Goirand (thomas-goirand) wrote :

Miguel,

Your understanding is correct.

Miguel Lavalle (minsel) wrote :

Thanks for your responses. We will discuss it in the drivers meeting

tags: added: rfe-triaged
removed: rfe
Slawek Kaplonski (slaweq) wrote :

So IIUC last comments from Miguel and Thomas, it seems that it is only some feature gaps in OpenStack client comparing to python-neutronclient. Is that true? If so, I think this should be reported as bug for OpenStack client instead of rfe for Neutron.

Slawek Kaplonski (slaweq) wrote :

According to my discussion with Zigo on IRC: http://eavesdrop.openstack.org/irclogs/%23openstack-neutron/%23openstack-neutron.2019-10-11.log.html#t2019-10-11T12:17:45 zigo's proposal is to add "evacuate" option for agents. Something like e.g.:

openstack network agent --evacuate --set disable <ID>

or

openstack network agent evacuate <ID>

As all needed API is already in Neutron, this can be also implemented on client's side, e.g. in openstackclient.
Pros and cons for both solutions:

- server side implementation of new API:
  * pros: much faster especially if there is many networks hosted on one agent,
  * cons: if there is many networks on one agent and all will be quickly evacuate to other agent, this new agent may be overloaded with configuration of new networks and it may cause problems e.g. with configuration of new ports in the same time,

- client side implementation (without new API):
  * pros: new agents shouldn't be overloaded as evacuation would be slower so new agent would have more time to configure everything,
  * cons: slower evacuation :)

Lets discuss it on drivers meeting to see what others thinks about it and which solution we should choose.

Slawek Kaplonski (slaweq) wrote :

@Zigo: according to our discussion on last drivers meeting (http://eavesdrop.openstack.org/meetings/neutron_drivers/2019/neutron_drivers.2019-10-18-14.01.log.html#l-44) we would like to clarify some things about potential new API:

1. Do You want to have new API which would be something like "evacuate all networks from agent X" or rather something like is e.g. in nova: "evacuate network X from agent Y"?

2. how would the --evacuate option look like at the rest api level? Can You specify it with some details maybe?

3. Are You willing to implement this feature if it will be accepted?

YAMAMOTO Takashi (yamamoto) wrote :

i feel it's natural to have it in server as we already have similar things like allow_automatic_dhcp_failover/allow_automatic_l3agent_failover

Hi,

The idea is to be able to empty a node for maintenance. So, IMO, this should be implemented in both DHCP and L3 agents. So, to me, it should be:

"evacuate all networks from agent X"

So, something like:

openstack network agent evacuate <AGENT-ID>

and the the DHCP / L3 agent gets emptied (with the networks rescheduled "somewhere else") and can be even deleted, if needed. IMO, we could even evacuate all networks when we do an "openstack network agent delete" by default. And that's where I had the idea of --evacuate. Let's say an agent has some networks, the delete would be forbidden unless the user add the --evacuate.

IMO, this should be done at the server API level, not on the client. If evacuate overloads agents, then we may make it artificially slower as a first approach, or even better, have a mechanism to wait until the agent ACK that it's in good shape (ie: ports configured, etc.).

Unfortunately, I don't think I'd have the required skills, or the time to acquire the skills to implement this myself. Though I'm convinced it'd be super nice options for operators that would simplify a lot maintenance. I just hope my return as an operator is useful here. If nobody has the time, I probably can have a look, but then I will need a lot of guidance, as I never added code to Neutron. In fact, I only sent bug fixes to Gerrit over the years (a lot of them...). :)

Slawek Kaplonski (slaweq) wrote :

Thx for clarification Thomas.
More I'm thinking about it, more I'm convinced that this should be done on server side.
Lets talk one more time about it on drivers meeting.

If You want, I would be happy to help with development of this in U cycle if that will be approved.

Slawek Kaplonski (slaweq) wrote :

According to decission on drivers meeting at 25.20.2019, this RFE is now approved. Thx for proposing it :)

tags: added: rfe-approved
removed: rfe-triaged
Changed in neutron:
milestone: none → ussuri-1
Changed in neutron:
milestone: ussuri-1 → none
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers