neutron dhcp agent state not consistent with real status

Bug #1988281 reported by norman shen
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Won't Fix
Wishlist
Unassigned

Bug Description

We have a situation where there are 4 servers which all of them could be seen as network and compute nodes. And the hosts are running in the same rack, to make things worse the power supply is not very stable which means occasionally all physical servers could be cut off of power supply at the same time. After reboot, we found that virtual machine (especially for centos series) could lost IP because when virtual machine reboots, it may not waiting for DHCP agents to be ready.

We are observing that neutron-dhcp-agent's state is deviating from "real state", by saying real state, I mean all hosted dnsmasq are running and configured.

For example, agent A is hosting 1,000 networks, if I reboot agent A then all dnsmasq processes are gone, and dhcp agent will try to reboot every dnsmasq, this will introduce a long delay between agent start and agent handles new rabbitmq messages. But weirdly, openstack network agent list will show that the agent is up and running which IMO is inconsistent. I think under this situation, openstack network agent list should report the corresponding agent to be down.

norman shen (jshen28)
description: updated
Revision history for this message
Rodolfo Alonso (rodolfo-alonso-hernandez) wrote :

The agent is active at this point, but in the initial transient period taken to resync, reading from the Neutron API and executing the required actions per network.

During this transient period the agent won't attend new updates but it is still active. As seen in other agents too (OVS, L3, OVN metadata), the resync process could take time. This could be an opportunity to improve the agent API adding this info: if the agent is during its initial sync period or not.

Revision history for this message
Brian Haley (brian-haley) wrote :

I would agree with Rodolfo that this is more of an RFE as there isn't any fine-grained status info, in this case UP indicates the agent is running.

As an FYI the agent is consuming messages off the queue as it's doing a full-sync, and it should also be receiving other messages as instances are created/destroyed. Also, these "new" messages have a priority value such that they should be processed sooner than some of the full-sync ones, based on the code and comments on the notifier code.

neutron/api/rpc/agentnotifiers/dhcp_rpc_agent_api.py

# In order to improve port dhcp provisioning when nova concurrently create
# multiple vms, I classify the port_create_end message to two levels, the
# high-level message only cast to one agent, the low-level message cast to all
# other agent. In this way, When there are a large number of ports that need to
# be processed, we can dispatch the high priority message of port to different
# agent, so that the processed port will not block other port's processing in
# other dhcp agents.

It can take a long time for any agent to complete a full-sync operation on a restart, but we have tried to speed it up as best we can and there's probably always room for improvement. The other option is go to an OVN backend, which removes these agents completely...

Changed in neutron:
importance: Undecided → Wishlist
status: New → Opinion
Revision history for this message
norman shen (jshen28) wrote :

thank you for the feedback. The issue with OVN backend is it does not support vxlan yet (?).

At least for DCHP agent, it is not responding to RPC calls until the first resync finishes, succeeded or not. until https://opendev.org/openstack/neutron/src/commit/4b83bf462d14824caf5f07569377de4276ed99c0/neutron/service.py#L362 this is getting called, the RPC server is ready for handling RPC calls. and resync process is happening at manager.init_host. but the report_state function is called periodically in an coroutine and thus not affected by init_host.

In my opinion, maybe could be started after init_host successfully completes.

norman shen (jshen28)
description: updated
Revision history for this message
Slawek Kaplonski (slaweq) wrote :

It was done like that in the past intentionally and heartbeat was moved to separate thread to be able to send messages to neutron server even when agent is busy. It was done like that to fix problem with "flapping" agents and rescheduling networks from "dead" agents to other ones which are alive.
I agree with Rodolfo and Brian that we could maybe propose more fine-grained statuses for agents but I think that it should be opened as new RFE and I'm going to close that bug now.

Changed in neutron:
status: Opinion → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.