neutron dhcp agent state not consistent with real status
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
neutron |
Won't Fix
|
Wishlist
|
Unassigned |
Bug Description
We have a situation where there are 4 servers which all of them could be seen as network and compute nodes. And the hosts are running in the same rack, to make things worse the power supply is not very stable which means occasionally all physical servers could be cut off of power supply at the same time. After reboot, we found that virtual machine (especially for centos series) could lost IP because when virtual machine reboots, it may not waiting for DHCP agents to be ready.
We are observing that neutron-
For example, agent A is hosting 1,000 networks, if I reboot agent A then all dnsmasq processes are gone, and dhcp agent will try to reboot every dnsmasq, this will introduce a long delay between agent start and agent handles new rabbitmq messages. But weirdly, openstack network agent list will show that the agent is up and running which IMO is inconsistent. I think under this situation, openstack network agent list should report the corresponding agent to be down.
The agent is active at this point, but in the initial transient period taken to resync, reading from the Neutron API and executing the required actions per network.
During this transient period the agent won't attend new updates but it is still active. As seen in other agents too (OVS, L3, OVN metadata), the resync process could take time. This could be an opportunity to improve the agent API adding this info: if the agent is during its initial sync period or not.