HA router state may be set incorrectly in the Neutron DB in some cases
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
neutron |
New
|
Low
|
Unassigned |
Bug Description
In case when Neutron-L3-agent reports state of the HA router to the neutron server, like:
2023-07-30 23:06:49.216 3139 DEBUG neutron.agent.l3.ha [-] Updating server with HA routers states {'f5d52aaf-
It may happen that e.g. during reboot of controller or some other fault in the cloud (we hit that in the faults Tobiko test in our downstream CI), this message potentially can never be delivered to the neutron server and state of the router will be incorrect in the Neutron DB thus incorrectly reported through Neutron API.
I see 2 potential ways to solve that issue:
* L3 agent with HA routers could maybe include state of all routers in the heartbeat and then Neutron server could update it's state in db while processing heartbeat messages from L3 agent or
* Neutron server would periodically ask each L3 agent with HA routers about state of the routers and update it in the DB accordingly.
tags: | added: low-hanging-fruit |
We see the same issue. For example if a network node which was active for some tenant networks fails, state will not be updated and multiple 'active' nodes will be listed in the database, even though only one is actually active.
Unfortunately in this situation even when the failed network node comes back online and assumes the 'backup' state, this doesn't necessarily seem to get updated in the database, so even with all systems operational again it appears that HA routers have multiple 'active' nodes.