HA router state may be set incorrectly in the Neutron DB in some cases

Bug #2030735 reported by Slawek Kaplonski
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
neutron
New
Low
Unassigned

Bug Description

In case when Neutron-L3-agent reports state of the HA router to the neutron server, like:

2023-07-30 23:06:49.216 3139 DEBUG neutron.agent.l3.ha [-] Updating server with HA routers states {'f5d52aaf-30e1-4396-bde4-8acb7506c301': 'active'} notify_server /usr/lib/python3.9/site-packages/neutron/agent/l3/ha.py:243

It may happen that e.g. during reboot of controller or some other fault in the cloud (we hit that in the faults Tobiko test in our downstream CI), this message potentially can never be delivered to the neutron server and state of the router will be incorrect in the Neutron DB thus incorrectly reported through Neutron API.

I see 2 potential ways to solve that issue:
* L3 agent with HA routers could maybe include state of all routers in the heartbeat and then Neutron server could update it's state in db while processing heartbeat messages from L3 agent or
* Neutron server would periodically ask each L3 agent with HA routers about state of the routers and update it in the DB accordingly.

tags: added: low-hanging-fruit
Revision history for this message
Andrew Bonney (andrewbonney) wrote :

We see the same issue. For example if a network node which was active for some tenant networks fails, state will not be updated and multiple 'active' nodes will be listed in the database, even though only one is actually active.

Unfortunately in this situation even when the failed network node comes back online and assumes the 'backup' state, this doesn't necessarily seem to get updated in the database, so even with all systems operational again it appears that HA routers have multiple 'active' nodes.

Revision history for this message
Andrew Bonney (andrewbonney) wrote :

We've just experienced a similar issue as a result of multiple HA instances assuming the primary role when one went down. We have four network nodes, each of which is a candidate to take over for a given tenant router.

Having shut down one network node for maintenance, all three other network nodes assuming the primary role for this tenant, with log messages like:

"Router 226ff7f1-8b69-4ff1-a72e-6fca8252e8f0 transitioned to primary on agent ..."

Despite two of these routers then following up with a transition to backup, the 'active' state remains against each of them in the database. This shows no sign of changing after many minutes.

"Router 226ff7f1-8b69-4ff1-a72e-6fca8252e8f0 transitioned to backup on agent ..."

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.