The race cannot happen between two create_routers for the same router, and it's not likely it's happening between two auto_schedule_router calls for the same router (That is invoked by sync_routers, which is an RPC method invoked by the L3 agent). So, that leaves a race between create_router and an agent invoking sync_routers on the server.
Looking at create_router in the HA routers mixin: https://github.com/openstack/neutron/blob/master/neutron/db/l3_hamode_db.py#L378. It's clearly not atomic, at all... I think that after the base DB object is created in line 386, if an RPC call from an agent (Say, it just started/restarted, or an error occurred and it's resyncing), sync_routers will see a router object in the DB and try to bind it to the agent. Basically, I think that an HA router can be bound after the super(L3_HA_NAT_db_mixin, self).create_router(context, router) call in line 378 but before the self._create_ha_interfaces(context, router_db, ha_network) call in line 398. I verified this by putting a break point right after the super create_router call, restarting a L3 agent, and hitting continue in pdb. After that when trying to list the router bindings for that router I got the trace described in the bug report.
Ann, Eugene - Thoughts on how to solve this issue? One way is modify the patch proposed (Keeping the new unique constraint), but in _create_ha_port_binding catching the unique constraint violation and returning the binding instead of raising an exception (i.e. changing _create_ha_port_binding to _create_or_get_ha_port_binding).
L3HARouterAgent PortBinding is added via a single method: add_ha_port (https:/ /github. com/openstack/ neutron/ blob/master/ neutron/ db/l3_hamode_ db.py#L312). That method is used in two places, while creating an HA router in _create_ ha_interfaces during router creation (https:/ /github. com/openstack/ neutron/ blob/master/ neutron/ db/l3_hamode_ db.py#L398), and in the L3 agent scheduler, in auto_schedule_ router, _schedule_ ha_routers_ to_additional_ agent here https:/ /github. com/openstack/ neutron/ blob/master/ neutron/ scheduler/ l3_agent_ scheduler. py#L150.
The race cannot happen between two create_routers for the same router, and it's not likely it's happening between two auto_schedule_ router calls for the same router (That is invoked by sync_routers, which is an RPC method invoked by the L3 agent). So, that leaves a race between create_router and an agent invoking sync_routers on the server.
Looking at create_router in the HA routers mixin: https:/ /github. com/openstack/ neutron/ blob/master/ neutron/ db/l3_hamode_ db.py#L378. It's clearly not atomic, at all... I think that after the base DB object is created in line 386, if an RPC call from an agent (Say, it just started/restarted, or an error occurred and it's resyncing), sync_routers will see a router object in the DB and try to bind it to the agent. Basically, I think that an HA router can be bound after the super(L3_ HA_NAT_ db_mixin, self).create_ router( context, router) call in line 378 but before the self._create_ ha_interfaces( context, router_db, ha_network) call in line 398. I verified this by putting a break point right after the super create_router call, restarting a L3 agent, and hitting continue in pdb. After that when trying to list the router bindings for that router I got the trace described in the bug report.
Ann, Eugene - Thoughts on how to solve this issue? One way is modify the patch proposed (Keeping the new unique constraint), but in _create_ ha_port_ binding catching the unique constraint violation and returning the binding instead of raising an exception (i.e. changing _create_ ha_port_ binding to _create_ or_get_ ha_port_ binding) .