Comment 2 for bug 1499647

Revision history for this message
Assaf Muller (amuller) wrote : Re: L3 HA: extra L3HARouterAgentPortBinding created for routers

L3HARouterAgentPortBinding is added via a single method: add_ha_port (https://github.com/openstack/neutron/blob/master/neutron/db/l3_hamode_db.py#L312). That method is used in two places, while creating an HA router in _create_ha_interfaces during router creation (https://github.com/openstack/neutron/blob/master/neutron/db/l3_hamode_db.py#L398), and in the L3 agent scheduler, in auto_schedule_router, _schedule_ha_routers_to_additional_agent here https://github.com/openstack/neutron/blob/master/neutron/scheduler/l3_agent_scheduler.py#L150.

The race cannot happen between two create_routers for the same router, and it's not likely it's happening between two auto_schedule_router calls for the same router (That is invoked by sync_routers, which is an RPC method invoked by the L3 agent). So, that leaves a race between create_router and an agent invoking sync_routers on the server.

Looking at create_router in the HA routers mixin: https://github.com/openstack/neutron/blob/master/neutron/db/l3_hamode_db.py#L378. It's clearly not atomic, at all... I think that after the base DB object is created in line 386, if an RPC call from an agent (Say, it just started/restarted, or an error occurred and it's resyncing), sync_routers will see a router object in the DB and try to bind it to the agent. Basically, I think that an HA router can be bound after the super(L3_HA_NAT_db_mixin, self).create_router(context, router) call in line 378 but before the self._create_ha_interfaces(context, router_db, ha_network) call in line 398. I verified this by putting a break point right after the super create_router call, restarting a L3 agent, and hitting continue in pdb. After that when trying to list the router bindings for that router I got the trace described in the bug report.

Ann, Eugene - Thoughts on how to solve this issue? One way is modify the patch proposed (Keeping the new unique constraint), but in _create_ha_port_binding catching the unique constraint violation and returning the binding instead of raising an exception (i.e. changing _create_ha_port_binding to _create_or_get_ha_port_binding).