When creating an HA router, after the server creates all the DB objects (including the HA network and ports if it's the first one), the server continues on the schedule the router to (some of) the available agents.
The race is achieved when an L3 agent router issues a sync_router request, which later down the line ends up in an auto_schedule_routers() call. If this happens before the above schedule (of the create_router()) is complete, the server will refuse to schedule the router to the other intended L3 agents, resulting is less agents being scheduled.
The only way to fix this is either restarting one of the L3 agents which didn't get scheduled, or recreating the router. Either is a bad option.
An example of the state:
$ neutron l3-agent-list-hosting-router router2
+--------------------------------------+-------------------------+----------------+-------+----------+
| id | host | admin_state_up | alive | ha_state |
+--------------------------------------+-------------------------+----------------+-------+----------+
| d05da32b-34e7-4c7f-b0dd-938328a0c0ed | vpn-6-12 | True | :-) | active |
+--------------------------------------+-------------------------+----------------+-------+----------+
(only 1 of the agent got scheduled with the router, even though there are 3 suitable agents that normally get scheduled without the race.)
Maybe the following trace is related to this bug: paste.openstack .org/show/ 488732/
http://