Comment 2 for bug 1667055

Revision history for this message
Ihar Hrachyshka (ihar-hrachyshka) wrote :

What I see in the logs is:

http://logs.openstack.org/94/487194/2/check/gate-neutron-dsvm-fullstack-ubuntu-xenial/e868687/testr_results.html.gz

When router is created, it's first committed to db, then it's scheduled to agents, and finally the result of the create operation is returned.

If we look at request-id of the create operation, we see the following in neutron-server log:

2017-08-28 23:49:25.722 32562 DEBUG neutron.scheduler.l3_agent_scheduler [req-e20d202e-025a-4bb5-9c64-7f4182037d44 - - - - -] Router 12c8a31d-6077-4096-89cb-d64b98bb88c1 is scheduled to L3 agent 2c2ed21e-e788-482d-844d-d11eb6145788 with binding_index 1 bind_router /opt/stack/new/neutron/neutron/scheduler/l3_agent_scheduler.py:225

We don't see this request scheduling the router to another agent.

Though around same time we see another log message that talks about scheduling the router to that other agent:

2017-08-28 23:49:27.274 32565 DEBUG neutron.scheduler.l3_agent_scheduler [req-eb2181a8-2d02-4fec-847d-e7e870d411c3 - - - - -] Router 12c8a31d-6077-4096-89cb-d64b98bb88c1 is scheduled to L3 agent d8edd5b8-ec1e-4cdf-9e35-93f258e65230 with binding_index 2 bind_router /opt/stack/new/neutron/neutron/scheduler/l3_agent_scheduler.py:225

You can notice that the request-id is different. This request came from the affected l3 agent:

2017-08-28 23:49:25.073 677 DEBUG oslo_service.periodic_task [req-eb2181a8-2d02-4fec-847d-e7e870d411c3 - - - - -] Running periodic task L3NATAgentForTest.periodic_sync_routers_task run_periodic_tasks /opt/stack/new/neutron/.tox/dsvm-fullstack/local/lib/python2.7/site-packages/oslo_service/periodic_task.py:215
2017-08-28 23:49:25.073 677 DEBUG neutron.agent.l3.agent [req-eb2181a8-2d02-4fec-847d-e7e870d411c3 - - - - -] Starting fullsync periodic_sync_routers_task periodic_sync_routers_task /opt/stack/new/neutron/neutron/agent/l3/agent.py:569

I suspect that there is a race condition between those two requests (one from api user, another from l3 agent rpc) that results in a binding missing/not seen by the next api request that fetches l3 agents for the router.

It would be nice if someone from HA folks look at the logs.