Comment 7 for bug 1818614

Revision history for this message
Slawek Kaplonski (slaweq) wrote : Re: Various L3HA functional tests fails often

Today I analyze one more such failure from test_ha_router module (no dvr).

It looks that this issue is caused by race condition between spawning keepalived and spawning ip monitor by neutron-keepalived-state-change process.

Lets check logs from failed test http://logs.openstack.org/17/641117/6/gate/neutron-functional/379d405/logs/dsvm-functional-logs/neutron.tests.functional.agent.l3.test_ha_router.LinuxBridgeL3HATestCase.test_ipv6_router_advts_and_fwd_after_router_state_change_backup.txt.gz

This test is creating 2 routers, one by one: https://github.com/openstack/neutron/blob/b847cd02c56dc8fe654f4731306dc2b5493a62eb/neutron/tests/functional/agent/l3/test_ha_router.py#L142

In our example, first router was e20c5656-7e6f-4a29-8413-3aaad80daca1 which was properly transitioned first to backup at 2019-03-08 10:34:07.072 and then to master at 2019-03-08 10:34:19.899

Second router has got ID b357d56c-4f76-4f5d-9767-289d8cde726e and was first transitioned to backup at 2019-03-08 10:34:26.061 but then was never transitioned to master and that's why test failed.

So let's now check in journal.log what happened with keepalived and neutron-keepalived-state-change processes for both routers.
First router which worked fine:
- neutron-keepalived-state-change spawned ip monitor process at Mar 08 10:34:17:
Mar 08 10:34:17 ubuntu-xenial-ovh-gra1-0003584991 neutron-keepalived-state-change[31497]: 2019-03-08 10:34:17.894 31497 DEBUG neutron.agent.linux.utils [-] Running command: ['ip', 'netns', 'exec', 'qroute

- keepalived switched to MASTER STATE at Mar 08 10:34:17:
ubuntu-xenial-ovh-gra1-0003584991 Keepalived_vrrp[32243]: VRRP_Instance(VR_1) Transition to MASTER STATE

- neutron-keepalived-state-change notices event on ip monitor stdout and thus notified L3 agent that router is now master:
Mar 08 10:34:19 ubuntu-xenial-ovh-gra1-0003584991 neutron-keepalived-state-change[31497]: 2019-03-08 10:34:19.893 31497 DEBUG neutron.agent.l3.keepalived_state_change [-] Wrote router e20c5656-7e6f-4a29-8

So now, lets see how it was in case of second router, which failed:

- keepalived switched to MASTER STATE at Mar 08 10:34:36
ubuntu-xenial-ovh-gra1-0003584991 Keepalived_vrrp[4024]: VRRP_Instance(VR_1) Transition to MASTER STATE

- neutron-keepalived-state-change spawned ip monitor process at Mar 08 10:34:52
ubuntu-xenial-ovh-gra1-0003584991 neutron-keepalived-state-change[3531]: 2019-03-08 10:34:52.313 3531 DEBUG neutron.agent.common.async_process [-] Launching async process [ip netns exec qr

- as keepalived already changed state to master, there is no any event on ip monitor noticed so L3 agent isn't informed about current state of router. After one minute, test fails because of that.