In our example, first router was e20c5656-7e6f-4a29-8413-3aaad80daca1 which was properly transitioned first to backup at 2019-03-08 10:34:07.072 and then to master at 2019-03-08 10:34:19.899
Second router has got ID b357d56c-4f76-4f5d-9767-289d8cde726e and was first transitioned to backup at 2019-03-08 10:34:26.061 but then was never transitioned to master and that's why test failed.
So let's now check in journal.log what happened with keepalived and neutron-keepalived-state-change processes for both routers.
First router which worked fine:
- neutron-keepalived-state-change spawned ip monitor process at Mar 08 10:34:17:
Mar 08 10:34:17 ubuntu-xenial-ovh-gra1-0003584991 neutron-keepalived-state-change[31497]: 2019-03-08 10:34:17.894 31497 DEBUG neutron.agent.linux.utils [-] Running command: ['ip', 'netns', 'exec', 'qroute
- keepalived switched to MASTER STATE at Mar 08 10:34:17:
ubuntu-xenial-ovh-gra1-0003584991 Keepalived_vrrp[32243]: VRRP_Instance(VR_1) Transition to MASTER STATE
- neutron-keepalived-state-change notices event on ip monitor stdout and thus notified L3 agent that router is now master:
Mar 08 10:34:19 ubuntu-xenial-ovh-gra1-0003584991 neutron-keepalived-state-change[31497]: 2019-03-08 10:34:19.893 31497 DEBUG neutron.agent.l3.keepalived_state_change [-] Wrote router e20c5656-7e6f-4a29-8
So now, lets see how it was in case of second router, which failed:
- keepalived switched to MASTER STATE at Mar 08 10:34:36
ubuntu-xenial-ovh-gra1-0003584991 Keepalived_vrrp[4024]: VRRP_Instance(VR_1) Transition to MASTER STATE
- neutron-keepalived-state-change spawned ip monitor process at Mar 08 10:34:52
ubuntu-xenial-ovh-gra1-0003584991 neutron-keepalived-state-change[3531]: 2019-03-08 10:34:52.313 3531 DEBUG neutron.agent.common.async_process [-] Launching async process [ip netns exec qr
- as keepalived already changed state to master, there is no any event on ip monitor noticed so L3 agent isn't informed about current state of router. After one minute, test fails because of that.
Today I analyze one more such failure from test_ha_router module (no dvr).
It looks that this issue is caused by race condition between spawning keepalived and spawning ip monitor by neutron- keepalived- state-change process.
Lets check logs from failed test http:// logs.openstack. org/17/ 641117/ 6/gate/ neutron- functional/ 379d405/ logs/dsvm- functional- logs/neutron. tests.functiona l.agent. l3.test_ ha_router. LinuxBridgeL3HA TestCase. test_ipv6_ router_ advts_and_ fwd_after_ router_ state_change_ backup. txt.gz
This test is creating 2 routers, one by one: https:/ /github. com/openstack/ neutron/ blob/b847cd02c5 6dc8fe654f47313 06dc2b5493a62eb /neutron/ tests/functiona l/agent/ l3/test_ ha_router. py#L142
In our example, first router was e20c5656- 7e6f-4a29- 8413-3aaad80dac a1 which was properly transitioned first to backup at 2019-03-08 10:34:07.072 and then to master at 2019-03-08 10:34:19.899
Second router has got ID b357d56c- 4f76-4f5d- 9767-289d8cde72 6e and was first transitioned to backup at 2019-03-08 10:34:26.061 but then was never transitioned to master and that's why test failed.
So let's now check in journal.log what happened with keepalived and neutron- keepalived- state-change processes for both routers. keepalived- state-change spawned ip monitor process at Mar 08 10:34:17: xenial- ovh-gra1- 0003584991 neutron- keepalived- state-change[ 31497]: 2019-03-08 10:34:17.894 31497 DEBUG neutron. agent.linux. utils [-] Running command: ['ip', 'netns', 'exec', 'qroute
First router which worked fine:
- neutron-
Mar 08 10:34:17 ubuntu-
- keepalived switched to MASTER STATE at Mar 08 10:34:17: xenial- ovh-gra1- 0003584991 Keepalived_ vrrp[32243] : VRRP_Instance(VR_1) Transition to MASTER STATE
ubuntu-
- neutron- keepalived- state-change notices event on ip monitor stdout and thus notified L3 agent that router is now master: xenial- ovh-gra1- 0003584991 neutron- keepalived- state-change[ 31497]: 2019-03-08 10:34:19.893 31497 DEBUG neutron. agent.l3. keepalived_ state_change [-] Wrote router e20c5656- 7e6f-4a29- 8
Mar 08 10:34:19 ubuntu-
So now, lets see how it was in case of second router, which failed:
- keepalived switched to MASTER STATE at Mar 08 10:34:36 xenial- ovh-gra1- 0003584991 Keepalived_ vrrp[4024] : VRRP_Instance(VR_1) Transition to MASTER STATE
ubuntu-
- neutron- keepalived- state-change spawned ip monitor process at Mar 08 10:34:52 xenial- ovh-gra1- 0003584991 neutron- keepalived- state-change[ 3531]: 2019-03-08 10:34:52.313 3531 DEBUG neutron. agent.common. async_process [-] Launching async process [ip netns exec qr
ubuntu-
- as keepalived already changed state to master, there is no any event on ip monitor noticed so L3 agent isn't informed about current state of router. After one minute, test fails because of that.