Comment 3 for bug 1945512

Revision history for this message
Rodolfo Alonso (rodolfo-alonso-hernandez) wrote :

Hello Bence:

The problem is not in the test timeout but how we check the state (and that should affect too to a production environment). We read the HA router state from the "keepalived-state-change" file when this process prints the current state of this instance [1]. The initial state does not imply a state change transition (because there was no previous state defined in the router). That means in [1] we read "primary" but [2] is still waiting to apply this state.

When in [3] we do the failover, the state changes immediately, the "keepalived-state-change" process writes the new state in the file and sends the HTTP request to the L3 agent, that attends this petition BEFORE the [2] timeout is finished.

So when in [4] we check the current transition state, this is now "backup" when this thread was processing "primary". That will trigger the premature exit of this method without any processing.

Regards.

[1]https://github.com/openstack/neutron/blob/7cdc4de11baebf7e7f7ebbab5932408e2cc7fcd4/neutron/tests/functional/agent/l3/test_ha_router.py#L115
[2]https://github.com/openstack/neutron/blob/e6ee06f818d3f1e83ef9788ddb23a33d44754e19/neutron/agent/l3/ha.py#L152
[3]https://github.com/openstack/neutron/blob/7cdc4de11baebf7e7f7ebbab5932408e2cc7fcd4/neutron/tests/functional/agent/l3/test_ha_router.py#L117-L118
[4]https://github.com/openstack/neutron/blob/e6ee06f818d3f1e83ef9788ddb23a33d44754e19/neutron/agent/l3/ha.py#L153