[HA] HA router first transition to master should not wait

Bug #1945512 reported by Rodolfo Alonso
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Fix Released
Medium
Rodolfo Alonso

Bug Description

In a HA router, the first state change should be immediate. [1] introduced a delay when transitioning from "backup" to "primary". This is happening too when setting the first HA router state. In this case, the delay should be skipped.

Logs: https://450622a53297d10bce29-d3fc18d05d3e89166397d52e51106961.ssl.cf2.rackcdn.com/805637/8/check/neutron-functional-with-uwsgi/517e0d3/testr_results.html

Snippet: https://paste.opendev.org/show/809678/

[1]https://review.opendev.org/q/I70037da9cdd0f8448e0af8dd96b4e3f5de5728ad

Tags: l3-ha
Changed in neutron:
assignee: nobody → Rodolfo Alonso (rodolfo-alonso-hernandez)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/neutron/+/811751

Changed in neutron:
status: New → In Progress
tags: added: l3-ha
Revision history for this message
Bence Romsics (bence-romsics) wrote :

IMHO both the problem reported and the proposed optimization are correct, however the state transition delay is 2s while the test timeout was 60s, so I don't think there will be a big change is test failure rates.

Changed in neutron:
status: In Progress → Triaged
importance: Undecided → Medium
Changed in neutron:
status: Triaged → In Progress
Revision history for this message
Rodolfo Alonso (rodolfo-alonso-hernandez) wrote :

Hello Bence:

The problem is not in the test timeout but how we check the state (and that should affect too to a production environment). We read the HA router state from the "keepalived-state-change" file when this process prints the current state of this instance [1]. The initial state does not imply a state change transition (because there was no previous state defined in the router). That means in [1] we read "primary" but [2] is still waiting to apply this state.

When in [3] we do the failover, the state changes immediately, the "keepalived-state-change" process writes the new state in the file and sends the HTTP request to the L3 agent, that attends this petition BEFORE the [2] timeout is finished.

So when in [4] we check the current transition state, this is now "backup" when this thread was processing "primary". That will trigger the premature exit of this method without any processing.

Regards.

[1]https://github.com/openstack/neutron/blob/7cdc4de11baebf7e7f7ebbab5932408e2cc7fcd4/neutron/tests/functional/agent/l3/test_ha_router.py#L115
[2]https://github.com/openstack/neutron/blob/e6ee06f818d3f1e83ef9788ddb23a33d44754e19/neutron/agent/l3/ha.py#L152
[3]https://github.com/openstack/neutron/blob/7cdc4de11baebf7e7f7ebbab5932408e2cc7fcd4/neutron/tests/functional/agent/l3/test_ha_router.py#L117-L118
[4]https://github.com/openstack/neutron/blob/e6ee06f818d3f1e83ef9788ddb23a33d44754e19/neutron/agent/l3/ha.py#L153

Revision history for this message
Bence Romsics (bence-romsics) wrote :

Thanks for the explanation, Rodolfo!

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/811751
Committed: https://opendev.org/openstack/neutron/commit/c20f2e5136fd241f4be5c37403ab1ed54cdaefb5
Submitter: "Zuul (22348)"
Branch: master

commit c20f2e5136fd241f4be5c37403ab1ed54cdaefb5
Author: Rodolfo Alonso Hernandez <email address hidden>
Date: Wed Sep 29 16:28:28 2021 +0000

    [HA] Do not add initial state change delay in HA router

    The initial state ("primary", "backup") should be set immediately.
    in [1], a transition delay to "primary" was introduced. This delay
    is unnecesary when the first state happens.

    Closes-Bug: #1945512

    [1]https://review.opendev.org/q/I70037da9cdd0f8448e0af8dd96b4e3f5de5728ad

    Change-Id: Ibe9178c4126977f1321e414676d67f28e5ec9b57

Changed in neutron:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 20.0.0.0rc1

This issue was fixed in the openstack/neutron 20.0.0.0rc1 release candidate.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.