[HA] HA router first transition to master should not wait

Bug #1945512 reported by Rodolfo Alonso
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Fix Released
Medium
Rodolfo Alonso

Bug Description

In a HA router, the first state change should be immediate. [1] introduced a delay when transitioning from "backup" to "primary". This is happening too when setting the first HA router state. In this case, the delay should be skipped.

Logs: https://450622a53297d10bce29-d3fc18d05d3e89166397d52e51106961.ssl.cf2.rackcdn.com/805637/8/check/neutron-functional-with-uwsgi/517e0d3/testr_results.html

Snippet: https://paste.opendev.org/show/809678/

[1]https://review.opendev.org/q/I70037da9cdd0f8448e0af8dd96b4e3f5de5728ad

Changed in neutron:
assignee: nobody → Rodolfo Alonso (rodolfo-alonso-hernandez)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/neutron/+/811751

Changed in neutron:
status: New → In Progress
tags: added: l3-ha
Revision history for this message
Bence Romsics (bence-romsics) wrote :

IMHO both the problem reported and the proposed optimization are correct, however the state transition delay is 2s while the test timeout was 60s, so I don't think there will be a big change is test failure rates.

Changed in neutron:
status: In Progress → Triaged
importance: Undecided → Medium
Changed in neutron:
status: Triaged → In Progress
Revision history for this message
Rodolfo Alonso (rodolfo-alonso-hernandez) wrote :

Hello Bence:

The problem is not in the test timeout but how we check the state (and that should affect too to a production environment). We read the HA router state from the "keepalived-state-change" file when this process prints the current state of this instance [1]. The initial state does not imply a state change transition (because there was no previous state defined in the router). That means in [1] we read "primary" but [2] is still waiting to apply this state.

When in [3] we do the failover, the state changes immediately, the "keepalived-state-change" process writes the new state in the file and sends the HTTP request to the L3 agent, that attends this petition BEFORE the [2] timeout is finished.

So when in [4] we check the current transition state, this is now "backup" when this thread was processing "primary". That will trigger the premature exit of this method without any processing.

Regards.

[1]https://github.com/openstack/neutron/blob/7cdc4de11baebf7e7f7ebbab5932408e2cc7fcd4/neutron/tests/functional/agent/l3/test_ha_router.py#L115
[2]https://github.com/openstack/neutron/blob/e6ee06f818d3f1e83ef9788ddb23a33d44754e19/neutron/agent/l3/ha.py#L152
[3]https://github.com/openstack/neutron/blob/7cdc4de11baebf7e7f7ebbab5932408e2cc7fcd4/neutron/tests/functional/agent/l3/test_ha_router.py#L117-L118
[4]https://github.com/openstack/neutron/blob/e6ee06f818d3f1e83ef9788ddb23a33d44754e19/neutron/agent/l3/ha.py#L153

Revision history for this message
Bence Romsics (bence-romsics) wrote :

Thanks for the explanation, Rodolfo!

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/811751
Committed: https://opendev.org/openstack/neutron/commit/c20f2e5136fd241f4be5c37403ab1ed54cdaefb5
Submitter: "Zuul (22348)"
Branch: master

commit c20f2e5136fd241f4be5c37403ab1ed54cdaefb5
Author: Rodolfo Alonso Hernandez <email address hidden>
Date: Wed Sep 29 16:28:28 2021 +0000

    [HA] Do not add initial state change delay in HA router

    The initial state ("primary", "backup") should be set immediately.
    in [1], a transition delay to "primary" was introduced. This delay
    is unnecesary when the first state happens.

    Closes-Bug: #1945512

    [1]https://review.opendev.org/q/I70037da9cdd0f8448e0af8dd96b4e3f5de5728ad

    Change-Id: Ibe9178c4126977f1321e414676d67f28e5ec9b57

Changed in neutron:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 20.0.0.0rc1

This issue was fixed in the openstack/neutron 20.0.0.0rc1 release candidate.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/neutron/+/937758

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (master)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/937758
Committed: https://opendev.org/openstack/neutron/commit/a3689956dde80b9639a3e805257ac02e1044a4c2
Submitter: "Zuul (22348)"
Branch: master

commit a3689956dde80b9639a3e805257ac02e1044a4c2
Author: yatinkarel <email address hidden>
Date: Mon Dec 16 09:59:52 2024 +0530

    Revert "[HA] Do not add initial state change delay in HA router"

    This reverts commit c20f2e5136fd241f4be5c37403ab1ed54cdaefb5.

    The fix of bug #1945512 reintroduced bug #1837635 as after
    the initial backup state ha router can transition to
    'primary' state on multiple hosts and due to this
    delay multiple routers get into 'active' ha_state even
    if one of the host quickly transition to backup after
    the primary state.
    The issue got visible since ha router fullstack tests
    are added as part of [1].

    [1] https://review.opendev.org/c/openstack/neutron/+/917429

    Related-Bug: #1837635
    Related-Bug: #1945512
    Related-Bug: #2083609
    Change-Id: I83b53a07362861da98b8361dafd95e94e5048322

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/2024.2)

Related fix proposed to branch: stable/2024.2
Review: https://review.opendev.org/c/openstack/neutron/+/937878

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/2024.1)

Related fix proposed to branch: stable/2024.1
Review: https://review.opendev.org/c/openstack/neutron/+/937879

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/2023.2)

Related fix proposed to branch: stable/2023.2
Review: https://review.opendev.org/c/openstack/neutron/+/937880

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (stable/2024.1)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/937879
Committed: https://opendev.org/openstack/neutron/commit/549af9e07cb8a973aab15f914e3d43a8ea311cb2
Submitter: "Zuul (22348)"
Branch: stable/2024.1

commit 549af9e07cb8a973aab15f914e3d43a8ea311cb2
Author: yatinkarel <email address hidden>
Date: Mon Dec 16 09:59:52 2024 +0530

    Revert "[HA] Do not add initial state change delay in HA router"

    This reverts commit c20f2e5136fd241f4be5c37403ab1ed54cdaefb5.

    The fix of bug #1945512 reintroduced bug #1837635 as after
    the initial backup state ha router can transition to
    'primary' state on multiple hosts and due to this
    delay multiple routers get into 'active' ha_state even
    if one of the host quickly transition to backup after
    the primary state.
    The issue got visible since ha router fullstack tests
    are added as part of [1].

    [1] https://review.opendev.org/c/openstack/neutron/+/917429

    Related-Bug: #1837635
    Related-Bug: #1945512
    Related-Bug: #2083609
    Change-Id: I83b53a07362861da98b8361dafd95e94e5048322
    (cherry picked from commit a3689956dde80b9639a3e805257ac02e1044a4c2)
    Conflicts:
            neutron/agent/l3/ha.py
            neutron/tests/unit/agent/l3/test_agent.py

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (stable/2024.2)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/937878
Committed: https://opendev.org/openstack/neutron/commit/23b57a6dae715d71b02f0f6bf528294e7bdadeea
Submitter: "Zuul (22348)"
Branch: stable/2024.2

commit 23b57a6dae715d71b02f0f6bf528294e7bdadeea
Author: yatinkarel <email address hidden>
Date: Mon Dec 16 09:59:52 2024 +0530

    Revert "[HA] Do not add initial state change delay in HA router"

    This reverts commit c20f2e5136fd241f4be5c37403ab1ed54cdaefb5.

    The fix of bug #1945512 reintroduced bug #1837635 as after
    the initial backup state ha router can transition to
    'primary' state on multiple hosts and due to this
    delay multiple routers get into 'active' ha_state even
    if one of the host quickly transition to backup after
    the primary state.
    The issue got visible since ha router fullstack tests
    are added as part of [1].

    [1] https://review.opendev.org/c/openstack/neutron/+/917429

    Related-Bug: #1837635
    Related-Bug: #1945512
    Related-Bug: #2083609
    Change-Id: I83b53a07362861da98b8361dafd95e94e5048322
    (cherry picked from commit a3689956dde80b9639a3e805257ac02e1044a4c2)
    Conflicts:
            neutron/agent/l3/ha.py
            neutron/tests/unit/agent/l3/test_agent.py

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (stable/2023.2)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/937880
Committed: https://opendev.org/openstack/neutron/commit/f30d7c7093f7900ddb788891c4c3fbbc4037ca1b
Submitter: "Zuul (22348)"
Branch: stable/2023.2

commit f30d7c7093f7900ddb788891c4c3fbbc4037ca1b
Author: yatinkarel <email address hidden>
Date: Mon Dec 16 09:59:52 2024 +0530

    Revert "[HA] Do not add initial state change delay in HA router"

    This reverts commit c20f2e5136fd241f4be5c37403ab1ed54cdaefb5.

    The fix of bug #1945512 reintroduced bug #1837635 as after
    the initial backup state ha router can transition to
    'primary' state on multiple hosts and due to this
    delay multiple routers get into 'active' ha_state even
    if one of the host quickly transition to backup after
    the primary state.
    The issue got visible since ha router fullstack tests
    are added as part of [1].

    [1] https://review.opendev.org/c/openstack/neutron/+/917429

    Related-Bug: #1837635
    Related-Bug: #1945512
    Related-Bug: #2083609
    Change-Id: I83b53a07362861da98b8361dafd95e94e5048322
    (cherry picked from commit a3689956dde80b9639a3e805257ac02e1044a4c2)
    Conflicts:
            neutron/agent/l3/ha.py
            neutron/tests/unit/agent/l3/test_agent.py

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (unmaintained/2023.1)

Related fix proposed to branch: unmaintained/2023.1
Review: https://review.opendev.org/c/openstack/neutron/+/943630

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (unmaintained/zed)

Related fix proposed to branch: unmaintained/zed
Review: https://review.opendev.org/c/openstack/neutron/+/943631

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (unmaintained/yoga)

Related fix proposed to branch: unmaintained/yoga
Review: https://review.opendev.org/c/openstack/neutron/+/943632

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (unmaintained/2023.1)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/943630
Committed: https://opendev.org/openstack/neutron/commit/e9e40a9c4b1d6e1b78f7140f0453cde970b9508c
Submitter: "Zuul (22348)"
Branch: unmaintained/2023.1

commit e9e40a9c4b1d6e1b78f7140f0453cde970b9508c
Author: yatinkarel <email address hidden>
Date: Mon Dec 16 09:59:52 2024 +0530

    Revert "[HA] Do not add initial state change delay in HA router"

    This reverts commit c20f2e5136fd241f4be5c37403ab1ed54cdaefb5.

    The fix of bug #1945512 reintroduced bug #1837635 as after
    the initial backup state ha router can transition to
    'primary' state on multiple hosts and due to this
    delay multiple routers get into 'active' ha_state even
    if one of the host quickly transition to backup after
    the primary state.
    The issue got visible since ha router fullstack tests
    are added as part of [1].

    [1] https://review.opendev.org/c/openstack/neutron/+/917429

    Related-Bug: #1837635
    Related-Bug: #1945512
    Related-Bug: #2083609
    Change-Id: I83b53a07362861da98b8361dafd95e94e5048322
    (cherry picked from commit a3689956dde80b9639a3e805257ac02e1044a4c2)
    Conflicts:
            neutron/agent/l3/ha.py
            neutron/tests/unit/agent/l3/test_agent.py
    (cherry picked from commit f30d7c7093f7900ddb788891c4c3fbbc4037ca1b)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (unmaintained/zed)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/943631
Committed: https://opendev.org/openstack/neutron/commit/62131095e4e1aae7744e5de0cdae95aaefb09f30
Submitter: "Zuul (22348)"
Branch: unmaintained/zed

commit 62131095e4e1aae7744e5de0cdae95aaefb09f30
Author: yatinkarel <email address hidden>
Date: Mon Dec 16 09:59:52 2024 +0530

    Revert "[HA] Do not add initial state change delay in HA router"

    This reverts commit c20f2e5136fd241f4be5c37403ab1ed54cdaefb5.

    The fix of bug #1945512 reintroduced bug #1837635 as after
    the initial backup state ha router can transition to
    'primary' state on multiple hosts and due to this
    delay multiple routers get into 'active' ha_state even
    if one of the host quickly transition to backup after
    the primary state.
    The issue got visible since ha router fullstack tests
    are added as part of [1].

    [1] https://review.opendev.org/c/openstack/neutron/+/917429

    Related-Bug: #1837635
    Related-Bug: #1945512
    Related-Bug: #2083609
    Change-Id: I83b53a07362861da98b8361dafd95e94e5048322
    (cherry picked from commit a3689956dde80b9639a3e805257ac02e1044a4c2)
    Conflicts:
            neutron/agent/l3/ha.py
            neutron/tests/unit/agent/l3/test_agent.py
    (cherry picked from commit f30d7c7093f7900ddb788891c4c3fbbc4037ca1b)

tags: added: in-unmaintained-zed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (unmaintained/yoga)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/943632
Committed: https://opendev.org/openstack/neutron/commit/20694907593ad0fd0f28a963d606d31b76670f93
Submitter: "Zuul (22348)"
Branch: unmaintained/yoga

commit 20694907593ad0fd0f28a963d606d31b76670f93
Author: yatinkarel <email address hidden>
Date: Mon Dec 16 09:59:52 2024 +0530

    Revert "[HA] Do not add initial state change delay in HA router"

    This reverts commit c20f2e5136fd241f4be5c37403ab1ed54cdaefb5.

    The fix of bug #1945512 reintroduced bug #1837635 as after
    the initial backup state ha router can transition to
    'primary' state on multiple hosts and due to this
    delay multiple routers get into 'active' ha_state even
    if one of the host quickly transition to backup after
    the primary state.
    The issue got visible since ha router fullstack tests
    are added as part of [1].

    [1] https://review.opendev.org/c/openstack/neutron/+/917429

    Related-Bug: #1837635
    Related-Bug: #1945512
    Related-Bug: #2083609
    Change-Id: I83b53a07362861da98b8361dafd95e94e5048322
    (cherry picked from commit a3689956dde80b9639a3e805257ac02e1044a4c2)
    Conflicts:
            neutron/agent/l3/ha.py
            neutron/tests/unit/agent/l3/test_agent.py
    (cherry picked from commit f30d7c7093f7900ddb788891c4c3fbbc4037ca1b)

tags: added: in-unmaintained-yoga
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.