Load balancers may be stuck in PENDING_UPDATE in case of DB outage

Bug #2036952 reported by Gregory Thiemonge
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
octavia
Fix Released
Medium
Gregory Thiemonge

Bug Description

When a DB outage occurs during the update/creation/deletion of a load balancer or one of its resources, the LB may be stuck in PENDING_UPDATE.

The revert of the flow by taskflow should handle all the errors but in many flows, the last revert task sets the status of the LB in ERROR (it can also be: resource in ERROR and LB ACTIVE).

https://opendev.org/openstack/octavia/src/commit/33fed53043091e5016a9114dfd803092a09a718b/octavia/controller/worker/v2/flows/load_balancer_flows.py#L392
https://opendev.org/openstack/octavia/src/commit/33fed53043091e5016a9114dfd803092a09a718b/octavia/controller/worker/task_utils.py#L173-L176

But if the DB is down this update may fail and the LB will be stuck in a PENDING_* state.

In those cases, we have some useful log messages (ERROR) that indicate that the resource status may not be correct:
- "Failed to update load balancer %(lb)s provisioning status to ERROR due to"
- "Failed to update amphora %(amp)s status to ERROR due to"
We could also add a warning log message that would explicitly mention that a load balancer status is not correct and it may be locked (currently not all error messages include the id of the LB)

It could help admins to find the locked LBs.

One way to mitigate this issue would be to retry to update DB during a long period (could be a few hours, until the DB outage is resolved), using tenacity in the TaskUtils methods could be a solution.

Changed in octavia:
assignee: nobody → Gregory Thiemonge (gthiemonge)
importance: Undecided → Medium
status: New → Confirmed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to octavia (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/octavia/+/896383

Changed in octavia:
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to octavia (master)

Reviewed: https://review.opendev.org/c/openstack/octavia/+/896383
Committed: https://opendev.org/openstack/octavia/commit/be91493332786365b8e997fcf88779a12d1ae130
Submitter: "Zuul (22348)"
Branch: master

commit be91493332786365b8e997fcf88779a12d1ae130
Author: Gregory Thiemonge <email address hidden>
Date: Mon Sep 25 07:48:01 2023 -0400

    Retry to set loadbalancer prov status on failures

    In case of DB outages when a flow is running, an exception is caught and
    the flow is reverted. In most of the flows, the revert function of the
    first task's (the last to be reverted) unlocks the load balancer by
    setting its provisioning status (to ERROR or ACTIVE, depending on the
    flow), but it fails if the DB is not reachable, leaving the LB in
    a PENDING_* state.
    This commit adds tenacity.retry to those functions, Octavia retries to
    set the status during ~2h45 (2000 attempts, 1 sec initial delay, 5 sec
    max delay).

    Closes-Bug: #2036952
    Change-Id: I458dd6d6f5383edc24116ea0fa27e3a593044146

Changed in octavia:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to octavia (stable/2023.2)

Fix proposed to branch: stable/2023.2
Review: https://review.opendev.org/c/openstack/octavia/+/897772

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to octavia (stable/2023.1)

Fix proposed to branch: stable/2023.1
Review: https://review.opendev.org/c/openstack/octavia/+/897777

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to octavia (stable/zed)

Fix proposed to branch: stable/zed
Review: https://review.opendev.org/c/openstack/octavia/+/897778

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to octavia (stable/yoga)

Fix proposed to branch: stable/yoga
Review: https://review.opendev.org/c/openstack/octavia/+/897779

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to octavia (stable/xena)

Fix proposed to branch: stable/xena
Review: https://review.opendev.org/c/openstack/octavia/+/897780

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to octavia (stable/wallaby)

Fix proposed to branch: stable/wallaby
Review: https://review.opendev.org/c/openstack/octavia/+/897781

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to octavia (stable/2023.2)

Reviewed: https://review.opendev.org/c/openstack/octavia/+/897772
Committed: https://opendev.org/openstack/octavia/commit/ef4b4d5007c51e13b26f243b94c3075dc092794b
Submitter: "Zuul (22348)"
Branch: stable/2023.2

commit ef4b4d5007c51e13b26f243b94c3075dc092794b
Author: Gregory Thiemonge <email address hidden>
Date: Mon Sep 25 07:48:01 2023 -0400

    Retry to set loadbalancer prov status on failures

    In case of DB outages when a flow is running, an exception is caught and
    the flow is reverted. In most of the flows, the revert function of the
    first task's (the last to be reverted) unlocks the load balancer by
    setting its provisioning status (to ERROR or ACTIVE, depending on the
    flow), but it fails if the DB is not reachable, leaving the LB in
    a PENDING_* state.
    This commit adds tenacity.retry to those functions, Octavia retries to
    set the status during ~2h45 (2000 attempts, 1 sec initial delay, 5 sec
    max delay).

    Closes-Bug: #2036952
    Change-Id: I458dd6d6f5383edc24116ea0fa27e3a593044146
    (cherry picked from commit be91493332786365b8e997fcf88779a12d1ae130)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to octavia (stable/wallaby)

Reviewed: https://review.opendev.org/c/openstack/octavia/+/897781
Committed: https://opendev.org/openstack/octavia/commit/09b249d0848d315c330c3b5d78c75821ba8c4fe8
Submitter: "Zuul (22348)"
Branch: stable/wallaby

commit 09b249d0848d315c330c3b5d78c75821ba8c4fe8
Author: Gregory Thiemonge <email address hidden>
Date: Mon Sep 25 07:48:01 2023 -0400

    Retry to set loadbalancer prov status on failures

    In case of DB outages when a flow is running, an exception is caught and
    the flow is reverted. In most of the flows, the revert function of the
    first task's (the last to be reverted) unlocks the load balancer by
    setting its provisioning status (to ERROR or ACTIVE, depending on the
    flow), but it fails if the DB is not reachable, leaving the LB in
    a PENDING_* state.
    This commit adds tenacity.retry to those functions, Octavia retries to
    set the status during ~2h45 (2000 attempts, 1 sec initial delay, 5 sec
    max delay).

    Note: stable/2023.1 and older, the patch also includes modifications for
    v1/tasks/lifecycle_tasks.py

    Conflicts:
            octavia/common/config.py
            octavia/tests/unit/controller/worker/test_task_utils.py
            octavia/controller/worker/v1/tasks/lifecycle_tasks.py

    Closes-Bug: #2036952
    Change-Id: I458dd6d6f5383edc24116ea0fa27e3a593044146
    (cherry picked from commit be91493332786365b8e997fcf88779a12d1ae130)
    (cherry picked from commit 96782e2c543bfb488c7629e9ba2cf009b2a6b033)
    (cherry picked from commit 57833dbdad964f8a7861a0ab1d2847159f483577)
    (cherry picked from commit 27060603db697c64dd9d3a30910d67fed7b5906e)
    (cherry picked from commit 5a411a68558b8e81552678d79b876e7143025f80)
    (cherry picked from commit c8a2cb4fbdbf357129bf52c98fbe2d8517dd557c)

tags: added: in-stable-wallaby
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to octavia (stable/2023.1)

Reviewed: https://review.opendev.org/c/openstack/octavia/+/897777
Committed: https://opendev.org/openstack/octavia/commit/57833dbdad964f8a7861a0ab1d2847159f483577
Submitter: "Zuul (22348)"
Branch: stable/2023.1

commit 57833dbdad964f8a7861a0ab1d2847159f483577
Author: Gregory Thiemonge <email address hidden>
Date: Mon Sep 25 07:48:01 2023 -0400

    Retry to set loadbalancer prov status on failures

    In case of DB outages when a flow is running, an exception is caught and
    the flow is reverted. In most of the flows, the revert function of the
    first task's (the last to be reverted) unlocks the load balancer by
    setting its provisioning status (to ERROR or ACTIVE, depending on the
    flow), but it fails if the DB is not reachable, leaving the LB in
    a PENDING_* state.
    This commit adds tenacity.retry to those functions, Octavia retries to
    set the status during ~2h45 (2000 attempts, 1 sec initial delay, 5 sec
    max delay).

    Note: stable/2023.1 and older, the patch also includes modifications for
    v1/tasks/lifecycle_tasks.py

    Conflicts:
            octavia/tests/unit/controller/worker/test_task_utils.py
            octavia/controller/worker/v1/tasks/lifecycle_tasks.py

    Closes-Bug: #2036952
    Change-Id: I458dd6d6f5383edc24116ea0fa27e3a593044146
    (cherry picked from commit be91493332786365b8e997fcf88779a12d1ae130)
    (cherry picked from commit 96782e2c543bfb488c7629e9ba2cf009b2a6b033)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to octavia (stable/xena)

Reviewed: https://review.opendev.org/c/openstack/octavia/+/897780
Committed: https://opendev.org/openstack/octavia/commit/c8a2cb4fbdbf357129bf52c98fbe2d8517dd557c
Submitter: "Zuul (22348)"
Branch: stable/xena

commit c8a2cb4fbdbf357129bf52c98fbe2d8517dd557c
Author: Gregory Thiemonge <email address hidden>
Date: Mon Sep 25 07:48:01 2023 -0400

    Retry to set loadbalancer prov status on failures

    In case of DB outages when a flow is running, an exception is caught and
    the flow is reverted. In most of the flows, the revert function of the
    first task's (the last to be reverted) unlocks the load balancer by
    setting its provisioning status (to ERROR or ACTIVE, depending on the
    flow), but it fails if the DB is not reachable, leaving the LB in
    a PENDING_* state.
    This commit adds tenacity.retry to those functions, Octavia retries to
    set the status during ~2h45 (2000 attempts, 1 sec initial delay, 5 sec
    max delay).

    Note: stable/2023.1 and older, the patch also includes modifications for
    v1/tasks/lifecycle_tasks.py

    Conflicts:
            octavia/common/config.py
            octavia/tests/unit/controller/worker/test_task_utils.py
            octavia/controller/worker/v1/tasks/lifecycle_tasks.py

    Closes-Bug: #2036952
    Change-Id: I458dd6d6f5383edc24116ea0fa27e3a593044146
    (cherry picked from commit be91493332786365b8e997fcf88779a12d1ae130)
    (cherry picked from commit 96782e2c543bfb488c7629e9ba2cf009b2a6b033)
    (cherry picked from commit 57833dbdad964f8a7861a0ab1d2847159f483577)
    (cherry picked from commit 27060603db697c64dd9d3a30910d67fed7b5906e)
    (cherry picked from commit 5a411a68558b8e81552678d79b876e7143025f80)

tags: added: in-stable-xena
tags: added: in-stable-zed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to octavia (stable/zed)

Reviewed: https://review.opendev.org/c/openstack/octavia/+/897778
Committed: https://opendev.org/openstack/octavia/commit/27060603db697c64dd9d3a30910d67fed7b5906e
Submitter: "Zuul (22348)"
Branch: stable/zed

commit 27060603db697c64dd9d3a30910d67fed7b5906e
Author: Gregory Thiemonge <email address hidden>
Date: Mon Sep 25 07:48:01 2023 -0400

    Retry to set loadbalancer prov status on failures

    In case of DB outages when a flow is running, an exception is caught and
    the flow is reverted. In most of the flows, the revert function of the
    first task's (the last to be reverted) unlocks the load balancer by
    setting its provisioning status (to ERROR or ACTIVE, depending on the
    flow), but it fails if the DB is not reachable, leaving the LB in
    a PENDING_* state.
    This commit adds tenacity.retry to those functions, Octavia retries to
    set the status during ~2h45 (2000 attempts, 1 sec initial delay, 5 sec
    max delay).

    Note: stable/2023.1 and older, the patch also includes modifications for
    v1/tasks/lifecycle_tasks.py

    Conflicts:
            octavia/common/config.py
            octavia/tests/unit/controller/worker/test_task_utils.py
            octavia/controller/worker/v1/tasks/lifecycle_tasks.py

    Closes-Bug: #2036952
    Change-Id: I458dd6d6f5383edc24116ea0fa27e3a593044146
    (cherry picked from commit be91493332786365b8e997fcf88779a12d1ae130)
    (cherry picked from commit 96782e2c543bfb488c7629e9ba2cf009b2a6b033)
    (cherry picked from commit 57833dbdad964f8a7861a0ab1d2847159f483577)

tags: added: in-stable-yoga
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to octavia (stable/yoga)

Reviewed: https://review.opendev.org/c/openstack/octavia/+/897779
Committed: https://opendev.org/openstack/octavia/commit/5a411a68558b8e81552678d79b876e7143025f80
Submitter: "Zuul (22348)"
Branch: stable/yoga

commit 5a411a68558b8e81552678d79b876e7143025f80
Author: Gregory Thiemonge <email address hidden>
Date: Mon Sep 25 07:48:01 2023 -0400

    Retry to set loadbalancer prov status on failures

    In case of DB outages when a flow is running, an exception is caught and
    the flow is reverted. In most of the flows, the revert function of the
    first task's (the last to be reverted) unlocks the load balancer by
    setting its provisioning status (to ERROR or ACTIVE, depending on the
    flow), but it fails if the DB is not reachable, leaving the LB in
    a PENDING_* state.
    This commit adds tenacity.retry to those functions, Octavia retries to
    set the status during ~2h45 (2000 attempts, 1 sec initial delay, 5 sec
    max delay).

    Note: stable/2023.1 and older, the patch also includes modifications for
    v1/tasks/lifecycle_tasks.py

    Conflicts:
            octavia/common/config.py
            octavia/tests/unit/controller/worker/test_task_utils.py
            octavia/controller/worker/v1/tasks/lifecycle_tasks.py

    Closes-Bug: #2036952
    Change-Id: I458dd6d6f5383edc24116ea0fa27e3a593044146
    (cherry picked from commit be91493332786365b8e997fcf88779a12d1ae130)
    (cherry picked from commit 96782e2c543bfb488c7629e9ba2cf009b2a6b033)
    (cherry picked from commit 57833dbdad964f8a7861a0ab1d2847159f483577)
    (cherry picked from commit 27060603db697c64dd9d3a30910d67fed7b5906e)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/octavia 10.1.1

This issue was fixed in the openstack/octavia 10.1.1 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/octavia 11.0.2

This issue was fixed in the openstack/octavia 11.0.2 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/octavia wallaby-eom

This issue was fixed in the openstack/octavia wallaby-eom release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/octavia xena-eom

This issue was fixed in the openstack/octavia xena-eom release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/octavia 14.0.0.0rc1

This issue was fixed in the openstack/octavia 14.0.0.0rc1 release candidate.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.