octavia

a failover of an ACTIVE_STANDBY LB recreate only one amphora when both amps are failing

Bug #2033734 reported by Gregory Thiemonge on 2023-09-01

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	octavia	Fix Released	Medium	Gregory Thiemonge

Bug Description

When both amphorae of an ACTIVE_STANDBY LB are failing, a failover is triggered but only one of the amphorae is recreated by the Octavia HM.

Consider an ACTIVE_STANDBY LB (with 2 amphorae):

Delete/stop both amphorae at the same time.

$ openstack server list --all -c ID -c Name
+--------------------------------------+----------------------------------------------+
| ID | Name |
+--------------------------------------+----------------------------------------------+
| a7748cea-417c-49ca-97e2-5548b67d4468 | amphora-0498849c-3d6b-463d-92cd-7c097b06aa37 |
| e9222b24-2704-49b7-8c8e-35297b6fef7c | amphora-6fc69002-9f02-498e-8c1e-4e5f8bdb798a |
+--------------------------------------+----------------------------------------------+

$ openstack server stop a7748cea-417c-49ca-97e2-5548b67d4468; openstack server stop e9222b24-2704-49b7-8c8e-35297b6fef7c

After one minute, the Octavia HM detects that one of the amphorae is missing and triggers a failover.

After a few minutes, only one new amphora was recreated and the other one (an "old" amp) is marked in ERROR.

The logs show that when recreating the first amp, the failover flow updates all the amphorae (to enable VRRP or sync the list of peers in haproxy) but the task cannot reach the other failing amphorae, and it marks it in ERROR:

Sep 01 02:19:34 gthiemon-devstack octavia-health-manager[1802183]: INFO octavia.controller.healthmanager.health_manager [-] Stale amphora's id is: 0498849c-3d6b-463d-92cd-7c097b06aa37
Sep 01 02:19:35 gthiemon-devstack octavia-health-manager[1802183]: INFO octavia.controller.healthmanager.health_manager [-] Waiting for 1 failovers to finish
Sep 01 02:19:35 gthiemon-devstack octavia-health-manager[1802183]: DEBUG octavia.common.base_taskflow [-] Waiting for job get_failover_amphora_flow-4067afce-2398-4529-bfb5-4c7e50c92695 to finish {{(pid=1802183) _wait_for_job /opt/stack/octavia/octavia/common/base_taskflow.p
y:203}}
Sep 01 02:19:35 gthiemon-devstack octavia-worker[1674294]: INFO octavia.controller.worker.v2.flows.amphora_flows [-] Performing failover for amphora: {'id': '0498849c-3d6b-463d-92cd-7c097b06aa37', 'load_balancer_id': '582ed564-bb30-41a4-84a7-7ccdbc5ac7c0', 'lb_network_ip':
'192.168.0.4', 'compute_id': 'a7748cea-417c-49ca-97e2-5548b67d4468', 'role': 'master_or_backup'}
Sep 01 02:19:35 gthiemon-devstack octavia-worker[1674294]: DEBUG octavia.common.base_taskflow [-] Flow 'get_failover_amphora_flow-4067afce-2398-4529-bfb5-4c7e50c92695' (4067afce-2398-4529-bfb5-4c7e50c92695) transitioned into state 'RUNNING' from state 'PENDING' {{(pid=1674294) _flow_receiver /opt/stack/taskflow/taskflow/listeners/logging.py:141}}
[..]

Sep 01 02:20:27 gthiemon-devstack octavia-worker[1674294]: DEBUG octavia.common.base_taskflow [-] Task 'octavia-get-vrrp-subflow-octavia-get-vrrp-subflow-1-octavia-amphora-vrrp-start' (6d80642e-2607-41bb-a669-a2c75ba365cc) transitioned into state 'SUCCESS' from state 'RUNNING' with result 'None' {{(pid=1674294) _task_receiver /opt/stack/taskflow/taskflow/listeners/logging.py:178}}
Sep 01 02:20:35 gthiemon-devstack octavia-worker[1674294]: WARNING octavia.amphorae.drivers.haproxy.rest_api_driver [-] Could not connect to instance. Retrying.: requests.exceptions.ConnectTimeout: HTTPSConnectionPool(host='192.168.0.106', port=9443): Max retries exceeded with url: // (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f3321f61940>, 'Connection to 192.168.0.106 timed out. (connect timeout=10.0)'))
Sep 01 02:20:44 gthiemon-devstack octavia-worker[1674294]: DEBUG octavia.common.base_taskflow [-] Extend expiry for job get_failover_amphora_flow-4067afce-2398-4529-bfb5-4c7e50c92695 {{(pid=1674294) _extend_jobs /opt/stack/octavia/octavia/common/base_taskflow.py:250}}
Sep 01 02:20:47 gthiemon-devstack octavia-worker[1674294]: WARNING octavia.amphorae.drivers.haproxy.rest_api_driver [-] Could not connect to instance. Retrying.: requests.exceptions.ConnectTimeout: HTTPSConnectionPool(host='192.168.0.106', port=9443): Max retries exceeded with url: // (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f332280bc10>, 'Connection to 192.168.0.106 timed out. (connect timeout=10.0)'))
[..]
Sep 01 02:23:25 gthiemon-devstack octavia-worker[1674294]: ERROR octavia.amphorae.drivers.haproxy.rest_api_driver [-] Connection retries (currently set to 15) exhausted. The amphora is unavailable. Reason: HTTPSConnectionPool(host='192.168.0.106', port=9443): Max retries e
xceeded with url: // (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f331b50ca00>, 'Connection to 192.168.0.106 timed out. (connect timeout=10.0)'))
Sep 01 02:23:25 gthiemon-devstack octavia-worker[1674294]: DEBUG octavia.common.base_taskflow [-] Task 'octavia-get-vrrp-subflow-octavia-get-vrrp-subflow-0-octavia-amphora-update-vrrp-intf' (1847c722-3a55-433d-bbda-7691aae8ae22) transitioned into state 'SUCCESS' from state
'RUNNING' with result 'None' {{(pid=1674294) _task_receiver /opt/stack/taskflow/taskflow/listeners/logging.py:178}}
[..]

Sep 01 02:23:25 gthiemon-devstack octavia-worker[1674294]: DEBUG octavia.common.base_taskflow [-] Task 'octavia-get-vrrp-subflow-octavia-get-vrrp-subflow-0-octavia-amphora-vrrp-update' (f0fda9fa-0845-4d74-a2bd-ebdcbad4d869) transitioned into state 'RUNNING' from state 'PENDING' {{(pid=1674294) _task_receiver /opt/stack/taskflow/taskflow/listeners/logging.py:190}}
Sep 01 02:23:25 gthiemon-devstack octavia-worker[1674294]: DEBUG octavia.amphorae.drivers.keepalived.vrrp_rest_driver [-] Update amphora 6fc69002-9f02-498e-8c1e-4e5f8bdb798a VRRP configuration. {{(pid=1674294) update_vrrp_conf /opt/stack/octavia/octavia/amphorae/drivers/keepalived/vrrp_rest_driver.py:50}}
Sep 01 02:23:25 gthiemon-devstack octavia-worker[1674294]: DEBUG octavia.amphorae.drivers.haproxy.rest_api_driver [-] request url / {{(pid=1674294) request /opt/stack/octavia/octavia/amphorae/drivers/haproxy/rest_api_driver.py:652}}
Sep 01 02:23:25 gthiemon-devstack octavia-worker[1674294]: DEBUG octavia.amphorae.drivers.haproxy.rest_api_driver [-] request url https://192.168.0.106:9443// {{(pid=1674294) request /opt/stack/octavia/octavia/amphorae/drivers/haproxy/rest_api_driver.py:655}}
Sep 01 02:23:35 gthiemon-devstack octavia-worker[1674294]: WARNING octavia.amphorae.drivers.haproxy.rest_api_driver [-] Could not connect to instance. Retrying.: requests.exceptions.ConnectTimeout: HTTPSConnectionPool(host='192.168.0.106', port=9443): Max retries exceeded with url: // (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f332179e0d0>, 'Connection to 192.168.0.106 timed out. (connect timeout=10.0)'))
Sep 01 02:23:47 gthiemon-devstack octavia-worker[1674294]: WARNING octavia.amphorae.drivers.haproxy.rest_api_driver [-] Could not connect to instance. Retrying.: requests.exceptions.ConnectTimeout: HTTPSConnectionPool(host='192.168.0.106', port=9443): Max retries exceeded with url: // (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f3320bc15e0>, 'Connection to 192.168.0.106 timed out. (connect timeout=10.0)'))
[..]
Sep 01 02:26:26 gthiemon-devstack octavia-worker[1674294]: ERROR octavia.controller.worker.v2.tasks.amphora_driver_tasks [-] Failed to update VRRP configuration amphora 6fc69002-9f02-498e-8c1e-4e5f8bdb798a. Skipping this amphora as it is failing to update due to: contacting
the amphora timed out: octavia.amphorae.driver_exceptions.exceptions.TimeOutException: contacting the amphora timed out

Each of these tasks can set the status of the 2nd amphora to ERROR [1].
As the new failover mechanism doesn't allow the automatic failovers on amps with an ERROR status [2], the 2nd amp is never recreated, Users could mitigate this issue by performing manually a failover on the LB or the amphora.

One solution would be to not set the ERROR status of the sibling amphorae during the failover of an amphora (we could pass an extra argument to the Task to skip it)

[1] https://opendev.org/openstack/octavia/src/branch/master/octavia/controller/worker/v2/tasks/amphora_driver_tasks.py#L567-L569
[2] https://opendev.org/openstack/octavia/src/branch/master/octavia/db/repositories.py#L1574-L1575

Tags:

OpenStack Infra (hudson-openstack) on 2023-09-04

Changed in octavia:
status:	New → In Progress

Gregory Thiemonge (gthiemonge) on 2023-09-18

Changed in octavia:
assignee:	nobody → Gregory Thiemonge (gthiemonge)
importance:	Undecided → Medium

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2023-10-09: Fix merged to octavia (master)

Reviewed: https://review.opendev.org/c/openstack/octavia/+/893612
Committed: https://opendev.org/openstack/octavia/commit/248cf2893ea913a74d870a763802a51586e0f63b
Submitter: "Zuul (22348)"
Branch: master

commit 248cf2893ea913a74d870a763802a51586e0f63b
Author: Gregory Thiemonge <email address hidden>
Date: Mon Sep 4 05:56:25 2023 -0400

Fix amphorae in ERROR during the failover

    When 2 amps were down, the failover flow created the first one and
    needed to update both amp to configure VRRP, but as the 2nd was missing,
    it was set to ERROR. Then the health-manager could not trigger a
    failover becasue amphorae in ERROR are excluded from the automated
    failover process.

    This commit changes the tasks that must be run on both amphorae during a
    failover of one amphora, it doesn't mark the secondary amphora in ERROR
    if it is not reachable.

Closes-Bug: #2033734

Change-Id: I4bd027346c61b93b537ab53810c2ecb6160b6be2

Changed in octavia:
status:	In Progress → Fix Released

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2023-10-10: Fix proposed to octavia (stable/2023.2)

Fix proposed to branch: stable/2023.2
Review: https://review.opendev.org/c/openstack/octavia/+/897785

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2023-10-10: Fix proposed to octavia (stable/zed)

Fix proposed to branch: stable/zed
Review: https://review.opendev.org/c/openstack/octavia/+/897788

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2023-10-10: Fix proposed to octavia (stable/2023.1)

Fix proposed to branch: stable/2023.1
Review: https://review.opendev.org/c/openstack/octavia/+/897791

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2023-10-12: Fix proposed to octavia (stable/yoga)

Fix proposed to branch: stable/yoga
Review: https://review.opendev.org/c/openstack/octavia/+/898103

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2023-10-12: Fix proposed to octavia (stable/xena)

Fix proposed to branch: stable/xena
Review: https://review.opendev.org/c/openstack/octavia/+/898106

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2023-10-12: Fix proposed to octavia (stable/wallaby)

Fix proposed to branch: stable/wallaby
Review: https://review.opendev.org/c/openstack/octavia/+/898114

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2023-10-16: Fix merged to octavia (stable/2023.2)

Reviewed: https://review.opendev.org/c/openstack/octavia/+/897785
Committed: https://opendev.org/openstack/octavia/commit/5622431534d3bf9b8b71ad8fdb2d744f94625ec4
Submitter: "Zuul (22348)"
Branch: stable/2023.2

commit 5622431534d3bf9b8b71ad8fdb2d744f94625ec4
Author: Gregory Thiemonge <email address hidden>
Date: Mon Sep 4 05:56:25 2023 -0400

Fix amphorae in ERROR during the failover

    This commit changes the tasks that must be run on both amphorae during a
    failover of one amphora, it doesn't mark the secondary amphora in ERROR
    if it is not reachable.

Closes-Bug: #2033734

Change-Id: I4bd027346c61b93b537ab53810c2ecb6160b6be2
(cherry picked from commit 248cf2893ea913a74d870a763802a51586e0f63b)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2023-10-16: Fix merged to octavia (stable/2023.1)

Reviewed: https://review.opendev.org/c/openstack/octavia/+/897791
Committed: https://opendev.org/openstack/octavia/commit/10cfd8c7605e7789ec1ae3de511b3a84b8324e04
Submitter: "Zuul (22348)"
Branch: stable/2023.1

commit 10cfd8c7605e7789ec1ae3de511b3a84b8324e04
Author: Gregory Thiemonge <email address hidden>
Date: Mon Sep 4 05:56:25 2023 -0400

Fix amphorae in ERROR during the failover

    This commit changes the tasks that must be run on both amphorae during a
    failover of one amphora, it doesn't mark the secondary amphora in ERROR
    if it is not reachable.

Closes-Bug: #2033734

Note: stable/2023.1 and older, the patch also includes modifications in
octavia/controller/worker/v1/

Conflicts:
octavia/controller/worker/v2/tasks/amphora_driver_tasks.py

    Change-Id: I4bd027346c61b93b537ab53810c2ecb6160b6be2
    (cherry picked from commit 248cf2893ea913a74d870a763802a51586e0f63b)
    (cherry picked from commit 5622431534d3bf9b8b71ad8fdb2d744f94625ec4)

tags:

added: in-stable-zed

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2023-10-16: Fix merged to octavia (stable/zed)

#10

Reviewed: https://review.opendev.org/c/openstack/octavia/+/897788
Committed: https://opendev.org/openstack/octavia/commit/d09065c4f6c08b07930fa2f8ef6794de5417c2c0
Submitter: "Zuul (22348)"
Branch: stable/zed

commit d09065c4f6c08b07930fa2f8ef6794de5417c2c0
Author: Gregory Thiemonge <email address hidden>
Date: Mon Sep 4 05:56:25 2023 -0400

Fix amphorae in ERROR during the failover

    This commit changes the tasks that must be run on both amphorae during a
    failover of one amphora, it doesn't mark the secondary amphora in ERROR
    if it is not reachable.

Closes-Bug: #2033734

Note: stable/2023.1 and older, the patch also includes modifications in
octavia/controller/worker/v1/

Conflicts:
octavia/controller/worker/v2/tasks/amphora_driver_tasks.py

tags:

added: in-stable-yoga

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2023-10-16: Fix merged to octavia (stable/yoga)

#11

Reviewed: https://review.opendev.org/c/openstack/octavia/+/898103
Committed: https://opendev.org/openstack/octavia/commit/488adb29b68074d345f6ac92abb27ae4282847d3
Submitter: "Zuul (22348)"
Branch: stable/yoga

commit 488adb29b68074d345f6ac92abb27ae4282847d3
Author: Gregory Thiemonge <email address hidden>
Date: Mon Sep 4 05:56:25 2023 -0400

Fix amphorae in ERROR during the failover

    This commit changes the tasks that must be run on both amphorae during a
    failover of one amphora, it doesn't mark the secondary amphora in ERROR
    if it is not reachable.

Closes-Bug: #2033734

Note: stable/2023.1 and older, the patch also includes modifications in
octavia/controller/worker/v1/

Conflicts:
octavia/controller/worker/v2/tasks/amphora_driver_tasks.py

    Change-Id: I4bd027346c61b93b537ab53810c2ecb6160b6be2
    (cherry picked from commit 248cf2893ea913a74d870a763802a51586e0f63b)
    (cherry picked from commit 5622431534d3bf9b8b71ad8fdb2d744f94625ec4)
    (cherry picked from commit 10cfd8c7605e7789ec1ae3de511b3a84b8324e04)
    (cherry picked from commit d09065c4f6c08b07930fa2f8ef6794de5417c2c0)

tags:

added: in-stable-xena

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2023-10-16: Fix merged to octavia (stable/xena)

#12

Reviewed: https://review.opendev.org/c/openstack/octavia/+/898106
Committed: https://opendev.org/openstack/octavia/commit/2a71565107657a02a15ddb4a4b8ac1dd5b5a3a98
Submitter: "Zuul (22348)"
Branch: stable/xena

commit 2a71565107657a02a15ddb4a4b8ac1dd5b5a3a98
Author: Gregory Thiemonge <email address hidden>
Date: Mon Sep 4 05:56:25 2023 -0400

Fix amphorae in ERROR during the failover

    This commit changes the tasks that must be run on both amphorae during a
    failover of one amphora, it doesn't mark the secondary amphora in ERROR
    if it is not reachable.

Closes-Bug: #2033734

Note: stable/2023.1 and older, the patch also includes modifications in
octavia/controller/worker/v1/

Conflicts:
octavia/controller/worker/v2/tasks/amphora_driver_tasks.py

tags:

added: in-stable-wallaby

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2023-10-16: Fix merged to octavia (stable/wallaby)

#13

Reviewed: https://review.opendev.org/c/openstack/octavia/+/898114
Committed: https://opendev.org/openstack/octavia/commit/96b17f6108453c6aa0e4776283493ac517fa43c5
Submitter: "Zuul (22348)"
Branch: stable/wallaby

commit 96b17f6108453c6aa0e4776283493ac517fa43c5
Author: Gregory Thiemonge <email address hidden>
Date: Mon Sep 4 05:56:25 2023 -0400

Fix amphorae in ERROR during the failover

    This commit changes the tasks that must be run on both amphorae during a
    failover of one amphora, it doesn't mark the secondary amphora in ERROR
    if it is not reachable.

Closes-Bug: #2033734

Note: stable/2023.1 and older, the patch also includes modifications in
octavia/controller/worker/v1/

Conflicts:
octavia/controller/worker/v2/tasks/amphora_driver_tasks.py