When both amphorae of an ACTIVE_STANDBY LB are failing, a failover is triggered but only one of the amphorae is recreated by the Octavia HM.
Consider an ACTIVE_STANDBY LB (with 2 amphorae):
$ openstack loadbalancer amphora list
+--------------------------------------+--------------------------------------+-----------+--------+---------------+-------------+
| id | loadbalancer_id | status | role | lb_network_ip | ha_ip |
+--------------------------------------+--------------------------------------+-----------+--------+---------------+-------------+
| 6fc69002-9f02-498e-8c1e-4e5f8bdb798a | 582ed564-bb30-41a4-84a7-7ccdbc5ac7c0 | ALLOCATED | BACKUP | 192.168.0.106 | 172.24.4.14 |
| 0498849c-3d6b-463d-92cd-7c097b06aa37 | 582ed564-bb30-41a4-84a7-7ccdbc5ac7c0 | ALLOCATED | MASTER | 192.168.0.4 | 172.24.4.14 |
+--------------------------------------+--------------------------------------+-----------+--------+---------------+-------------+
Delete/stop both amphorae at the same time.
$ openstack server list --all -c ID -c Name
+--------------------------------------+----------------------------------------------+
| ID | Name |
+--------------------------------------+----------------------------------------------+
| a7748cea-417c-49ca-97e2-5548b67d4468 | amphora-0498849c-3d6b-463d-92cd-7c097b06aa37 |
| e9222b24-2704-49b7-8c8e-35297b6fef7c | amphora-6fc69002-9f02-498e-8c1e-4e5f8bdb798a |
+--------------------------------------+----------------------------------------------+
$ openstack server stop a7748cea-417c-49ca-97e2-5548b67d4468; openstack server stop e9222b24-2704-49b7-8c8e-35297b6fef7c
After one minute, the Octavia HM detects that one of the amphorae is missing and triggers a failover.
$ openstack loadbalancer amphora list
+--------------------------------------+--------------------------------------+----------------+--------+---------------+-------------+
| id | loadbalancer_id | status | role | lb_network_ip | ha_ip |
+--------------------------------------+--------------------------------------+----------------+--------+---------------+-------------+
| 6fc69002-9f02-498e-8c1e-4e5f8bdb798a | 582ed564-bb30-41a4-84a7-7ccdbc5ac7c0 | ALLOCATED | BACKUP | 192.168.0.106 | 172.24.4.14 |
| 0498849c-3d6b-463d-92cd-7c097b06aa37 | 582ed564-bb30-41a4-84a7-7ccdbc5ac7c0 | PENDING_DELETE | MASTER | 192.168.0.4 | 172.24.4.14 |
| a6f3bf10-9a10-4a81-b941-41bce8cf7eb3 | 582ed564-bb30-41a4-84a7-7ccdbc5ac7c0 | BOOTING | None | 192.168.0.165 | None |
+--------------------------------------+--------------------------------------+----------------+--------+---------------+-------------+
After a few minutes, only one new amphora was recreated and the other one (an "old" amp) is marked in ERROR.
$ openstack loadbalancer amphora list
+--------------------------------------+--------------------------------------+-----------+--------+---------------+-------------+
| id | loadbalancer_id | status | role | lb_network_ip | ha_ip |
+--------------------------------------+--------------------------------------+-----------+--------+---------------+-------------+
| 6fc69002-9f02-498e-8c1e-4e5f8bdb798a | 582ed564-bb30-41a4-84a7-7ccdbc5ac7c0 | ERROR | BACKUP | 192.168.0.106 | 172.24.4.14 |
| a6f3bf10-9a10-4a81-b941-41bce8cf7eb3 | 582ed564-bb30-41a4-84a7-7ccdbc5ac7c0 | ALLOCATED | MASTER | 192.168.0.165 | 172.24.4.14 |
+--------------------------------------+--------------------------------------+-----------+--------+---------------+-------------+
The logs show that when recreating the first amp, the failover flow updates all the amphorae (to enable VRRP or sync the list of peers in haproxy) but the task cannot reach the other failing amphorae, and it marks it in ERROR:
Sep 01 02:19:34 gthiemon-devstack octavia-health-manager[1802183]: INFO octavia.controller.healthmanager.health_manager [-] Stale amphora's id is: 0498849c-3d6b-463d-92cd-7c097b06aa37
Sep 01 02:19:35 gthiemon-devstack octavia-health-manager[1802183]: INFO octavia.controller.healthmanager.health_manager [-] Waiting for 1 failovers to finish
Sep 01 02:19:35 gthiemon-devstack octavia-health-manager[1802183]: DEBUG octavia.common.base_taskflow [-] Waiting for job get_failover_amphora_flow-4067afce-2398-4529-bfb5-4c7e50c92695 to finish {{(pid=1802183) _wait_for_job /opt/stack/octavia/octavia/common/base_taskflow.p
y:203}}
Sep 01 02:19:35 gthiemon-devstack octavia-worker[1674294]: INFO octavia.controller.worker.v2.flows.amphora_flows [-] Performing failover for amphora: {'id': '0498849c-3d6b-463d-92cd-7c097b06aa37', 'load_balancer_id': '582ed564-bb30-41a4-84a7-7ccdbc5ac7c0', 'lb_network_ip':
'192.168.0.4', 'compute_id': 'a7748cea-417c-49ca-97e2-5548b67d4468', 'role': 'master_or_backup'}
Sep 01 02:19:35 gthiemon-devstack octavia-worker[1674294]: DEBUG octavia.common.base_taskflow [-] Flow 'get_failover_amphora_flow-4067afce-2398-4529-bfb5-4c7e50c92695' (4067afce-2398-4529-bfb5-4c7e50c92695) transitioned into state 'RUNNING' from state 'PENDING' {{(pid=1674294) _flow_receiver /opt/stack/taskflow/taskflow/listeners/logging.py:141}}
[..]
Sep 01 02:20:27 gthiemon-devstack octavia-worker[1674294]: DEBUG octavia.common.base_taskflow [-] Task 'octavia-get-vrrp-subflow-octavia-get-vrrp-subflow-1-octavia-amphora-vrrp-start' (6d80642e-2607-41bb-a669-a2c75ba365cc) transitioned into state 'SUCCESS' from state 'RUNNING' with result 'None' {{(pid=1674294) _task_receiver /opt/stack/taskflow/taskflow/listeners/logging.py:178}}
Sep 01 02:20:35 gthiemon-devstack octavia-worker[1674294]: WARNING octavia.amphorae.drivers.haproxy.rest_api_driver [-] Could not connect to instance. Retrying.: requests.exceptions.ConnectTimeout: HTTPSConnectionPool(host='192.168.0.106', port=9443): Max retries exceeded with url: // (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f3321f61940>, 'Connection to 192.168.0.106 timed out. (connect timeout=10.0)'))
Sep 01 02:20:44 gthiemon-devstack octavia-worker[1674294]: DEBUG octavia.common.base_taskflow [-] Extend expiry for job get_failover_amphora_flow-4067afce-2398-4529-bfb5-4c7e50c92695 {{(pid=1674294) _extend_jobs /opt/stack/octavia/octavia/common/base_taskflow.py:250}}
Sep 01 02:20:47 gthiemon-devstack octavia-worker[1674294]: WARNING octavia.amphorae.drivers.haproxy.rest_api_driver [-] Could not connect to instance. Retrying.: requests.exceptions.ConnectTimeout: HTTPSConnectionPool(host='192.168.0.106', port=9443): Max retries exceeded with url: // (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f332280bc10>, 'Connection to 192.168.0.106 timed out. (connect timeout=10.0)'))
[..]
Sep 01 02:23:25 gthiemon-devstack octavia-worker[1674294]: ERROR octavia.amphorae.drivers.haproxy.rest_api_driver [-] Connection retries (currently set to 15) exhausted. The amphora is unavailable. Reason: HTTPSConnectionPool(host='192.168.0.106', port=9443): Max retries e
xceeded with url: // (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f331b50ca00>, 'Connection to 192.168.0.106 timed out. (connect timeout=10.0)'))
Sep 01 02:23:25 gthiemon-devstack octavia-worker[1674294]: DEBUG octavia.common.base_taskflow [-] Task 'octavia-get-vrrp-subflow-octavia-get-vrrp-subflow-0-octavia-amphora-update-vrrp-intf' (1847c722-3a55-433d-bbda-7691aae8ae22) transitioned into state 'SUCCESS' from state
'RUNNING' with result 'None' {{(pid=1674294) _task_receiver /opt/stack/taskflow/taskflow/listeners/logging.py:178}}
[..]
Sep 01 02:23:25 gthiemon-devstack octavia-worker[1674294]: DEBUG octavia.common.base_taskflow [-] Task 'octavia-get-vrrp-subflow-octavia-get-vrrp-subflow-0-octavia-amphora-vrrp-update' (f0fda9fa-0845-4d74-a2bd-ebdcbad4d869) transitioned into state 'RUNNING' from state 'PENDING' {{(pid=1674294) _task_receiver /opt/stack/taskflow/taskflow/listeners/logging.py:190}}
Sep 01 02:23:25 gthiemon-devstack octavia-worker[1674294]: DEBUG octavia.amphorae.drivers.keepalived.vrrp_rest_driver [-] Update amphora 6fc69002-9f02-498e-8c1e-4e5f8bdb798a VRRP configuration. {{(pid=1674294) update_vrrp_conf /opt/stack/octavia/octavia/amphorae/drivers/keepalived/vrrp_rest_driver.py:50}}
Sep 01 02:23:25 gthiemon-devstack octavia-worker[1674294]: DEBUG octavia.amphorae.drivers.haproxy.rest_api_driver [-] request url / {{(pid=1674294) request /opt/stack/octavia/octavia/amphorae/drivers/haproxy/rest_api_driver.py:652}}
Sep 01 02:23:25 gthiemon-devstack octavia-worker[1674294]: DEBUG octavia.amphorae.drivers.haproxy.rest_api_driver [-] request url https://192.168.0.106:9443// {{(pid=1674294) request /opt/stack/octavia/octavia/amphorae/drivers/haproxy/rest_api_driver.py:655}}
Sep 01 02:23:35 gthiemon-devstack octavia-worker[1674294]: WARNING octavia.amphorae.drivers.haproxy.rest_api_driver [-] Could not connect to instance. Retrying.: requests.exceptions.ConnectTimeout: HTTPSConnectionPool(host='192.168.0.106', port=9443): Max retries exceeded with url: // (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f332179e0d0>, 'Connection to 192.168.0.106 timed out. (connect timeout=10.0)'))
Sep 01 02:23:47 gthiemon-devstack octavia-worker[1674294]: WARNING octavia.amphorae.drivers.haproxy.rest_api_driver [-] Could not connect to instance. Retrying.: requests.exceptions.ConnectTimeout: HTTPSConnectionPool(host='192.168.0.106', port=9443): Max retries exceeded with url: // (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f3320bc15e0>, 'Connection to 192.168.0.106 timed out. (connect timeout=10.0)'))
[..]
Sep 01 02:26:26 gthiemon-devstack octavia-worker[1674294]: ERROR octavia.controller.worker.v2.tasks.amphora_driver_tasks [-] Failed to update VRRP configuration amphora 6fc69002-9f02-498e-8c1e-4e5f8bdb798a. Skipping this amphora as it is failing to update due to: contacting
the amphora timed out: octavia.amphorae.driver_exceptions.exceptions.TimeOutException: contacting the amphora timed out
Each of these tasks can set the status of the 2nd amphora to ERROR [1].
As the new failover mechanism doesn't allow the automatic failovers on amps with an ERROR status [2], the 2nd amp is never recreated, Users could mitigate this issue by performing manually a failover on the LB or the amphora.
One solution would be to not set the ERROR status of the sibling amphorae during the failover of an amphora (we could pass an extra argument to the Task to skip it)
[1] https://opendev.org/openstack/octavia/src/branch/master/octavia/controller/worker/v2/tasks/amphora_driver_tasks.py#L567-L569
[2] https://opendev.org/openstack/octavia/src/branch/master/octavia/db/repositories.py#L1574-L1575
Reviewed: https:/ /review. opendev. org/c/openstack /octavia/ +/893612 /opendev. org/openstack/ octavia/ commit/ 248cf2893ea913a 74d870a763802a5 1586e0f63b
Committed: https:/
Submitter: "Zuul (22348)"
Branch: master
commit 248cf2893ea913a 74d870a763802a5 1586e0f63b
Author: Gregory Thiemonge <email address hidden>
Date: Mon Sep 4 05:56:25 2023 -0400
Fix amphorae in ERROR during the failover
When 2 amps were down, the failover flow created the first one and
needed to update both amp to configure VRRP, but as the 2nd was missing,
it was set to ERROR. Then the health-manager could not trigger a
failover becasue amphorae in ERROR are excluded from the automated
failover process.
This commit changes the tasks that must be run on both amphorae during a
failover of one amphora, it doesn't mark the secondary amphora in ERROR
if it is not reachable.
Closes-Bug: #2033734
Change-Id: I4bd027346c61b9 3b537ab53810c2e cb6160b6be2