The revert of a LB creation flow may be not complete

Bug #2043360 reported by Gregory Thiemonge
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
octavia
In Progress
High
Gregory Thiemonge

Bug Description

Under some circumstances (amphorav2 - with or without jobboard, ACTIVE_STANDBY required, the bug is not 100% reproducible), the revert of a taskflow flow may be incomplete.

For instance, when adding a failure in a Task with a Retry (AmphoraComputeConnectivityWait)

diff --git a/octavia/controller/worker/v2/tasks/amphora_driver_tasks.py b/octavia/controller/worker/v2/tasks/amphora_driver_tasks.py
index 62d68051..84609805 100644
--- a/octavia/controller/worker/v2/tasks/amphora_driver_tasks.py
+++ b/octavia/controller/worker/v2/tasks/amphora_driver_tasks.py
@@ -675,6 +675,7 @@ class AmphoraComputeConnectivityWait(BaseAmphoraTask):
     def execute(self, amphora, raise_retry_exception=False):
         """Execute get_info routine for an amphora until it responds."""
         try:
+ raise driver_except.AmpConnectionRetry(exception="foo")
             session = db_apis.get_session()
             with session.begin():
                 db_amphora = self.amphora_repo.get(

and creating a load balancer, the subflow is reverted after the retries, but:
* the 2nd subflow (there are 2 subflows running concurrently when creating an A/S LB) is not reverted
* the "main" flow is not reverted

Nov 10 07:52:04 gthiemon-devstack octavia-worker[3952342]: WARNING octavia.controller.worker.v2.controller_worker [-] Task 'MASTER-octavia-create-amp-for-lb-subflow-octavia-amp-compute-connectivity-wait' (5b98ef7b-8005-4ab2-8a77-7d8bda2729be) transitioned into state 'REVERTED' from state 'REVERTING' with result 'None'
Nov 10 07:52:04 gthiemon-devstack octavia-worker[3952342]: DEBUG octavia.controller.worker.v2.controller_worker [-] Task 'MASTER-octavia-create-amp-for-lb-subflow-octavia-amp-compute-connectivity-wait' (5b98ef7b-8005-4ab2-8a77-7d8bda2729be) transitioned into state 'PENDING' from state 'REVERTED' {{(pid=3952342) _task_receiver /opt/stack/taskflow/taskflow/listeners/logging.py:190}}
Nov 10 07:52:04 gthiemon-devstack octavia-worker[3952342]: DEBUG octavia.controller.worker.v2.controller_worker [-] Task 'MASTER-octavia-create-amp-for-lb-subflow-octavia-amp-compute-connectivity-wait' (5b98ef7b-8005-4ab2-8a77-7d8bda2729be) transitioned into state 'RUNNING' from state 'PENDING' {{(pid=3952342) _task_receiver /opt/stack/taskflow/taskflow/listeners/logging.py:190}}
Nov 10 07:52:04 gthiemon-devstack octavia-worker[3952342]: DEBUG octavia.controller.worker.v2.controller_worker [-] Task 'MASTER-octavia-create-amp-for-lb-subflow-octavia-amp-compute-connectivity-wait' (5b98ef7b-8005-4ab2-8a77-7d8bda2729be) transitioned into state 'REVERTING' from state 'FAILURE' {{(pid=3952342) _task_receiver /opt/stack/taskflow/taskflow/listeners/logging.py:190}}
Nov 10 07:52:04 gthiemon-devstack octavia-worker[3952342]: WARNING octavia.controller.worker.v2.controller_worker [-] Task 'MASTER-octavia-create-amp-for-lb-subflow-octavia-amp-compute-connectivity-wait' (5b98ef7b-8005-4ab2-8a77-7d8bda2729be) transitioned into state 'REVERTED' from state 'REVERTING' with result 'None'
Nov 10 07:52:04 gthiemon-devstack octavia-worker[3952342]: DEBUG octavia.controller.worker.v2.controller_worker [-] Task 'MASTER-octavia-create-amp-for-lb-subflow-octavia-amp-compute-connectivity-wait' (5b98ef7b-8005-4ab2-8a77-7d8bda2729be) transitioned into state 'PENDING' from state 'REVERTED' {{(pid=3952342) _task_receiver /opt/stack/taskflow/taskflow/listeners/logging.py:190}}
Nov 10 07:52:04 gthiemon-devstack octavia-worker[3952342]: DEBUG octavia.controller.worker.v2.controller_worker [-] Task 'MASTER-octavia-create-amp-for-lb-subflow-octavia-amp-compute-connectivity-wait' (5b98ef7b-8005-4ab2-8a77-7d8bda2729be) transitioned into state 'RUNNING' from state 'PENDING' {{(pid=3952342) _task_receiver /opt/stack/taskflow/taskflow/listeners/logging.py:190}}
Nov 10 07:52:04 gthiemon-devstack octavia-worker[3952342]: DEBUG octavia.controller.worker.v2.controller_worker [-] Task 'MASTER-octavia-create-amp-for-lb-subflow-octavia-amp-compute-connectivity-wait' (5b98ef7b-8005-4ab2-8a77-7d8bda2729be) transitioned into state 'REVERTING' from state 'FAILURE' {{(pid=3952342) _task_receiver /opt/stack/taskflow/taskflow/listeners/logging.py:190}}
Nov 10 07:52:04 gthiemon-devstack octavia-worker[3952342]: WARNING octavia.controller.worker.v2.controller_worker [-] Task 'MASTER-octavia-create-amp-for-lb-subflow-octavia-amp-compute-connectivity-wait' (5b98ef7b-8005-4ab2-8a77-7d8bda2729be) transitioned into state 'REVERTED' from state 'REVERTING' with result 'None'
Nov 10 07:52:04 gthiemon-devstack octavia-worker[3952342]: DEBUG octavia.controller.worker.v2.controller_worker [-] Task 'MASTER-octavia-create-amp-for-lb-subflow-octavia-update-amphora-info' (b86b6c05-39d0-4e01-b28d-104f26963f14) transitioned into state 'REVERTING' from state 'SUCCESS' {{(pid=3952342) _task_receiver /opt/stack/taskflow/taskflow/listeners/logging.py:190}}
Nov 10 07:52:04 gthiemon-devstack octavia-worker[3952342]: WARNING octavia.controller.worker.v2.controller_worker [-] Task 'MASTER-octavia-create-amp-for-lb-subflow-octavia-update-amphora-info' (b86b6c05-39d0-4e01-b28d-104f26963f14) transitioned into state 'REVERTED' from state 'REVERTING' with result 'None'
Nov 10 07:52:04 gthiemon-devstack octavia-worker[3952342]: DEBUG octavia.controller.worker.v2.controller_worker [-] Task 'MASTER-octavia-create-amp-for-lb-subflow-octavia-compute-wait' (c61e6fab-c79d-4745-8467-73a16a062a3a) transitioned into state 'REVERTING' from state 'SUCCESS' {{(pid=3952342) _task_receiver /opt/stack/taskflow/taskflow/listeners/logging.py:190}}
Nov 10 07:52:04 gthiemon-devstack octavia-worker[3952342]: WARNING octavia.controller.worker.v2.controller_worker [-] Task 'MASTER-octavia-create-amp-for-lb-subflow-octavia-compute-wait' (c61e6fab-c79d-4745-8467-73a16a062a3a) transitioned into state 'REVERTED' from state 'REVERTING' with result 'None'
Nov 10 07:52:04 gthiemon-devstack octavia-worker[3952342]: DEBUG octavia.controller.worker.v2.controller_worker [-] Task 'MASTER-octavia-create-amp-for-lb-subflow-octavia-mark-amphora-booting-indb' (56e61bf1-0c58-4685-87a2-52abb660b6b1) transitioned into state 'REVERTING' from state 'SUCCESS' {{(pid=3952342) _task_receiver /opt/stack/taskflow/taskflow/listeners/logging.py:190}}
Nov 10 07:52:04 gthiemon-devstack octavia-worker[3952342]: WARNING octavia.controller.worker.v2.tasks.database_tasks [-] Reverting mark amphora booting in DB for amp id 5b377faa-01e0-4e02-b702-c6a12ab51a59 and compute id 5ba04bb8-a632-4c44-a03e-234ac87b5b16
Nov 10 07:52:04 gthiemon-devstack octavia-worker[3952342]: WARNING octavia.controller.worker.v2.controller_worker [-] Task 'MASTER-octavia-create-amp-for-lb-subflow-octavia-mark-amphora-booting-indb' (56e61bf1-0c58-4685-87a2-52abb660b6b1) transitioned into state 'REVERTED' from state 'REVERTING' with result 'None'
Nov 10 07:52:04 gthiemon-devstack octavia-worker[3952342]: DEBUG octavia.controller.worker.v2.controller_worker [-] Task 'MASTER-octavia-create-amp-for-lb-subflow-octavia-update-amphora-computeid' (95cd38b3-cea4-4ffe-a1db-677d471b1e8f) transitioned into state 'REVERTING' from state 'SUCCESS' {{(pid=3952342) _task_receiver /opt/stack/taskflow/taskflow/listeners/logging.py:190}}
Nov 10 07:52:04 gthiemon-devstack octavia-worker[3952342]: WARNING octavia.controller.worker.v2.controller_worker [-] Task 'MASTER-octavia-create-amp-for-lb-subflow-octavia-update-amphora-computeid' (95cd38b3-cea4-4ffe-a1db-677d471b1e8f) transitioned into state 'REVERTED' from state 'REVERTING' with result 'None'
Nov 10 07:52:04 gthiemon-devstack octavia-worker[3952342]: DEBUG octavia.controller.worker.v2.controller_worker [-] Task 'MASTER-octavia-create-amp-for-lb-subflow-octavia-cert-compute-create' (776b4bfb-cb88-403c-bbbe-1404989799a8) transitioned into state 'REVERTING' from state 'SUCCESS' {{(pid=3952342) _task_receiver /opt/stack/taskflow/taskflow/listeners/logging.py:190}}
Nov 10 07:52:04 gthiemon-devstack octavia-worker[3952342]: WARNING octavia.controller.worker.v2.tasks.compute_tasks [-] Reverting compute create for amphora with id 5b377faa-01e0-4e02-b702-c6a12ab51a59 and compute id: 5ba04bb8-a632-4c44-a03e-234ac87b5b16
Nov 10 07:52:04 gthiemon-devstack octavia-worker[3952342]: DEBUG novaclient.v2.client [-] REQ: curl -g -i --cacert "/opt/stack/data/ca-bundle.pem" -X DELETE http://192.168.1.101/compute/v2.1/servers/5ba04bb8-a632-4c44-a03e-234ac87b5b16 -H "Accept: application/json" -H "User-Agent: python-novaclient" -H "X-Auth-Token: {SHA256}df1b68dcf137685f869322dd195bd5e33031be72615bbf639ff8328b0debebc1" -H "X-OpenStack-Nova-API-Version: 2.15" {{(pid=3952342) _http_log_request /usr/local/lib/python3.9/site-packages/keystoneauth1/session.py:511}}
Nov 10 07:52:04 gthiemon-devstack octavia-worker[3952342]: DEBUG novaclient.v2.client [-] RESP: [204] Connection: close Content-Type: application/json Date: Fri, 10 Nov 2023 12:52:04 GMT OpenStack-API-Version: compute 2.15 Server: Apache/2.4.57 (CentOS Stream) OpenSSL/3.0.7 mod_wsgi/4.7.1 Python/3.9 Vary: OpenStack-API-Version,X-OpenStack-Nova-API-Version X-OpenStack-Nova-API-Version: 2.15 x-compute-request-id: req-44eb94ed-cbd3-461a-8cce-a7afb43fcbf3 x-openstack-request-id: req-44eb94ed-cbd3-461a-8cce-a7afb43fcbf3 {{(pid=3952342) _http_log_response /usr/local/lib/python3.9/site-packages/keystoneauth1/session.py:542}}
Nov 10 07:52:04 gthiemon-devstack octavia-worker[3952342]: DEBUG novaclient.v2.client [-] DELETE call to compute for http://192.168.1.101/compute/v2.1/servers/5ba04bb8-a632-4c44-a03e-234ac87b5b16 used request id req-44eb94ed-cbd3-461a-8cce-a7afb43fcbf3 {{(pid=3952342) request /usr/local/lib/python3.9/site-packages/keystoneauth1/session.py:946}}
Nov 10 07:52:04 gthiemon-devstack octavia-worker[3952342]: WARNING octavia.controller.worker.v2.controller_worker [-] Task 'MASTER-octavia-create-amp-for-lb-subflow-octavia-cert-compute-create' (776b4bfb-cb88-403c-bbbe-1404989799a8) transitioned into state 'REVERTED' from state 'REVERTING' with result 'None'
Nov 10 07:52:04 gthiemon-devstack octavia-worker[3952342]: DEBUG octavia.controller.worker.v2.controller_worker [-] Task 'MASTER-octavia-create-amp-for-lb-subflow-octavia-update-cert-expiration' (fb9de337-37f5-4789-a6a4-ac2898805542) transitioned into state 'REVERTING' from state 'SUCCESS' {{(pid=3952342) _task_receiver /opt/stack/taskflow/taskflow/listeners/logging.py:190}}
Nov 10 07:52:04 gthiemon-devstack octavia-worker[3952342]: WARNING octavia.controller.worker.v2.controller_worker [-] Task 'MASTER-octavia-create-amp-for-lb-subflow-octavia-update-cert-expiration' (fb9de337-37f5-4789-a6a4-ac2898805542) transitioned into state 'REVERTED' from state 'REVERTING' with result 'None'
Nov 10 07:52:04 gthiemon-devstack octavia-worker[3952342]: DEBUG octavia.controller.worker.v2.controller_worker [-] Task 'MASTER-octavia-create-amp-for-lb-subflow-octavia-generate-serverpem' (d4adb1f9-e4ef-4ac6-b9b2-7fa03401468e) transitioned into state 'REVERTING' from state 'SUCCESS' {{(pid=3952342) _task_receiver /opt/stack/taskflow/taskflow/listeners/logging.py:190}}
Nov 10 07:52:04 gthiemon-devstack octavia-worker[3952342]: WARNING octavia.controller.worker.v2.controller_worker [-] Task 'MASTER-octavia-create-amp-for-lb-subflow-octavia-generate-serverpem' (d4adb1f9-e4ef-4ac6-b9b2-7fa03401468e) transitioned into state 'REVERTED' from state 'REVERTING' with result 'None'
Nov 10 07:52:04 gthiemon-devstack octavia-worker[3952342]: DEBUG octavia.controller.worker.v2.controller_worker [-] Task 'MASTER-octavia-create-amp-for-lb-subflow-octavia-create-amphora-indb' (38d938ff-133c-4637-9b8b-61d14771f766) transitioned into state 'REVERTING' from state 'SUCCESS' {{(pid=3952342) _task_receiver /opt/stack/taskflow/taskflow/listeners/logging.py:190}}
Nov 10 07:52:04 gthiemon-devstack octavia-worker[3952342]: WARNING octavia.controller.worker.v2.tasks.database_tasks [-] Reverting create amphora in DB for amp id 5b377faa-01e0-4e02-b702-c6a12ab51a59
Nov 10 07:52:04 gthiemon-devstack octavia-worker[3952342]: WARNING octavia.controller.worker.v2.controller_worker [-] Task 'MASTER-octavia-create-amp-for-lb-subflow-octavia-create-amphora-indb' (38d938ff-133c-4637-9b8b-61d14771f766) transitioned into state 'REVERTED' from state 'REVERTING' with result 'None'
Nov 10 07:52:05 gthiemon-devstack octavia-worker[3952342]: DEBUG octavia.controller.worker.v2.controller_worker [-] Task 'BACKUP-octavia-create-amp-for-lb-subflow-octavia-compute-wait' (38a8143e-4816-4a16-ae05-5037003245ba) transitioned into state 'REVERTING' from state 'FAILURE' {{(pid=3952342) _task_receiver /opt/stack/taskflow/taskflow/listeners/logging.py:190}}
Nov 10 07:52:05 gthiemon-devstack octavia-worker[3952342]: WARNING octavia.controller.worker.v2.controller_worker [-] Task 'BACKUP-octavia-create-amp-for-lb-subflow-octavia-compute-wait' (38a8143e-4816-4a16-ae05-5037003245ba) transitioned into state 'REVERTED' from state 'REVERTING' with result 'None'
Nov 10 07:52:05 gthiemon-devstack octavia-worker[3952342]: DEBUG octavia.controller.worker.v2.controller_worker [-] Task 'BACKUP-octavia-create-amp-for-lb-subflow-octavia-compute-wait' (38a8143e-4816-4a16-ae05-5037003245ba) transitioned into state 'PENDING' from state 'REVERTED' {{(pid=3952342) _task_receiver /opt/stack/taskflow/taskflow/listeners/logging.py:190}}
Nov 10 07:52:05 gthiemon-devstack octavia-worker[3952342]: WARNING octavia.controller.worker.v2.controller_worker [-] Flow 'octavia-create-loadbalancer-flow' (9c221fbb-d7c5-413b-9b17-3a6ca92e8d7a) transitioned into state 'REVERTED' from state 'RUNNING'
Nov 10 07:52:05 gthiemon-devstack octavia-worker[3952342]: ERROR oslo_messaging.rpc.server [-] Exception during message handling: taskflow.exceptions.WrappedFailure: WrappedFailure: [Failure: octavia.common.exceptions.ComputeWaitTimeoutException: Waiting for compute id 5ba04bb8-a632-4c44-a03e-234ac87b5b16 to go active timeout., Failure: octavia.amphorae.driver_exceptions.exceptions.AmpConnectionRetry: Could not connect to amphora, exception caught: foo, Failure: octavia.common.exceptions.ComputeWaitTimeoutException: Waiting for compute id b64f2a56-020f-4c73-99ca-5e427ea78f1c to go active timeout., Failure: octavia.common.exceptions.ComputeWaitTimeoutException: Waiting for compute id b64f2a56-020f-4c73-99ca-5e427ea78f1c to go active timeout.]

only the subflow for the MASTER amp is reverted, the subflow for BACKUP is interrupted (amphora not deleted), and none of the tasks of the main flow are reverted (LB stuck in PENDING_CREATE)

After some investigations in taskflow, it appears that using Retry in 2 unordered flows is not safe, when the Retry of the 1st flow fails, it marks all the tasks of the all flows as to be reverted but the 2nd Retry running in parallel may override this state and reschedule the execution of a task.

2 mitigations are possible:
- stop using Retries in octavia: using long-duration tasks is fine with amphorav2 and jobboard (it was a problem in old releases, but the issue is fixed)
- fix taskflow

full logs are provided in attachment.

Revision history for this message
Gregory Thiemonge (gthiemonge) wrote :
Revision history for this message
Gregory Thiemonge (gthiemonge) wrote :

Reproducer + potential fix in taskflow: https://review.opendev.org/c/openstack/taskflow/+/900746

Revision history for this message
Gregory Thiemonge (gthiemonge) wrote :

Reverting https://opendev.org/openstack/octavia/commit/a9ee09a676074eeb619a4d7c3d9912114de50e88 would partially fix it, we would also need to do the same treatment with the AmphoraComputeConnectivityWait task

They are the 2 only tasks that use retries:
https://opendev.org/openstack/octavia/src/commit/d3f6c99f095592d185f635a6eefa453daf48b74f/octavia/controller/worker/v2/flows/amphora_flows.py#L60-L77

Note: amphorav1 is not impacted as retry was not used there

Revision history for this message
Gregory Thiemonge (gthiemonge) wrote :

I think we need to remove the Retries from the Flows:
- it fixes this issue
- using some active tasks (aka a loop) would not impact (no behavioral changes) the amphora driver

Changed in octavia:
importance: Undecided → High
status: New → Confirmed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to octavia (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/octavia/+/914962

Changed in octavia:
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/octavia/+/914963

Changed in octavia:
assignee: nobody → Gregory Thiemonge (gthiemonge)
Revision history for this message
Gregory Thiemonge (gthiemonge) wrote :
Revision history for this message
Gregory Thiemonge (gthiemonge) wrote :

| I think we need to remove the Retries from the Flows:
| - it fixes this issue
| - using some active tasks (aka a loop) would not impact (no behavioral changes) the amphora driver

One thing to add on this topic:

Removing retries would make the graceful shutdown of octavia longer.

when https://bugs.launchpad.net/octavia/+bug/2064101 is fixed

with retries: the flow is almost immediately suspended and the worker is stopped
without retries (active loop): the flow is suspended when the task with the active loop is complete (it takes sometimes a few minutes in case of errors with the amphorae)

IMHO, we should keep the retries in the code

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Bug attachments

Remote bug watches

Bug watches keep track of this bug in other bug trackers.