The revert of a LB creation flow may be not complete

Bug #2043360 reported by Gregory Thiemonge
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
octavia
In Progress
High
Gregory Thiemonge

Bug Description

Under some circumstances (amphorav2 - with or without jobboard, ACTIVE_STANDBY required, the bug is not 100% reproducible), the revert of a taskflow flow may be incomplete.

For instance, when adding a failure in a Task with a Retry (AmphoraComputeConnectivityWait)

diff --git a/octavia/controller/worker/v2/tasks/amphora_driver_tasks.py b/octavia/controller/worker/v2/tasks/amphora_driver_tasks.py
index 62d68051..84609805 100644
--- a/octavia/controller/worker/v2/tasks/amphora_driver_tasks.py
+++ b/octavia/controller/worker/v2/tasks/amphora_driver_tasks.py
@@ -675,6 +675,7 @@ class AmphoraComputeConnectivityWait(BaseAmphoraTask):
     def execute(self, amphora, raise_retry_exception=False):
         """Execute get_info routine for an amphora until it responds."""
         try:
+ raise driver_except.AmpConnectionRetry(exception="foo")
             session = db_apis.get_session()
             with session.begin():
                 db_amphora = self.amphora_repo.get(

and creating a load balancer, the subflow is reverted after the retries, but:
* the 2nd subflow (there are 2 subflows running concurrently when creating an A/S LB) is not reverted
* the "main" flow is not reverted

Nov 10 07:52:04 gthiemon-devstack octavia-worker[3952342]: WARNING octavia.controller.worker.v2.controller_worker [-] Task 'MASTER-octavia-create-amp-for-lb-subflow-octavia-amp-compute-connectivity-wait' (5b98ef7b-8005-4ab2-8a77-7d8bda2729be) transitioned into state 'REVERTED' from state 'REVERTING' with result 'None'
Nov 10 07:52:04 gthiemon-devstack octavia-worker[3952342]: DEBUG octavia.controller.worker.v2.controller_worker [-] Task 'MASTER-octavia-create-amp-for-lb-subflow-octavia-amp-compute-connectivity-wait' (5b98ef7b-8005-4ab2-8a77-7d8bda2729be) transitioned into state 'PENDING' from state 'REVERTED' {{(pid=3952342) _task_receiver /opt/stack/taskflow/taskflow/listeners/logging.py:190}}
Nov 10 07:52:04 gthiemon-devstack octavia-worker[3952342]: DEBUG octavia.controller.worker.v2.controller_worker [-] Task 'MASTER-octavia-create-amp-for-lb-subflow-octavia-amp-compute-connectivity-wait' (5b98ef7b-8005-4ab2-8a77-7d8bda2729be) transitioned into state 'RUNNING' from state 'PENDING' {{(pid=3952342) _task_receiver /opt/stack/taskflow/taskflow/listeners/logging.py:190}}
Nov 10 07:52:04 gthiemon-devstack octavia-worker[3952342]: DEBUG octavia.controller.worker.v2.controller_worker [-] Task 'MASTER-octavia-create-amp-for-lb-subflow-octavia-amp-compute-connectivity-wait' (5b98ef7b-8005-4ab2-8a77-7d8bda2729be) transitioned into state 'REVERTING' from state 'FAILURE' {{(pid=3952342) _task_receiver /opt/stack/taskflow/taskflow/listeners/logging.py:190}}
Nov 10 07:52:04 gthiemon-devstack octavia-worker[3952342]: WARNING octavia.controller.worker.v2.controller_worker [-] Task 'MASTER-octavia-create-amp-for-lb-subflow-octavia-amp-compute-connectivity-wait' (5b98ef7b-8005-4ab2-8a77-7d8bda2729be) transitioned into state 'REVERTED' from state 'REVERTING' with result 'None'
Nov 10 07:52:04 gthiemon-devstack octavia-worker[3952342]: DEBUG octavia.controller.worker.v2.controller_worker [-] Task 'MASTER-octavia-create-amp-for-lb-subflow-octavia-amp-compute-connectivity-wait' (5b98ef7b-8005-4ab2-8a77-7d8bda2729be) transitioned into state 'PENDING' from state 'REVERTED' {{(pid=3952342) _task_receiver /opt/stack/taskflow/taskflow/listeners/logging.py:190}}
Nov 10 07:52:04 gthiemon-devstack octavia-worker[3952342]: DEBUG octavia.controller.worker.v2.controller_worker [-] Task 'MASTER-octavia-create-amp-for-lb-subflow-octavia-amp-compute-connectivity-wait' (5b98ef7b-8005-4ab2-8a77-7d8bda2729be) transitioned into state 'RUNNING' from state 'PENDING' {{(pid=3952342) _task_receiver /opt/stack/taskflow/taskflow/listeners/logging.py:190}}
Nov 10 07:52:04 gthiemon-devstack octavia-worker[3952342]: DEBUG octavia.controller.worker.v2.controller_worker [-] Task 'MASTER-octavia-create-amp-for-lb-subflow-octavia-amp-compute-connectivity-wait' (5b98ef7b-8005-4ab2-8a77-7d8bda2729be) transitioned into state 'REVERTING' from state 'FAILURE' {{(pid=3952342) _task_receiver /opt/stack/taskflow/taskflow/listeners/logging.py:190}}
Nov 10 07:52:04 gthiemon-devstack octavia-worker[3952342]: WARNING octavia.controller.worker.v2.controller_worker [-] Task 'MASTER-octavia-create-amp-for-lb-subflow-octavia-amp-compute-connectivity-wait' (5b98ef7b-8005-4ab2-8a77-7d8bda2729be) transitioned into state 'REVERTED' from state 'REVERTING' with result 'None'
Nov 10 07:52:04 gthiemon-devstack octavia-worker[3952342]: DEBUG octavia.controller.worker.v2.controller_worker [-] Task 'MASTER-octavia-create-amp-for-lb-subflow-octavia-update-amphora-info' (b86b6c05-39d0-4e01-b28d-104f26963f14) transitioned into state 'REVERTING' from state 'SUCCESS' {{(pid=3952342) _task_receiver /opt/stack/taskflow/taskflow/listeners/logging.py:190}}
Nov 10 07:52:04 gthiemon-devstack octavia-worker[3952342]: WARNING octavia.controller.worker.v2.controller_worker [-] Task 'MASTER-octavia-create-amp-for-lb-subflow-octavia-update-amphora-info' (b86b6c05-39d0-4e01-b28d-104f26963f14) transitioned into state 'REVERTED' from state 'REVERTING' with result 'None'
Nov 10 07:52:04 gthiemon-devstack octavia-worker[3952342]: DEBUG octavia.controller.worker.v2.controller_worker [-] Task 'MASTER-octavia-create-amp-for-lb-subflow-octavia-compute-wait' (c61e6fab-c79d-4745-8467-73a16a062a3a) transitioned into state 'REVERTING' from state 'SUCCESS' {{(pid=3952342) _task_receiver /opt/stack/taskflow/taskflow/listeners/logging.py:190}}
Nov 10 07:52:04 gthiemon-devstack octavia-worker[3952342]: WARNING octavia.controller.worker.v2.controller_worker [-] Task 'MASTER-octavia-create-amp-for-lb-subflow-octavia-compute-wait' (c61e6fab-c79d-4745-8467-73a16a062a3a) transitioned into state 'REVERTED' from state 'REVERTING' with result 'None'
Nov 10 07:52:04 gthiemon-devstack octavia-worker[3952342]: DEBUG octavia.controller.worker.v2.controller_worker [-] Task 'MASTER-octavia-create-amp-for-lb-subflow-octavia-mark-amphora-booting-indb' (56e61bf1-0c58-4685-87a2-52abb660b6b1) transitioned into state 'REVERTING' from state 'SUCCESS' {{(pid=3952342) _task_receiver /opt/stack/taskflow/taskflow/listeners/logging.py:190}}
Nov 10 07:52:04 gthiemon-devstack octavia-worker[3952342]: WARNING octavia.controller.worker.v2.tasks.database_tasks [-] Reverting mark amphora booting in DB for amp id 5b377faa-01e0-4e02-b702-c6a12ab51a59 and compute id 5ba04bb8-a632-4c44-a03e-234ac87b5b16
Nov 10 07:52:04 gthiemon-devstack octavia-worker[3952342]: WARNING octavia.controller.worker.v2.controller_worker [-] Task 'MASTER-octavia-create-amp-for-lb-subflow-octavia-mark-amphora-booting-indb' (56e61bf1-0c58-4685-87a2-52abb660b6b1) transitioned into state 'REVERTED' from state 'REVERTING' with result 'None'
Nov 10 07:52:04 gthiemon-devstack octavia-worker[3952342]: DEBUG octavia.controller.worker.v2.controller_worker [-] Task 'MASTER-octavia-create-amp-for-lb-subflow-octavia-update-amphora-computeid' (95cd38b3-cea4-4ffe-a1db-677d471b1e8f) transitioned into state 'REVERTING' from state 'SUCCESS' {{(pid=3952342) _task_receiver /opt/stack/taskflow/taskflow/listeners/logging.py:190}}
Nov 10 07:52:04 gthiemon-devstack octavia-worker[3952342]: WARNING octavia.controller.worker.v2.controller_worker [-] Task 'MASTER-octavia-create-amp-for-lb-subflow-octavia-update-amphora-computeid' (95cd38b3-cea4-4ffe-a1db-677d471b1e8f) transitioned into state 'REVERTED' from state 'REVERTING' with result 'None'
Nov 10 07:52:04 gthiemon-devstack octavia-worker[3952342]: DEBUG octavia.controller.worker.v2.controller_worker [-] Task 'MASTER-octavia-create-amp-for-lb-subflow-octavia-cert-compute-create' (776b4bfb-cb88-403c-bbbe-1404989799a8) transitioned into state 'REVERTING' from state 'SUCCESS' {{(pid=3952342) _task_receiver /opt/stack/taskflow/taskflow/listeners/logging.py:190}}
Nov 10 07:52:04 gthiemon-devstack octavia-worker[3952342]: WARNING octavia.controller.worker.v2.tasks.compute_tasks [-] Reverting compute create for amphora with id 5b377faa-01e0-4e02-b702-c6a12ab51a59 and compute id: 5ba04bb8-a632-4c44-a03e-234ac87b5b16
Nov 10 07:52:04 gthiemon-devstack octavia-worker[3952342]: DEBUG novaclient.v2.client [-] REQ: curl -g -i --cacert "/opt/stack/data/ca-bundle.pem" -X DELETE http://192.168.1.101/compute/v2.1/servers/5ba04bb8-a632-4c44-a03e-234ac87b5b16 -H "Accept: application/json" -H "User-Agent: python-novaclient" -H "X-Auth-Token: {SHA256}df1b68dcf137685f869322dd195bd5e33031be72615bbf639ff8328b0debebc1" -H "X-OpenStack-Nova-API-Version: 2.15" {{(pid=3952342) _http_log_request /usr/local/lib/python3.9/site-packages/keystoneauth1/session.py:511}}
Nov 10 07:52:04 gthiemon-devstack octavia-worker[3952342]: DEBUG novaclient.v2.client [-] RESP: [204] Connection: close Content-Type: application/json Date: Fri, 10 Nov 2023 12:52:04 GMT OpenStack-API-Version: compute 2.15 Server: Apache/2.4.57 (CentOS Stream) OpenSSL/3.0.7 mod_wsgi/4.7.1 Python/3.9 Vary: OpenStack-API-Version,X-OpenStack-Nova-API-Version X-OpenStack-Nova-API-Version: 2.15 x-compute-request-id: req-44eb94ed-cbd3-461a-8cce-a7afb43fcbf3 x-openstack-request-id: req-44eb94ed-cbd3-461a-8cce-a7afb43fcbf3 {{(pid=3952342) _http_log_response /usr/local/lib/python3.9/site-packages/keystoneauth1/session.py:542}}
Nov 10 07:52:04 gthiemon-devstack octavia-worker[3952342]: DEBUG novaclient.v2.client [-] DELETE call to compute for http://192.168.1.101/compute/v2.1/servers/5ba04bb8-a632-4c44-a03e-234ac87b5b16 used request id req-44eb94ed-cbd3-461a-8cce-a7afb43fcbf3 {{(pid=3952342) request /usr/local/lib/python3.9/site-packages/keystoneauth1/session.py:946}}
Nov 10 07:52:04 gthiemon-devstack octavia-worker[3952342]: WARNING octavia.controller.worker.v2.controller_worker [-] Task 'MASTER-octavia-create-amp-for-lb-subflow-octavia-cert-compute-create' (776b4bfb-cb88-403c-bbbe-1404989799a8) transitioned into state 'REVERTED' from state 'REVERTING' with result 'None'
Nov 10 07:52:04 gthiemon-devstack octavia-worker[3952342]: DEBUG octavia.controller.worker.v2.controller_worker [-] Task 'MASTER-octavia-create-amp-for-lb-subflow-octavia-update-cert-expiration' (fb9de337-37f5-4789-a6a4-ac2898805542) transitioned into state 'REVERTING' from state 'SUCCESS' {{(pid=3952342) _task_receiver /opt/stack/taskflow/taskflow/listeners/logging.py:190}}
Nov 10 07:52:04 gthiemon-devstack octavia-worker[3952342]: WARNING octavia.controller.worker.v2.controller_worker [-] Task 'MASTER-octavia-create-amp-for-lb-subflow-octavia-update-cert-expiration' (fb9de337-37f5-4789-a6a4-ac2898805542) transitioned into state 'REVERTED' from state 'REVERTING' with result 'None'
Nov 10 07:52:04 gthiemon-devstack octavia-worker[3952342]: DEBUG octavia.controller.worker.v2.controller_worker [-] Task 'MASTER-octavia-create-amp-for-lb-subflow-octavia-generate-serverpem' (d4adb1f9-e4ef-4ac6-b9b2-7fa03401468e) transitioned into state 'REVERTING' from state 'SUCCESS' {{(pid=3952342) _task_receiver /opt/stack/taskflow/taskflow/listeners/logging.py:190}}
Nov 10 07:52:04 gthiemon-devstack octavia-worker[3952342]: WARNING octavia.controller.worker.v2.controller_worker [-] Task 'MASTER-octavia-create-amp-for-lb-subflow-octavia-generate-serverpem' (d4adb1f9-e4ef-4ac6-b9b2-7fa03401468e) transitioned into state 'REVERTED' from state 'REVERTING' with result 'None'
Nov 10 07:52:04 gthiemon-devstack octavia-worker[3952342]: DEBUG octavia.controller.worker.v2.controller_worker [-] Task 'MASTER-octavia-create-amp-for-lb-subflow-octavia-create-amphora-indb' (38d938ff-133c-4637-9b8b-61d14771f766) transitioned into state 'REVERTING' from state 'SUCCESS' {{(pid=3952342) _task_receiver /opt/stack/taskflow/taskflow/listeners/logging.py:190}}
Nov 10 07:52:04 gthiemon-devstack octavia-worker[3952342]: WARNING octavia.controller.worker.v2.tasks.database_tasks [-] Reverting create amphora in DB for amp id 5b377faa-01e0-4e02-b702-c6a12ab51a59
Nov 10 07:52:04 gthiemon-devstack octavia-worker[3952342]: WARNING octavia.controller.worker.v2.controller_worker [-] Task 'MASTER-octavia-create-amp-for-lb-subflow-octavia-create-amphora-indb' (38d938ff-133c-4637-9b8b-61d14771f766) transitioned into state 'REVERTED' from state 'REVERTING' with result 'None'
Nov 10 07:52:05 gthiemon-devstack octavia-worker[3952342]: DEBUG octavia.controller.worker.v2.controller_worker [-] Task 'BACKUP-octavia-create-amp-for-lb-subflow-octavia-compute-wait' (38a8143e-4816-4a16-ae05-5037003245ba) transitioned into state 'REVERTING' from state 'FAILURE' {{(pid=3952342) _task_receiver /opt/stack/taskflow/taskflow/listeners/logging.py:190}}
Nov 10 07:52:05 gthiemon-devstack octavia-worker[3952342]: WARNING octavia.controller.worker.v2.controller_worker [-] Task 'BACKUP-octavia-create-amp-for-lb-subflow-octavia-compute-wait' (38a8143e-4816-4a16-ae05-5037003245ba) transitioned into state 'REVERTED' from state 'REVERTING' with result 'None'
Nov 10 07:52:05 gthiemon-devstack octavia-worker[3952342]: DEBUG octavia.controller.worker.v2.controller_worker [-] Task 'BACKUP-octavia-create-amp-for-lb-subflow-octavia-compute-wait' (38a8143e-4816-4a16-ae05-5037003245ba) transitioned into state 'PENDING' from state 'REVERTED' {{(pid=3952342) _task_receiver /opt/stack/taskflow/taskflow/listeners/logging.py:190}}
Nov 10 07:52:05 gthiemon-devstack octavia-worker[3952342]: WARNING octavia.controller.worker.v2.controller_worker [-] Flow 'octavia-create-loadbalancer-flow' (9c221fbb-d7c5-413b-9b17-3a6ca92e8d7a) transitioned into state 'REVERTED' from state 'RUNNING'
Nov 10 07:52:05 gthiemon-devstack octavia-worker[3952342]: ERROR oslo_messaging.rpc.server [-] Exception during message handling: taskflow.exceptions.WrappedFailure: WrappedFailure: [Failure: octavia.common.exceptions.ComputeWaitTimeoutException: Waiting for compute id 5ba04bb8-a632-4c44-a03e-234ac87b5b16 to go active timeout., Failure: octavia.amphorae.driver_exceptions.exceptions.AmpConnectionRetry: Could not connect to amphora, exception caught: foo, Failure: octavia.common.exceptions.ComputeWaitTimeoutException: Waiting for compute id b64f2a56-020f-4c73-99ca-5e427ea78f1c to go active timeout., Failure: octavia.common.exceptions.ComputeWaitTimeoutException: Waiting for compute id b64f2a56-020f-4c73-99ca-5e427ea78f1c to go active timeout.]

only the subflow for the MASTER amp is reverted, the subflow for BACKUP is interrupted (amphora not deleted), and none of the tasks of the main flow are reverted (LB stuck in PENDING_CREATE)

After some investigations in taskflow, it appears that using Retry in 2 unordered flows is not safe, when the Retry of the 1st flow fails, it marks all the tasks of the all flows as to be reverted but the 2nd Retry running in parallel may override this state and reschedule the execution of a task.

2 mitigations are possible:
- stop using Retries in octavia: using long-duration tasks is fine with amphorav2 and jobboard (it was a problem in old releases, but the issue is fixed)
- fix taskflow

full logs are provided in attachment.

Revision history for this message
Gregory Thiemonge (gthiemonge) wrote :
Revision history for this message
Gregory Thiemonge (gthiemonge) wrote :

Reproducer + potential fix in taskflow: https://review.opendev.org/c/openstack/taskflow/+/900746

Revision history for this message
Gregory Thiemonge (gthiemonge) wrote :

Reverting https://opendev.org/openstack/octavia/commit/a9ee09a676074eeb619a4d7c3d9912114de50e88 would partially fix it, we would also need to do the same treatment with the AmphoraComputeConnectivityWait task

They are the 2 only tasks that use retries:
https://opendev.org/openstack/octavia/src/commit/d3f6c99f095592d185f635a6eefa453daf48b74f/octavia/controller/worker/v2/flows/amphora_flows.py#L60-L77

Note: amphorav1 is not impacted as retry was not used there

Revision history for this message
Gregory Thiemonge (gthiemonge) wrote :

I think we need to remove the Retries from the Flows:
- it fixes this issue
- using some active tasks (aka a loop) would not impact (no behavioral changes) the amphora driver

Changed in octavia:
importance: Undecided → High
status: New → Confirmed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to octavia (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/octavia/+/914962

Changed in octavia:
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/octavia/+/914963

Changed in octavia:
assignee: nobody → Gregory Thiemonge (gthiemonge)
Revision history for this message
Gregory Thiemonge (gthiemonge) wrote :
Revision history for this message
Gregory Thiemonge (gthiemonge) wrote :

| I think we need to remove the Retries from the Flows:
| - it fixes this issue
| - using some active tasks (aka a loop) would not impact (no behavioral changes) the amphora driver

One thing to add on this topic:

Removing retries would make the graceful shutdown of octavia longer.

when https://bugs.launchpad.net/octavia/+bug/2064101 is fixed

with retries: the flow is almost immediately suspended and the worker is stopped
without retries (active loop): the flow is suspended when the task with the active loop is complete (it takes sometimes a few minutes in case of errors with the amphorae)

IMHO, we should keep the retries in the code

Revision history for this message
Hua Zhang (zhhuabj) wrote (last edit ):
Download full text (3.2 KiB)

Hi @gthiemonge ,

We may also hit this bug (it may be the same, but not 100% sure, the logs [1] look very similar).
For mine, it seems two amphoras created for the problematic LB, the backup amphora reverted back after the failure but the master one is not reverted.
For yours, only the subflow for the MASTER amp is reverted, the subflow for BACKUP is interrupted (amphora not deleted), and none of the tasks of the main flow are reverted (LB stuck in PENDING_CREATE).

and our env is yoga release, which not include retry feature [2]. So theoretically, it doesn't seem to have the problem described by this bug [3] (Fix REVERT_ALL with Retries in unordered Flows).

More importantly, it seems that the flow the logs[1] mentioned is called octavia-create-amphora-flow, it doesn't have unordered flows, so it shouldn't have the problem of unordered flow either.

octavia-create-amphora-flow - linear_flow
  - task - CreateAmphoraInDB
  - ... other tasks ...
  - ComputeActiveWait - may throw ComputeWaitTimeoutException
  - octavia-create-amphora-retry-subflow - linear_flow
       - task AmphoraComputeConnectivityWait - catch TimeOutException then run raise
       - retry - AmpRetry - it returns retry.REVERT_ALL after connection_max_retries. it returns retry.RETRY before connection_max_retries and AmpConnectionRetry

It seems only the flow get_failover_amphora_flow includes unordered flows, it's used to perform failover for amphora, but the logs [1] shows that it was not in this stage, but in the stage of creating the amphora VM.

main -> hm_health_check -> health_check -> failover_amphora -> get_failover_amphora_flow
$ grep -r 'unordered_flow' ./octavia/controller/worker/v2/flows/amphora_flows.py |grep -v import
        update_amps_subflow = unordered_flow.Flow('VRRP-update-subflow')
        update_amps_subflow = unordered_flow.Flow(
        reload_listener_subflow = unordered_flow.Flow(

octavia-failover-amphora-flow - linear_flow
  - task AmphoraToErrorOnRevertTask - used to revert LB to provisioning_status ERROR if this flow goes wrong

Based on the above reasons, do you think your following log (this log is same as my log as well) is really realated to rever operation? or it is failing because of an issue with the amphoras themselves? thanks.

Nov 10 07:52:05 gthiemon-devstack octavia-worker[3952342]: ERROR oslo_messaging.rpc.server [-] Exception during message handling: taskflow.exceptions.WrappedFailure: WrappedFailure: [Failure: octavia.common.exceptions.ComputeWaitTimeoutException: Waiting for compute id 5ba04bb8-a632-4c44-a03e-234ac87b5b16 to go active timeout., Failure: octavia.amphorae.driver_exceptions.exceptions.AmpConnectionRetry: Could not connect to amphora, exception caught: foo, Failure: octavia.common.exceptions.ComputeWaitTimeoutException: Waiting for compute id b64f2a56-020f-4c73-99ca-5e427ea78f1c to go active timeout., Failure: octavia.common.exceptions.ComputeWaitTimeoutException: Waiting for compute id b64f2a56-020f-4c73-99ca-5e427ea78f1c to go active timeout.]

[1] https://paste.ubuntu.com/p/C3WsRqgWJr/
[2] https://opendev.org/openstack/octavia/commit/a9ee09a676074eeb619a4d7c3d9912114de50e88
[3] https://review.opendev.o...

Read more...

Revision history for this message
Gregory Thiemonge (gthiemonge) wrote :

Hi @zhhuabj

Thanks for the detailed report.
The logs in [1] look truncated (maybe paste.ubuntu is limited to 100 lines?), do you have the logs of the end of the flow?

Still about [1] I don't see the "octavia-create-amphora-flow" flow that you mention, the flow is "octavia-create-loadbalancer-flow" which should used Taskflow unordered flows.

you're right the Retry on the ComputeWait doesn't exist in Yoga, but there's still this Retry of the AmphoraComputeConnectivityWait task that can trigger the taskflow bug.

I think the full logs of the Flow would help to understand if it's the same bug.

Revision history for this message
Hua Zhang (zhhuabj) wrote :

Hi @gthiemonge , sorry, it should be this one - https://paste.ubuntu.com/p/C3WsRqgWJr/

Revision history for this message
Gregory Thiemonge (gthiemonge) wrote :

Looking at the logs, I really think it's the taskflow bug:

1. the first task executed by the flow is "LoadBalancerIDToErrorOnRevertTask", it means that in case of REVERT, it will be the last task to be reverted. I don't see the revert of this task before the flow in marked as REVERTED

2. the last reverted task in the flow is "MASTER-octavia-create-amp-for-lb-subflow-octavia-create-amphora-indb" (the first task of the MASTER subflow), it means that the MASTER subflow has been correctly reverted but the revert stopped at this point (it should have reverted the tasks executed before the creation of the subflow), which is the behavior we observed with this taskflow bug.

3. the last task executed by the BACKUP subflow is BACKUP-octavia-create-amp-for-lb-subflow-octavia-amp-compute-connectivity-wait, this subflow hasn't been reverted when the exception in MASTER-octavia-plug-net-subflow-octavia-amp-post-vip-plug has triggered the revert.

Revision history for this message
Hua Zhang (zhhuabj) wrote :
Download full text (7.1 KiB)

Hi @gthiemonge

Yes, it seems possible, because octavia-create-loadbalancer-flow is a unodered_flow as you mentioned, and octavia-create-amphora-flow is not. Below is my code analysis about octavia-create-loadbalancer-flow.

class ComputeActiveWait(BaseComputeTask):
    def execute(self, compute_id, amphora_id, availability_zone):
        ...
        for i in range(CONF.controller_worker.amp_active_retries):
            amp, fault = self.compute.get_amphora(compute_id, amp_network)
            if amp.status == constants.ACTIVE:
                if CONF.haproxy_amphora.build_rate_limit != -1:
                    self.rate_limit.remove_from_build_req_queue(amphora_id)
                return amp.to_dict()
            if amp.status == constants.ERROR:
                raise exceptions.ComputeBuildException(fault=fault)
            time.sleep(CONF.controller_worker.amp_active_wait_sec)
        raise exceptions.ComputeWaitTimeoutException(id=compute_id)

class AmpRetry(retry.Times):
    def on_failure(self, history, *args, **kwargs):
        last_errors = history[-1][1]
        max_retry_attempt = CONF.haproxy_amphora.connection_max_retries
        for task_name, ex_info in last_errors.items():
            if len(history) <= max_retry_attempt:
                if ex_info is None or ex_info._exc_info is None:
                    return retry.RETRY
                excp = ex_info._exc_info[1]
                if isinstance(excp, driver_except.AmpConnectionRetry):
                    return retry.RETRY
        return retry.REVERT_ALL

octavia-create-amp-for-failover-subflow
octavia-create-loadbalancer-flow - linear_flow
  - LoadBalancerIDToErrorOnRevertTask
  - ReloadLoadBalancer -> AllocateVIP -> UpdateVIPAfterAllocation -> UpdateVIPSecurityGroup -> GetSubnetFromVIP #vip tasks
  - octavia-create-loadbalancer-flow - unordered_flow
      - MASTER-octavia-plug-net-subflow
          - MASTER-octavia-create-amp-for-lb-subflow
              - CreateAmphoraInDB -> GenerateServerPEMTask -> ...
              - ComputeActiveWait #may throw ComputeWaitTimeoutException, and the related options is amp_active_retries and amp_active_wait_sec
              - UpdateAmphoraInfo
              - MASTER-octavia-create-amp-for-lb-subflow-octavia-amp-compute-connectivity-wait
                  - AmphoraComputeConnectivityWait #it may run raise when catching TimeOutExc...

Read more...

Revision history for this message
Hua Zhang (zhhuabj) wrote :

Hi @gthiemonge

LoadBalancerIDToErrorOnRevertTask is a task to set the LB to ERROR on revert, the final LB's status is PENDING_CREATE, which suggests that
LoadBalancerIDToErrorOnRevertTask's revert didn't be called. In other words, I should be hitting this bug as well that the revert of a LB creation flow may not be complete, leading to the retry mechanism not functioning properly. I was considering the following options:

1, I am not sure if increasing amp_active_wait_sec to avoid using retry mechanism could serve as a workaround

2, I am not sure if using only standby instead of active/standby could serve as a workaround

3, do you think only this fix (https://review.opendev.org/c/openstack/taskflow/+/900746) might be sufficient to address this bug as well, because I saw that you were trying to remove retry and then saying it was necessary above.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Bug attachments

Remote bug watches

Bug watches keep track of this bug in other bug trackers.