octavia

Bug #2043360
Comment #9

Comment 9 for bug 2043360

Revision history for this message

Hua Zhang (zhhuabj) wrote on 2024-08-21 (last edit on 2024-08-21):

Hi @gthiemonge ,

We may also hit this bug (it may be the same, but not 100% sure, the logs [1] look very similar).
For mine, it seems two amphoras created for the problematic LB, the backup amphora reverted back after the failure but the master one is not reverted.
For yours, only the subflow for the MASTER amp is reverted, the subflow for BACKUP is interrupted (amphora not deleted), and none of the tasks of the main flow are reverted (LB stuck in PENDING_CREATE).

and our env is yoga release, which not include retry feature [2]. So theoretically, it doesn't seem to have the problem described by this bug [3] (Fix REVERT_ALL with Retries in unordered Flows).

More importantly, it seems that the flow the logs[1] mentioned is called octavia-create-amphora-flow, it doesn't have unordered flows, so it shouldn't have the problem of unordered flow either.

octavia-create-amphora-flow - linear_flow
  - task - CreateAmphoraInDB
  - ... other tasks ...
  - ComputeActiveWait - may throw ComputeWaitTimeoutException
  - octavia-create-amphora-retry-subflow - linear_flow
       - task AmphoraComputeConnectivityWait - catch TimeOutException then run raise
       - retry - AmpRetry - it returns retry.REVERT_ALL after connection_max_retries. it returns retry.RETRY before connection_max_retries and AmpConnectionRetry

main -> hm_health_check -> health_check -> failover_amphora -> get_failover_amphora_flow
$ grep -r 'unordered_flow' ./octavia/controller/worker/v2/flows/amphora_flows.py |grep -v import
        update_amps_subflow = unordered_flow.Flow('VRRP-update-subflow')
        update_amps_subflow = unordered_flow.Flow(
        reload_listener_subflow = unordered_flow.Flow(

octavia-failover-amphora-flow - linear_flow
- task AmphoraToErrorOnRevertTask - used to revert LB to provisioning_status ERROR if this flow goes wrong

Based on the above reasons, do you think your following log (this log is same as my log as well) is really realated to rever operation? or it is failing because of an issue with the amphoras themselves? thanks.

Nov 10 07:52:05 gthiemon-devstack octavia-worker[3952342]: ERROR oslo_messaging.rpc.server [-] Exception during message handling: taskflow.exceptions.WrappedFailure: WrappedFailure: [Failure: octavia.common.exceptions.ComputeWaitTimeoutException: Waiting for compute id 5ba04bb8-a632-4c44-a03e-234ac87b5b16 to go active timeout., Failure: octavia.amphorae.driver_exceptions.exceptions.AmpConnectionRetry: Could not connect to amphora, exception caught: foo, Failure: octavia.common.exceptions.ComputeWaitTimeoutException: Waiting for compute id b64f2a56-020f-4c73-99ca-5e427ea78f1c to go active timeout., Failure: octavia.common.exceptions.ComputeWaitTimeoutException: Waiting for compute id b64f2a56-020f-4c73-99ca-5e427ea78f1c to go active timeout.]

[1] https://paste.ubuntu.com/p/C3WsRqgWJr/
[2] https://opendev.org/openstack/octavia/commit/a9ee09a676074eeb619a4d7c3d9912114de50e88
[3] https://review.opendev.org/c/openstack/taskflow/+/900746

Hi @gthiemonge ,

We may also hit this bug (it may be the same, but not 100% sure, the logs [1] look very similar). 
For mine, it seems two amphoras created for the problematic LB, the backup amphora reverted back after the failure but the master one is not reverted.
For yours, only the subflow for the MASTER amp is reverted, the subflow for BACKUP is interrupted (amphora not deleted), and none of the tasks of the main flow are reverted (LB stuck in PENDING_CREATE).

and our env is yoga release, which not include retry feature [2]. So theoretically, it doesn't seem to have the problem described by this bug [3] (Fix REVERT_ALL with Retries in unordered Flows).

More importantly, it seems that the flow the logs[1] mentioned is called octavia-create-amphora-flow, it doesn't have unordered flows, so it shouldn't have the problem of unordered flow either.

octavia-create-amphora-flow - linear_flow
  - task - CreateAmphoraInDB
  - ... other tasks ...
  - ComputeActiveWait  - may throw ComputeWaitTimeoutException
  - octavia-create-amphora-retry-subflow - linear_flow
       - task AmphoraComputeConnectivityWait - catch TimeOutException then run raise
       - retry - AmpRetry - it returns retry.REVERT_ALL after connection_max_retries. it returns retry.RETRY before connection_max_retries and AmpConnectionRetry

It seems only the flow get_failover_amphora_flow includes unordered flows, it's used to perform failover for amphora, but the logs [1] shows that it was not in this stage, but in the stage of creating the amphora VM.
 
main -> hm_health_check -> health_check -> failover_amphora -> get_failover_amphora_flow
$ grep -r 'unordered_flow' ./octavia/controller/worker/v2/flows/amphora_flows.py |grep -v import
        update_amps_subflow = unordered_flow.Flow('VRRP-update-subflow')
        update_amps_subflow = unordered_flow.Flow(
        reload_listener_subflow = unordered_flow.Flow(
        
octavia-failover-amphora-flow - linear_flow
  - task AmphoraToErrorOnRevertTask - used to revert LB to provisioning_status ERROR if this flow goes wrong
  
Based on the above reasons, do you think your following log (this log is same as my log as well) is really realated to rever operation? or it is failing because of an issue with the amphoras themselves? thanks.

[1] https://paste.ubuntu.com/p/C3WsRqgWJr/
[2] https://opendev.org/openstack/octavia/commit/a9ee09a676074eeb619a4d7c3d9912114de50e88
[3] https://review.opendev.org/c/openstack/taskflow/+/900746