amphora active wait retries not working

Bug #2064600 reported by Maximilian Stinsky
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
octavia
New
Undecided
Unassigned

Bug Description

Greetings,

I was debugging an issues in our test environment where sometimes we have random tempest error against our octavia setup.

What I found was that it seems that the retry logic waiting for an active amphora does not seem to work.
We have amp_active_wait_sec set to 15 and I can see the following in our logs.
Octavia creates a new amphora at 13:29:32 this new vm is getting scheduled by nova and by chance hits the same hv as the other vm of the same loadbalancer which results in an anti-affinity violations. So the vm gets pushed back to the scheduler and gets scheduled to a different hv. This delay results in octavia to fail with:
raise exceptions.ComputeWaitTimeoutException(id=compute_id), octavia.common.exceptions.ComputeWaitTimeoutException: Waiting for compute id b24fa313-2e73-4120-92fd-e159538a7352 to go active timeout

In the amphora logs I can see that the vm would have been started correctly if octavia would have retried but it didnt which confuses me as amp_active_retries is default 30. In the end the flow just transitioned into FAILURE and did not retry anything. The lb is also stuck PENDING_CREATE and the amphora vm stuck in BOOTING state.

Full log timeline:
May 2, 2024 @ 13:29:32.000 Created Amphora 650e5073-52a1-4de4-9fac-5f777689e2ef in DB for load balancer 7144e491-0a54-43d7-98cf-472f7d65cd60
May 2, 2024 @ 13:29:33.000 Server created with id: b24fa313-2e73-4120-92fd-e159538a7352 for amphora id: 650e5073-52a1-4de4-9fac-5f777689e2ef
May 2, 2024 @ 13:29:36.000 Failed to build and run instance (anti affinity violation)
May 2, 2024 @ 13:29:45.000 Instance spawned successfully.
May 2, 2024 @ 13:29:45.000 Took 5.98 seconds to build instance.
May 2, 2024 @ 13:29:48.000 Task 'MASTER-octavia-create-amp-for-lb-subflow-octavia-compute-wait' (1f242021-36ac-4c56-9fb6-faa60fefbea7) transitioned into state 'FAILURE' from state 'RUNNING'

amphora log which shows that it would have come up correctly some time later.
May 2, 2024 @ 13:30:09.000 [610] Listening at: http://[::]:9443 (610)

So in the end octavia seems to not retry the activity check after the 15s. Any idea on why this would be the case is appriciated.

What I am not sure is if this is somehow also realted to
https://bugs.launchpad.net/octavia/+bug/2043360
https://bugs.launchpad.net/octavia/+bug/2060099

Best Regards
Max

Revision history for this message
Gregory Thiemonge (gthiemonge) wrote :

Hey, yes, that really looks like https://bugs.launchpad.net/octavia/+bug/2043360

In case of failure when creating a load balancer, the final task before the failure of the flow should set the load balancer provisioning_status to ERROR.
In your case I believe that the final task is *-octavia-create-amp-for-lb-subflow-octavia-compute-wait, it's exactly the taskflow bug (https://bugs.launchpad.net/taskflow/+bug/2043808)

I proposed a fix/workaround this morning: https://review.opendev.org/c/openstack/taskflow/+/900746

what release are you using?

Revision history for this message
Maximilian Stinsky (mstinsky) wrote :

Hi!

Thanks for the quick response.

We are running yoga (10.1.0).

What I dont 100% get in my case here is it seems to me that octavia is not even retrying the active check.
The task transitions into 'FAILURE' after 15s without any retry after that. In the case that I described one of the lb's amphoras gets anti affinity violation by chance so it gets rescheduled which pushes the time above the 15s for the vm to spawn. My understanding is that octavia should retry 30 times (default) before it transitions into 'FAILURE'?

Revision history for this message
Gregory Thiemonge (gthiemonge) wrote :

Ok maybe we need to get the logs of the flow (at least the DEBUG-level messages with the transitions of the tasks/flows - like "transitioned into state 'XXX' from state 'XXX'") and the exceptions that are thrown by the task.

Indeed the retry should work for the MASTER compute wait task, but in the case of Active-Standby, an issue in the BACKUP flow may break the retry.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.