amphora active wait retries not working
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
octavia |
New
|
Undecided
|
Unassigned |
Bug Description
Greetings,
I was debugging an issues in our test environment where sometimes we have random tempest error against our octavia setup.
What I found was that it seems that the retry logic waiting for an active amphora does not seem to work.
We have amp_active_wait_sec set to 15 and I can see the following in our logs.
Octavia creates a new amphora at 13:29:32 this new vm is getting scheduled by nova and by chance hits the same hv as the other vm of the same loadbalancer which results in an anti-affinity violations. So the vm gets pushed back to the scheduler and gets scheduled to a different hv. This delay results in octavia to fail with:
raise exceptions.
In the amphora logs I can see that the vm would have been started correctly if octavia would have retried but it didnt which confuses me as amp_active_retries is default 30. In the end the flow just transitioned into FAILURE and did not retry anything. The lb is also stuck PENDING_CREATE and the amphora vm stuck in BOOTING state.
Full log timeline:
May 2, 2024 @ 13:29:32.000 Created Amphora 650e5073-
May 2, 2024 @ 13:29:33.000 Server created with id: b24fa313-
May 2, 2024 @ 13:29:36.000 Failed to build and run instance (anti affinity violation)
May 2, 2024 @ 13:29:45.000 Instance spawned successfully.
May 2, 2024 @ 13:29:45.000 Took 5.98 seconds to build instance.
May 2, 2024 @ 13:29:48.000 Task 'MASTER-
amphora log which shows that it would have come up correctly some time later.
May 2, 2024 @ 13:30:09.000 [610] Listening at: http://[::]:9443 (610)
So in the end octavia seems to not retry the activity check after the 15s. Any idea on why this would be the case is appriciated.
What I am not sure is if this is somehow also realted to
https:/
https:/
Best Regards
Max
Hey, yes, that really looks like https:/ /bugs.launchpad .net/octavia/ +bug/2043360
In case of failure when creating a load balancer, the final task before the failure of the flow should set the load balancer provisioning_status to ERROR. create- amp-for- lb-subflow- octavia- compute- wait, it's exactly the taskflow bug (https:/ /bugs.launchpad .net/taskflow/ +bug/2043808)
In your case I believe that the final task is *-octavia-
I proposed a fix/workaround this morning: https:/ /review. opendev. org/c/openstack /taskflow/ +/900746
what release are you using?