octavia

Amphorae stuck in ERROR

Bug #2028041 reported by Justin Lamp on 2023-07-18

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	octavia	New	Undecided	Unassigned

Bug Description

When the connection to the amphora is lost for some time the Amphorae wil be marked as provisioning_state ERROR. Even if the connection is back up, the Amphorae will be stuck in that state.

That's surprising as the healthchecks are coming in and the external traffic was never disrupted. If I issue a `amphora configure --wait` I can even see the successful completion of the request in the worker log. Unfortunately the cli still returns an error and the state of the Amphora does not change (`The resource did not successfully reach ACTIVE status`) I also tried getting the amphora stats and those update just fine.

The only way to get the LB back is to initiate a failover, which is not desirable in some cases and has to done as admin.

Revision history for this message

Justin Lamp (modzilla) wrote on 2023-07-18:

> Hi,
>
> FYI the octavia project no longer uses storyboard and moved back to launchpad: https://bugs.launchpad.net/octavia/
>
> 1. if an amphora is in provisioning_state ERROR, it means that an attempt to update the load balancer failed. So even if the connection is back, the amphora is probably incorrectly configured. Only a failover would fix it
>
> 2. "amphora configure" is used to propagate changes from the Controller's octavia.conf file to the configuration file in the amphora. It doesn't update the load balancer's resources. I think that there's an bug with the --wait flag because the configure API doesn't update the provisioning_status of the amphora/LB, it cannot wait for a specific status (a --wait option in this CLI doesn't make sense).
>
> I don't see any other alternative than a failover here.
> <footer>Gregory Thiemonge on 2023-07-17 at 14:51:41</footer>

Revision history for this message

Justin Lamp (modzilla) wrote on 2023-07-18:

Hi Gregory,

thanks for your answer!

1. Even if an attempt did not work, why doesn't it try again after a successful stats update? I know that the amphora is in fact correctly configured. I haven't made a change and checked it as well.

2. Okay that totally makes sense. But wouldn't it also make sense to update the status back to ACTIVE if that configure works?

Revision history for this message

Noel Ashford (nashford77) wrote on 2024-04-08:

I have the same error.... I had a perfectly working bunch of amphora's (magnum LB's & Ingress amphora) worked great then i rebooted, toast... IT seems to be buggy, i see nothing useful in the logs except that it is waiting on the DB to come up during the boot process, doesn't see it then starts ok eventually but the amphora's are all in error.

root@slurm-primary-controller:~/slurm# openstack loadbalancer list
+--------------------------------------+------------------------------------------------------------------------------------------+----------------------------------+-------------+---------------------+------------------+----------+
| id | name | project_id | vip_address | provisioning_status | operating_status | provider |
+--------------------------------------+------------------------------------------------------------------------------------------+----------------------------------+-------------+---------------------+------------------+----------+
| 2fe463fd-38fb-4907-9eb6-7f17bcd7e1a1 | k8s-5net-lhvu6wmfb3ey-api_lb-z2t53ouwxs4k-loadbalancer-rs6m5vqvrfy3 | e5b9296fbd9e4d9ea5e925780c64690f | 10.5.1.193 | ERROR | ONLINE | amphora |
| 402c1e76-150c-4a24-8cf7-c30c1df1c7c8 | k8s-5net-lhvu6wmfb3ey-etcd_lb-uoeerdx6kmju-loadbalancer-ky3rjmpz53ea | e5b9296fbd9e4d9ea5e925780c64690f | 10.5.1.123 | ERROR | ONLINE | amphora |
| 15dcb97a-f24b-4830-be45-a36f55189dd4 | kube_service_98d5f131-377f-401c-bbca-dcc9a4fe42a6_ingress-nginx_ingress-nginx-controller | e5b9296fbd9e4d9ea5e925780c64690f | 10.5.1.169 | ERROR | ERROR | amphora |
+--------------------------------------+------------------------------------------------------------------------------------------+----------------------------------+-------------+---------------------+------------------+----------+

What gives here ? I tried to reboot the compute instances, no go, they do reboot fine, the status remains. seems to be a bug where it is marked dead - i am scared to ever reboot this box ;0 #help

root@slurm-primary-controller:~/slurm# openstack loadbalancer set 2fe463fd-38fb-4907-9eb6-7f17bcd7e1a1 --enable
Invalid state ERROR of loadbalancer resource 2fe463fd-38fb-4907-9eb6-7f17bcd7e1a1 (HTTP 409) (Request-ID: req-35dda6b0-210c-4117-b1c0-2189b506fd43)

root@slurm-primary-controller:~/slurm# openstack loadbalancer list
+--------------------------------------+------------------------------------------------------------------------------------------+----------------------------------+-------------+---------------------+------------------+----------+
| id                                   | name                                                                                     | project_id                       | vip_address | provisioning_status | operating_status | provider |
+--------------------------------------+------------------------------------------------------------------------------------------+----------------------------------+-------------+---------------------+------------------+----------+
| 2fe463fd-38fb-4907-9eb6-7f17bcd7e1a1 | k8s-5net-lhvu6wmfb3ey-api_lb-z2t53ouwxs4k-loadbalancer-rs6m5vqvrfy3                      | e5b9296fbd9e4d9ea5e925780c64690f | 10.5.1.193  | ERROR               | ONLINE           | amphora  |
| 402c1e76-150c-4a24-8cf7-c30c1df1c7c8 | k8s-5net-lhvu6wmfb3ey-etcd_lb-uoeerdx6kmju-loadbalancer-ky3rjmpz53ea                     | e5b9296fbd9e4d9ea5e925780c64690f | 10.5.1.123  | ERROR               | ONLINE           | amphora  |
| 15dcb97a-f24b-4830-be45-a36f55189dd4 | kube_service_98d5f131-377f-401c-bbca-dcc9a4fe42a6_ingress-nginx_ingress-nginx-controller | e5b9296fbd9e4d9ea5e925780c64690f | 10.5.1.169  | ERROR               | ERROR            | amphora  |
+--------------------------------------+------------------------------------------------------------------------------------------+----------------------------------+-------------+---------------------+------------------+----------+

What gives here ? I tried to reboot the compute instances, no go, they do reboot fine, the status remains. seems to be a bug where it is marked dead - i am scared to ever reboot this box ;0 #help

Revision history for this message

Gregory Thiemonge (gthiemonge) wrote on 2024-04-08:

Hi Noel,

First you may take a look at the octavia worker and health-manager logs (and sometimes housekeeping logs), you should see a backtrace or an error for any load balancer in provisioning_status ERROR.

Those logs would help us to understand if there's a bug in Octavia or anything that can be improved on our side for such cases.

Rebooting a compute node usually impacts the amphora VMs: if the VMs are rebooted, the load balancer restarts correctly unless one of its resources use TLS (TLS certificates are stored in a tmpfs in the amphora and are not preserved across reboots). In case something wrong happens the octavia health-manager should detect it and respawn some new amphora VMs (during a failover), something should appear in the health-manager logs.

> Invalid state ERROR of loadbalancer resource 2fe463fd-38fb-4907-9eb6-7f17bcd7e1a1 (HTTP 409) (Request-ID: req-35dda6b0-210c-4117-b1c0-2189b506fd43)

when a LB is in ERROR, only 2 actions are possible:
- the deletion of the LB
- the failover of the LB: it recreates the amphora VMs and deletes the old ones, this is usually the right way to fix LBs in ERROR

so you can try:
- openstack loadbalancer failover 2fe463fd-38fb-4907-9eb6-7f17bcd7e1a1

and check the logs.

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.