octavia

Loadbalancers in PENDING_UPDATE state become immutable

Bug #2006965 reported by Connor Chamberlain on 2023-02-10

This bug affects 3 people

Affects		Status	Importance	Assigned to	Milestone
	octavia	Incomplete	Undecided	Unassigned

Bug Description

We have observed that when an LB loses all of its members, it drops into a PENDING_UPDATE state and becomes immutable. The issue here is that the LB remains immutable until it returns to an ACTIVE state, which is usually caused by one or more members coming back online. However, there can be cases where LB members are completely removed (rather than simply being offline), and Octavia falsely labels them as OFFLINE. This state will last indefinitely. To recover from this state, operators have to perform database surgery to force Octavia to delete the LB in question.

A more desirable behavior would be to keep the LB mutable, but perhaps warn the user about modifying an LB that's in PENDING_UPDATE. This would allow operators and users to rapidly correct any LB that gets into this bad state without accidentally modifying an LB that should remain untouched.

Revision history for this message

Michael Johnson (johnsom) wrote on 2023-02-10:

Members going offline should not cause a transition to PENDING_UPDATE. That is not normal.
Member status should only affect the operating status field and never the provisioning status.

Also note, you should never edit that statuses in the database. PENDING_* means one of your controller processes has ownership of the load balancer object and is actively working on it. Some actions have very long retry timeouts (retrying nova failures for example). After those timeouts expire, the object should go back to ACTIVE or ERROR depending on if we could resolve the failure.

Can you provide the configuration settings of this load balancer?
Listeners, protocols, health monitor settings, etc?

Also, can you provide the worker and health manager logs?

Revision history for this message

Gregory Thiemonge (gthiemonge) wrote on 2023-10-04:

No update since February, I'm marking it as incomplete

Changed in octavia:
status:	New → Incomplete

Revision history for this message

Nikita Koltsov (nkoltsov) wrote on 2024-02-29:

We was able to get the same behavior, with a way to reproduce it. Here is the scenario provided by the customer:

Scenario 1:
1 - Create a load balancer
2 - Create a listener
3 - Create a Pool
4 - Create two members in a pool with the same IP & PORT using terraform script found https://registry.terraform.io/providers/terraform-provider-openstack/openstack/latest/docs/resources/lb_members_v2
CONCLUSION: The customer gets a 500 Internal server error upon execution of point 4.

Scenario 2:
1 - Create a load balancer
2 - Create a listener
3 - Create a Pool
4 - Create one members in a pool with an IP address https://registry.terraform.io/providers/terraform-provider-openstack/openstack/latest/docs/resources/lb_members_v2
5 - Update the existing load balancer by adding a second member with the same IP address that the one added in step 4.
Conclusion: All resources, including the load balancer, go to pending update and unable to be deleted.

Also I noticed that they assigned the same pool to two listeners in the same load balancer. And this lead to behaviour described above, we also had to perform database surgery to remove this load balancer

Revision history for this message

Gregory Thiemonge (gthiemonge) wrote on 2024-02-29:

Hi @nkoltsov

A few questions:
- what is the version of Octavia in your environment?
- do you have the backtrace that appears in the octavia logs when the 500 is returned?
- do you know what is sent to the API endpoints by terraform when create/updating the members? (the json doc)

Revision history for this message

Giuseppe Petralia (peppepetra) wrote on 2024-03-05 (last edit on 2024-03-05):

Hello,

Octavia version is 1:10.0.0-0ubuntu1

The error is:

2024-03-04T13:43:32.779+0100 [INFO] provider.terraform-provider-openstack_v1.54.1: 2024/03/04 13:43:32 [DEBUG] OpenStack Response Code: 500: timestamp="2024-03-04T13:43:32.779+0100"
2024-03-04T13:43:32.779+0100 [INFO] provider.terraform-provider-openstack_v1.54.1: 2024/03/04 13:43:32 [DEBUG] OpenStack Response Headers:
Content-Length: 105
Content-Type: application/json
Date: Mon, 04 Mar 2024 12:43:32 GMT
Server: Apache/2.4.52 (Ubuntu)
X-Openstack-Request-Id: req-7d9b92e1-d73d-4385-a642-d93806d029c4: timestamp="2024-03-04T13:43:32.779+0100"
2024-03-04T13:43:32.779+0100 [INFO] provider.terraform-provider-openstack_v1.54.1: 2024/03/04 13:43:32 [DEBUG] OpenStack Response Body: {
  "debuginfo": null,
  "faultcode": "Server",
  "faultstring": "'NoneType' object has no attribute 'to_dict'"
}: timestamp="2024-03-04T13:43:32.779+0100"

I don't have the full json sent from Terraform unfort.

Revision history for this message

Gregory Thiemonge (gthiemonge) wrote on 2024-03-05:

It looks like https://storyboard.openstack.org/#!/story/2009128

The bug was fixed in https://review.opendev.org/c/openstack/octavia/+/847555

The commit is available in octavia 10.1.0 and 10.1.1

Revision history for this message

Nikita Koltsov (nkoltsov) wrote on 2024-05-06 (last edit on 2024-05-06):

I performed testing, of the issue. And was able to reproduce it and also trace it back to API call. Issue is caused when terraform is trying to use batch update on pool members. openstack CLI is not supporting such calls. So I can't provide a simple openstack CLI command to reproduce it.

Here is the documentation to this API call https://docs.openstack.org/api-ref/load-balancer/v2/#batch-update-members

And when two members with the same IP address were added to loadbalancer in one call. The result is the only only member is added. And loadbalancer stuck in PENDING_UPDATE state and can't be removed or altered.

Also It's worth to mention that terraform is using old version of openstack client. Current version is not using batch update for loadbalancer anymore(Here is the link to PR with changes https://github.com/gophercloud/gophercloud/pull/2560)

Issue could be quite easily reproduced by using terraform:
1. Create a VM and install nginx on it.
2. Install terraform(For example snap install terraform)
3. Create a folder and put the following text into main.tf file(https://pastebin.ubuntu.com/p/nCZQSvxYkn/)
4. Specify correct instance ID for VM and subnet name in the file
5. Run terraform init
6. Run terrafrom apply

Terraform will wait for a loadbalancer to goes online and eventually timeout

I believe it's not 100% related to the bug above should I create another bug for it?

And for testing I used jammy/yoga with octavia version is 10.1.0(10.1.0-0ubuntu1)

Revision history for this message

Gregory Thiemonge (gthiemonge) wrote on 2024-05-06:

I just ran a test with 10.1.0 (from pip) and an openstacksdk script:

members = [
    {
        'address': '10.0.0.2',
        'protocol_port': '80',
        'weight': 60
    },
    {
        'address': '10.0.0.2',
        'protocol_port': '80',
        'weight': 40
    },
]

r = requests.put(f'{endpoint}/v2.0/lbaas/pools/{pool.id}/members',
                headers={
                    'X-Auth-Token': conn.auth_token,
                    'Content-Type': 'application/json'
                }, data=json.dumps({'members': members}))

Adding the members is blocked by the API:

DEBUG wsme.api [None req-712d9664-62f1-48c0-ae63-6e83be7c7277 demo admin] Client-side error: Another member on this pool is already using ip 10.0.0.2 on protocol_port 80 {{(pid=879452) format_exception /usr/local/lib/python3.9/site-packages/wsme/api.py:223}}

then the API returns a 409 error (the load balancer is not created)

can you check manually that your octavia api package contains this patch: https://review.opendev.org/c/openstack/octavia/+/847555

What's the full backtrace when you're facing the issue?

Revision history for this message

Nikita Koltsov (nkoltsov) wrote on 2024-05-08:

Patch is there, I have checked it already. We are using setup with 3 octavia here is access logs related to broken member:
https://pastebin.ubuntu.com/p/3HgkSJBj3C/

Also here is everything related to this pool from octavia logs: https://pastebin.ubuntu.com/p/3Zwh8MkH2t/
The interesting fact I can't find any messages connected to the moment when pool switched to PENDING_UPDATE status. I could share full set of logs as well if needed.

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.