LBaaSv2: Can't delete the Load balancer and also dependant entities if the load balancer provisioning_status is in PENDING_UPDATE

Bug #1498130 reported by hgangwx
34
This bug affects 6 people
Affects Status Importance Assigned to Milestone
octavia
Invalid
Undecided
Unassigned

Bug Description

If the Load balancer provisioning_status is in PENDING_UPDATE

cannot delete the Loadbalancer and also dependent entities like listener or pool

 neutron -v lbaas-listener-delete 6f9fdf3a-4578-4e3e-8b0b-f2699608b7e6
DEBUG: keystoneclient.session REQ: curl -g -i -X GET http://9.197.47.200:5000/v2.0 -H "Accept: application/json" -H "User-Agent: python-keystoneclient"
DEBUG: keystoneclient.session RESP: [200] content-length: 338 vary: X-Auth-Token connection: keep-alive date: Mon, 21 Sep 2015 18:35:55 GMT content-type: application/json x-openstack-request-id: req-952f21b0-81bf-4e0f-a6c8-b3fc13ac4cd2
RESP BODY: {"version": {"status": "stable", "updated": "2014-04-17T00:00:00Z", "media-types": [{"base": "application/json", "type": "application/vnd.openstack.identity-v2.0+json"}], "id": "v2.0", "links": [{"href": "http://9.197.47.200:5000/v2.0/", "rel": "self"}, {"href": "http://docs.openstack.org/", "type": "text/html", "rel": "describedby"}]}}

DEBUG: neutronclient.neutron.v2_0.lb.v2.listener.DeleteListener run(Namespace(id=u'6f9fdf3a-4578-4e3e-8b0b-f2699608b7e6', request_format='json'))
DEBUG: keystoneclient.auth.identity.v2 Making authentication request to http://9.197.47.200:5000/v2.0/tokens
DEBUG: keystoneclient.session REQ: curl -g -i -X GET http://9.197.47.200:9696/v2.0/lbaas/listeners.json?fields=id&id=6f9fdf3a-4578-4e3e-8b0b-f2699608b7e6 -H "User-Agent: python-neutronclient" -H "Accept: application/json" -H "X-Auth-Token: {SHA1}9ea944020f06fa79f4b6db851dbd9e69aca65d58"
DEBUG: keystoneclient.session RESP: [200] date: Mon, 21 Sep 2015 18:35:56 GMT connection: keep-alive content-type: application/json; charset=UTF-8 content-length: 346 x-openstack-request-id: req-fd7ee22b-f776-4ebd-94c6-7548a5aff362
RESP BODY: {"listeners": [{"protocol_port": 100, "protocol": "TCP", "description": "", "sni_container_ids": [], "admin_state_up": true, "loadbalancers": [{"id": "ab8f76ec-236f-4f4c-b28e-cd7bfee48cd2"}], "default_tls_container_id": null, "connection_limit": 100, "default_pool_id": null, "id": "6f9fdf3a-4578-4e3e-8b0b-f2699608b7e6", "name": "listener100"}]}

DEBUG: keystoneclient.session REQ: curl -g -i -X DELETE http://9.197.47.200:9696/v2.0/lbaas/listeners/6f9fdf3a-4578-4e3e-8b0b-f2699608b7e6.json -H "User-Agent: python-neutronclient" -H "Accept: application/json" -H "X-Auth-Token: {SHA1}9ea944020f06fa79f4b6db851dbd9e69aca65d58"
DEBUG: keystoneclient.session RESP:
DEBUG: neutronclient.v2_0.client Error message: {"NeutronError": {"message": "Invalid state PENDING_UPDATE of loadbalancer resource ab8f76ec-236f-4f4c-b28e-cd7bfee48cd2", "type": "StateInvalid", "detail": ""}}
ERROR: neutronclient.shell Invalid state PENDING_UPDATE of loadbalancer resource ab8f76ec-236f-4f4c-b28e-cd7bfee48cd2
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/neutronclient/shell.py", line 766, in run_subcommand
    return run_command(cmd, cmd_parser, sub_argv)
  File "/usr/lib/python2.7/site-packages/neutronclient/shell.py", line 101, in run_command
    return cmd.run(known_args)
  File "/usr/lib/python2.7/site-packages/neutronclient/neutron/v2_0/__init__.py", line 581, in run
    obj_deleter(_id)
  File "/usr/lib/python2.7/site-packages/neutronclient/v2_0/client.py", line 102, in with_params
    ret = self.function(instance, *args, **kwargs)
  File "/usr/lib/python2.7/site-packages/neutronclient/v2_0/client.py", line 932, in delete_listener
    return self.delete(self.lbaas_listener_path % (lbaas_listener))
  File "/usr/lib/python2.7/site-packages/neutronclient/v2_0/client.py", line 289, in delete
    headers=headers, params=params)
  File "/usr/lib/python2.7/site-packages/neutronclient/v2_0/client.py", line 270, in retry_request
    headers=headers, params=params)
  File "/usr/lib/python2.7/site-packages/neutronclient/v2_0/client.py", line 211, in do_request
    self._handle_fault_response(status_code, replybody)
  File "/usr/lib/python2.7/site-packages/neutronclient/v2_0/client.py", line 185, in _handle_fault_response
    exception_handler_v20(status_code, des_error_body)
  File "/usr/lib/python2.7/site-packages/neutronclient/v2_0/client.py", line 70, in exception_handler_v20
    status_code=status_code)
StateInvalidClient: Invalid state PENDING_UPDATE of loadbalancer resource ab8f76ec-236f-4f4c-b28e-cd7bfee48cd2

Tags: lbaas
hgangwx (hgangwx)
summary: LBaaSv2: Can't delete the Load balancer and also dependant entities if
- the load balancer status is PENDING_UPDATE
+ the load balancer status is in PENDING_UPDATE
summary: LBaaSv2: Can't delete the Load balancer and also dependant entities if
- the load balancer status is in PENDING_UPDATE
+ the load balancer provisioning_status is in PENDING_UPDATE
Revision history for this message
Hong Hui Xiao (xiaohhui) wrote :

I think it is designed to be this way.

Revision history for this message
hgangwx (hgangwx) wrote :

In that case , how will you handle these cases ,where loadbalancer got stuck in PENDING_UPDATE . Can't delete or can't do any other operation for that particular namespace.

Revision history for this message
Richard Theis (rtheis) wrote :

I have encountered problems with LBaaS V2 and Octavia such that creating a load balancer gets stuck in PENDING_CREATE. I'm not experiencing the same problems with LBaaS V2 and HAProxy. But you bring up a good point...how are PENDING_* actions cleaned up? I'm not sure.

Richard Theis (rtheis)
Changed in neutron:
assignee: nobody → Richard Theis (rtheis)
Revision history for this message
Richard Theis (rtheis) wrote :

I've searched through the neutron-lbaas code and see that failure actions should result in the PENDING_* state getting cleared and replaced by ERROR or ACTIVE depending on the resource in-use. I also confirmed this in a test in which I had a load balancer creation time out.

Since there appears to be error handling code to avoid the general "stuck status case", I think that the general concern raised by this issue is working-as-designed. However, there may be specific scenarios where a load balancer resource gets stuck in a PENDING_* status. If you have such a scenario, please update this bug accordingly. Without a specific scenario, I plan to mark this bug incomplete.

Richard Theis (rtheis)
Changed in neutron:
assignee: Richard Theis (rtheis) → nobody
status: New → Incomplete
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for neutron because there has been no activity for 60 days.]

Changed in neutron:
status: Incomplete → Expired
Changed in neutron:
status: Expired → New
Revision history for this message
Spyros Trigazis (strigazi) wrote :

Any advice for that? I even modified the db and didn't work.

So for the use case:
In openstack/magnum we have an option to use lbaas for our clusters. Two lbaas' are created one for etcd and one for api. These lbaas' are created with heat. If for any reason (unrelated to lbaas) the heat creation fails we want to delete the stack, but it is impossible because we can't delete the load balancers.

One more thing I tried and failed:
I tried to use an even smaller flavor than m1.amphora with 512mb RAM and the lbaas creation as stack.

Revision history for this message
Michael Johnson (johnsom) wrote :

Marking this as invalid as it is as designed to not allow actions on load balancers in PENDING_* states.
PENDING_* means an action against that load balancer (DELETE or UPDATE) is already in progress.

As for load balancers getting stuck in a PENDING_* state, many bugs have been cleaned up for that situation. If you find a situation that leads to a load balancer stuck in a PENDING_* state, please report that as a new bug.
Operators can clear load balnacers stuck in PENDING_* by manually updating the database record for the resource.

affects: neutron → octavia
Changed in octavia:
status: New → Invalid
Revision history for this message
Oleksandr Savatieiev (osavatieiev) wrote :

This is a great approach to suggest updating DB instead of introducing an approach on how to clear this in normal way. This way we can drop some CLI functionality at all. Who needs that anyway? Just update DB.

Revision history for this message
Giondo (giondo) wrote :

I just created my account to do a plus 1 to this comment

Oleksandr Savatieiev (osavatieiev) wrote on 2017-05-25: #8
This is a great approach to suggest updating DB instead of introducing an approach on how to clear this in normal way. This way we can drop some CLI functionality at all. Who needs that anyway? Just update DB.

If we wanted to manage all the infra from a DB, why create APIs and CLIs?
Please please please do it the right way, stop being frontend developers

2017-09-07 and this issue still here!

Revision history for this message
Ulises Alonso Camaró (dp26) wrote :

I completely agree with previous comments, the current situation is not practical operationally.

Revision history for this message
Yang Youseok (ileixe) wrote :

Agreed with #9. FYI, Our case stuck in PENDING_* state is that controller worker config 'amp_active_wait_sec' is over than API's wait interval. This is definitely misused case though, imho there should be any recoverable API.

Revision history for this message
Oleksandr Savatieiev (osavatieiev) wrote :

Coming back here...
Since last time, we had ~5 situations in DevOps that was lead to LB stuck in PENDING_*. Yes, DB update to 'Error' helps. But this is not something that simple DevOps Engineer must do.
There is more to it, using DB client is a Security issue. Sometimes, security is hard enough not to let you in DB so easily.

In normal world, LB stuck in PENDING_* for more than "timeout" can be reset to error.
Suggest to implement lb-reset-state function with configurable lb-reset-timeout parameter that works similar to DB update.

Revision history for this message
Oleksandr Savatieiev (osavatieiev) wrote :

Trying to raise this once again...

Changed in octavia:
status: Invalid → New
Revision history for this message
Michael Johnson (johnsom) wrote :
Download full text (3.3 KiB)

Hi folks,

Sorry we were not aware that this conversation was continuing. When a bug is in a closed state (such as Invalid) it removes it from our dashboards and stops sending notifications of comments.

Also note, the OpenStack foundation has migrated Octavia off of launchpad and onto Storyboard (https://storyboard.openstack.org/). All OpenStack projects are being migrated. Because of that I will reclose this bug (If I can as launchpad bugs were disabled for this project as part of the migration).

That said I am sorry my comments were not clear. Let me try to clarify.

All objects in Octavia should and will end in a consistent state: ACTIVE or ERROR. The only time an object would be in a PENDING_* state is when a controller has ownership of the object and is actively been managed. The timing for when a controller will give up and result to ERROR is configurable in the octavia.conf file. The defaults are quite long due to the low performance of some development systems (virtual box for one).

At no time should an object be "stuck" in PENDING_*. A controller should have ownership of the object and be actively managing it. This is why you should never interrupt or circumvent these states. Doing so will likely lead to the system going into alternate recovery paths, such as failover, that may not be the desired outcome. PENDING_* means the object is actively being worked on by a controller that has locked the object to make sure others do not make changes to this object while the controller is working on the object.

If you look at the Octavia code, all paths lead back to either ACTIVE or ERROR (Example: https://docs.openstack.org/octavia/latest/_images/ListenerFlows-get_create_listener_flow.svg). In the ERROR state users or operators can recover by deleting the object and recreating.

In the past there were a few bugs in the code that could lead to an abandoned object in a PENDING_* state. We have aggressively worked to resolve those bugs and do not have any outstanding bugs detailing an object stuck in PENDING_*. The only other path that could lead to a PENDING_* would be an unsafe shutdown (kill -9 or hard power off) of a controller that had ownership of the object.

You should never need to access or change the database, but "stuff" does happen. This would only be the last resort.

We have evaluated an admin tool to "force" a delete on objects in PENDING_* but found that it was abused because the ramifications where not clear for people using it. It became the "universal" screwdriver (also known as a sledgehammer to install a thumb tack). This led to issues in other services and very unhappy customers because they lost their VIP addresses or they have quota still in use for abandoned resources in the other services.

Because we were no longer seeing this issue, or getting bug reports of objects stuck in "PENDING_*" we have opted to not "do more harm than good" and decided to not enable a very dangerous "force" option.

As I requested in the original response, if you are seeing objects stuck in "PENDING_*", please open a bug in Storyboard for us. We need to understand how it got there and the rate of occurrence. If you are seeing object...

Read more...

Changed in octavia:
status: New → Invalid
Revision history for this message
Oleksandr Savatieiev (osavatieiev) wrote :

Michael, thanks for the explanation.

Revision history for this message
Adrian Vladu (avladu) wrote :

Hello,

The bug is easy to reproduce in a dev environment:

* Create a loadbalancer
* Wait for the amphora vm to be created
* Restart the octavia worker

Now the loadbalancer will be left in an undeletable state PENDING_CREATE.

The restart of the octavia worker step can be replaced by a failure in connectivity, etc (underlying networking issues).

Thank you,
Adrian.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.