Unable to delete LoadBalancers that are stuck in certain states

Bug #1856331 reported by Ryan Farrell
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Octavia Charm
Triaged
Undecided
Unassigned

Bug Description

Occasionally we will encounter an error with the LoadBalancers where they might get stuck in PENDING_CREATE, PENDING_UPDATE, or ERROR state. Then later attempts to delete the LB via CLI will fail and we must resort to manually editing the database to set the status to active before we can successfully delete the LB.

Is it possible to implement a --force flag on the CLI to facilitate deletion?

#Also created this as a story on the octavia storyboard
https://storyboard.openstack.org/#!/story/2007022

Revision history for this message
Alex Kavanagh (ajkavanagh) wrote :

It doesn't seem like a charm bug EXCEPT for the comment by Michael Johnson on the linked storyboard around killing octavia processes or restarts. reproducing here for convenience:

Marking this story as invalid as it duplicates existing stories on the topic.

That said I can give a summary of some of the other stories for you:

    ERROR states will not block deletes and manual recovery. In some cases the failover APIs will also recover those (soon it will fix almost all of those scenarios).
    PENDING_* states only occur when a controller is forcefully interrupted while it has ownership of the resource.
    a. Make sure your systemd units do not kill -9 Octavia controller processes. Let them finish processing and gracefully shutdown.
    b. Don't kill -9 Octavia processes. kill -15 them to allow them to gracefully shutdown and release the resources.
    There are still issues with a controller suddenly losing power, someone accidentally kill -9, etc.
a. The Octavia team is implementing TaskFlow JobBoard to resolve this issue. It will allow sub-flow recovery of in-process activities in the event of a power lose, etc. This is currently WIP with open patches and we hope to land it for the Ussuri release.
4. Many times objects in PENDING_* are not stuck, but the controller is still attempting retries to resolve a problem with the underlying cloud (typically nova failures).
a. The default timeouts are very lengthy, in some cases the controllers will retry for hours. At the end of this timeout it will either set the object to ACTIVE if it successfully resolved the issue, or it will mark it in ERROR and you can attempt the other recovery tools (delete and failover) once the cloud is healthy again.
b. If you do not want the controller to try so long to work around a cloud failure, adjust these defaults down to more "user friendly" values.
c. This is the most common case when someone claims objects are "stuck" in PENDING_*. Be sure to check the controllers to make sure none of them have ownership and are currently retrying action on them.
5. Adding a --force option is very dangerous as it is very likely that one of the controllers has ownership of that resource and is acting on it. You are likely to do more damage that improve the situation by adding a --force option. This is why the community decided not to do this.

In particular, the last comment around "--force" being "very dangerous" suggests a won't fix on the bug upstream (particularly as they closed it).

However, leaving this open to enable discussion about whether, perhaps, the octavia charm might be doing something wrong (but my initial investigation didn't reveal anything).

Andrew McLeod (admcleod)
Changed in charm-octavia:
status: New → Triaged
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.