Clusters can stay in Deleting state forever

Bug #1647411 reported by Luigi Toscano
30
This bug affects 7 people
Affects Status Importance Assigned to Milestone
Sahara
Triaged
Medium
Unassigned

Bug Description

I observed that many times clusters can be stuck in Deleting state. This can happen both after a successful creation and subsequent deletion, or a removal of an Active cluster. It is not easy to give a consistent reproducer.
In those cases, heat shows an empty stack (tested with `openstack stack list --hidden --nested`).

I suspect that the engine loses the notification of the cleanup from heat and it does not switch the status.
I'm not sure if, apart from heat cleaning up the stack, some other steps are required by the engine in order to cleanup a cluster. If it is the case, those additional steps should be (re)triggered as well when the engine detects that the heat stack is empty.

The configuration key cleanup_time_for_incomplete_clusters does not apply in this case as Delete is considered a "final" state for it.

Revision history for this message
Luigi Toscano (ltoscano) wrote :

Found in many versions, I can remember for sure from Liberty; hitting frequently on Newton (Red Hat OpenStack Platform on Red Hat Enterprise Linux).

Revision history for this message
Vitalii Gridnev (vgridnev) wrote :

What kind of solution we can propose? Actually, there are many possible reasons why cluster can stay in deleting status for long time. Most of them - heat stack can't be deleted. We are not subscribing to heat stack events, we just polling its status, and if heat stack went to delete failed status there no way to retry something, I think. We can do something like force deletion: but we need additional API changes for that.

Revision history for this message
Luigi Toscano (ltoscano) wrote :

In case when heat returns an error and then Deleting is kept a final state, can't we add a periodic recheck (even if every few hours, even 24) to see if the heats resource is still in error state or it was deleted in the meantime?

Revision history for this message
Vitalii Gridnev (vgridnev) wrote :

If heat stack is in DELETE_FAILED, we probably will need to retry deletion, ... or what?

Changed in sahara:
importance: Undecided → Medium
status: New → Triaged
Revision history for this message
Luigi Toscano (ltoscano) wrote :

Retry if the last registered state for heat stack was DELETE_FAILED. We should avoid the case when the resource does not exist anymore in heat and the cloud is still Deleting.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to sahara-specs (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/513230

Revision history for this message
Shi Yan (yanshi-403) wrote :

The review drafts Sahara cluster force deletion, but do we first change the state of cluster rather than hanging in DELETING forever when deletion has something wrong?
If heat stack is in DELETE_FAILED, Sahara cluster should also be in DELETE_FAILED, right

Revision history for this message
Luigi Toscano (ltoscano) wrote :

Please comment on the review!

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to sahara-specs (master)

Reviewed: https://review.openstack.org/513230
Committed: https://git.openstack.org/cgit/openstack/sahara-specs/commit/?id=cde9b00587b4386ac4f5718c14756b60d1337523
Submitter: Zuul
Branch: master

commit cde9b00587b4386ac4f5718c14756b60d1337523
Author: Jeremy Freudberg <email address hidden>
Date: Thu Oct 19 02:48:41 2017 +0000

    Spec for force deletion of clusters

    This specification details how Heat's stack abandon feature will be
    leveraged to accomplish the force deletion of clusters.

    APIImpact

    Partially-Implements: bp sahara-force-delete
    Related-Bug: #1647411

    Change-Id: Iaf992fd6c89d46668e88bec536002bfffb1854dc

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to sahara (master)

Reviewed: https://review.openstack.org/529792
Committed: https://git.openstack.org/cgit/openstack/sahara/commit/?id=6850bb8ba89f466eb9182e86d40f718725b5c713
Submitter: Zuul
Branch: master

commit 6850bb8ba89f466eb9182e86d40f718725b5c713
Author: Jeremy Freudberg <email address hidden>
Date: Fri Dec 22 05:38:55 2017 +0000

    Force deletion of clusters

    * Add force delete operation to engine using stack abandon
    * Basic unit tests for force delete
    * Necessary removal of "direct engine" code
    * Change schema of APIv2 cluster delete to allow force delete
    * Unit tests for new cluster delete schema
    * Basic docs about force delete
    * Release note

    bp sahara-force-delete
    Partial-Bug: #1647411

    Change-Id: Ida72677c0a4110cb78edf9d62d8330cd4608ff76

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.