Periodic cleanup of non-final clusters moves the cluster into Error instead of removing it

Bug #1468722 reported by Luigi Toscano
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Sahara
Fix Released
High
Elise Gafford

Bug Description

Since Kilo, Sahara can start a periodic job to cleanup stale clusters, which means clusters not in final states (different from Active, Error, Deleting states).
http://specs.openstack.org/openstack/sahara-specs/specs/kilo/periodic-cleanup.html

In order to test this, the cleanup_time_for_incomplete_clusters key has been set to 1 (== 1 hour).
A suggested way to trigger the "reaper" is restart the -engine daemon when the cluster is in the initialization phase ("Waiting" state).
What I observed is that the cleanup process is triggered 1 hour (after the restart of the -engine daemon), but something goes wrong and the cluster is moved into "Error" state:

2015-06-17 20:35:51.343 29763 DEBUG sahara.service.periodic [-] Terminating old clusters in non-final state terminate_incomplete_clusters /usr/lib/python2.7/site-packages/sahara/service/
periodic.py:174
2015-06-17 20:35:51.444 29763 DEBUG sahara.service.periodic [-] Terminating incomplete cluster cccc2 in "Waiting" state with id 55715c6c-b5dd-4ae1-905b-92c8089ac142 terminate_cluster /us
r/lib/python2.7/site-packages/sahara/service/periodic.py:90
2015-06-17 20:35:51.771 29763 ERROR sahara.service.ops [-] Error during operating on cluster cccc2 (reason: Service "compute" not found in service catalog
Error ID: 0367176a-933b-447a-8b91-30bf0a7eef63)
2015-06-17 20:35:52.008 29763 ERROR sahara.service.ops [-] Error during rollback of cluster cccc2 (reason: Service "compute" not found in service catalog
Error ID: 78e16133-8a20-417e-86a5-ebb9b29f7eda)

This issue was originally filed on Red Hat bugzilla against the Kilo-based product version, because it was not sure whether it was environment specific. It seems (thanks Ethan Gafford for the investigation) that the issue always applies, also on current master.

In a nutshell, the credentials needed for this operation (trust_id) are not properly populated. They are populated only from transient clusters, but not for long-running clusters.

Changed in sahara:
milestone: none → liberty-2
Changed in sahara:
importance: Undecided → High
Elise Gafford (egafford)
Changed in sahara:
assignee: nobody → Ethan Gafford (egafford)
Revision history for this message
Elise Gafford (egafford) wrote :

Analysis to date (copied from downstream Red Hat bug for visibility):

Essentially, the issue here is that:
1) The periodic cluster cleanup task depends on the existence of a trust_id (see sahara.service.trusts.trusts.use_os_admin_auth_token and sahara.context.get_admin_context).
2) The trust_id is only populated for transient clusters (see sahara.service.ops._provision_cluster).
3) The periodic cluster cleanup task selects in clusters based only on CONF.use_identity_api_v3, without regard to whether the cluster has a trust_id (see sahara.service.periodic.terminate_cluster).

So the logical gate to create a trust seems to be more restrictive than the logical gate to enter the periodic cleanup flow, which depends on the existence of that trust to populate its context.

Will discuss with alazarev re: intent; I see a few possible solutions.

Thanks,
Ethan

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to sahara-specs (master)

Fix proposed to branch: master
Review: https://review.openstack.org/200275

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to sahara (master)

Fix proposed to branch: master
Review: https://review.openstack.org/200719

Changed in sahara:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to sahara-specs (master)

Reviewed: https://review.openstack.org/200275
Committed: https://git.openstack.org/cgit/openstack/sahara-specs/commit/?id=6c2b9bf33859e13cd0897c4f1773921d48948b6c
Submitter: Jenkins
Branch: master

commit 6c2b9bf33859e13cd0897c4f1773921d48948b6c
Author: Ethan Gafford <email address hidden>
Date: Thu Jul 9 17:28:30 2015 -0400

    Store trusts for all clusters in DB

    As per team meeting on 7/9/2015, in order to successfully cleanup
    stale clusters from a periodic task, we must store trust ids in the
    database until cluster activation in the case of long-running
    clusters. This spec update formalizes that plan and adds
    commentary on the alternative of in-memory (context) trust_id
    storage.

    Change-Id: If273d99a320f08b22cc0663ac369cc06c13206d8
    Addresses-bug: #1468722

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to sahara (master)

Reviewed: https://review.openstack.org/200719
Committed: https://git.openstack.org/cgit/openstack/sahara/commit/?id=242232f52272267af47427bedb510fa231c63c77
Submitter: Jenkins
Branch: master

commit 242232f52272267af47427bedb510fa231c63c77
Author: Ethan Gafford <email address hidden>
Date: Fri Jul 10 16:50:15 2015 -0400

    Cluster creation with trust

    This change creates trusts between the admin and tenant for all clusters,
    storing the trust_ids in the DB until they are no longer needed.

    Creating all clusters (including long-running ones) with trusts will allow:
    1) Long-running cluster operations to complete
    2) Administrative periodic cluster cleanup to delete stale clusters
    3) Better support of intra-operation HA in future revisions

    Change-Id: I3a0c31913ce76579570513a478b1f55d546c122d
    Implements: blueprint cluster-creation-with-trust
    Closes-bug: 1468722

Changed in sahara:
status: In Progress → Fix Committed
Thierry Carrez (ttx)
Changed in sahara:
status: Fix Committed → Fix Released
Thierry Carrez (ttx)
Changed in sahara:
milestone: liberty-2 → 3.0.0
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.