Periodic cleanup of non-final clusters moves the cluster into Error instead of removing it
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Sahara |
Fix Released
|
High
|
Elise Gafford |
Bug Description
Since Kilo, Sahara can start a periodic job to cleanup stale clusters, which means clusters not in final states (different from Active, Error, Deleting states).
http://
In order to test this, the cleanup_
A suggested way to trigger the "reaper" is restart the -engine daemon when the cluster is in the initialization phase ("Waiting" state).
What I observed is that the cleanup process is triggered 1 hour (after the restart of the -engine daemon), but something goes wrong and the cluster is moved into "Error" state:
2015-06-17 20:35:51.343 29763 DEBUG sahara.
periodic.py:174
2015-06-17 20:35:51.444 29763 DEBUG sahara.
r/lib/python2.
2015-06-17 20:35:51.771 29763 ERROR sahara.service.ops [-] Error during operating on cluster cccc2 (reason: Service "compute" not found in service catalog
Error ID: 0367176a-
2015-06-17 20:35:52.008 29763 ERROR sahara.service.ops [-] Error during rollback of cluster cccc2 (reason: Service "compute" not found in service catalog
Error ID: 78e16133-
This issue was originally filed on Red Hat bugzilla against the Kilo-based product version, because it was not sure whether it was environment specific. It seems (thanks Ethan Gafford for the investigation) that the issue always applies, also on current master.
In a nutshell, the credentials needed for this operation (trust_id) are not properly populated. They are populated only from transient clusters, but not for long-running clusters.
Changed in sahara: | |
milestone: | none → liberty-2 |
Changed in sahara: | |
importance: | Undecided → High |
Changed in sahara: | |
assignee: | nobody → Ethan Gafford (egafford) |
Changed in sahara: | |
status: | Fix Committed → Fix Released |
Changed in sahara: | |
milestone: | liberty-2 → 3.0.0 |
Analysis to date (copied from downstream Red Hat bug for visibility):
Essentially, the issue here is that: service. trusts. trusts. use_os_ admin_auth_ token and sahara. context. get_admin_ context) . service. ops._provision_ cluster) . identity_ api_v3, without regard to whether the cluster has a trust_id (see sahara. service. periodic. terminate_ cluster) .
1) The periodic cluster cleanup task depends on the existence of a trust_id (see sahara.
2) The trust_id is only populated for transient clusters (see sahara.
3) The periodic cluster cleanup task selects in clusters based only on CONF.use_
So the logical gate to create a trust seems to be more restrictive than the logical gate to enter the periodic cleanup flow, which depends on the existence of that trust to populate its context.
Will discuss with alazarev re: intent; I see a few possible solutions.
Thanks,
Ethan