Failed to delete spawning cluster
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Sahara |
Fix Released
|
Low
|
Andrew Lazarev |
Bug Description
Steps to repro:
1. Create cluster
2. Click 'Delete cluster' in the middle of creation process
Observed behavior: cluster moved to 'Error' state
Direct engine is used.
Stacktrace:
2014-11-13 13:14:37.013 5894 ERROR sahara.service.ops [-] Error during operating cluster 'al-test3' (reason: Instance could not be found (HTTP 404) (Request-ID: req-48309e51-
2014-11-13 13:14:37.013 5894 TRACE sahara.service.ops Traceback (most recent call last):
2014-11-13 13:14:37.013 5894 TRACE sahara.service.ops File "/Users/
2014-11-13 13:14:37.013 5894 TRACE sahara.service.ops f(cluster_id, *args, **kwds)
2014-11-13 13:14:37.013 5894 TRACE sahara.service.ops File "/Users/
2014-11-13 13:14:37.013 5894 TRACE sahara.service.ops INFRA.create_
2014-11-13 13:14:37.013 5894 TRACE sahara.service.ops File "/Users/
2014-11-13 13:14:37.013 5894 TRACE sahara.service.ops self._assign_
2014-11-13 13:14:37.013 5894 TRACE sahara.service.ops File "/Users/
2014-11-13 13:14:37.013 5894 TRACE sahara.service.ops node_group.
2014-11-13 13:14:37.013 5894 TRACE sahara.service.ops File "/Users/
2014-11-13 13:14:37.013 5894 TRACE sahara.service.ops nova.client(
2014-11-13 13:14:37.013 5894 TRACE sahara.service.ops File "/Users/
2014-11-13 13:14:37.013 5894 TRACE sahara.service.ops return self._get(
2014-11-13 13:14:37.013 5894 TRACE sahara.service.ops File "/Users/
2014-11-13 13:14:37.013 5894 TRACE sahara.service.ops _resp, body = self.api.
2014-11-13 13:14:37.013 5894 TRACE sahara.service.ops File "/Users/
2014-11-13 13:14:37.013 5894 TRACE sahara.service.ops return self._cs_
2014-11-13 13:14:37.013 5894 TRACE sahara.service.ops File "/Users/
2014-11-13 13:14:37.013 5894 TRACE sahara.service.ops resp, body = self._time_
2014-11-13 13:14:37.013 5894 TRACE sahara.service.ops File "/Users/
2014-11-13 13:14:37.013 5894 TRACE sahara.service.ops resp, body = self.request(url, method, **kwargs)
2014-11-13 13:14:37.013 5894 TRACE sahara.service.ops File "/Users/
2014-11-13 13:14:37.013 5894 TRACE sahara.service.ops raise exceptions.
2014-11-13 13:14:37.013 5894 TRACE sahara.service.ops NotFound: Instance could not be found (HTTP 404) (Request-ID: req-48309e51-
...
2014-11-13 13:14:41.554 5894 INFO sahara.
Changed in sahara: | |
importance: | Undecided → Low |
status: | New → Confirmed |
Changed in sahara: | |
assignee: | nobody → Andrew Lazarev (alazarev) |
milestone: | none → kilo-1 |
Changed in sahara: | |
status: | Fix Committed → Fix Released |
Changed in sahara: | |
milestone: | kilo-1 → 2015.1.0 |
ops._provision_ cluster is already protected over error with ALREADY deleted cluster. The problem here is that cluster deletion is not instant operation.
Here how cluster deletion happens:
1. Send signal to nova to terminate instances
2. Wait for actual termination of ALL instances
3. Delete related objects (security groups, etc.)
4. Delete cluster object
As we can see there is a plenty of time between when the first instance was deleted and the time when cluster object is deleted. So, it is possible to have error when some part of cluster is deleted already, but cluster object is still exists.
Proposed solution: mark cluster object as being deleted before termination of any instances. It can be done either via status or by some other field (other field is better because status can be occasionally overwritten by other process).