Sahara

Failed to delete spawning cluster

Bug #1392490 reported by Andrew Lazarev on 2014-11-13

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Sahara	Fix Released	Low	Andrew Lazarev	Sahara 2015.1.0 "kilo"

Bug Description

Steps to repro:
1. Create cluster
2. Click 'Delete cluster' in the middle of creation process

Observed behavior: cluster moved to 'Error' state
Direct engine is used.

Stacktrace:
2014-11-13 13:14:37.013 5894 ERROR sahara.service.ops [-] Error during operating cluster 'al-test3' (reason: Instance could not be found (HTTP 404) (Request-ID: req-48309e51-fb14-4b24-b35b-4f20b258508c))
2014-11-13 13:14:37.013 5894 TRACE sahara.service.ops Traceback (most recent call last):
2014-11-13 13:14:37.013 5894 TRACE sahara.service.ops File "/Users/alazarev/openstack/sahara/sahara/service/ops.py", line 141, in wrapper
2014-11-13 13:14:37.013 5894 TRACE sahara.service.ops f(cluster_id, *args, **kwds)
2014-11-13 13:14:37.013 5894 TRACE sahara.service.ops File "/Users/alazarev/openstack/sahara/sahara/service/ops.py", line 226, in _provision_cluster
2014-11-13 13:14:37.013 5894 TRACE sahara.service.ops INFRA.create_cluster(cluster)
2014-11-13 13:14:37.013 5894 TRACE sahara.service.ops File "/Users/alazarev/openstack/sahara/sahara/service/direct_engine.py", line 59, in create_cluster
2014-11-13 13:14:37.013 5894 TRACE sahara.service.ops self._assign_floating_ips(instances)
2014-11-13 13:14:37.013 5894 TRACE sahara.service.ops File "/Users/alazarev/openstack/sahara/sahara/service/direct_engine.py", line 368, in _assign_floating_ips
2014-11-13 13:14:37.013 5894 TRACE sahara.service.ops node_group.floating_ip_pool)
2014-11-13 13:14:37.013 5894 TRACE sahara.service.ops File "/Users/alazarev/openstack/sahara/sahara/service/networks.py", line 92, in assign_floating_ip
2014-11-13 13:14:37.013 5894 TRACE sahara.service.ops nova.client().servers.get(instance_id).add_floating_ip(ip)
2014-11-13 13:14:37.013 5894 TRACE sahara.service.ops File "/Users/alazarev/openstack/sahara/.tox/venv/lib/python2.7/site-packages/novaclient/v1_1/servers.py", line 563, in get
2014-11-13 13:14:37.013 5894 TRACE sahara.service.ops return self._get("/servers/%s" % base.getid(server), "server")
2014-11-13 13:14:37.013 5894 TRACE sahara.service.ops File "/Users/alazarev/openstack/sahara/.tox/venv/lib/python2.7/site-packages/novaclient/base.py", line 93, in _get
2014-11-13 13:14:37.013 5894 TRACE sahara.service.ops _resp, body = self.api.client.get(url)
2014-11-13 13:14:37.013 5894 TRACE sahara.service.ops File "/Users/alazarev/openstack/sahara/.tox/venv/lib/python2.7/site-packages/novaclient/client.py", line 487, in get
2014-11-13 13:14:37.013 5894 TRACE sahara.service.ops return self._cs_request(url, 'GET', **kwargs)
2014-11-13 13:14:37.013 5894 TRACE sahara.service.ops File "/Users/alazarev/openstack/sahara/.tox/venv/lib/python2.7/site-packages/novaclient/client.py", line 465, in _cs_request
2014-11-13 13:14:37.013 5894 TRACE sahara.service.ops resp, body = self._time_request(url, method, **kwargs)
2014-11-13 13:14:37.013 5894 TRACE sahara.service.ops File "/Users/alazarev/openstack/sahara/.tox/venv/lib/python2.7/site-packages/novaclient/client.py", line 439, in _time_request
2014-11-13 13:14:37.013 5894 TRACE sahara.service.ops resp, body = self.request(url, method, **kwargs)
2014-11-13 13:14:37.013 5894 TRACE sahara.service.ops File "/Users/alazarev/openstack/sahara/.tox/venv/lib/python2.7/site-packages/novaclient/client.py", line 433, in request
2014-11-13 13:14:37.013 5894 TRACE sahara.service.ops raise exceptions.from_response(resp, body, url, method)
2014-11-13 13:14:37.013 5894 TRACE sahara.service.ops NotFound: Instance could not be found (HTTP 404) (Request-ID: req-48309e51-fb14-4b24-b35b-4f20b258508c)
...
2014-11-13 13:14:41.554 5894 INFO sahara.utils.general [-] Cluster status has been changed: id=fb3df1b4-eab0-47fa-8324-8099aac28b15, New status=Error

Andrew Lazarev (alazarev) on 2014-11-18

Changed in sahara:
importance:	Undecided → Low
status:	New → Confirmed

Revision history for this message

Andrew Lazarev (alazarev) wrote on 2014-11-20:

ops._provision_cluster is already protected over error with ALREADY deleted cluster. The problem here is that cluster deletion is not instant operation.

Here how cluster deletion happens:
1. Send signal to nova to terminate instances
2. Wait for actual termination of ALL instances
3. Delete related objects (security groups, etc.)
4. Delete cluster object

As we can see there is a plenty of time between when the first instance was deleted and the time when cluster object is deleted. So, it is possible to have error when some part of cluster is deleted already, but cluster object is still exists.

Proposed solution: mark cluster object as being deleted before termination of any instances. It can be done either via status or by some other field (other field is better because status can be occasionally overwritten by other process).

Andrew Lazarev (alazarev) on 2014-11-20

Changed in sahara:
assignee:	nobody → Andrew Lazarev (alazarev)
milestone:	none → kilo-1

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-11-20: Fix proposed to sahara (master)

Fix proposed to branch: master
Review: https://review.openstack.org/136180

Changed in sahara:
status:	Confirmed → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-11-24: Fix merged to sahara (master)

Reviewed: https://review.openstack.org/136180
Committed: https://git.openstack.org/cgit/openstack/sahara/commit/?id=00401e0f5b0a082d229330888a93891cd2b54b62
Submitter: Jenkins
Branch: master

commit 00401e0f5b0a082d229330888a93891cd2b54b62
Author: Andrew Lazarev <email address hidden>
Date: Thu Nov 20 15:31:47 2014 -0800

Added checks on deleting cluster

Checks if cluster is going to be deleted can prevent race condition
problems.

Change-Id: I85e8bcd2d61ff60ba7eadce1242b0489e354889d
Closes-Bug: #1392490

Changed in sahara:
status:	In Progress → Fix Committed

Thierry Carrez (ttx) on 2014-12-17

Changed in sahara:
status:	Fix Committed → Fix Released

Thierry Carrez (ttx) on 2015-04-30

Changed in sahara:
milestone:	kilo-1 → 2015.1.0

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.