Failed to delete spawning cluster

Bug #1392490 reported by Andrew Lazarev
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Sahara
Fix Released
Low
Andrew Lazarev

Bug Description

Steps to repro:
1. Create cluster
2. Click 'Delete cluster' in the middle of creation process

Observed behavior: cluster moved to 'Error' state
Direct engine is used.

Stacktrace:
2014-11-13 13:14:37.013 5894 ERROR sahara.service.ops [-] Error during operating cluster 'al-test3' (reason: Instance could not be found (HTTP 404) (Request-ID: req-48309e51-fb14-4b24-b35b-4f20b258508c))
2014-11-13 13:14:37.013 5894 TRACE sahara.service.ops Traceback (most recent call last):
2014-11-13 13:14:37.013 5894 TRACE sahara.service.ops File "/Users/alazarev/openstack/sahara/sahara/service/ops.py", line 141, in wrapper
2014-11-13 13:14:37.013 5894 TRACE sahara.service.ops f(cluster_id, *args, **kwds)
2014-11-13 13:14:37.013 5894 TRACE sahara.service.ops File "/Users/alazarev/openstack/sahara/sahara/service/ops.py", line 226, in _provision_cluster
2014-11-13 13:14:37.013 5894 TRACE sahara.service.ops INFRA.create_cluster(cluster)
2014-11-13 13:14:37.013 5894 TRACE sahara.service.ops File "/Users/alazarev/openstack/sahara/sahara/service/direct_engine.py", line 59, in create_cluster
2014-11-13 13:14:37.013 5894 TRACE sahara.service.ops self._assign_floating_ips(instances)
2014-11-13 13:14:37.013 5894 TRACE sahara.service.ops File "/Users/alazarev/openstack/sahara/sahara/service/direct_engine.py", line 368, in _assign_floating_ips
2014-11-13 13:14:37.013 5894 TRACE sahara.service.ops node_group.floating_ip_pool)
2014-11-13 13:14:37.013 5894 TRACE sahara.service.ops File "/Users/alazarev/openstack/sahara/sahara/service/networks.py", line 92, in assign_floating_ip
2014-11-13 13:14:37.013 5894 TRACE sahara.service.ops nova.client().servers.get(instance_id).add_floating_ip(ip)
2014-11-13 13:14:37.013 5894 TRACE sahara.service.ops File "/Users/alazarev/openstack/sahara/.tox/venv/lib/python2.7/site-packages/novaclient/v1_1/servers.py", line 563, in get
2014-11-13 13:14:37.013 5894 TRACE sahara.service.ops return self._get("/servers/%s" % base.getid(server), "server")
2014-11-13 13:14:37.013 5894 TRACE sahara.service.ops File "/Users/alazarev/openstack/sahara/.tox/venv/lib/python2.7/site-packages/novaclient/base.py", line 93, in _get
2014-11-13 13:14:37.013 5894 TRACE sahara.service.ops _resp, body = self.api.client.get(url)
2014-11-13 13:14:37.013 5894 TRACE sahara.service.ops File "/Users/alazarev/openstack/sahara/.tox/venv/lib/python2.7/site-packages/novaclient/client.py", line 487, in get
2014-11-13 13:14:37.013 5894 TRACE sahara.service.ops return self._cs_request(url, 'GET', **kwargs)
2014-11-13 13:14:37.013 5894 TRACE sahara.service.ops File "/Users/alazarev/openstack/sahara/.tox/venv/lib/python2.7/site-packages/novaclient/client.py", line 465, in _cs_request
2014-11-13 13:14:37.013 5894 TRACE sahara.service.ops resp, body = self._time_request(url, method, **kwargs)
2014-11-13 13:14:37.013 5894 TRACE sahara.service.ops File "/Users/alazarev/openstack/sahara/.tox/venv/lib/python2.7/site-packages/novaclient/client.py", line 439, in _time_request
2014-11-13 13:14:37.013 5894 TRACE sahara.service.ops resp, body = self.request(url, method, **kwargs)
2014-11-13 13:14:37.013 5894 TRACE sahara.service.ops File "/Users/alazarev/openstack/sahara/.tox/venv/lib/python2.7/site-packages/novaclient/client.py", line 433, in request
2014-11-13 13:14:37.013 5894 TRACE sahara.service.ops raise exceptions.from_response(resp, body, url, method)
2014-11-13 13:14:37.013 5894 TRACE sahara.service.ops NotFound: Instance could not be found (HTTP 404) (Request-ID: req-48309e51-fb14-4b24-b35b-4f20b258508c)
...
2014-11-13 13:14:41.554 5894 INFO sahara.utils.general [-] Cluster status has been changed: id=fb3df1b4-eab0-47fa-8324-8099aac28b15, New status=Error

Changed in sahara:
importance: Undecided → Low
status: New → Confirmed
Revision history for this message
Andrew Lazarev (alazarev) wrote :

ops._provision_cluster is already protected over error with ALREADY deleted cluster. The problem here is that cluster deletion is not instant operation.

Here how cluster deletion happens:
1. Send signal to nova to terminate instances
2. Wait for actual termination of ALL instances
3. Delete related objects (security groups, etc.)
4. Delete cluster object

As we can see there is a plenty of time between when the first instance was deleted and the time when cluster object is deleted. So, it is possible to have error when some part of cluster is deleted already, but cluster object is still exists.

Proposed solution: mark cluster object as being deleted before termination of any instances. It can be done either via status or by some other field (other field is better because status can be occasionally overwritten by other process).

Changed in sahara:
assignee: nobody → Andrew Lazarev (alazarev)
milestone: none → kilo-1
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to sahara (master)

Fix proposed to branch: master
Review: https://review.openstack.org/136180

Changed in sahara:
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to sahara (master)

Reviewed: https://review.openstack.org/136180
Committed: https://git.openstack.org/cgit/openstack/sahara/commit/?id=00401e0f5b0a082d229330888a93891cd2b54b62
Submitter: Jenkins
Branch: master

commit 00401e0f5b0a082d229330888a93891cd2b54b62
Author: Andrew Lazarev <email address hidden>
Date: Thu Nov 20 15:31:47 2014 -0800

    Added checks on deleting cluster

    Checks if cluster is going to be deleted can prevent race condition
    problems.

    Change-Id: I85e8bcd2d61ff60ba7eadce1242b0489e354889d
    Closes-Bug: #1392490

Changed in sahara:
status: In Progress → Fix Committed
Thierry Carrez (ttx)
Changed in sahara:
status: Fix Committed → Fix Released
Thierry Carrez (ttx)
Changed in sahara:
milestone: kilo-1 → 2015.1.0
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.