Heat is unable to detach volumes from servers

Bug #1298350 reported by Alexander Ignatov on 2014-03-27
22
This bug affects 4 people
Affects Status Importance Assigned to Milestone
OpenStack Heat
Fix Released
High
Pavlo Shchelokovskyy
Icehouse
Fix Released
Undecided
Unassigned

Bug Description

Steps to reproduce:
1) Create stack with 2 servers and 2 volumes attached per instance;
     The template could be is as follows: http://paste.openstack.org/show/74431/

2) After stack moved to CREATE_COMPLETE try to delete this stack;

Expected result:
Stack has been completely removed and all provisioned resources removed as well

Actual behaviour:
Stack is on DELETE_IN_PROGRESS or DELETE_FAILED state and some of volumes and instances are not removed.

Note:
To increase reproducibility you can increase number of servers OR/AND volumes per instances as well.

Changed in heat:
assignee: nobody → Sergey Kraynev (skraynev)
Alexander Ignatov (aignatov) wrote :

Actually this issue blocks Sahara to use Heat as provisioning engine.
Related Sahara bug: https://bugs.launchpad.net/sahara/+bug/1293962

It should be note there that direct provisioning Sahara code works well in such case.

Alexander Ignatov (aignatov) wrote :

Also I looked at the VolumeDetachTask https://github.com/openstack/heat/blob/master/heat/engine/resources/volume.py#L232

and noticed that its code uses Nova client to detach volumes. Probably it would be better to use Cinder client there since it has detach() action as well.

Pavlo Shchelokovskyy (pshchelo) wrote :

check my https://review.openstack.org/#/c/72681, may be that will solve your problem

Pavlo Shchelokovskyy (pshchelo) wrote :

I mangled with VolumeDetachTask there, and I think made it more obust

description: updated
description: updated
Changed in heat:
milestone: none → juno-1
importance: Undecided → High
Pavlo Shchelokovskyy (pshchelo) wrote :

I just wonder why you have mountpoint: null in your templates?

Sergey Kraynev (skraynev) wrote :

Pavlo as explanation on your comment about null mountpoint https://bugs.launchpad.net/heat/+bug/1260345

Pavlo Shchelokovskyy (pshchelo) wrote :

I was able to reproduce on local devstack with provided template, but not every time. One instance was left in Error state due to "not able co connect to Neutron, max retries exceeded" (according to nova show). Looks like a race somewhere.

Pavlo Shchelokovskyy (pshchelo) wrote :

Have tried it using https://review.openstack.org/#/c/72681/ - it does not solve this problem.

Alexander Ignatov (aignatov) wrote :

See my note in bug description, more resources in use will increase probability of this issue.

Ramashri Umale (ramashri) wrote :

I was facing this issue consistently with devstack and single volume attached to an instance. The root cause I found to be twofold.

1) heat is issuing multiple detach commands for same volume triggering a race condition
2) race condition in nova volume detach that puts the volume in an inconsistent state. It has been explained very well in bug https://bugs.launchpad.net/cinder/+bug/1238093.

Anyway the end result of the bug is that the heat stack shows VolumeAttachment as DELETE_FAILED. A look into cinder db shows volume status to be 'in-use' and 'detached' instead of being 'available' and detached. A look inside the instance on which volume was mounted shows the volume as having already been detached.

From a heat perspective I found that removing un-neccessary call to detach the same volume again resolved the issue for me

In the __call__ subroutine of class VolumeDetachTask in file heat/engine/resources/volume.py

change
            while vol.status in ('in-use', 'detaching'):
                logger.debug(_('%s - volume still in use') % str(self))
                yield

                try:
                    server_api.delete_server_volume(self.server_id,
                                                    self.volume_id)
                except (clients.novaclient.exceptions.BadRequest,
                        clients.novaclient.exceptions.NotFound):
                    pass
                vol.get()

to

            while vol.status in ('in-use', 'detaching'):
                logger.debug(_('%s - volume still in use') % str(self))
                yield
                vol.get()

Alexander Ignatov (aignatov) wrote :

Ramashri, good finding! Also I've looked at the history of VolumeDetachTask and noticed that it uses Nova client actions to work with volumes. I think it's designed this way because in that time of DetachTask develompent Cinder client didn't have all needed rest calls to work with. Now Cinder and its client contains all needed functions, and probably Cinder::Volume and Cinder::VolumeAttachment should be rewritten to use it. As a proof I can show an example that in Sahara provisioning code works fine in the similar case just having this code:
https://github.com/openstack/sahara/blob/master/sahara/service/volumes.py#L157

Changed in heat:
status: New → Triaged
milestone: juno-1 → icehouse-rc2
Changed in heat:
assignee: Sergey Kraynev (skraynev) → nobody
Changed in heat:
assignee: nobody → Pavlo Shchelokovskyy (pshchelo)
Changed in heat:
milestone: icehouse-rc2 → juno-1

Related fix proposed to branch: master
Review: https://review.openstack.org/86638

Changed in heat:
status: Triaged → In Progress

I have abandoned https://review.openstack.org/86620 (first one above) in favor of https://review.openstack.org/86638 (second one above). The latter uses native cinder API calls for attachment and detachment of volumes and seems to work even on my local devstack, so I think that it is the way to fix this bug.

Reviewed: https://review.openstack.org/86638
Committed: https://git.openstack.org/cgit/openstack/heat/commit/?id=e5853057142313b3ed99b66786d1d3164675f211
Submitter: Jenkins
Branch: master

commit e5853057142313b3ed99b66786d1d3164675f211
Author: Pavlo Shchelokovskyy <email address hidden>
Date: Wed Apr 9 14:58:02 2014 +0000

    Use cinder API for volume attach and detach

    Replace nova calls for attaching and detaching volumes
    with native cinder API calls.

    Closes-Bug: #1298350
    Change-Id: I628b4a0dd559389f66ab25b6030a9ad6fc927e39

Changed in heat:
status: In Progress → Fix Committed
tags: added: icehouse-backport-potential
Thierry Carrez (ttx) on 2014-06-11
Changed in heat:
status: Fix Committed → Fix Released

Hi guys,

   I applied the patch to Icehouse and tried it. The volume gets attached only in the database ! What I mean by this is that tgt never creates the iscsi session and thus, never attaches the volume to the instance ... I`m still trying to understand what`s missing ! I`m still running cinder from icehouse so perhaps that`s my issue ! Otherwise, something`s missing somewhere.

Dave

Download full text (15.4 KiB)

Hi guys,

    Me again. I've adapted this patch in order to actually do something but from the API documentation, attach and detach are only changing the metadata. I think we actually need to create the iscsi connection and tear it down once completed. Also, my patch is not clean because it's explointing a validation issue in the metadata where the volume is not longer attached according to the metadata and thus can actually delete the instance. I tried to understand how to actually delete the iscsi session (and create it) but it seems a bit complicated for my knowledge of openstack and python (I don't know python after all).

Dave

FYI

From: David Hill
Sent: 30-Aug-14 6:33 PM
To: openstack (<email address hidden>)
Subject: Heat patch for Icehouse vs Volume deletion.

Hi guys,

I’ve wrote simple patch that takes care of the volume deletion and it’s ugly. I’m posting it here so anybody with better code knowledge on heat can base his patch on it if it’s very that ugly 

For some reasons, sometimes an instance will fail to delete, this takes care of it and will retry until it disappears:
--- instance.py.orig 2014-08-30 22:07:45.259328109 +0000
+++ instance.py 2014-08-30 22:23:03.180527084 +0000
@@ -587,8 +587,13 @@
                     self.resource_id_set(None)
                     break
                 elif server.status == "ERROR":
- raise exception.Error(_("Deletion of server %s failed.") %
- server.id)
+ while server.status == "ERROR":
+ nova_utils.refresh_server(server)
+ server.reset_state()
+ server.delete()
+ if server.status == "DELETED":
+ self.resource_id_set(None)
+ break
             except clients.novaclient.exceptions.NotFound:
                 self.resource_id_set(None)
                 break

This patch is based on the following bug: https://bugs.launchpad.net/heat/+bug/1298350 => https://review.openstack.org/#/c/86638/
It didn’t work for me but this one works flawlessly.

--- volume.py.icehouse 2014-08-30 00:59:49.844290344 +0000
+++ volume.py 2014-08-30 21:58:08.769813146 +0000
@@ -151,11 +151,11 @@
                 if vol.status == 'in-use':
                     logger.warn(_('can not delete volume when in-use'))
                     raise exception.Error(_('Volume in use'))
-
- vol.delete()
- while True:
- yield
- vol.get()
+ if vol.status != 'deleting':
+ vol.delete()
+ while True:
+ yield
+ vol.get()
             except clients.cinderclient.exceptions.NotFound:
                 self.resource_id_set(None)

@@ -227,69 +227,187 @@

         logger.info(_('%s - complete') % str(self))

-
-class VolumeDetachTask(object):
+class VolumeDeleteTask(object):
     """A task for detaching a volume from a Nova server."""

- def __init__(self, stack, server_id, attachment_id):
+ def __init__(sel...

Steve Baker (steve-stevebaker) wrote :

Hi David

Could you please submit this patch as a http://review.openstack.org gerrit review? (even if it is WIP)

This will make it easier for us to see and discuss what your change is doing.

Ok

David, that is the wrong patch to backport. I was the original author, but then we discovered that the cinder API is indeed confusingly only operates on cinder db and not on nova servers. The patch was reverted and rewritten. Please see the following patch that really fixed it

https://review.openstack.org/#/c/89796/

(there is also a bug reference)

Change abandoned by David Hill (<email address hidden>) on branch: stable/icehouse
Review: https://review.openstack.org/118517
Reason: Useless... There's so many commits in here that we easily get lost. ;)

Thierry Carrez (ttx) on 2014-10-16
Changed in heat:
milestone: juno-1 → 2014.2
Zane Bitter (zaneb) wrote :

Pavlo, can you backport the correct patch to icehouse/stable?

Zane, will do

One more question - should I also align the patch to AWS (we mistakenly allow updates to AWS Volume Attachments in Icehouse), or that would too much for the backport?

Zane Bitter (zaneb) wrote :

Probably better not to make changes like that in a backport if you can easily avoid it.

Reviewed: https://review.openstack.org/139972
Committed: https://git.openstack.org/cgit/openstack/heat/commit/?id=0b984b8c13338cf2ec3836e21e71c3c955a148b4
Submitter: Jenkins
Branch: stable/icehouse

commit 0b984b8c13338cf2ec3836e21e71c3c955a148b4
Author: Sushil Kumar <email address hidden>
Date: Wed Apr 23 11:18:31 2014 +0000

    Call server volume detach only once

    Changes:
     - Removed duplicate delete calls to prevent race condition.
     - Removed duplicate mocked calls from unit-test.

    Closes-Bug: #1311533
    Closes-Bug: #1298350

    Conflicts:
     heat/engine/resources/volume.py
     heat/tests/test_volume.py

    Change-Id: I5f16c528652f12440160f03b92f41b76d1c9100c
    (cherry picked from commit d1ffbd4bfde0cd6b7a82a48b9d4f59cc8b310bd8)

tags: added: in-stable-icehouse
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers