Nova doesn't allow cleanup of volumes stuck in 'attaching' or 'detaching' status

Bug #1449221 reported by Scott DAngelo on 2015-04-27
44
This bug affects 9 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
High
Unassigned

Bug Description

Cinder volumes can get stuck in a state of 'attaching' or 'detaching' and they need to be cleaned up or they will be incapable of being used. This is not possible at the moment as Nova doesn't allow any actions on volumes in 'ing' status.
For detaching a volume nova should do 3 things:
1 Detach the volume from the instance
2 inform cinder about the detach
3 delete the record in the nova BDM table

At the moment if 1 fails we do a roll back, if 2 fails we are stuck with a volume in detaching status. Nova shouldn't stop to complete the detach from its side if it gets some errors from cinder.
What we can do is to modify the nova code in order to manage a potential error coming from cinder, log it and go ahead with the deletion of the BDM record, then an operator can try to fix the cinder side calling the appropriate cinder call, like force-delete.
Basically, if there is a BDM record in nova, we allow the user to call the detach volume as many time as he/she likes.
Nova will delete the BDM record only if the call to cinder "terminate_connection" will success.

This bug has been discussed in a spec: https://review.openstack.org/84048
where we agreed that a spec is not required but we consider this change as a bug fix.

Changed in nova:
assignee: nobody → Scott DAngelo (scott-dangelo)
Changed in nova:
assignee: Scott DAngelo (scott-dangelo) → nobody
Changed in nova:
assignee: nobody → Andrea Rosa (andrea-rosa-m)
Andrea Rosa (andrea-rosa-m) wrote :

I am wonder if this should be marked as "Wishlist", what do you think?

Andrea Rosa (andrea-rosa-m) wrote :

To reproduce the issue:
- nova boot --image <image_id> --flavor <flavor_id> test
- cinder create 1
- nova volume-attach <server_id> <volume_id> /dev/vdb
- kill/stop cinder volume
- nova volume-detach <server_id> <volume_id>
- restart cinder volume

At this point the volume is reported in "detaching" status and it is no possible to recover from this situation.
If you try to delete the volume you get:

Delete for volume <volume_id> failed: Volume <volume_id> is still attached, detach volume first. (HTTP 400)

and it fails the detach as well:

ERROR (BadRequest): Invalid input received: Invalid volume: Unable to detach volume. Volume status must be 'in-use' and attach_status must be 'attached' to detach. Currently: status: 'detaching', attach_status: 'attached.' (HTTP 400)

Changed in nova:
status: New → In Progress
summary: - Nova volume-detach lacks '--force' command for cleanup
+ Nova doesn't allow to cleanup volumes stuck in 'attaching' or
+ 'detaching' status
description: updated

About this "Nova will delete the BDM record only if the call to cinder "terminate_connection" will success".

There is other option IMO: Nova will clean up BDM no matter terminated_connection exception, and then admin/user call force-detach API in cinder side to ensure not exported volume and detach it.

What's your guys suggestion about this option?

Scott DAngelo (scott-dangelo) wrote :

wanghao, I think the problem with ignoring the success of cinder's terminate_connection was pointed out by Walt_Boring:

" If Nova only calls libvirt volume's disconnect_volume, without Cinder's terminate_connection being called, then volumes may show back up on the nova host. Specifically for iSCSI volumes.

If an iSCSI session from the compute host to the storage backend still exists (because other volumes are connected), then the volume you just removed will show back up on the next scsi bus rescan."

So, the user should not think that the detach succeeded until the terminate_connection succeeds. Since terminate_connection is asynchronous, the Nova volume-detach will have to verify this somehow.

Andrea Rosa (andrea-rosa-m) wrote :

wanghao the problem is what Scott said in comment #5.

@scott you raised an interesting point about the fact that terminate_connection is async.
At the moment Nova considers the call succeeded if can send the requests without any errors, but it doesn't check if the connection has been actually terminated on the cinder side.
Is there a cinder call we can make to get the status of the connection from cinder?
If so we could check the status in a small fixedInternalLoop before deleting the BDM device, even if I do not like this solution it seems a bit hacky.
Any other ideas?

Changed in nova:
importance: Undecided → High
tags: added: volumes
Changed in nova:
assignee: Andrea Rosa (andrea-rosa-m) → John Garbutt (johngarbutt)
Changed in nova:
assignee: John Garbutt (johngarbutt) → Andrea Rosa (andrea-rosa-m)
Changed in nova:
assignee: Andrea Rosa (andrea-rosa-m) → wanghao (wanghao749)
Changed in nova:
assignee: wanghao (wanghao749) → Andrea Rosa (andrea-rosa-m)
Scott DAngelo (scott-dangelo) wrote :

Proposed fix:
https://review.openstack.org/#/c/184537/9

I think that the proposed fix should be automatically linked to this bug, but was not for some reason.

summary: - Nova doesn't allow to cleanup volumes stuck in 'attaching' or
+ Nova doesn't allow cleanup of volumes stuck in 'attaching' or
'detaching' status
Changed in nova:
status: In Progress → Confirmed
assignee: Andrea Rosa (andrea-rosa-m) → nobody

Change abandoned by Michael Still (<email address hidden>) on branch: master
Review: https://review.openstack.org/184537
Reason: This code hasn't been updated in a long time, and is in merge conflict. I am going to abandon this review, but feel free to restore it if you're still working on this.

Tang Chen (tangchen) wrote :

Hi,

Is anyone still working on this bug ? And do we still need this patch ? If we need, I'd like to go on with it if you don't mind.

Thanks.

Tang Chen (tangchen) on 2016-07-22
Changed in nova:
assignee: nobody → Tang Chen (tangchen)
srividyaketharaju (srividya) wrote :

Hi,

Is anyone still working on this bug ? And do we still need this patch ? If we need, I'd like to go on with it if you don't mind.

Thanks.

Nazeema Begum (nazeema123) wrote :

Hi,

Is anyone still working on this bug ? And do we still need this patch ? If we need, I'd like to go on with it if you don't mind.

Thanks.

Changed in nova:
assignee: Tang Chen (tangchen) → Nazeema Begum (nazeema123)
Nazeema Begum (nazeema123) wrote :

I request the bug reporter to close the bug as this bug is already fixed in mitaka version and here is my analysis on this bug and the delta between liberty and the mitaka

Analysis:

In Liberty:
There is no proper volume attach/detach handling in compute/api.py in liberty. Also, there is no local cleanup of the bdm table.

Fix in Mitaka:
Here, 3 new methods are included to handle volume attach/detach in /compute/api.py.
1) _attach_volume_shelved_offloaded - This method handles attaching volumes in shelved offloaded state.
2) _detach_volume_shelved_offloaded - This method handles detaching volumes in shelved offloaded state on terminate_connection call.
3) _local_cleanup_bdm_volumes - This method deletes the bdm record and takes care of cleanup of volumes

The same as above is even mentioned in the latest release notes of Mitaka version in new features list:
'''It is possible to call attach and detach volume API operations for instances which are in shelved and shelved_offloaded state. For an instance in shelved_offloaded state Nova will set to None the value for the device_name field, the right value for that field will be set once the instance will be unshelved as it will be managed by a specific compute manager.'''

REFFERED FILES:

 /opt/stack/nova/nova/compute/api.py
 /opt/stack/nova/nova/compute/manager.py
 /opt/stack/nova/nova/test/unit/compute/test_compute.py

Sean Dague (sdague) on 2017-06-23
Changed in nova:
assignee: Nazeema Begum (nazeema123) → nobody
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Related blueprints