Nova doesn't allow cleanup of volumes stuck in 'attaching' or 'detaching' status

Bug #1449221 reported by Scott DAngelo
72
This bug affects 15 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Confirmed
High
Unassigned

Bug Description

Cinder volumes can get stuck in a state of 'attaching' or 'detaching' and they need to be cleaned up or they will be incapable of being used. This is not possible at the moment as Nova doesn't allow any actions on volumes in 'ing' status.
For detaching a volume nova should do 3 things:
1 Detach the volume from the instance
2 inform cinder about the detach
3 delete the record in the nova BDM table

At the moment if 1 fails we do a roll back, if 2 fails we are stuck with a volume in detaching status. Nova shouldn't stop to complete the detach from its side if it gets some errors from cinder.
What we can do is to modify the nova code in order to manage a potential error coming from cinder, log it and go ahead with the deletion of the BDM record, then an operator can try to fix the cinder side calling the appropriate cinder call, like force-delete.
Basically, if there is a BDM record in nova, we allow the user to call the detach volume as many time as he/she likes.
Nova will delete the BDM record only if the call to cinder "terminate_connection" will success.

This bug has been discussed in a spec: https://review.openstack.org/84048
where we agreed that a spec is not required but we consider this change as a bug fix.

Tags: volumes
Changed in nova:
assignee: nobody → Scott DAngelo (scott-dangelo)
Changed in nova:
assignee: Scott DAngelo (scott-dangelo) → nobody
Changed in nova:
assignee: nobody → Andrea Rosa (andrea-rosa-m)
Revision history for this message
Andrea Rosa (andrea-rosa-m) wrote :

I am wonder if this should be marked as "Wishlist", what do you think?

Revision history for this message
Andrea Rosa (andrea-rosa-m) wrote :

To reproduce the issue:
- nova boot --image <image_id> --flavor <flavor_id> test
- cinder create 1
- nova volume-attach <server_id> <volume_id> /dev/vdb
- kill/stop cinder volume
- nova volume-detach <server_id> <volume_id>
- restart cinder volume

At this point the volume is reported in "detaching" status and it is no possible to recover from this situation.
If you try to delete the volume you get:

Delete for volume <volume_id> failed: Volume <volume_id> is still attached, detach volume first. (HTTP 400)

and it fails the detach as well:

ERROR (BadRequest): Invalid input received: Invalid volume: Unable to detach volume. Volume status must be 'in-use' and attach_status must be 'attached' to detach. Currently: status: 'detaching', attach_status: 'attached.' (HTTP 400)

Changed in nova:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/184537

summary: - Nova volume-detach lacks '--force' command for cleanup
+ Nova doesn't allow to cleanup volumes stuck in 'attaching' or
+ 'detaching' status
description: updated
Revision history for this message
wanghao (wanghao749) wrote : Re: Nova doesn't allow to cleanup volumes stuck in 'attaching' or 'detaching' status

About this "Nova will delete the BDM record only if the call to cinder "terminate_connection" will success".

There is other option IMO: Nova will clean up BDM no matter terminated_connection exception, and then admin/user call force-detach API in cinder side to ensure not exported volume and detach it.

What's your guys suggestion about this option?

Revision history for this message
Scott DAngelo (scott-dangelo) wrote :

wanghao, I think the problem with ignoring the success of cinder's terminate_connection was pointed out by Walt_Boring:

" If Nova only calls libvirt volume's disconnect_volume, without Cinder's terminate_connection being called, then volumes may show back up on the nova host. Specifically for iSCSI volumes.

If an iSCSI session from the compute host to the storage backend still exists (because other volumes are connected), then the volume you just removed will show back up on the next scsi bus rescan."

So, the user should not think that the detach succeeded until the terminate_connection succeeds. Since terminate_connection is asynchronous, the Nova volume-detach will have to verify this somehow.

Revision history for this message
Andrea Rosa (andrea-rosa-m) wrote :

wanghao the problem is what Scott said in comment #5.

@scott you raised an interesting point about the fact that terminate_connection is async.
At the moment Nova considers the call succeeded if can send the requests without any errors, but it doesn't check if the connection has been actually terminated on the cinder side.
Is there a cinder call we can make to get the status of the connection from cinder?
If so we could check the status in a small fixedInternalLoop before deleting the BDM device, even if I do not like this solution it seems a bit hacky.
Any other ideas?

Changed in nova:
importance: Undecided → High
tags: added: volumes
Changed in nova:
assignee: Andrea Rosa (andrea-rosa-m) → John Garbutt (johngarbutt)
Changed in nova:
assignee: John Garbutt (johngarbutt) → Andrea Rosa (andrea-rosa-m)
Changed in nova:
assignee: Andrea Rosa (andrea-rosa-m) → wanghao (wanghao749)
Changed in nova:
assignee: wanghao (wanghao749) → Andrea Rosa (andrea-rosa-m)
Revision history for this message
Scott DAngelo (scott-dangelo) wrote :

Proposed fix:
https://review.openstack.org/#/c/184537/9

I think that the proposed fix should be automatically linked to this bug, but was not for some reason.

summary: - Nova doesn't allow to cleanup volumes stuck in 'attaching' or
+ Nova doesn't allow cleanup of volumes stuck in 'attaching' or
'detaching' status
Changed in nova:
status: In Progress → Confirmed
assignee: Andrea Rosa (andrea-rosa-m) → nobody
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (master)

Change abandoned by Michael Still (<email address hidden>) on branch: master
Review: https://review.openstack.org/184537
Reason: This code hasn't been updated in a long time, and is in merge conflict. I am going to abandon this review, but feel free to restore it if you're still working on this.

Revision history for this message
Tang Chen (tangchen) wrote :

Hi,

Is anyone still working on this bug ? And do we still need this patch ? If we need, I'd like to go on with it if you don't mind.

Thanks.

Tang Chen (tangchen)
Changed in nova:
assignee: nobody → Tang Chen (tangchen)
Revision history for this message
srividyaketharaju (srividya) wrote :

Hi,

Is anyone still working on this bug ? And do we still need this patch ? If we need, I'd like to go on with it if you don't mind.

Thanks.

Revision history for this message
Nazeema Begum (nazeema123) wrote :

Hi,

Is anyone still working on this bug ? And do we still need this patch ? If we need, I'd like to go on with it if you don't mind.

Thanks.

Changed in nova:
assignee: Tang Chen (tangchen) → Nazeema Begum (nazeema123)
Revision history for this message
Nazeema Begum (nazeema123) wrote :

I request the bug reporter to close the bug as this bug is already fixed in mitaka version and here is my analysis on this bug and the delta between liberty and the mitaka

Analysis:

In Liberty:
There is no proper volume attach/detach handling in compute/api.py in liberty. Also, there is no local cleanup of the bdm table.

Fix in Mitaka:
Here, 3 new methods are included to handle volume attach/detach in /compute/api.py.
1) _attach_volume_shelved_offloaded - This method handles attaching volumes in shelved offloaded state.
2) _detach_volume_shelved_offloaded - This method handles detaching volumes in shelved offloaded state on terminate_connection call.
3) _local_cleanup_bdm_volumes - This method deletes the bdm record and takes care of cleanup of volumes

The same as above is even mentioned in the latest release notes of Mitaka version in new features list:
'''It is possible to call attach and detach volume API operations for instances which are in shelved and shelved_offloaded state. For an instance in shelved_offloaded state Nova will set to None the value for the device_name field, the right value for that field will be set once the instance will be unshelved as it will be managed by a specific compute manager.'''

REFFERED FILES:

 /opt/stack/nova/nova/compute/api.py
 /opt/stack/nova/nova/compute/manager.py
 /opt/stack/nova/nova/test/unit/compute/test_compute.py

Sean Dague (sdague)
Changed in nova:
assignee: Nazeema Begum (nazeema123) → nobody
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/571472

Changed in nova:
assignee: nobody → Chen (chenn2)
status: Confirmed → In Progress
Revision history for this message
Mike Chen (chenn2) wrote :

I can still reproduce this bug following the steps in comment #2 (environment: queens, nova version 17.0.1).

The status of the volume gets stuck in "detaching".

When trying to do detach again (nova volume-detach vm_id volume_id):

ERROR (BadRequest): Invalid volume: Invalid input received: Invalid volume: Unable to detach volume. Volume status must be 'in-use' and attach_status must be 'attached' to detach. (HTTP 400)

When trying to do delete the volume (cinder delete volume_id):

Delete for volume volume_id failed: Invalid volume: Volume status must be available or error or error_restoring or error_extending or error_managing and must not be migrating, attached, belong to a group, have snapshots or be disassociated from snapshots after volume transfer. (HTTP 400)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (master)

Change abandoned by Stephen Finucane (<email address hidden>) on branch: master
Review: https://review.opendev.org/571472
Reason: WIP for some time. It seems this has stalled so abandoning.

Revision history for this message
wang (yunhua) wrote :

stay tuned

Lee Yarwood (lyarwood)
Changed in nova:
status: In Progress → Confirmed
assignee: Mike Chen (chenn2) → nobody
Maurice Wei (mauricewei)
Changed in nova:
assignee: nobody → Maurice Wei (mauricewei)
Changed in nova:
assignee: Maurice Wei (mauricewei) → nobody
assignee: nobody → HanGuangyu (hanguangyu)
Changed in nova:
assignee: HanGuangyu (hanguangyu) → nobody
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Related blueprints

Remote bug watches

Bug watches keep track of this bug in other bug trackers.