VMAX: _unlink_volume failed after 30 tries

Bug #1800008 reported by Carl Pecinovsky
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Cinder
Fix Released
Undecided
Unassigned

Bug Description

Discussed this some with the driver team. When a VMAX volume is requested to be deleted and it has one or more snapVX sessions linking it to a source volume, those need to be cleaned up before the volume can be requested to be deallocated.

The REST driver attempts to remove copy sessions sequentially with a looping call thread that polls for completion of each one. This does not appear to work very well at scale.

1. The retry and interval values are hardcoded.
UNLINK_INTERVAL = 15
UNLINK_RETRIES = 30

This results in a failure after 450 seconds in the case where REST calls are responding quickly if the copy session has not completed. Other requests are controlled by configurable option values for retries and interval. This unlink flow should either use the configured options or have separate configurable options for interval and retries.
The reason is that 450 seconds is not adequate in some environments, especially hybrid systems where perhaps a server creation fails later in the flow for a networking problem and then the volumes need to be detached and deleted right away during rollback. Volume deletion fails in this case.

2. Because of a REST capability limitation, the looping call thread is attempting to modify the temporary snapshot in order to see if the copy session has completed and remove the link. Even at small scale, the PUT call can outlast the 15 second interval between calls by a large amount. See pasted log snippet below...

2018-10-25 13:51:53.762 106815 DEBUG cinder.volume.drivers.dell_emc.vmax.rest [-] PUT request to httpspvc://9.3.233.170:8443/univmax/restapi/private/84/replication/symmetrix/000196800573/snapshot/temp-02A99-volumoot-0 has returned with a status code of: 500. request /usr/lib/python2.7/site-packages/cinder/volume/drivers/dell_emc/vmax/rest.py:138
2018-10-25 13:51:53.763 106815 WARNING oslo.service.loopingcall [-] Function 'cinder.volume.drivers.dell_emc.vmax.provision._unlink_vol' run outlasted interval by 134.19 sec
2018-10-25 13:51:54.340 106815 DEBUG cinder.volume.drivers.dell_emc.vmax.rest [-] PUT request to httpspvc://9.3.233.170:8443/univmax/restapi/private/84/replication/symmetrix/000196800573/snapshot/temp-02AA1-volumoot-0 has returned with a status code of: 500. request /usr/lib/python2.7/site-packages/cinder/volume/drivers/dell_emc/vmax/rest.py:138
2018-10-25 13:51:54.340 106815 WARNING oslo.service.loopingcall [-] Function 'cinder.volume.drivers.dell_emc.vmax.provision._unlink_vol' run outlasted interval by 85.58 sec
2018-10-25 13:52:28.065 106815 DEBUG cinder.volume.drivers.dell_emc.vmax.rest [-] PUT request to httpspvc://9.3.233.170:8443/univmax/restapi/private/84/replication/symmetrix/000196800573/snapshot/temp-02A94-volumoot-0 has returned with a status code of: 500. request /usr/lib/python2.7/site-packages/cinder/volume/drivers/dell_emc/vmax/rest.py:138
2018-10-25 13:52:28.066 106815 WARNING oslo.service.loopingcall [-] Function 'cinder.volume.drivers.dell_emc.vmax.provision._unlink_vol' run outlasted interval by 102.40 sec

This sort of function is not ideal. I believe other cinder drivers are able to put volumes (requested to be deleted) on a "deallocation queue" so either a separate driver thread or backend process is dealing with volumes on the queue: marked deleted, wait for copy session(s) to complete, break linkage, deallocate, deprovision. Please check into the possibility of such a capability, so that delete request can return quicker and all the cleanup be handled asynchronously.

Helen Walsh (walshh2)
Changed in cinder:
status: New → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.