VMAX: _unlink_volume failed after 30 tries
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Cinder |
Fix Released
|
Undecided
|
Unassigned |
Bug Description
Discussed this some with the driver team. When a VMAX volume is requested to be deleted and it has one or more snapVX sessions linking it to a source volume, those need to be cleaned up before the volume can be requested to be deallocated.
The REST driver attempts to remove copy sessions sequentially with a looping call thread that polls for completion of each one. This does not appear to work very well at scale.
1. The retry and interval values are hardcoded.
UNLINK_INTERVAL = 15
UNLINK_RETRIES = 30
This results in a failure after 450 seconds in the case where REST calls are responding quickly if the copy session has not completed. Other requests are controlled by configurable option values for retries and interval. This unlink flow should either use the configured options or have separate configurable options for interval and retries.
The reason is that 450 seconds is not adequate in some environments, especially hybrid systems where perhaps a server creation fails later in the flow for a networking problem and then the volumes need to be detached and deleted right away during rollback. Volume deletion fails in this case.
2. Because of a REST capability limitation, the looping call thread is attempting to modify the temporary snapshot in order to see if the copy session has completed and remove the link. Even at small scale, the PUT call can outlast the 15 second interval between calls by a large amount. See pasted log snippet below...
2018-10-25 13:51:53.762 106815 DEBUG cinder.
2018-10-25 13:51:53.763 106815 WARNING oslo.service.
2018-10-25 13:51:54.340 106815 DEBUG cinder.
2018-10-25 13:51:54.340 106815 WARNING oslo.service.
2018-10-25 13:52:28.065 106815 DEBUG cinder.
2018-10-25 13:52:28.066 106815 WARNING oslo.service.
This sort of function is not ideal. I believe other cinder drivers are able to put volumes (requested to be deleted) on a "deallocation queue" so either a separate driver thread or backend process is dealing with volumes on the queue: marked deleted, wait for copy session(s) to complete, break linkage, deallocate, deprovision. Please check into the possibility of such a capability, so that delete request can return quicker and all the cleanup be handled asynchronously.
Changed in cinder: | |
status: | New → Fix Released |