long flashcopy operations in the storwize_scv driver will block in _delete_vdisk()

Bug #1203152 reported by Jay Bryant on 2013-07-19
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Cinder
High
Avishay Traeger

Bug Description

There is a loop inside cinder/volume/drivers/storwize_svc.py _delete_vdisk() function that will
wait on flashcopy to finish before the vdisk can be deleted. If trying to delete a cinder volume
that is created from snapshot or another volume before the flashcopy finishes, the volume
service process will loop and wait for the flashcopy to be done. Since the code is blocked
in the _delete_vdisk code, volume service is blocked and won't respond to REST API
or update status. The service will be marked offline.

I am waiting for the person who found this bug to test a change that puts the while loop into an
inline function that I then run with FixedIntervalLoopingCall.

I hope to have a patch to post here later today once we have been able to test the code I wrote.

Jay Bryant (jsbryant) on 2013-07-19
Changed in cinder:
assignee: nobody → Jay Bryant (jsbryant)
Kun Huang (academicgareth) wrote :

# Timeout after 5 seconds
@timeout(5)
def long_running_function2():
    ...

could this be helpful on self._get_flashcopy_mapping_attributes()

tags: added: drivers storwize-svc
Changed in cinder:
status: New → Confirmed
importance: Undecided → High
Jay Bryant (jsbryant) wrote :

Moving this to 'Invalid'. I am not able to recreate the problem reported by the user and they also are unable to recreate the problem.

I am able to get one thread of execution in the _ensure_vdisk_no_fc_mappings function. It will sit there waiting for the vdisk to be in a state where it can be deleted. I can start a second delete request and it will get to the _ensure_vdisk_no_fc_mappings function and also wait.

So, given that, if the problem does still exist I don't think that the problem could be at this point in the code. Can always reopen if the problem reappears.

Changed in cinder:
status: Confirmed → Invalid
Alan Jiang (ajiang) wrote :

Jay

I think the user reported the problem has my internal fix already. The problem needs to be created when there
is a long running flashcopy clone especially when there are multiple flashcopy from the save source vdisk.

Alan

Jay Bryant (jsbryant) wrote :

I just had a chat with Alan. I was not aware that they were still able to recreate this problem fairly easily. The person I had been working with wasn't able to recreate. Perhaps, as he noted, they were already running with the fix.

Alan is going to push up a fix based on the debug code I asked him to try.

I am reopening this bug so that it can be used to check the code in against.

Changed in cinder:
status: Invalid → Confirmed
tags: added: grizzly-backport-potential

Fix proposed to branch: master
Review: https://review.openstack.org/49647

Changed in cinder:
assignee: Jay Bryant (jsbryant) → Alan Jiang (ajiang)
status: Confirmed → In Progress
tags: added: havana-rc-potential
Changed in cinder:
assignee: Alan Jiang (ajiang) → Avishay Traeger (avishay-il)

Reviewed: https://review.openstack.org/49647
Committed: http://github.com/openstack/cinder/commit/7aa4f65a8c17aa037deff0f5b534ed694c17e62a
Submitter: Jenkins
Branch: master

commit 7aa4f65a8c17aa037deff0f5b534ed694c17e62a
Author: Alan Jiang <email address hidden>
Date: Thu Oct 3 17:03:09 2013 -0500

    long flashcopy operation may block volume service

    Storwize family uses flashcopy for snapshot or volume clone. The
    volume delete has to wait until flashcopy finishes or errors out.
    The _delete_vdisk() will poll volume FlashCopy status in a loop.
    This may block volume serivce heartheat since it is in the same
    . The solution is to use openstack FixedIntervalLoopingCall
    to run the FlashCopy status poll in a timer thread.

    The cinder volume mananger will resume delete operation for those
    volumes that are in the deleting state during volume service startup.
    Since Storwize volume delete may wait for a long time, this can cause
    volume service to have long delay before it becomes available.
    A greenpool is used to offload those volume delete operations.

    Change-Id: Ie01a441a327e1e318fa8da0040ae130731b7a686
    Closes-Bug: #1203152

Changed in cinder:
status: In Progress → Fix Committed
Changed in cinder:
milestone: none → havana-rc2

Reviewed: https://review.openstack.org/50984
Committed: http://github.com/openstack/cinder/commit/8a2a3d691fa54c07d14b3e32558641f43b69c040
Submitter: Jenkins
Branch: milestone-proposed

commit 8a2a3d691fa54c07d14b3e32558641f43b69c040
Author: Alan Jiang <email address hidden>
Date: Thu Oct 3 17:03:09 2013 -0500

    long flashcopy operation may block volume service

    Storwize family uses flashcopy for snapshot or volume clone. The
    volume delete has to wait until flashcopy finishes or errors out.
    The _delete_vdisk() will poll volume FlashCopy status in a loop.
    This may block volume serivce heartheat since it is in the same
    . The solution is to use openstack FixedIntervalLoopingCall
    to run the FlashCopy status poll in a timer thread.

    The cinder volume mananger will resume delete operation for those
    volumes that are in the deleting state during volume service startup.
    Since Storwize volume delete may wait for a long time, this can cause
    volume service to have long delay before it becomes available.
    A greenpool is used to offload those volume delete operations.

    Change-Id: Ie01a441a327e1e318fa8da0040ae130731b7a686
    Closes-Bug: #1203152
    (cherry picked from commit 7aa4f65a8c17aa037deff0f5b534ed694c17e62a)

Changed in cinder:
status: Fix Committed → Fix Released
Thierry Carrez (ttx) on 2013-10-17
Changed in cinder:
milestone: havana-rc2 → 2013.2
Alan Pevec (apevec) on 2014-03-31
tags: removed: grizzly-backport-potential
tags: removed: havana-rc-potential
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers