Race condition when deleting iscsi devices

Bug #1297635 reported by Sam Morrison
30
This bug affects 6 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Expired
Undecided
Unassigned

Bug Description

If you have two instances on the same compute node that each have a volume attached (using iscsi backend)

If you delete both of them triggering a disconnect volume the following happens:

First request will delete the device
echo 1> /sys/block/sdr/device/delete

The second request triggers an iscsi_rescan which then rediscovers the device.

The volume is then deleted from the backend cinder.

now you have a device which is pointing back to a deleted volume.

This is using an NetApp device where all the devices are in the same IQN and using multipath on stable/havana

Tracy Jones (tjones-i)
tags: added: volumes
Revision history for this message
Nikola Đipanov (ndipanov) wrote :

If I understand correctly the deleted part happens only if volume is set to "delete_on_termination". Otherwise - yes, this seems like something we want to serialize in the libvirt iscsi volume driver.

Changed in nova:
importance: Undecided → High
importance: High → Medium
Revision history for this message
Nikola Đipanov (ndipanov) wrote :

Hmmm - I haven't tried to reproduce yet - so will leave the bug on "New" for now - but just by looking at the code, I can't figure out where the rescan happens.

Revision history for this message
Sam Morrison (sorrison) wrote :

The rescan happens when the next volume is deleted, it happens too fast as the first volume hasn't been deleted by cinder yet and so the targe is still discoverable
.

Revision history for this message
Sam Morrison (sorrison) wrote :

OK I've just worked out that this is only a problem when using multipath

Revision history for this message
Nikola Đipanov (ndipanov) wrote :

Yep - after looking at the code - it does seem that there is a race when using multipath. A likely fix is to make an instance wide mutex on libvirt volume detach.

Changed in nova:
status: New → Triaged
Revision history for this message
Ihor Kaharlichenko (madkinder) wrote :

The same problem happens with fibre channel connected devices that use multipath.

Revision history for this message
Sam Morrison (sorrison) wrote :

Great to know others have this issue! This is a serious issue for us as it's causing volume to get into really bad states and the only way to fix is to reboot the compute node

tags: added: multipath
Sean Dague (sdague)
Changed in nova:
status: Triaged → Confirmed
Revision history for this message
Matt Riedemann (mriedem) wrote :

Is this still an issue in liberty? Otherwise see comment 6 in bug 1492026 - in mitaka I'd like to add some event callback code to the libvirt driver such that we can make the volume device attach/detach synchronous before we call off to cinder/os-brick to do the iscsi connect/disconnect volume work.

Revision history for this message
Markus Zoeller (markus_z) (mzoeller) wrote : Cleanup EOL bug report

This is an automated cleanup. This bug report has been closed because it
is older than 18 months and there is no open code change to fix this.
After this time it is unlikely that the circumstances which lead to
the observed issue can be reproduced.

If you can reproduce the bug, please:
* reopen the bug report (set to status "New")
* AND add the detailed steps to reproduce the issue (if applicable)
* AND leave a comment "CONFIRMED FOR: <RELEASE_NAME>"
  Only still supported release names are valid (LIBERTY, MITAKA, OCATA, NEWTON).
  Valid example: CONFIRMED FOR: LIBERTY

Changed in nova:
importance: Medium → Undecided
status: Confirmed → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.