os-brick : flush multipath fails if queing is enabled on it

Bug #1592520 reported by Raunak Kumar on 2016-06-14
16
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Cinder
Undecided
Unassigned
os-brick
Undecided
Raunak Kumar

Bug Description

When queue_if_no_path is enabled for multipath flush_multipath_device fails for Fibre Channel devices.
The workflow of flush multipath_device is called after
1. Flushing the I/O's on each individual scsi path
2. Deleting the scsi path

When flush multipath is called on the multipath device since it has no paths underneath it it goes into infinite queueing for some vendors.

Proposed Solution:
Disable queuing on multipath device before flushing it.

Raunak Kumar (rkumar-b) on 2016-06-14
Changed in os-brick:
assignee: nobody → Raunak Kumar (rkumar-b)
status: New → In Progress
Rodrigo Freire (rbs-j) wrote :
Download full text (9.3 KiB)

Ok, I have found a reproducer for this issue.

0. Setup a LVM volume in your system. In this case, it is a 4x1 TB PVs (LUNs 3EB, 3EA, 3E9 and 3E8), resulting in a single 4 TB LV.

1. Presented 3 20 GB (LUNs 038, 037 and 034) volumes to a tenant. From the compute perspective:
   [root@compute-0 ~]# multipath -ll
   (<redacted>00000038) dm-6 HITACHI ,DF600F
   size=20G features='1 queue_if_no_path' hwhandler='0' wp=rw
   |-+- policy='round-robin 0' prio=1 status=active
   | |- 1:0:1:3 sdac 65:192 active ready running
   | `- 1:0:0:3 sdu 65:64 active ready running
   `-+- policy='round-robin 0' prio=0 status=enabled
     |- 4:0:1:3 sdm 8:192 active ready running
     `- 4:0:0:3 sde 8:64 active ready running
   (<redacted>000003eb) dm-0 HITACHI ,DF600F
   size=1.0T features='1 queue_if_no_path' hwhandler='0' wp=rw
   |-+- policy='round-robin 0' prio=1 status=active
   | |- 4:0:0:203 sdi 8:128 active ready running
   | `- 4:0:1:203 sdq 65:0 active ready running
   `-+- policy='round-robin 0' prio=0 status=enabled
     |- 1:0:1:203 sdag 66:0 active ready running
     `- 1:0:0:203 sdy 65:128 active ready running
   (<redacted>000003ea) dm-2 HITACHI ,DF600F
   size=1.0T features='1 queue_if_no_path' hwhandler='0' wp=rw
   |-+- policy='round-robin 0' prio=1 status=active
   | |- 1:0:0:202 sdx 65:112 active ready running
   | `- 1:0:1:202 sdaf 65:240 active ready running
   `-+- policy='round-robin 0' prio=0 status=enabled
     |- 4:0:0:202 sdh 8:112 active ready running
     `- 4:0:1:202 sdp 8:240 active ready running
   (<redacted>000003e9) dm-4 HITACHI ,DF600F
   size=1.0T features='1 queue_if_no_path' hwhandler='0' wp=rw
   |-+- policy='round-robin 0' prio=1 status=active
   | |- 4:0:1:201 sdo 8:224 active ready running
   | `- 4:0:0:201 sdg 8:96 active ready running
   `-+- policy='round-robin 0' prio=0 status=enabled
     |- 1:0:1:201 sdae 65:224 active ready running
     `- 1:0:0:201 sdw 65:96 active ready running
   (<redacted>00000031) dm-5 HITACHI ,DF600F
   size=270G features='1 queue_if_no_path' hwhandler='0' wp=rw
   |-+- policy='round-robin 0' prio=1 status=active
   | |- 4:0:0:0 sdb 8:16 active ready running
   | `- 4:0:1:0 sdj 8:144 active ready running
   `-+- policy='round-robin 0' prio=0 status=enabled
     |- 1:0:1:0 sdz 65:144 active ready running
     `- 1:0:0:0 sdr 65:16 active ready running
   (<redacted>000003e8) dm-1 HITACHI ,DF600F
   size=1.0T features='1 queue_if_no_path' hwhandler='0' wp=rw
   |-+- policy='round-robin 0' prio=1 status=active
   | |- 1:0:1:200 sdad 65:208 active ready running
   | `- 1:0:0:200 sdv 65:80 active ready running
   `-+- policy='round-robin 0' prio=0 status=enabled
     |- 4:0:0:200 sdf 8:80 active ready running
     `- 4:0:1:200 sdn 8:208 active ready running
   (<redacted>00000034) dm-7 HITACHI ,DF600F
   size=20G features='1 queue_if_no_path' hwhandler='0' wp=rw
   |-+- policy='round-robin 0' prio=1 status=active
   | |- 1:0:1:1 sdaa 65:160 active ready running
   | `- 1:0:0:1 sds 65:32 active ready running
   `-+- policy='round-robin 0' prio=0 status=enabled
     |- 4:0:1:1 sdk 8:160 active ready running
     `- 4:0:0:1 ...

Read more...

Rodrigo Freire (rbs-j) wrote :

And I have found the solution too.

When detaching the LUNs, multipath -f <WWID> can be racy, leaving these lingering devices.
The proposed solution involves:

1. Disable lvmetad in the compute node:
   * Edit /etc/lvm/lvm.conf and change:
     use_lvmetad = 1
   to
     use_lvmetad = 0

2. Insert a sleep just after multipath -f <WWID> in os_brick/initiator/linuxscsi.py.

Raunak Kumar (rkumar-b) wrote :

Hi Rodrigo

Can you paste the content of your /etc/multipath.conf ?

Rodrigo Freire (rbs-j) wrote :

Hello, Raunak!

Sure! Please find it attached.

For what is worth, this problem only happens when detaching two or more volumes. Could not reproduce detaching a single volume.

Rodrigo Freire (rbs-j) wrote :

Raunak,

For what is worth, the proposed patch is now being tested by Gerrit.

Patchset at https://review.openstack.org/#/c/331375/1/os_brick/initiator/linuxscsi.py

HTH
- RF.

Raunak Kumar (rkumar-b) wrote :

My issue is with respect to disable the multipath queue before doing a flush
for eg. if mpath device mpatha has queuing enabled you need to run the following first :

dmsetup message 0 mpatha fail_if_no_path

Gorka Eguileor (gorka) wrote :

I believe this could be connected with this other bug: https://bugs.launchpad.net/os-brick/+bug/1502979

Raunak, if you have a way to reproduce the bug I can revive and update that patch so you can test it and see if it fixes the bug.

Change abandoned by Sean McGinnis (<email address hidden>) on branch: master
Review: https://review.openstack.org/331375
Reason: This review is > 4 weeks without comment and currently blocked by a core reviewer with a -2. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and contacting the reviewer with the -2 on this review to ensure you address their concerns.

Change abandoned by Sean McGinnis (<email address hidden>) on branch: master
Review: https://review.openstack.org/342421
Reason: This review is > 4 weeks without comment, and failed Jenkins the last time it was checked. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Bug attachments

Remote bug watches

Bug watches keep track of this bug in other bug trackers.