[iSCSI Multipath]Thousands of multipath -ll <mp-id > are executed during volume detachment when multiple LUNs are exposed on a same target

Bug #1454978 reported by Tina Tang
30
This bug affects 5 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Invalid
Undecided
Tina Tang

Bug Description

iSCSI multipath has performance issue on volume detachment when multiple LUNs are exposed via single target(iqn).

1. We are using VNX as cinder backends. VNX is exposing multiple LUNs via a iqn. And a LUN is exposed via different iqns for multipathing. Libvirt driver is used in nova. And the virt_type is kvm.

2. After we attached 100 volumes to VMs, and then do volume detachment in batch, we noticed that thousands of "multipath -ll <mp_id>" are executed per a volume detachement. In out enviornment, a "multipath -ll <mp_id>" takes about 0.2s, the performance is bad.

3. Why there are so many "multipath -ll <mp-id>" triggerred?
In order to find all pathes of a multipath device, the code went through all the devices under /dev/disk/by-path which used the same iqn and execute ‘multipath –ll’ on each of them to get the multipath id. When the multipath id of a device is the same as the volume to be detached. Then it is a path of the volume. When each iqn only expose one LUN, this code do not expose performance issue. However, when multiple luns are expose via a single iqn, the problems comes out.

Assuming taht we have n LUNs attached. Each LUN has m iqns for multipathing, then there will be m*n devices under /dev/disk/by-path. And they are sharing m iqns. Then,
    -- Code line 623- 644 will trigger o(n*m) times of "multipath -ll <mp-id>"
    -- Code line 648-649 will trigger o((n*m)^2) times of "multipath -ll <mp-id>"

nova/nova/virt/libvirt/volume.py
LibvirtISCSIVolumeDriver._disconnect_volume_multipath_iscsi

 618 out = self._run_iscsiadm_discover(iscsi_properties)
 619
 620 # Extract targets for the current multipath device.
 621 ips_iqns = []
 622 entries = self._get_iscsi_devices()
 623 for ip, iqn in self._get_target_portals_from_iscsiadm_output(out):
 624 ip_iqn = "%s-iscsi-%s" % (ip.split(",")[0], iqn)
 625 for entry in entries:
 626 entry_ip_iqn = entry.split("-lun-")[0]
 627 if entry_ip_iqn[:3] == "ip-":
 628 entry_ip_iqn = entry_ip_iqn[3:]
 629 elif entry_ip_iqn[:4] == "pci-":
 630 # Look at an offset of len('pci-0000:00:00.0')
 631 offset = entry_ip_iqn.find("ip-", 16, 21)
 632 entry_ip_iqn = entry_ip_iqn[(offset + 3):]
 633 if (ip_iqn != entry_ip_iqn):
 634 continue
 635 entry_real_path = os.path.realpath("/dev/disk/by-path/%s" %
 636 entry)
 637 entry_mpdev = self._get_multipath_device_name(entry_real_path)
 638 if entry_mpdev == multipath_device:
 639 ips_iqns.append([ip, iqn])
 640 break
 641
 642 if not devices:
 643 # disconnect if no other multipath devices
 644 self._disconnect_mpath(iscsi_properties, ips_iqns)
 645 return
 646
 647 # Get a target for all other multipath devices
 648 other_iqns = [self._get_multipath_iqn(device)
 649 for device in devices]

====================Code version =====================
stack@openstack-performance:~/tina/nova_iscsi_mp/nova$ git log -1
commit f4504f3575b35ec14390b4b678e441fcf953f47b
Merge: 3f21f60 5fbd852
Author: Jenkins <email address hidden>
Date: Tue May 12 22:46:43 2015 +0000

    Merge "Remove db layer hard-code permission checks for network_get_all_by_host"

Tina Tang (tina-tang)
description: updated
description: updated
Revision history for this message
Tina Tang (tina-tang) wrote :

The code logic can be improved:

1. Improve the way to find all pathes for a multipath device
   The multipath –ll <mp-id>will give out the device name of each path. (sdd, sdf, sdh for below example)
   #sudo multipath –ll 3600601602ba03400278103ca73f8e411
   3600601602ba03400278103ca73f8e411 dm-1 DGC,VRAID
   size=3.0G features='1 queue_if_no_path' hwhandler='1 alua' wp=rw
   |-+- policy='round-robin 0' prio=130 status=active
   | |- 44:0:0:23 sdd 8:48 active ready running
   | `- 45:0:0:23 sdf 8:80 active ready running
   `-+- policy='round-robin 0' prio=10 status=enabled
     `- 46:0:0:23 sdh 8:112 active ready running

   Go through each devices under /dev/disk/py-path, as long as the device link to the device name of the multipath device<mp-id>. Then it is a path of the volume. No additional multipath –ll is needed.
   # ls -l /dev/disk/by-path
   total 0
   ip-192.168.3.50:3260-iscsi-<iqna>-lun-0 -> ../../sdg
   ip-192.168.3.50:3260-iscsi-<iqna>-lun-23 -> ../../sdh
   ip-192.168.3.51:3260-iscsi-<iqnb>-lun-0 -> ../../sdc
   ip-192.168.3.51:3260-iscsi-<iqnb>-lun-23 -> ../../sdd
   ip-192.168.4.51:3260-iscsi-<iqnc>-lun-0 -> ../../sde
   ip-192.168.4.51:3260-iscsi-<iqnc>-lun-23 -> ../../sdf

2. In order check whehter a iqn is used by other devices, we do not need to find iqns used by all the other devices. But just mark the iqn as used once we find it has been used by a certain device, and return as soon as possible.

Tina Tang (tina-tang)
Changed in nova:
assignee: nobody → Tina Tang (tina-tang)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/184005

Changed in nova:
status: New → In Progress
Tina Tang (tina-tang)
description: updated
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (master)

Change abandoned by Michael Still (<email address hidden>) on branch: master
Review: https://review.openstack.org/184005
Reason: Sounds like we're going with os.brick here.

Revision history for this message
Matt Riedemann (mriedem) wrote :

Can this be tried against liberty or mitaka nova when we're using the os-brick library which had other fixes for multipath issues than did nova?

Changed in nova:
status: In Progress → Incomplete
Revision history for this message
Lee Yarwood (lyarwood) wrote :

Moving to invalid as this should no longer reproduce against Liberty or Mitaka after the move os-brick. Please reopen and reassign to os-brick if this issue persists.

Changed in nova:
status: Incomplete → Invalid
Revision history for this message
Preston L. Bannister (preston-bannister) wrote :

While I do not know the accepted conventions for OpenStack bugs, calling this "invalid" seems wrong.

This was a severe issue up until Liberty. You are right in that the problem does not exist in Liberty (and later).

I believe there was an exceptionally large patch in the last update to Kilo, specifically to address a small zoo of bugs in this area. Somehow it seems this bug report was missed. While I have not yet verified, I suspect this bug was fixed in the last Kilo update.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.