os-brick

Nova falling back to single path when multipath device already exists

Bug #1742682 reported by Damian Cikowski on 2018-01-11

This bug report is a duplicate of: Bug #1815844: iscsi multipath dm-N device only used on first volume attachment. Edit Remove

This bug affects 3 people

Affects		Status	Importance	Assigned to	Milestone
	os-brick	New	Undecided	Unassigned

Bug Description

Our environment is Ubuntu 16.04.
OpenStack 16.0.5.

Nova is configured to use multipath and we're trying to start an instance with attached volumes that was previously stopped.

Before starting the instance, things on the compute node look as follows:
* iSCSI session is still up from the previous run of the instance
* Device Mapper dm-X multipath device remains in /dev/
* multipath -ll shows that all paths for dm-X are active and running
* There are links in /dev/disk/by-id (scsi-* and wwn-*) that point to /dev/dm-X
* However, no links by WWN to the underlying block devices (sd*) exist in /dev/disk/by-id (because of udev rules, see below)

While starting the instance, nova performs iSCSI discovery and obtains a list of 4 (ip, iqn, lun) tuples that it should use for constructing a multipath device.
4 iSCSI sessions are started, and the LUNs appear as block devices in Linux - let's call them /dev/sda, /dev/sdb, /dev/sdc and /dev/sdd.
Then, apparently, os-brick tries to find the WWN of any of our devices by globbing for "/dev/disk/by-id/scsi-*", iterating over all found symlinks, and seeing if any of them leads to any of our 4 devices.
The above fails (yields no result), because no links from /dev/disk/by-id point to our devices anymore - udev replaces those with links to /dev/dm-* devices because of udev rules.
The lack of a WWN causes os-brick's multipath logic to be skipped completely (the multipath branch is only entered if at least one device's WWN can be found).

In nova-compute.log, this manifests itself as this particular log entry: "No dm was created, connection to volume is probably bad and will perform poorly."
(In fact, of course, the dm device is still there, only its parts are not represented by proper links in /dev/disk/by-id/.)

All of the above causes os-brick to attach our volume in a "degraded" mode - in virsh, it can be observed that only one of the underlying "single-path" devices, such as /dev/sda, is used as the volume backing store, instead of using the multipath device.

Note that:
* The WWN discovery succeeds if this is the first iSCSI connection to these LUNs, because there is no multipath device yet and so there are still links from /dev/disk/by-id/scsi-* to /dev/sd*.
* Some other links (in the form of /dev/disk/by-id/scsi-* - in our case something like /dev/disk/by-id/scsi-SNETAPP-01010101001) may accidentally cause the WWN-finding procedure to succeed, because it still points to /dev/sda and is not removed by udev when a multipath device appears (we don't know why though)

In short, this bug causes silent multipath failures by converting them into single-path volume attachments due to unfounded reliance of os-brick on udev links.

Revision history for this message

Tamas Pasztor (tomcsi) wrote on 2018-03-12:

I created the patch. This is temporary solution, but it's works.

--- /openstack/venvs/nova-16.0.8/lib/python2.7/site-packages/os_brick/initiator/connectors/iscsi.py 2018-03-12 18:02:48.973477364 +0100
+++ /openstack/venvs/nova-16.0.8/lib/python2.7/site-packages/os_brick/initiator/connectors/iscsi.py.tomcsi 2018-03-12 18:04:44.133474261 +0100
@@ -705,7 +705,10 @@
                     data['failed_logins'])):
             # We have devices but we don't know the wwn yet
             if not wwn and found:
- wwn = self._linuxscsi.get_sysfs_wwn(found)
+ if not mpath:
+ wwn = self._linuxscsi.get_sysfs_wwn(self._linuxscsi.find_sysfs_multipath_dm(found))
+ else:
+ wwn = self._linuxscsi.get_sysfs_wwn(found)
             # We have the wwn but not a multipath
             if wwn and not mpath:
                 mpath = self._linuxscsi.find_sysfs_multipath_dm(found)