Nova falling back to single path when multipath device already exists

Bug #1742682 reported by Damian Cikowski
14
This bug affects 3 people
Affects Status Importance Assigned to Milestone
os-brick
New
Undecided
Unassigned

Bug Description

Our environment is Ubuntu 16.04.
OpenStack 16.0.5.

Nova is configured to use multipath and we're trying to start an instance with attached volumes that was previously stopped.

Before starting the instance, things on the compute node look as follows:
* iSCSI session is still up from the previous run of the instance
* Device Mapper dm-X multipath device remains in /dev/
* multipath -ll shows that all paths for dm-X are active and running
* There are links in /dev/disk/by-id (scsi-* and wwn-*) that point to /dev/dm-X
  * However, no links by WWN to the underlying block devices (sd*) exist in /dev/disk/by-id (because of udev rules, see below)

While starting the instance, nova performs iSCSI discovery and obtains a list of 4 (ip, iqn, lun) tuples that it should use for constructing a multipath device.
4 iSCSI sessions are started, and the LUNs appear as block devices in Linux - let's call them /dev/sda, /dev/sdb, /dev/sdc and /dev/sdd.
Then, apparently, os-brick tries to find the WWN of any of our devices by globbing for "/dev/disk/by-id/scsi-*", iterating over all found symlinks, and seeing if any of them leads to any of our 4 devices.
The above fails (yields no result), because no links from /dev/disk/by-id point to our devices anymore - udev replaces those with links to /dev/dm-* devices because of udev rules.
The lack of a WWN causes os-brick's multipath logic to be skipped completely (the multipath branch is only entered if at least one device's WWN can be found).

In nova-compute.log, this manifests itself as this particular log entry: "No dm was created, connection to volume is probably bad and will perform poorly."
(In fact, of course, the dm device is still there, only its parts are not represented by proper links in /dev/disk/by-id/.)

All of the above causes os-brick to attach our volume in a "degraded" mode - in virsh, it can be observed that only one of the underlying "single-path" devices, such as /dev/sda, is used as the volume backing store, instead of using the multipath device.

Note that:
* The WWN discovery succeeds if this is the first iSCSI connection to these LUNs, because there is no multipath device yet and so there are still links from /dev/disk/by-id/scsi-* to /dev/sd*.
* Some other links (in the form of /dev/disk/by-id/scsi-* - in our case something like /dev/disk/by-id/scsi-SNETAPP-01010101001) may accidentally cause the WWN-finding procedure to succeed, because it still points to /dev/sda and is not removed by udev when a multipath device appears (we don't know why though)

In short, this bug causes silent multipath failures by converting them into single-path volume attachments due to unfounded reliance of os-brick on udev links.

Revision history for this message
Tamas Pasztor (tomcsi) wrote :

I created the patch. This is temporary solution, but it's works.

--- /openstack/venvs/nova-16.0.8/lib/python2.7/site-packages/os_brick/initiator/connectors/iscsi.py 2018-03-12 18:02:48.973477364 +0100
+++ /openstack/venvs/nova-16.0.8/lib/python2.7/site-packages/os_brick/initiator/connectors/iscsi.py.tomcsi 2018-03-12 18:04:44.133474261 +0100
@@ -705,7 +705,10 @@
                     data['failed_logins'])):
             # We have devices but we don't know the wwn yet
             if not wwn and found:
- wwn = self._linuxscsi.get_sysfs_wwn(found)
+ if not mpath:
+ wwn = self._linuxscsi.get_sysfs_wwn(self._linuxscsi.find_sysfs_multipath_dm(found))
+ else:
+ wwn = self._linuxscsi.get_sysfs_wwn(found)
             # We have the wwn but not a multipath
             if wwn and not mpath:
                 mpath = self._linuxscsi.find_sysfs_multipath_dm(found)

Revision history for this message
Tamas Pasztor (tomcsi) wrote :
Revision history for this message
Tamas Pasztor (tomcsi) wrote :

Newest kernel solve this problem.

Revision history for this message
Damian Cikowski (dcdamien) wrote :

Tamas, do you know how new kernel makes it working? To what specific version of kernel are you referring?

Revision history for this message
Aurelien Lourot (aurelien-lourot) wrote :

This seems to be a duplicate of [0], which is indeed caused by a kernel bug fixed somewhere between 4.4.0-176 and 4.8.0-58. See comment #21 on [0].

[0] https://launchpad.net/bugs/1815844

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.