iscsi multipath dm-N device only used on first volume attachment

Bug #1815844 reported by John George
32
This bug affects 5 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Invalid
Undecided
Unassigned
OpenStack Nova Compute Charm
Invalid
High
Unassigned
os-brick
Invalid
Undecided
Unassigned
linux (Ubuntu)
Expired
Undecided
Unassigned

Bug Description

With nova-compute from cloud:xenial-queens and use-multipath=true iscsi multipath is configured and the dm-N devices used on the first attachment but subsequent attachments only use a single path.

The back-end storage is a Purestorage array.
The multipath.conf is attached
The issue is easily reproduced as shown below:

jog@pnjostkinfr01:~⟫ openstack volume create pure2 --size 10 --type pure
+---------------------+--------------------------------------+
| Field | Value |
+---------------------+--------------------------------------+
| attachments | [] |
| availability_zone | nova |
| bootable | false |
| consistencygroup_id | None |
| created_at | 2019-02-13T23:07:40.000000 |
| description | None |
| encrypted | False |
| id | e286161b-e8e8-47b0-abe3-4df411993265 |
| migration_status | None |
| multiattach | False |
| name | pure2 |
| properties | |
| replication_status | None |
| size | 10 |
| snapshot_id | None |
| source_volid | None |
| status | creating |
| type | pure |
| updated_at | None |
| user_id | c1fa4ae9a0b446f2ba64eebf92705d53 |
+---------------------+--------------------------------------+

jog@pnjostkinfr01:~⟫ openstack volume show pure2
+--------------------------------+--------------------------------------+
| Field | Value |
+--------------------------------+--------------------------------------+
| attachments | [] |
| availability_zone | nova |
| bootable | false |
| consistencygroup_id | None |
| created_at | 2019-02-13T23:07:40.000000 |
| description | None |
| encrypted | False |
| id | e286161b-e8e8-47b0-abe3-4df411993265 |
| migration_status | None |
| multiattach | False |
| name | pure2 |
| os-vol-host-attr:host | cinder@cinder-pure#cinder-pure |
| os-vol-mig-status-attr:migstat | None |
| os-vol-mig-status-attr:name_id | None |
| os-vol-tenant-attr:tenant_id | 9be499fd1eee48dfb4dc6faf3cc0a1d7 |
| properties | |
| replication_status | None |
| size | 10 |
| snapshot_id | None |
| source_volid | None |
| status | available |
| type | pure |
| updated_at | 2019-02-13T23:07:41.000000 |
| user_id | c1fa4ae9a0b446f2ba64eebf92705d53 |
+--------------------------------+--------------------------------------+

Add the volume to an instance:
jog@pnjostkinfr01:~⟫ openstack server add volume T1 pure2
jog@pnjostkinfr01:~⟫ openstack server show T1
+-------------------------------------+----------------------------------------------------------+
| Field | Value |
+-------------------------------------+----------------------------------------------------------+
| OS-DCF:diskConfig | MANUAL |
| OS-EXT-AZ:availability_zone | nova |
| OS-EXT-SRV-ATTR:host | pnjostkcompps1 |
| OS-EXT-SRV-ATTR:hypervisor_hostname | pnjostkcompps1.maas |
| OS-EXT-SRV-ATTR:instance_name | instance-00000001 |
| OS-EXT-STS:power_state | Running |
| OS-EXT-STS:task_state | None |
| OS-EXT-STS:vm_state | active |
| OS-SRV-USG:launched_at | 2019-02-08T22:08:49.000000 |
| OS-SRV-USG:terminated_at | None |
| accessIPv4 | |
| accessIPv6 | |
| addresses | test-net=192.168.0.3 |
| config_drive | |
| created | 2019-02-08T22:08:29Z |
| flavor | test (986ce042-27e5-4a45-8edf-3df704c7db6f) |
| hostId | 50e26a44ba01548369a53578c817e7e1d99aed184261d203353840d3 |
| id | dfe2704c-8419-41e8-97c4-53f3e8ad00a3 |
| image | 0db099d0-9d72-4d15-878c-b86b439d6a99 |
| key_name | None |
| name | T1 |
| progress | 0 |
| project_id | 9be499fd1eee48dfb4dc6faf3cc0a1d7 |
| properties | |
| security_groups | name='default' |
| status | ACTIVE |
| updated | 2019-02-08T23:14:15Z |
| user_id | c1fa4ae9a0b446f2ba64eebf92705d53 |
| volumes_attached | id='e286161b-e8e8-47b0-abe3-4df411993265' |
+-------------------------------------+----------------------------------------------------------+

Check the device name used in the libvirt domain xml:
    <disk type='block' device='disk'>
      <driver name='qemu' type='raw' cache='none' io='native' discard='unmap'/>
      <source dev='/dev/dm-0'/> ## NOTE multipath device
      <backingStore/>
      <target dev='vdb' bus='virtio'/>
      <serial>e286161b-e8e8-47b0-abe3-4df411993265</serial>
      <alias name='virtio-disk1'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x06' function='0x0'/>
    </disk>

Show the dm device and it's paths:
ubuntu@pnjostkcompps1:/var/log/nova$ sudo dmsetup info /dev/dm-0
Name: 3624a9370150c5d6aef724e2d00012029
State: ACTIVE
Read Ahead: 256
Tables present: LIVE
Open count: 2
Event number: 0
Major, minor: 252, 0
Number of targets: 1
UUID: mpath-3624a9370150c5d6aef724e2d00012029

ubuntu@pnjostkcompps1:/var/log/nova$ sudo dmsetup ls --tree
3624a9370150c5d6aef724e2d00012029 (252:0)
 ├─ (8:64)
 ├─ (8:48)
 ├─ (8:32)
 └─ (8:16)
ubuntu@pnjostkcompps1:/var/log/nova$ sudo multipath -ll
3624a9370150c5d6aef724e2d00012029 dm-0 PURE,FlashArray
size=10G features='0' hwhandler='1 alua' wp=rw
`-+- policy='queue-length 0' prio=50 status=active
  |- 19:0:0:1 sdb 8:16 active ready running
  |- 20:0:0:1 sdc 8:32 active ready running
  |- 21:0:0:1 sdd 8:48 active ready running
  `- 22:0:0:1 sde 8:64 active ready running

Remove the volume:
jog@pnjostkinfr01:~⟫ openstack server remove volume T1 pure2
jog@pnjostkinfr01:~⟫ openstack server show T1
+-------------------------------------+----------------------------------------------------------+
| Field | Value |
+-------------------------------------+----------------------------------------------------------+
| OS-DCF:diskConfig | MANUAL |
| OS-EXT-AZ:availability_zone | nova |
| OS-EXT-SRV-ATTR:host | pnjostkcompps1 |
| OS-EXT-SRV-ATTR:hypervisor_hostname | pnjostkcompps1.maas |
| OS-EXT-SRV-ATTR:instance_name | instance-00000001 |
| OS-EXT-STS:power_state | Running |
| OS-EXT-STS:task_state | None |
| OS-EXT-STS:vm_state | active |
| OS-SRV-USG:launched_at | 2019-02-08T22:08:49.000000 |
| OS-SRV-USG:terminated_at | None |
| accessIPv4 | |
| accessIPv6 | |
| addresses | test-net=192.168.0.3 |
| config_drive | |
| created | 2019-02-08T22:08:29Z |
| flavor | test (986ce042-27e5-4a45-8edf-3df704c7db6f) |
| hostId | 50e26a44ba01548369a53578c817e7e1d99aed184261d203353840d3 |
| id | dfe2704c-8419-41e8-97c4-53f3e8ad00a3 |
| image | 0db099d0-9d72-4d15-878c-b86b439d6a99 |
| key_name | None |
| name | T1 |
| progress | 0 |
| project_id | 9be499fd1eee48dfb4dc6faf3cc0a1d7 |
| properties | |
| security_groups | name='default' |
| status | ACTIVE |
| updated | 2019-02-08T23:14:15Z |
| user_id | c1fa4ae9a0b446f2ba64eebf92705d53 |
| volumes_attached | |
+-------------------------------------+----------------------------------------------------------+

Add the volume back:

Check the device name used in the libvirt domain xml:
    <disk type='block' device='disk'>
      <driver name='qemu' type='raw' cache='none' io='native' discard='unmap'/>
      <source dev='/dev/sdb'/> ## NOTE single path device
      <backingStore/>
      <target dev='vdb' bus='virtio'/>
      <serial>e286161b-e8e8-47b0-abe3-4df411993265</serial>
      <alias name='virtio-disk1'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x06' function='0x0'/>
    </disk>

Nova log:
2019-02-13 23:19:09.472 45238 INFO nova.compute.manager [req-cfbbc316-b456-4a03-8742-937559cd1de1 c1fa4ae9a0b446f2ba64eebf92705d53 9be499fd1eee48dfb4dc6faf3cc0a1d7 - e69140fe01214a39bcc6560b7b2e70e0 e69140fe01214a39bcc6560b7b2e70e0] [instance: dfe2704c-8419-41e8-97c4-53f3e8ad00a3] Attaching volume e286161b-e8e8-47b0-abe3-4df411993265 to /dev/vdb
2019-02-13 23:19:10.896 45238 INFO os_brick.initiator.connectors.iscsi [req-cfbbc316-b456-4a03-8742-937559cd1de1 c1fa4ae9a0b446f2ba64eebf92705d53 9be499fd1eee48dfb4dc6faf3cc0a1d7 - e69140fe01214a39bcc6560b7b2e70e0 e69140fe01214a39bcc6560b7b2e70e0] Trying to connect to iSCSI portal 192.168.19.20:3260
2019-02-13 23:19:10.897 45238 INFO os_brick.initiator.connectors.iscsi [req-cfbbc316-b456-4a03-8742-937559cd1de1 c1fa4ae9a0b446f2ba64eebf92705d53 9be499fd1eee48dfb4dc6faf3cc0a1d7 - e69140fe01214a39bcc6560b7b2e70e0 e69140fe01214a39bcc6560b7b2e70e0] Trying to connect to iSCSI portal 192.168.19.21:3260
2019-02-13 23:19:10.899 45238 INFO os_brick.initiator.connectors.iscsi [req-cfbbc316-b456-4a03-8742-937559cd1de1 c1fa4ae9a0b446f2ba64eebf92705d53 9be499fd1eee48dfb4dc6faf3cc0a1d7 - e69140fe01214a39bcc6560b7b2e70e0 e69140fe01214a39bcc6560b7b2e70e0] Trying to connect to iSCSI portal 192.168.19.22:3260
2019-02-13 23:19:10.900 45238 INFO os_brick.initiator.connectors.iscsi [req-cfbbc316-b456-4a03-8742-937559cd1de1 c1fa4ae9a0b446f2ba64eebf92705d53 9be499fd1eee48dfb4dc6faf3cc0a1d7 - e69140fe01214a39bcc6560b7b2e70e0 e69140fe01214a39bcc6560b7b2e70e0] Trying to connect to iSCSI portal 192.168.19.23:3260
2019-02-13 23:19:11.409 45238 WARNING os_brick.initiator.connectors.iscsi [req-cfbbc316-b456-4a03-8742-937559cd1de1 c1fa4ae9a0b446f2ba64eebf92705d53 9be499fd1eee48dfb4dc6faf3cc0a1d7 - e69140fe01214a39bcc6560b7b2e70e0 e69140fe01214a39bcc6560b7b2e70e0] Couldn't find iscsi sessions because iscsiadm err: iscsiadm: No active sessions.

2019-02-13 23:19:11.446 45238 WARNING os_brick.initiator.connectors.iscsi [req-cfbbc316-b456-4a03-8742-937559cd1de1 c1fa4ae9a0b446f2ba64eebf92705d53 9be499fd1eee48dfb4dc6faf3cc0a1d7 - e69140fe01214a39bcc6560b7b2e70e0 e69140fe01214a39bcc6560b7b2e70e0] Couldn't find iscsi sessions because iscsiadm err: iscsiadm: No active sessions.

2019-02-13 23:19:11.488 45238 WARNING os_brick.initiator.connectors.iscsi [req-cfbbc316-b456-4a03-8742-937559cd1de1 c1fa4ae9a0b446f2ba64eebf92705d53 9be499fd1eee48dfb4dc6faf3cc0a1d7 - e69140fe01214a39bcc6560b7b2e70e0 e69140fe01214a39bcc6560b7b2e70e0] Couldn't find iscsi sessions because iscsiadm err: iscsiadm: No active sessions.

2019-02-13 23:19:11.526 45238 WARNING os_brick.initiator.connectors.iscsi [req-cfbbc316-b456-4a03-8742-937559cd1de1 c1fa4ae9a0b446f2ba64eebf92705d53 9be499fd1eee48dfb4dc6faf3cc0a1d7 - e69140fe01214a39bcc6560b7b2e70e0 e69140fe01214a39bcc6560b7b2e70e0] Couldn't find iscsi sessions because iscsiadm err: iscsiadm: No active sessions.

2019-02-13 23:19:16.483 45238 INFO nova.compute.resource_tracker [req-4e84cf0b-619b-44a0-8ea8-389ae9725297 - - - - -] Final resource view: name=pnjostkcompps1.maas phys_ram=64388MB used_ram=16872MB phys_disk=274GB used_disk=20GB total_vcpus=24 used_vcpus=4 pci_stats=[]

Multipath device is still configured but not used by nova:
ubuntu@pnjostkcompps1:/var/log/nova$ sudo iscsiadm -m node
192.168.19.20:3260,-1 iqn.2010-06.com.purestorage:flasharray.401a4a5a9b723cc8
192.168.19.23:3260,-1 iqn.2010-06.com.purestorage:flasharray.401a4a5a9b723cc8
192.168.19.22:3260,-1 iqn.2010-06.com.purestorage:flasharray.401a4a5a9b723cc8
192.168.19.21:3260,-1 iqn.2010-06.com.purestorage:flasharray.401a4a5a9b723cc8

ubuntu@pnjostkcompps1:/var/log/nova$ sudo dmsetup info /dev/dm-0
Name: 3624a9370150c5d6aef724e2d00012029
State: ACTIVE
Read Ahead: 256
Tables present: LIVE
Open count: 0
Event number: 0
Major, minor: 252, 0
Number of targets: 1
UUID: mpath-3624a9370150c5d6aef724e2d00012029

ubuntu@pnjostkcompps1:/var/log/nova$ sudo dmsetup ls --tree
3624a9370150c5d6aef724e2d00012029 (252:0)
 ├─ (8:64)
 ├─ (8:48)
 ├─ (8:32)
 └─ (8:16)

ubuntu@pnjostkcompps1:/var/log/nova$ sudo multipath -ll
3624a9370150c5d6aef724e2d00012029 dm-0 PURE,FlashArray
size=10G features='0' hwhandler='1 alua' wp=rw
`-+- policy='queue-length 0' prio=50 status=active
  |- 23:0:0:1 sdb 8:16 active ready running
  |- 24:0:0:1 sdc 8:32 active ready running
  |- 25:0:0:1 sdd 8:48 active ready running
  `- 26:0:0:1 sde 8:64 active ready running

Revision history for this message
John George (jog) wrote :

Subscribed this bug to field-high. A dev environment is accessible at the customer site for approximately one week, which includes an attachment to the Purestorage array.

Revision history for this message
John George (jog) wrote :
Revision history for this message
John George (jog) wrote :
Revision history for this message
John George (jog) wrote :
John George (jog)
tags: added: field-high
Revision history for this message
Liam Young (gnuoy) wrote :

I don't think this is related to the charm, it looks like a bug in upstream nova.

no longer affects: nova (Ubuntu)
Revision history for this message
Matt Riedemann (mriedem) wrote :

The new-style volume attachment flow with the volume attachments API was introduced in Queens, so maybe there is something regressed from that:

https://specs.openstack.org/openstack/nova-specs/specs/queens/implemented/cinder-new-attach-apis.html

What you should probably look for is the connection_info in the volume attachments table for the problem attachments and see if it has multipath information it (or it could be on the block_device_mappings.connection_info column for the associated instance in the nova database).

Ryan Beisner (1chb1n)
Changed in charm-nova-compute:
status: New → Invalid
Revision history for this message
Sahid Orentino (sahid-ferdjaoui) wrote :

Yes, as Matt said it's probably the first place where we should look at. I should have access to that env soon and will be able to ensure it.

Related to the multipath bit, I looked at the code and we have a specific function to preserve it [0]. To save that information we have to satisfy the condition here [1] where related to the bug description that does not look to be satisfied, the status is 'available'.

That's said, there are lot of paths to go here so that might not be related. I will let you know my progresses.

[0] https://github.com/openstack/nova/blob/master/nova/virt/block_device.py#L450
[1] https://github.com/openstack/nova/blob/master/nova/virt/block_device.py#L479

Changed in nova:
assignee: nobody → Sahid Orentino (sahid-ferdjaoui)
Revision history for this message
Sahid Orentino (sahid-ferdjaoui) wrote :

Ok so my previous comment was not right we use new Cinder API. Also I start to think that issue may not be related to Nova since we don't reuse bdm details previously stored in the database also it seems that Nova were able to attach and detach the volume without any issues.

Revision history for this message
Sahid Orentino (sahid-ferdjaoui) wrote :
Revision history for this message
John George (jog) wrote :

https://review.openstack.org/#/c/636226/ is related to failing to attach. In the case for which this bug was filed, the volume successfully attaches but the multipath device is not used.

Revision history for this message
Sahid Orentino (sahid-ferdjaoui) wrote :

OK so I was able to make thing working by applying that patch to os-brick. But it's probably not the right fix. Something seems clear is not related to Nova.

/connectors/iscsi.py b/os_brick/initiator/connectors/iscsi.py
index 45a474e..3688345 100644
--- a/os_brick/initiator/connectors/iscsi.py
+++ b/os_brick/initiator/connectors/iscsi.py
@@ -735,10 +735,10 @@ class ISCSIConnector(base.BaseLinuxConnector, base_iscsi.BaseISCSIConnector):
                    (mpath and len(ips_iqns_luns) == data['num_logins'] +
                     data['failed_logins'])):
             # We have devices but we don't know the wwn yet
- if not wwn and found:
+ if wwn is None and found:
                 wwn = self._linuxscsi.get_sysfs_wwn(found)
             # We have the wwn but not a multipath
- if wwn and not mpath:
+ if wwn is not None and not mpath:
                 mpath = self._linuxscsi.find_sysfs_multipath_dm(found)
                 if not (mpath or wwn_added):
                     # Tell multipathd that this wwn is a multipath and hint

Basically it seems that the WWN are not presented to /dev/disk/by-id/. I don't know if that is a particularity of Purestorage when devices are mounted via iscsid or perhaps an issue in the configuration of multipathd. I will continue my investigation.

root@pnjostkcompps1:~# multipath -l
3624a9370150c5d6aef724e2d00012029 dm-1 PURE,FlashArray
size=10G features='0' hwhandler='1 alua' wp=rw
`-+- policy='queue-length 0' prio=0 status=active
  |- 55:0:0:2 sdf 8:80 active undef running
  |- 56:0:0:2 sdg 8:96 active undef running
  |- 57:0:0:2 sdh 8:112 active undef running
  `- 58:0:0:2 sdi 8:128 active undef running
3624a9370150c5d6aef724e2d00012028 dm-0 PURE,FlashArray
size=10G features='0' hwhandler='1 alua' wp=rw
`-+- policy='queue-length 0' prio=0 status=active
  |- 55:0:0:1 sdb 8:16 active undef running
  |- 56:0:0:1 sdc 8:32 active undef running
  |- 57:0:0:1 sdd 8:48 active undef running
  `- 58:0:0:1 sde 8:64 active undef running

root@pnjostkcompps1:~# ls -l /dev/disk/by-id/scsi-*
lrwxrwxrwx 1 root root 9 Feb 13 23:29 /dev/disk/by-id/scsi-0HP_LOGICAL_VOLUME_00000000 -> ../../sda
lrwxrwxrwx 1 root root 10 Feb 13 23:29 /dev/disk/by-id/scsi-0HP_LOGICAL_VOLUME_00000000-part1 -> ../../sda1
lrwxrwxrwx 1 root root 9 Feb 13 23:29 /dev/disk/by-id/scsi-3600508b1001ce1392ff51a489b34fda0 -> ../../sda
lrwxrwxrwx 1 root root 10 Feb 13 23:29 /dev/disk/by-id/scsi-3600508b1001ce1392ff51a489b34fda0-part1 -> ../../sda1
lrwxrwxrwx 1 root root 10 Feb 21 18:02 /dev/disk/by-id/scsi-3624a9370150c5d6aef724e2d00012028 -> ../../dm-0
lrwxrwxrwx 1 root root 10 Feb 21 18:02 /dev/disk/by-id/scsi-3624a9370150c5d6aef724e2d00012029 -> ../../dm-1
lrwxrwxrwx 1 root root 9 Feb 13 23:29 /dev/disk/by-id/scsi-SHP_LOGICAL_VOLUME_001438030E74160 -> ../../sda
lrwxrwxrwx 1 root root 10 Feb 13 23:29 /dev/disk/by-id/scsi-SHP_LOGICAL_VOLUME_001438030E74160-part1 -> ../../sda1

Revision history for this message
Sahid Orentino (sahid-ferdjaoui) wrote :

Patch proposed against os-brick here [0]

[0] https://review.openstack.org/#/c/638639/

Changed in os-brick:
assignee: nobody → Sahid Orentino (sahid-ferdjaoui)
Changed in nova:
status: New → Invalid
Revision history for this message
Sahid Orentino (sahid-ferdjaoui) wrote :

Basically the issue is related to 'find_multipaths "yes"' in /etc/multipath.conf. The patch I proposed fix the issue but adds more complexity to the algorithm which is already a bit tricky. So let see whether upstream is going to accept it.

At least we should document something that using multipath should be when multipathd configured like:

   find_multipaths "no"

I'm re-adding the charm-nova-compute to this bug so we add a not about it in the doc of the option.

Changed in charm-nova-compute:
status: Invalid → New
James Page (james-page)
Changed in charm-nova-compute:
status: New → Triaged
importance: Undecided → High
milestone: none → 19.04
Changed in os-brick:
status: New → In Progress
Changed in charm-nova-compute:
assignee: nobody → Sahid Orentino (sahid-ferdjaoui)
Revision history for this message
Gorka Eguileor (gorka) wrote :

I looked into this and the problem is not specific to "find_multipaths yes", and can happen with the configuration option set to "no" as well.

The problem only happens on backends that present multiple designators on page 0x83, if the system doesn't have enough CPU cycles and is starving a little.

In this situation the OS-Brick process and threads don't have enough CPU time to check the links from the individual volumes that appear under '/dev/disk/by-id/scsi-*' before the multipath device mapper is created and triggers the udev rule that overwrites the existing '/dev/disk/by-id/scsi-*' to make it point to the DM device.

The easiest solution is to modify the "get_sysfs_wwn" method to check, if the individual search for links fail, if any of the devices are part of a mutlipath, and if they are to check if the multipath is in "wwn_paths".

Revision history for this message
Gorka Eguileor (gorka) wrote :

Sorry, not exactly check if it's in "wwn_paths", we must look for the match on the real paths on the wwn_paths.

Revision history for this message
Sahid Orentino (sahid-ferdjaoui) wrote :

Thanks working on this Gorka I really appreciate.

Unfortunately I don't think your statement is correct. If 'find_multipaths' is set to 'no' there are no reasons to see a multipath device created before the individual volumes appear under sysfs '/dev/disk/by-id/scsi-*'. The creation will be done by the os-brick iscsi connector itself line L751 [0]. That's why I think you are probably suffering a different issue.

Beside of that the fix that you are proposing to update "get_sysfs_wwn" is basically what I'm proposing, falling-back to look at the DM device. The point is I don't think it's correct to update get_sysfs_wwn(found_device). As mentioned in the review the purpose of this method seems clear based on its signature.

[0] https://github.com/openstack/os-brick/blob/master/os_brick/initiator/connectors/iscsi.py#L751

Revision history for this message
Gorka Eguileor (gorka) wrote :

That is not how "find_multipaths" works. If it's set to "no" multipathd will create a DM for ALL non blacklisted devices that appear in the system, even if only 1 device appears.

I added the code that forces the creation of a multipath for the "find_multipaths yes" case, when it's set to "no" that is pretty much useless. That way the multipath is created on the first device that appears, helping us move faster if one of the links is having delays or transmission errors.

I wrote most of the code we are discussing when I refactored the iSCSI connect and disconnect mechanism, including the "get_sysfs_wwn", and it is the right place to add the fix.

The method should return the wwn for any of the devices that are passed without making any call to "iscsiadm". So it first tries getting it reading the wwid on the sysfs and checking if that link exists, and if it's not there (it happens when there are multiple designators on page 0x83) it tries to see if any of the links that exists point to any of the devices passed in the arguments.

The code I suggests basically adds a new step which is, what happens if the link has been overwritten by the triggering of the DM udev rule? In that case we check if theres' a holder of the devices (this would be the DM) and check if there is a link to that device that will give us the WWN.

So the method will perform the same functionality, it will just add a new way to find it.

Revision history for this message
Sahid Orentino (sahid-ferdjaoui) wrote :

OK fair-enough, perhaps I'm wrong. I'm exposing here what I'm suffering in my env. There are several factors, also it might be that there is a bug in the version of multipath that we are shipping. I don't know.

The more important point is that It looks we are agree there is a bug in os-brick and we are also agree we should fallback to look at the multipath device if we are not able to a find correct wwn in sysfs from the connected volumes.

I'm not happy with your proposed solution because as indicated it adds more logic in a utility function, also by this way we still pass condition L738 [0] and call a second time 'self._linuxscsi.find_sysfs_multipath_dm(found)'. Which can be avoided since we already know that we have a viable mpath.
But in same time I can understand your point to have this function doing its work of finding a wwn in any possible ways.

Let's discuss of the proper fix in the review if you don't mind.

Thanks for you inputs.

[0] https://github.com/openstack/os-brick/blob/master/os_brick/initiator/connectors/iscsi.py#L738

David Ames (thedac)
Changed in charm-nova-compute:
milestone: 19.04 → 19.07
David Ames (thedac)
Changed in charm-nova-compute:
milestone: 19.07 → 19.10
David Ames (thedac)
Changed in charm-nova-compute:
milestone: 19.10 → 20.01
James Page (james-page)
Changed in charm-nova-compute:
milestone: 20.01 → 20.05
Changed in nova:
assignee: Sahid Orentino (sahid-ferdjaoui) → nobody
Changed in charm-nova-compute:
assignee: Sahid Orentino (sahid-ferdjaoui) → Aurelien Lourot (aurelien-lourot)
Changed in os-brick:
assignee: Sahid Orentino (sahid-ferdjaoui) → Aurelien Lourot (aurelien-lourot)
Revision history for this message
Aurelien Lourot (aurelien-lourot) wrote :

I can successfully create, attach, detach and re-attach several times purestorage volumes with multipath and they remain multipath (i.e. I can't reproduce the issue). I tried with the following setups, with and without high CPU load on the nova-compute node running the instance to which I'm attaching the volume:
- bionic-train (with os-brick 2.10.0-0ubuntu1~cloud0 installed on nova-compute)
- bionic-stein (with os-brick 2.8.1-0ubuntu1~cloud0 installed on nova-compute, which should be approximately the version that the original reporter was using at that time)
- bionic-rocky (with os-brick 2.5.3-0ubuntu1~cloud0 installed on nova-compute)

I'm wondering if I can't reproduce the issue because I'm always lucky and the symlinks mentioned above always appear in time or maybe if my setup doesn't fulfil the conditions, e.g. maybe my backend doesn't present multiple designators on page 0x83? On nova-compute:

$ sudo multipath -ll
3624a93708726b5033af2433d00011456 dm-0 PURE,FlashArray
size=10G features='0' hwhandler='0' wp=rw
`-+- policy='service-time 0' prio=50 status=active
  |- 2:0:0:1 sda 8:0 active ready running
  `- 3:0:0:1 sdb 8:16 active ready running
$ sudo iscsiadm -m node
10.246.112.10:3260,-1 iqn.2010-06.com.purestorage:flasharray.3f95c59b49b7b2cd
10.246.112.11:3260,-1 iqn.2010-06.com.purestorage:flasharray.3f95c59b49b7b2cd
$ sudo dmsetup info /dev/dm-0
Name: 3624a93708726b5033af2433d00011456
State: ACTIVE
Read Ahead: 256
Tables present: LIVE
Open count: 2
Event number: 0
Major, minor: 253, 0
Number of targets: 1
UUID: mpath-3624a93708726b5033af2433d00011456
$ sudo /lib/udev/scsi_id --whitelisted --page=0x83 --device=/dev/dm-0
3624a93708726b5033af2433d00011456
$ sudo /lib/udev/scsi_id --whitelisted --page=0x83 --device=/dev/sda
3624a93708726b5033af2433d00011456
$ sudo /lib/udev/scsi_id --whitelisted --page=0x83 --device=/dev/sdb
3624a93708726b5033af2433d00011456

Revision history for this message
Aurelien Lourot (aurelien-lourot) wrote :

As mentioned here [0] it seems to be a kernel bug. It is indeed reproducible on xenial (kernel 4.4.0-176-generic) instead of bionic (4.15.0-91-generic). I can consistently reproduce it on xenial-queens without CPU load. For the record on that deployment we get os-brick 2.3.0-0ubuntu1~cloud0 but the problem is clearly outside of os-brick: on this setup when re-attaching a volume the following files don't show up anymore: /sys/block/sda/device/wwid and /sys/block/sdb/device/wwid

So this seems to be an issue in the driver which has been fixed somewhere between kernel 4.4.0 and 4.15.0

Also for the record on that setup the page 0x83 isn't presenting multiple designators so this isn't needed in order to reproduce the issue.

[0] https://bugs.launchpad.net/os-brick/+bug/1742682

Revision history for this message
Aurelien Lourot (aurelien-lourot) wrote :

On xenial if I install a kernel 4.8.0 instead of 4.4.0 on the nova-compute nodes the problem vanishes:

$ sudo apt install linux-image-4.8.0-58-generic linux-image-extra-4.8.0-58-generic linux-headers-4.8.0-58-generic

For /boot/grub/menu.lst choose "Keep the local version currently installed". Then tweak it manually:

$ sudo sed -i 's/4.4.0-176/4.8.0-58/g' /boot/grub/menu.lst
$ sudo reboot

I think we can close this bug?

Changed in os-brick:
status: In Progress → Invalid
Changed in charm-nova-compute:
assignee: Aurelien Lourot (aurelien-lourot) → nobody
Changed in os-brick:
assignee: Aurelien Lourot (aurelien-lourot) → nobody
David Ames (thedac)
Changed in charm-nova-compute:
milestone: 20.05 → 20.08
Revision history for this message
James Page (james-page) wrote :

Marking charm task as invalid as this is a kernel issue with the xenial release kernel.

Ubuntu/Linux bug task raised for further progression if updating to the latest HWE kernel on Xenial is not an option.

Changed in charm-nova-compute:
status: Triaged → Invalid
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1815844

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for linux (Ubuntu) because there has been no activity for 60 days.]

Changed in linux (Ubuntu):
status: Incomplete → Expired
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on os-brick (master)

Change abandoned by Sean McGinnis (<email address hidden>) on branch: master
Review: https://review.opendev.org/638639
Reason: After reading the bug report, I'm going to abandon this. Feel free to restore and update if there is anything else to be done.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on charm-nova-compute (master)

Change abandoned by "Alex Kavanagh <email address hidden>" on branch: master
Review: https://review.opendev.org/c/openstack/charm-nova-compute/+/639719
Reason: Abandoning as submitter put a hold on it but hasn't revisited in 4 years. Please feel free to re-open it if it is still valid.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.