NVMe-oF: Does not properly wait for a block device

Bug #1964395 reported by Gorka Eguileor
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
os-brick
Fix Released
Medium
Gorka Eguileor

Bug Description

The os-brick code skips half the subsystems when looking for the paths in the connected subsystems, and the half that doesn't skip just uses the wrong paths in the search.

Offending code is:

# Find nvme name among subsystems
for i in range(0, int(len(subsystems) / 2)):
    subsystem = subsystems[i * 2]
    if 'NQN' in subsystem and subsystem['NQN'] == conn_nqn:
        for path in subsystems[i * 2 + 1]['Paths']:
            if (path['Transport'] == nvme_transport_type
                    and path['Address'] == nvme_address):
                nvme_name = path['Name']
                break

This is because the code was written based on the output from an nvme client version that had a bug and returned the wrong json output, it returned each subsystem as 2 different elements on the list on the "nvme list-subsys -o json" command run:
{
  "Subsystems" : [
    {
      "Name" : "nvme-subsys0",
      "NQN" : "nqn.2016-06.io.spdk:cnode1"
    },
    {
      "Paths" : [
        {
          "Name" : "nvme0",
          "Transport" : "rdma",
          "Address" : "traddr=10.0.2.15 trsvcid=4420",
          "Status" : "live"
        }
      ]
    }
  ]
}

Instead of returning the right output:
{
  "Subsystems" : [
    {
      "Name" : "nvme-subsys0",
      "NQN" : "nqn.2016-06.io.spdk:cnode1"
      "Paths" : [
        {
          "Name" : "nvme0",
          "Transport" : "rdma",
          "Address" : "traddr=10.0.2.15 trsvcid=4420",
          "Status" : "live"
        }
      ]
    }
  ]
}

But this was fixed on commit 2a2d5b438cbd4c345cf6a4a79ad81e3413f9cc8f, so if the nvme version present in the system has that commit the os-brick code will not only be skipping half the entries, but it will also mix the paths from 2 different subsystems.

Tags: nvme nvmeof
Gorka Eguileor (gorka)
description: updated
Changed in os-brick:
importance: Undecided → Medium
Changed in os-brick:
status: New → Triaged
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to os-brick (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/os-brick/+/836060

Changed in os-brick:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to os-brick (master)

Reviewed: https://review.opendev.org/c/openstack/os-brick/+/836060
Committed: https://opendev.org/openstack/os-brick/commit/4c21b40df20ef497c7a3c82ffcf32fe21621034c
Submitter: "Zuul (22348)"
Branch: master

commit 4c21b40df20ef497c7a3c82ffcf32fe21621034c
Author: Gorka Eguileor <email address hidden>
Date: Tue Jun 7 18:26:13 2022 +0200

    NVMe-oF: Consolidate code paths

    We currently have 2 different connection information formats: The
    initial one and the new one.

    Within the new format we have 2 variants: Replicated and Not replicated.

    Currently the nvmeof connector has multiple code paths and we end up
    finding bugs that affect one path and not the other, and bugs that are
    present in both may end up getting fixed in only one of them.

    This patch consolidates both formats by converting the connection
    information into a common internal representation, thus reducing the
    number of code paths.

    Thanks to this consolidation the Cinder drivers are less restricted on
    how they can identify a volume in the connection information. They are
    no longer forced to pass the NVMe UUID (in case they cannot get that
    information or the backend doesn't support it) and they can provide
    other information instead (nguid or namespace id).

    This connection properties format consolidation is explained in the new
    NVMeOFConnProps class docstring.

    By consolidating the code paths a number of bugs get fixed
    automatically, because they were broken in one path but not in the
    other. As examples, the old code path didn't have rescans but the new
    one did, and the old code path had device flush but the new one didn't.

    Given the big refactoring needed to consolidate everything this patch
    also accomplishes the following things:

    - Uses Portal, Target, and NVMeOFConnProps classes instead of using the
      harder to read and error prone dictionaries and tuples.

    - Adds method annotations.

    - Documents most methods (exect the raid ones which I'm not familiar
      with)

    - Adds the connector to the docs.

    - Makes method signatures conform with Cinder team standards: no more
      static methods passing self and no more calling class methods using
      the class name instead of self.

    Closes-Bug: #1964590
    Closes-Bug: #1964395
    Closes-Bug: #1964388
    Closes-Bug: #1964385
    Closes-Bug: #1964380
    Closes-Bug: #1964383
    Closes-Bug: #1965954
    Closes-Bug: #1903032
    Related-Bug: #1961102
    Change-Id: I6c2a952f7e286f3e3890e3f2fcb2fdd1063f0c17

Changed in os-brick:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/os-brick 6.1.0

This issue was fixed in the openstack/os-brick 6.1.0 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.