NVMe-oF doesn't disconnect

Bug #1961102 reported by Gorka Eguileor
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
os-brick
Fix Released
Medium
Unassigned

Bug Description

The current os-brick code doesn't disconnect NVMe-oF volumes when the ``disconnect_volume`` method is called on the nvme connector, as it is currently relying on the Cinder driver not removing the subsystem, but drivers like the LVM that don't share the subsystem for multiple namespaces (volumes) delete the subsystem on disconnect.

This results in the kernel reporting errors like this:

  Feb 15 14:50:16 localhost.localdomain kernel: nvme nvme0: starting error recovery
  Feb 15 14:50:16 localhost.localdomain kernel: nvme nvme0: Reconnecting in 10 seconds...
  Feb 15 14:50:27 localhost.localdomain kernel: nvme nvme0: Connect rejected: status 8 (invalid service ID).
  Feb 15 14:50:27 localhost.localdomain kernel: nvme nvme0: rdma connection establishment failed (-104)
  Feb 15 14:50:27 localhost.localdomain kernel: nvme nvme0: Failed reconnect attempt 1
  Feb 15 14:50:27 localhost.localdomain kernel: nvme nvme0: Reconnecting in 10 seconds...

And by default there are 60 retries, one each 10 seconds, which take 10 minutes to complete.

During this time we cannot reconnect that same volume to that same host.

The issue was introduced in Change-Id I5f73ec4287f6ab2a0a5ae83748947fa56a0eeb9d

Tags: nvme
Changed in os-brick:
importance: Undecided → Medium
Revision history for this message
Gorka Eguileor (gorka) wrote :

There's a patch up for review to fix this issue: https://review.opendev.org/c/openstack/os-brick/+/825310

description: updated
Revision history for this message
Axel Sündermann (cubitusinthecloud) wrote :

A comment on the fix, i thought that "nvme disconnect -d" is no longer working?

See:
https://github.com/openstack/os-brick/commit/6ea276c22e204531653110089759d800fd4f7bbf

Revision history for this message
Gorka Eguileor (gorka) wrote :

That command is trying to disconnect a specific namespace device (which is mostly pointless since linux doesn't provide any control on the subsystem scans). Passing the controller device instead should work as intended in all versions.

I am testing my wip disconnect code with nvme 1.16 and it works as expected.

Revision history for this message
Axel Sündermann (cubitusinthecloud) wrote :

Thank you for the clarification!

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to os-brick (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/os-brick/+/836060

Changed in os-brick:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to os-brick (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/os-brick/+/836062

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/os-brick/+/836063

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to os-brick (master)

Reviewed: https://review.opendev.org/c/openstack/os-brick/+/836060
Committed: https://opendev.org/openstack/os-brick/commit/4c21b40df20ef497c7a3c82ffcf32fe21621034c
Submitter: "Zuul (22348)"
Branch: master

commit 4c21b40df20ef497c7a3c82ffcf32fe21621034c
Author: Gorka Eguileor <email address hidden>
Date: Tue Jun 7 18:26:13 2022 +0200

    NVMe-oF: Consolidate code paths

    We currently have 2 different connection information formats: The
    initial one and the new one.

    Within the new format we have 2 variants: Replicated and Not replicated.

    Currently the nvmeof connector has multiple code paths and we end up
    finding bugs that affect one path and not the other, and bugs that are
    present in both may end up getting fixed in only one of them.

    This patch consolidates both formats by converting the connection
    information into a common internal representation, thus reducing the
    number of code paths.

    Thanks to this consolidation the Cinder drivers are less restricted on
    how they can identify a volume in the connection information. They are
    no longer forced to pass the NVMe UUID (in case they cannot get that
    information or the backend doesn't support it) and they can provide
    other information instead (nguid or namespace id).

    This connection properties format consolidation is explained in the new
    NVMeOFConnProps class docstring.

    By consolidating the code paths a number of bugs get fixed
    automatically, because they were broken in one path but not in the
    other. As examples, the old code path didn't have rescans but the new
    one did, and the old code path had device flush but the new one didn't.

    Given the big refactoring needed to consolidate everything this patch
    also accomplishes the following things:

    - Uses Portal, Target, and NVMeOFConnProps classes instead of using the
      harder to read and error prone dictionaries and tuples.

    - Adds method annotations.

    - Documents most methods (exect the raid ones which I'm not familiar
      with)

    - Adds the connector to the docs.

    - Makes method signatures conform with Cinder team standards: no more
      static methods passing self and no more calling class methods using
      the class name instead of self.

    Closes-Bug: #1964590
    Closes-Bug: #1964395
    Closes-Bug: #1964388
    Closes-Bug: #1964385
    Closes-Bug: #1964380
    Closes-Bug: #1964383
    Closes-Bug: #1965954
    Closes-Bug: #1903032
    Related-Bug: #1961102
    Change-Id: I6c2a952f7e286f3e3890e3f2fcb2fdd1063f0c17

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to os-brick (master)

Reviewed: https://review.opendev.org/c/openstack/os-brick/+/836062
Committed: https://opendev.org/openstack/os-brick/commit/05a4c05c14506df40be93f8d949a4fef4b746deb
Submitter: "Zuul (22348)"
Branch: master

commit 05a4c05c14506df40be93f8d949a4fef4b746deb
Author: Gorka Eguileor <email address hidden>
Date: Fri Apr 29 12:44:32 2022 +0200

    NVMe-oF: Disconnect subsystems

    Current code doesn't disconnect NVMe-oF subsystems when doing a
    disconnect_volume or on connect_volume failure.

    This is very problematic for systems that don't share subsytems for
    multiple namespaces, because both the device (i.e., /dev/nvme0n1) and
    the subsystem (i.e., /dev/nvme0) will stay forever (now that we connect
    with the controller loss timeout set to infinite, before it was for 10
    minutes) in the system (until manually removed) while the host keeps
    trying to connect to the remote subsystem, but it won't be able to
    connect because in this case drivers usually destroy both the namespace
    and the subsystem simultaneously (so there's no AER message to indicate
    the change in available namespaces within the subsystem).

    We'll experience multiple issues with all these leftover devices, such
    as an ever increasing number of kernel log messages with the connection
    retries, possible exhaustion of number of connected NVMe subsystems
    and/or files in /dev, and so on.

    This patch makes sure the nvmeof connector disconnects a subsystem when
    there is no longer a namespace present or when the only namespace
    present is the one we are disconnecting. This is done both on the
    disconnect_volume call as well as failures during connect_volume.

    This is not a full solution to the problem of leaving leftover devices,
    because for drivers that share the subsystem there are race conditions
    between unexport/unmap of volumes on the cinder side and os-brick
    disconnect_volume calls. To fully prevent this situation Cinder needs
    to start reporting the shared_targets value for NVMe volumes (something
    it's already doing for iSCSI).

    Partial-Bug: #1961102
    Change-Id: Ia00be53420307d6ac1f100420d039da7b65dc349

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.opendev.org/c/openstack/os-brick/+/836063
Committed: https://opendev.org/openstack/os-brick/commit/1cb6d3e1f16d9bd63186fe1e7fcf706d084fa992
Submitter: "Zuul (22348)"
Branch: master

commit 1cb6d3e1f16d9bd63186fe1e7fcf706d084fa992
Author: Gorka Eguileor <email address hidden>
Date: Wed May 11 12:15:27 2022 +0200

    Support shared_targets tristate value

    Cinder microversion 3.69 adds an additional value to shared_targets
    beyond true and false. Now null/None is also a valid value that can be
    used to force locking.

    So we now have 3 possible values:

    - True ==> Lock if iSCSI initiator doesn't support manual scans
    - False ==> Never lock.
    - None ==> Always lock.

    This patch updates the guard_connection context manager to support the
    3 possible values.

    With this Cinder can now force locking for NVMe-oF drivers that share
    subsystems.

    Closes-Bug: #1961102
    Depends-On: I8cda6d9830f39e27ac700b1d8796fe0489fd7c0a
    Change-Id: Id872543ce08c934cefbbbdaff6ddc61e3828b1f1

Changed in os-brick:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/os-brick 6.1.0

This issue was fixed in the openstack/os-brick 6.1.0 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.