NVMe-oF cannot connect

Bug #2035695 reported by Gorka Eguileor
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
os-brick
Fix Released
Undecided
Gorka Eguileor

Bug Description

When the nvme subsystem has all portals in connecting state and we try to attach a new volume to that same subsystem it will fail.

One way to reproduce this issue is to configure LVM+nvmet with:
  use_multipath_for_image_xfer = true
  iscsi_secondary_ip_addresses = 127.0.0.1
  target_secondary_ip_addresses = 127.0.0.1
  lvm_share_target = true
  nvmeof_conn_info_version = 2

Then we attach 2 volumes to an instance and delete the instance.

Then we create another instance and try to connect a volume. This will fail.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to os-brick (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/os-brick/+/895193

Changed in os-brick:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to os-brick (master)

Reviewed: https://review.opendev.org/c/openstack/os-brick/+/895193
Committed: https://opendev.org/openstack/os-brick/commit/ec22c32de6820184d7737c5af70e573c0634cd38
Submitter: "Zuul (22348)"
Branch: master

commit ec22c32de6820184d7737c5af70e573c0634cd38
Author: Gorka Eguileor <email address hidden>
Date: Thu Sep 14 12:19:26 2023 +0200

    NVMe-oF: Fix attach when reconnecting

    When an nvme subsystem has all portals in connecting state and we try
    to attach a new volume to that same subsystem it will fail.

    We can reproduce it with LVM+nvmet if we configure it to share targets
    and then:
    - Create instance
    - Attach 2 volumes
    - Delete instance (this leaves the subsystem in connecting state [1])
    - Create instance
    - Attach volume <== FAILS

    The problem comes from the '_connect_target' method that ignores
    subsystems in 'connecting' state, so if they are all in that state it
    considers it equivalent to all portals being inaccessible.

    This patch changes this behavior and if we cannot connect to a target
    but we have portals in 'connecting' state we wait for the next retry of
    the nvme linux driver. Specifically we wait 10 more seconds that the
    interval between retries.

    [1]: https://bugs.launchpad.net/nova/+bug/2035375

    Closes-Bug: #2035695
    Change-Id: Ife710f52c339d67f2dcb160c20ad0d75480a1f48

Changed in os-brick:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to os-brick (stable/2023.2)

Fix proposed to branch: stable/2023.2
Review: https://review.opendev.org/c/openstack/os-brick/+/905230

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/os-brick 6.6.0

This issue was fixed in the openstack/os-brick 6.6.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to os-brick (stable/2023.2)

Reviewed: https://review.opendev.org/c/openstack/os-brick/+/905230
Committed: https://opendev.org/openstack/os-brick/commit/7419306d2669568b8ef1aac6283e680da841d82f
Submitter: "Zuul (22348)"
Branch: stable/2023.2

commit 7419306d2669568b8ef1aac6283e680da841d82f
Author: Gorka Eguileor <email address hidden>
Date: Thu Sep 14 12:19:26 2023 +0200

    NVMe-oF: Fix attach when reconnecting

    When an nvme subsystem has all portals in connecting state and we try
    to attach a new volume to that same subsystem it will fail.

    We can reproduce it with LVM+nvmet if we configure it to share targets
    and then:
    - Create instance
    - Attach 2 volumes
    - Delete instance (this leaves the subsystem in connecting state [1])
    - Create instance
    - Attach volume <== FAILS

    The problem comes from the '_connect_target' method that ignores
    subsystems in 'connecting' state, so if they are all in that state it
    considers it equivalent to all portals being inaccessible.

    This patch changes this behavior and if we cannot connect to a target
    but we have portals in 'connecting' state we wait for the next retry of
    the nvme linux driver. Specifically we wait 10 more seconds that the
    interval between retries.

    [1]: https://bugs.launchpad.net/nova/+bug/2035375

    Closes-Bug: #2035695
    Change-Id: Ife710f52c339d67f2dcb160c20ad0d75480a1f48
    (cherry picked from commit ec22c32de6820184d7737c5af70e573c0634cd38)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to os-brick (stable/2023.1)

Fix proposed to branch: stable/2023.1
Review: https://review.opendev.org/c/openstack/os-brick/+/905991

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to os-brick (stable/2023.1)

Reviewed: https://review.opendev.org/c/openstack/os-brick/+/905991
Committed: https://opendev.org/openstack/os-brick/commit/c0fded9fcd6bd58f883868840df5d932a68b6bad
Submitter: "Zuul (22348)"
Branch: stable/2023.1

commit c0fded9fcd6bd58f883868840df5d932a68b6bad
Author: Gorka Eguileor <email address hidden>
Date: Thu Sep 14 12:19:26 2023 +0200

    NVMe-oF: Fix attach when reconnecting

    When an nvme subsystem has all portals in connecting state and we try
    to attach a new volume to that same subsystem it will fail.

    We can reproduce it with LVM+nvmet if we configure it to share targets
    and then:
    - Create instance
    - Attach 2 volumes
    - Delete instance (this leaves the subsystem in connecting state [1])
    - Create instance
    - Attach volume <== FAILS

    The problem comes from the '_connect_target' method that ignores
    subsystems in 'connecting' state, so if they are all in that state it
    considers it equivalent to all portals being inaccessible.

    This patch changes this behavior and if we cannot connect to a target
    but we have portals in 'connecting' state we wait for the next retry of
    the nvme linux driver. Specifically we wait 10 more seconds that the
    interval between retries.

    [1]: https://bugs.launchpad.net/nova/+bug/2035375

    Closes-Bug: #2035695
    Change-Id: Ife710f52c339d67f2dcb160c20ad0d75480a1f48
    (cherry picked from commit ec22c32de6820184d7737c5af70e573c0634cd38)
    (cherry picked from commit 7419306d2669568b8ef1aac6283e680da841d82f)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/os-brick 6.4.1

This issue was fixed in the openstack/os-brick 6.4.1 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/os-brick 6.2.3

This issue was fixed in the openstack/os-brick 6.2.3 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.