LVM nvmet: Disconnects all volumes on map/unmap

Bug #1964391 reported by Gorka Eguileor
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Cinder
Fix Released
Low
Gorka Eguileor

Bug Description

When using the LVM driver with the nvmet target helper any new attach (create_nvmeof_target) or detach (delete_nvmeof_target) will make ALL the hosts that are connected to nvmet volumes to lose lose the connection to ALL the volumes for a number of seconds (around 15 seconds).

This is caused by the nvmet client clearing everything (ports, subsystems, namespaces) every time there is a change.

Kernel logs on the hosts connected to volumes will have entries like these:

Feb 15 16:56:00 localhost.localdomain kernel: nvme nvme0: starting error recovery
Feb 15 16:56:00 localhost.localdomain kernel: nvme nvme0: Reconnecting in 10 seconds...
Feb 15 16:56:10 localhost.localdomain kernel: nvme nvme0: failed to connect socket: -111
Feb 15 16:56:10 localhost.localdomain kernel: nvme nvme0: Failed reconnect attempt 1

Gorka Eguileor (gorka)
description: updated
Changed in cinder:
importance: Undecided → Low
tags: added: map unmap
Changed in cinder:
status: New → Triaged
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to cinder (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/cinder/+/836071

Changed in cinder:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to cinder (master)

Reviewed: https://review.opendev.org/c/openstack/cinder/+/836071
Committed: https://opendev.org/openstack/cinder/commit/00793ac09ba125c6af43bbbfeadc483de042815b
Submitter: "Zuul (22348)"
Branch: master

commit 00793ac09ba125c6af43bbbfeadc483de042815b
Author: Gorka Eguileor <email address hidden>
Date: Tue Mar 8 16:10:40 2022 +0100

    LVM-nvmet: Use nvmetcli as library instead of CLI

    The nvmet target was using the nvmetcli command to manage the NVMe-oF
    targets, but the command has a big limitation when it comes to
    controlling it's behaviour through the command line, as it can only
    restore all its configuration and small parts, such as ports or
    subsystems, cannot be directly modified without entering into the
    interactive mode.

    Due to this limitation the nvmet target would:

    - Save current nvmet configuration
    - Make changes to the json data
    - Restore the updated configuration

    The problem with this process, besides being slow because it runs a CLI
    command and uses temporary files, is that the restoration completely
    deletes EVERYTHING, before recreating it again. This means that all
    hosts that are connected to volumes will suddenly experience a
    disconnect to the volumes (because the namespace and subsystems have
    disappeared) and will keep retrying to connect. The reconnect succeeds
    after the configuration has been restored by nvmet, but that's 10 to 20
    seconds that hosts cannot access volumes (this may block things in VMs)
    and will present nnvme kernel log error messages.

    To fix all these issues, speed and disconnect, this patch stops using
    the nvmetcli as a CLI and uses it as a Python library, since that's the
    most feature rich functionality of nvmetcli.

    Querying the nvmet system can be done directly with the library, but to
    make changes (create/destroy ports, subsystems, namespaces) it requires
    privileges, so this patch adds a privsep wrapper for the operations that
    we use and cannot be done as a normal user.

    The nvmet wrapper doesn't provide privsep support for ALL operations,
    only for those that we currently use.

    Due to the client-server architecture of privsep and nvmet using non
    primitive instances as parameters, the privsep wrapper needs custom
    serialization code to pass these instances.

    As a side effect of the refactoring we also fix a bug were we tried to
    create the port over and over again on each create_export call which
    resulted in nvme kernel warning logs.

    Closes-Bug: #1964391
    Closes-Bug: #1964394
    Change-Id: Icae9802713867fa148bc041c86beb010086dacc9

Changed in cinder:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/cinder 22.0.0.0rc1

This issue was fixed in the openstack/cinder 22.0.0.0rc1 release candidate.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.