NVMe connect fails with multiple NICs on storage subnet

Bug #2077558 reported by Simon Dodsley
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
os-brick
New
Undecided
Unassigned

Bug Description

If there are multiple NICs on a server that are connected to the storage subnet providing connectivity to the NVMe backend, but not all of these are NVMe capable interfaces, specifically NVMe-RoCE, which requires the NIC to be an approved Mellanox card, eg CX-5, then the nvme connect command will fail as the `-w` switch isn't specified to select the correct NIC for the RDNA transport.

Failure traceback:

Aug 21 16:23:27 np0000004186 cinder-volume[87682]: DEBUG oslo_concurrency.processutils [-] Running cmd (subprocess): nvme connect -a 192.168.1.68 -s 4420 -t rdma -n nqn.2010-06.com.purestorage:flasharray.4118da8397e32fc7 -Q 128 -l -1 {{(pid=88989) execute /opt/stack/data/venv/lib/python3.10/site-packages/oslo_concurrency/processutils.py:390}}
Aug 21 16:23:27 np0000004186 cinder-volume[87682]: DEBUG oslo_concurrency.processutils [-] CMD "nvme connect -a 192.168.1.68 -s 4420 -t rdma -n nqn.2010-06.com.purestorage:flasharray.4118da8397e32fc7 -Q 128 -l -1" returned: 104 in 0.004s {{(pid=88989) execute /opt/stack/data/venv/lib/python3.10/site-packages/oslo_concurrency/processutils.py:428}} Aug 21 16:23:27 np0000004186 cinder-volume[87682]: DEBUG oslo_concurrency.processutils [-] 'nvme connect -a 192.168.1.68 -s 4420 -t rdma -n nqn.2010-06.com.purestorage:flasharray.4118da8397e32fc7 -Q 128 -l -1' failed. Not Retrying. {{(pid=88989) execute /opt/stack/data/venv/lib/python3.10/site-packages/oslo_concurrency/processutils.py:479}}
Aug 21 16:23:27 np0000004186 cinder-volume[87682]: DEBUG oslo.privsep.daemon [-] privsep: Exception during request[2bbd4fc8-342a-48e3-832c-d43a2db0cfc9]: Unexpected error while running command.
Aug 21 16:23:27 np0000004186 cinder-volume[87682]: Command: nvme connect -a 192.168.1.68 -s 4420 -t rdma -n nqn.2010-06.com.purestorage:flasharray.4118da8397e32fc7 -Q 128 -l -1 Aug 21 16:23:27 np0000004186 cinder-volume[87682]: Exit code: 104
Aug 21 16:23:27 np0000004186 cinder-volume[87682]: Stdout: ''
Aug 21 16:23:27 np0000004186 cinder-volume[87682]: Stderr: 'Failed to write to /dev/nvme-fabrics: Connection reset by peer\n' {{(pid=88989) _process_cmd /opt/stack/data/venv/lib/python3.10/site-packages/oslo_privsep/daemon.py:477}}
Aug 21 16:23:27 np0000004186 cinder-volume[87682]: Traceback (most recent call last):
Aug 21 16:23:27 np0000004186 cinder-volume[87682]: File "/opt/stack/data/venv/lib/python3.10/site-packages/oslo_privsep/daemon.py", line 474, in _process_cmd Aug 21 16:23:27 np0000004186 cinder-volume[87682]: ret = func(*f_args, **f_kwargs)
Aug 21 16:23:27 np0000004186 cinder-volume[87682]: File "/opt/stack/data/venv/lib/python3.10/site-packages/oslo_privsep/priv_context.py", line 274, in _wrap
Aug 21 16:23:27 np0000004186 cinder-volume[87682]: return func(*args, **kwargs)
Aug 21 16:23:27 np0000004186 cinder-volume[87682]: File "/opt/stack/os-brick/os_brick/privileged/rootwrap.py", line 197, in execute_root
Aug 21 16:23:27 np0000004186 cinder-volume[87682]: return custom_execute(*cmd, shell=False, run_as_root=False, **kwargs) Aug 21 16:23:27 np0000004186 cinder-volume[87682]: File "/opt/stack/os-brick/os_brick/privileged/rootwrap.py", line 145, in custom_execute
Aug 21 16:23:27 np0000004186 cinder-volume[87682]: return putils.execute(on_execute=on_execute, Aug 21 16:23:27 np0000004186 cinder-volume[87682]: File "/opt/stack/data/venv/lib/python3.10/site-packages/oslo_concurrency/processutils.py", line 444, in execute
Aug 21 16:23:27 np0000004186 cinder-volume[87682]: raise ProcessExecutionError(exit_code=_returncode,
Aug 21 16:23:27 np0000004186 cinder-volume[87682]: oslo_concurrency.processutils.ProcessExecutionError: Unexpected error while running command. Aug 21 16:23:27 np0000004186 cinder-volume[87682]: Command: nvme connect -a 192.168.1.68 -s 4420 -t rdma -n nqn.2010-06.com.purestorage:flasharray.4118da8397e32fc7 -Q 128 -l -1
Aug 21 16:23:27 np0000004186 cinder-volume[87682]: Exit code: 104 Aug 21 16:23:27 np0000004186 cinder-volume[87682]: Stdout: ''
Aug 21 16:23:27 np0000004186 cinder-volume[87682]: Stderr: 'Failed to write to /dev/nvme-fabrics: Connection reset by peer\n'
Aug 21 16:23:27 np0000004186 cinder-volume[87682]: DEBUG oslo.privsep.daemon [-] privsep: reply[2bbd4fc8-342a-48e3-832c-d43a2db0cfc9]: (5, 'oslo_concurrency.processutils.ProcessExecutionError', ('', 'Failed to write to /dev/nvme-fabrics: Connection reset by peer\n', 104, 'nvme connect -a 192.168.1.68 -s 4420 -t rdma -n nqn.2010-06.com.purestorage:flasharray.4118da8397e32fc7 -Q 128 -l -1', None)) {{(pid=88989) _call_back /opt/stack/data/venv/lib/python3.10/site-packages/oslo_privsep/daemon.py:499}} Aug 21 16:23:27 np0000004186 cinder-volume[87682]: ERROR os_brick.initiator.connectors.nvmeof [None req-122c236e-49bd-46e0-af7c-20a64bb9f465 tempest-TestInstancesWithCinderVolumes-1040753460 None] Could not connect to Portal rdma at 192.168.1.68:4420 (ctrl: None): exit_code: 104, stdout: "", stderr: "Failed to write to /dev/nvme-fabrics: Connection reset by peer Aug 21 16:23:27 np0000004186 cinder-volume[87682]: ",: oslo_concurrency.processutils.ProcessExecutionError: Unexpected error while running command.

For this example the node running os-brick has these physical ports:

# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: enp3s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq state UP group default qlen 1000
    link/ether fa:16:3e:58:03:b7 brd ff:ff:ff:ff:ff:ff
    inet 10.241.128.141/24 brd 10.241.128.255 scope global dynamic enp3s0
       valid_lft 40152sec preferred_lft 40152sec
    inet6 fe80::f816:3eff:fe58:3b7/64 scope link
       valid_lft forever preferred_lft forever
3: enp4s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq state UP group default qlen 1000
    link/ether fa:16:3e:56:fb:ee brd ff:ff:ff:ff:ff:ff
    inet 192.168.1.146/24 brd 192.168.1.255 scope global dynamic enp4s0
       valid_lft 40152sec preferred_lft 40152sec
    inet6 fe80::f816:3eff:fe56:fbee/64 scope link
       valid_lft forever preferred_lft forever
4: enp6s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq state UP group default qlen 1000
    link/ether 6a:c2:78:28:09:50 brd ff:ff:ff:ff:ff:ff
    inet 192.168.1.246/24 scope global enp6s0
       valid_lft forever preferred_lft forever
    inet6 fe80::68c2:78ff:fe28:950/64 scope link
       valid_lft forever preferred_lft forever

Notice that enp4s0 and enp6s0 are both on the storage subnet, but only enp6s0 is a Mellanox (RoCE) capable NIC. enp4s0 is a virtio NIC.

# lspci
...
03:00.0 Ethernet controller: Red Hat, Inc. Virtio network device (rev 01)
04:00.0 Ethernet controller: Red Hat, Inc. Virtio network device (rev 01)
05:00.0 SCSI storage controller: Red Hat, Inc. Virtio block device (rev 01)
06:00.0 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5 Virtual Function]
...

Routing table has routes for both storage subnet NICs to reach the NVMe ports on the backend:

# route
Kernel IP routing table
Destination Gateway Genmask Flags Metric Ref Use Iface
default 10.241.128.1 0.0.0.0 UG 0 0 0 enp3s0
10.241.128.0 0.0.0.0 255.255.255.0 U 0 0 0 enp3s0
169.254.169.254 192.168.1.100 255.255.255.255 UGH 0 0 0 enp4s0
172.24.5.0 0.0.0.0 255.255.255.0 U 0 0 0 br-ex
192.168.1.0 0.0.0.0 255.255.255.0 U 0 0 0 enp4s0
192.168.1.0 0.0.0.0 255.255.255.0 U 0 0 0 enp6s0
192.168.122.0 0.0.0.0 255.255.255.0 U 0 0 0 virbr0

To resolve this the nvme connect command needs to have the Mellanox IP used with the -w switch to correctly connect.

summary: - NVMe connect fails iwith multiple NICs on storage subnet
+ NVMe connect fails with multiple NICs on storage subnet
description: updated
description: updated
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.