[SRU] Ubuntu 22.04 - NVMe TCP - Host fails to reconnect to target after link down/link up sequence

Bug #1989990 reported by Narendra K
18
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Invalid
Undecided
Michael Reed
Jammy
Fix Released
Medium
Michael Reed

Bug Description

[Impact]
Ubuntu 22.04 host fails to reconnect successfully to the NVMe TCP target after link down event if the number of queues have changed post link down.

[Fix]
Following upstream patch set helps address the issue.

1.
nvmet: Expose max queues to configfs
https://git.infradead.org/nvme.git/commit/2c4282742d049e2a5ab874e2b359a2421b9377c2

2.
nvme-tcp: Handle number of queue changes
https://git.infradead.org/nvme.git/commit/516204e486a19d03962c2757ef49782e6c1cacf4

3.
nvme-rdma: Handle number of queue changes
https://git.infradead.org/nvme.git/commit/e800278c1dc97518eab1970f8f58a5aad52b0f86

The patch in Point 2 above helps address the failure to reconnect in NVMe TCP scenario.

Also, following patch addresses error code parsing issue in the reconnect sequence.

nvme-fabrics: parse nvme connect Linux error codes
https://git.infradead.org/nvme.git/commit/ec9e96b5230148294c7abcaf3a4c592d3720b62d

[Test Plan]
1. Boot into Ubuntu 22.04 kernel without fix.

2. Establish connection to NVMe TCP target.

3. Toggle NIC link and bring link up after 10 seconds. When the NIC link is down, on the target increase the number of queues assigned to the controller.

4. Observe that connection to target is lost and after link comes up, controller from host tries to re-establish connection.

5. With patch, reconnection succeeds with higher number of queues

[Where problems could occur]

Regression risk is low to medium.

[Other Info]

Test Kernel Source

https://code.launchpad.net/~mreed8855/ubuntu/+source/linux/+git/jammy/+ref/lp_1989990_nvme_tcp

CVE References

Narendra K (knarendra)
information type: Public → Private
Michael Reed (mreed8855)
Changed in linux (Ubuntu):
assignee: nobody → Michael Reed (mreed8855)
Revision history for this message
Michael Reed (mreed8855) wrote :

I have created a test kernel in jammy with patches listed in the description. Please test it.

https://people.canonical.com/~mreed/dell/nvme/bug_1989990/

description: updated
Michael Reed (mreed8855)
description: updated
summary: - Ubuntu 22.04 - NVMe TCP - Host fails to reconnect to target after link
- down/link up sequence
+ [SRU]Ubuntu 22.04 - NVMe TCP - Host fails to reconnect to target after
+ link down/link up sequence
summary: - [SRU]Ubuntu 22.04 - NVMe TCP - Host fails to reconnect to target after
+ [SRU] Ubuntu 22.04 - NVMe TCP - Host fails to reconnect to target after
link down/link up sequence
Changed in linux (Ubuntu Jammy):
assignee: nobody → Michael Reed (mreed8855)
status: New → In Progress
Changed in linux (Ubuntu):
status: New → In Progress
Revision history for this message
Narendra K (knarendra) wrote :
Download full text (5.0 KiB)

Hi Michael,

We gave the test kernel provided in comment #1 -

linux-image-unsigned-5.15.0-50-generic_5.15.0-50.56_amd64.deb

It helps resolve the issue.

Kernel without patch -

dmesg log when issue reproduces:

[421826.541979] nvme nvme68: Connect command failed, error wo/DNR bit: -16389
[421826.541982] nvme nvme69: Connect command failed, error wo/DNR bit: -16389
[421826.542746] nvme nvme68: failed to connect queue: 9 ret=-5
[421826.543455] nvme nvme69: failed to connect queue: 9 ret=-5
[421826.555194] nvme nvme69: Failed reconnect attempt 1
[421826.555359] nvme nvme69: Reconnecting in 10 seconds...
[421826.564122] nvme nvme70: Connect command failed, error wo/DNR bit: -16389
[421826.569191] nvme nvme68: Failed reconnect attempt 1
[421826.569580] nvme nvme68: Reconnecting in 10 seconds...
[421826.569591] nvme nvme70: failed to connect queue: 9 ret=-5
[421826.583034] nvme nvme70: Failed reconnect attempt 1
[421826.583152] nvme nvme70: Reconnecting in 10 seconds...
[421827.813932] nvme nvme4: creating 64 I/O queues.
[421827.834123] nvme nvme4: mapped 64/0/0 default/read/poll queues.
[421827.838274] nvme nvme4: Successfully reconnected (1 attempt)
[421836.773770] nvme nvme69: queue_size 128 > ctrl sqsize 64, clamping down
[421836.773828] nvme nvme69: creating 64 I/O queues.
[421836.774150] nvme nvme68: queue_size 128 > ctrl sqsize 64, clamping down
[421836.774222] nvme nvme68: creating 64 I/O queues.
[421836.777739] nvme nvme70: queue_size 128 > ctrl sqsize 64, clamping down
[421836.777800] nvme nvme70: creating 64 I/O queues.
[421836.781770] nvme nvme69: Connect command failed, error wo/DNR bit: -16389
[421836.781807] nvme nvme68: Connect command failed, error wo/DNR bit: -16389
[421836.782548] nvme nvme69: failed to connect queue: 9 ret=-5
[421836.783229] nvme nvme68: failed to connect queue: 9 ret=-5
[421836.791938] nvme nvme68: Failed reconnect attempt 2
[421836.792048] nvme nvme68: Reconnecting in 10 seconds...
[421836.808276] nvme nvme69: Failed reconnect attempt 2
[421836.808278] nvme nvme69: Reconnecting in 10 seconds...
[421836.808632] nvme nvme70: Connect command failed, error wo/DNR bit: -16389
[421836.812815] nvme nvme70: failed to connect queue: 9 ret=-5
[421836.814891] nvme nvme70: Failed reconnect attempt 2
[421836.814894] nvme nvme70: Reconnecting in 10 seconds...
[421847.013870] nvme nvme69: queue_size 128 > ctrl sqsize 64, clamping down
[421847.013901] nvme nvme68: queue_size 128 > ctrl sqsize 64, clamping down

Without patch, nvme68, nvme69 and nvme70 fail to reconnect after a link down/up sequence.

Dmesg log with test kernel including fix:

[ 647.634154] nvme nvme70: queue 0: timeout request 0x0 type 4
[ 647.634163] nvme nvme70: starting error recovery
[ 647.634198] nvme nvme69: queue 0: timeout request 0x0 type 4
[ 647.634205] nvme nvme69: starting error recovery
[ 647.634210] nvme nvme68: queue 0: timeout request 0x0 type 4
[ 647.634212] nvme nvme68: starting error recovery
[ 647.634427] nvme nvme70: failed nvme_keep_alive_end_io error=10
[ 647.634452] nvme nvme68: failed nvme_keep_alive_end_io error=10
[ 647.634455] nvme nvme69: failed nvme_keep_alive_end_io error=10
[ 647.650152] nvme nvme69: Reconnectin...

Read more...

Revision history for this message
Narendra K (knarendra) wrote :

Hi Michael,

Please help include the patches into current SRU.

Narendra K (knarendra)
description: updated
description: updated
Michael Reed (mreed8855)
description: updated
information type: Private → Public
Stefan Bader (smb)
Changed in linux (Ubuntu Jammy):
importance: Undecided → Medium
Changed in linux (Ubuntu):
status: In Progress → Invalid
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the linux/5.15.0-59.65 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-jammy' to 'verification-done-jammy'. If the problem still exists, change the tag 'verification-needed-jammy' to 'verification-failed-jammy'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: kernel-spammed-jammy-linux verification-needed-jammy
Revision history for this message
Narendra K (knarendra) wrote :

Hi,

We tried to repro issue with kernel 'linux-image-unsigned-5.15.0-59-generic_5.15.0-59.65_amd64.deb' from -proposed repository. The issue is not observed.

After link down/up sequence, nvme controllers 4, 68, 69 and 70 reconnect successfully.

[ 793.550538] nvme nvme4: queue 0: timeout request 0x0 type 4
[ 793.550544] nvme nvme4: starting error recovery
[ 793.552141] nvme nvme4: failed nvme_keep_alive_end_io error=10
[ 793.567947] nvme nvme4: Reconnecting in 10 seconds...
[ 794.574539] nvme nvme70: queue 0: timeout request 0x0 type 4
[ 794.574543] nvme nvme70: starting error recovery
[ 794.574544] nvme nvme68: queue 0: timeout request 0x0 type 4
[ 794.574548] nvme nvme69: queue 0: timeout request 0x0 type 4
[ 794.574549] nvme nvme68: starting error recovery
[ 794.574550] nvme nvme69: starting error recovery
[ 794.574768] nvme nvme70: failed nvme_keep_alive_end_io error=10
[ 794.574793] nvme nvme69: failed nvme_keep_alive_end_io error=10
[ 794.574877] nvme nvme68: failed nvme_keep_alive_end_io error=10
[ 794.591403] nvme nvme70: Reconnecting in 10 seconds...
[ 794.591628] nvme nvme69: Reconnecting in 10 seconds...
[ 794.594555] nvme nvme68: Reconnecting in 10 seconds...
[ 796.631586] IPv6: ADDRCONF(NETDEV_CHANGE): eno33np0: link becomes ready
[ 803.632108] nvme nvme4: creating 64 I/O queues.
[ 803.668542] nvme nvme4: mapped 64/0/0 default/read/poll queues.
[ 803.671517] nvme nvme4: Successfully reconnected (1 attempt)
[ 804.655794] nvme nvme70: queue_size 128 > ctrl sqsize 64, clamping down
[ 804.655886] nvme nvme70: creating 64 I/O queues.
[ 804.655961] nvme nvme68: queue_size 128 > ctrl sqsize 64, clamping down
[ 804.655994] nvme nvme69: queue_size 128 > ctrl sqsize 64, clamping down
[ 804.656042] nvme nvme68: creating 64 I/O queues.
[ 804.656043] nvme nvme69: creating 64 I/O queues.
[ 804.669742] nvme nvme69: mapped 64/0/0 default/read/poll queues.
[ 804.669761] nvme nvme70: mapped 64/0/0 default/read/poll queues.
[ 804.669773] nvme nvme68: mapped 64/0/0 default/read/poll queues.
[ 804.685893] nvme nvme70: Successfully reconnected (1 attempt)
[ 804.702605] nvme nvme69: Successfully reconnected (1 attempt)
[ 804.722602] nvme nvme68: Successfully reconnected (1 attempt)

tags: added: verification-done-jammy
removed: verification-needed-jammy
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (57.5 KiB)

This bug was fixed in the package linux - 5.15.0-60.66

---------------
linux (5.15.0-60.66) jammy; urgency=medium

  * jammy/linux: 5.15.0-60.66 -proposed tracker (LP: #2003450)

  * Revoke & rotate to new signing key (LP: #2002812)
    - [Packaging] Revoke and rotate to new signing key

linux (5.15.0-59.65) jammy; urgency=medium

  * jammy/linux: 5.15.0-59.65 -proposed tracker (LP: #2001801)

  * Packaging resync (LP: #1786013)
    - [Packaging] update helper scripts

  * CVE-2022-47940
    - ksmbd: validate length in smb2_write()

  * Fix iosm: WWAN cannot build the connection (DW5823e) (LP: #1998115)
    - net: wwan: iosm: fix driver not working with INTEL_IOMMU disabled
    - [Config] CONFIG_IOSM update annotations on arm64 armhf ppc64el s390x

  * support for same series backports versioning numbers (LP: #1993563)
    - [Packaging] sameport -- add support for sameport versioning

  * [DEP-8] Run ADT regression suite for lowlatency kernels Jammy and later
    (LP: #1999528)
    - [DEP-8] Fix regression suite to run on lowlatency

  * Micron NVME storage failure [1344,5407] (LP: #1998883)
    - nvme: add a bogus subsystem NQN quirk for Micron MTFDKBA2T0TFH

  * Jammy update: v5.15.78 upstream stable release (LP: #1998843)
    - scsi: lpfc: Rework MIB Rx Monitor debug info logic
    - serial: ar933x: Deassert Transmit Enable on ->rs485_config()
    - KVM: x86: Trace re-injected exceptions
    - KVM: x86: Treat #DBs from the emulator as fault-like (code and DR7.GD=1)
    - drm/amd/display: explicitly disable psr_feature_enable appropriately
    - mm/hugetlb: fix races when looking up a CONT-PTE/PMD size hugetlb page
    - HID: playstation: add initial DualSense Edge controller support
    - KVM: x86: Protect the unused bits in MSR exiting flags
    - KVM: x86: Copy filter arg outside kvm_vm_ioctl_set_msr_filter()
    - KVM: x86: Add compat handler for KVM_X86_SET_MSR_FILTER
    - RDMA/cma: Use output interface for net_dev check
    - IB/hfi1: Correctly move list in sc_disable()
    - RDMA/hns: Remove magic number
    - RDMA/hns: Use hr_reg_xxx() instead of remaining roce_set_xxx()
    - RDMA/hns: Disable local invalidate operation
    - NFSv4: Fix a potential state reclaim deadlock
    - NFSv4.1: Handle RECLAIM_COMPLETE trunking errors
    - NFSv4.1: We must always send RECLAIM_COMPLETE after a reboot
    - SUNRPC: Fix null-ptr-deref when xps sysfs alloc failed
    - NFSv4.2: Fixup CLONE dest file size for zero-length count
    - nfs4: Fix kmemleak when allocate slot failed
    - net: dsa: Fix possible memory leaks in dsa_loop_init()
    - RDMA/core: Fix null-ptr-deref in ib_core_cleanup()
    - RDMA/qedr: clean up work queue on failure in qedr_alloc_resources()
    - net: dsa: fall back to default tagger if we can't load the one from DT
    - nfc: fdp: Fix potential memory leak in fdp_nci_send()
    - nfc: nxp-nci: Fix potential memory leak in nxp_nci_send()
    - nfc: s3fwrn5: Fix potential memory leak in s3fwrn5_nci_send()
    - nfc: nfcmrvl: Fix potential memory leak in nfcmrvl_i2c_nci_send()
    - net: fec: fix improper use of NETDEV_TX_BUSY
    - ata: pata_legacy: fix pdc20230_set_piomode()
    - net: sched: Fix use after free in red_...

Changed in linux (Ubuntu Jammy):
status: In Progress → Fix Released
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the linux-azure/5.15.0-1034.41 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-jammy' to 'verification-done-jammy'. If the problem still exists, change the tag 'verification-needed-jammy' to 'verification-failed-jammy'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: kernel-spammed-jammy-linux-azure verification-needed-jammy
removed: verification-done-jammy
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the linux-aws/5.15.0-1031.35 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-jammy' to 'verification-done-jammy'. If the problem still exists, change the tag 'verification-needed-jammy' to 'verification-failed-jammy'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: kernel-spammed-jammy-linux-aws
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the linux-realtime/5.15.0-1033.36 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-jammy' to 'verification-done-jammy'. If the problem still exists, change the tag 'verification-needed-jammy' to 'verification-failed-jammy'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: kernel-spammed-jammy-linux-realtime
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the linux-aws-5.15/5.15.0-1046.51~20.04.1 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-focal-linux-aws-5.15' to 'verification-done-focal-linux-aws-5.15'. If the problem still exists, change the tag 'verification-needed-focal-linux-aws-5.15' to 'verification-failed-focal-linux-aws-5.15'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: kernel-spammed-focal-linux-aws-5.15-v2 verification-needed-focal-linux-aws-5.15
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the linux-mtk/5.15.0-1030.34 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-jammy-linux-mtk' to 'verification-done-jammy-linux-mtk'. If the problem still exists, change the tag 'verification-needed-jammy-linux-mtk' to 'verification-failed-jammy-linux-mtk'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: kernel-spammed-jammy-linux-mtk-v2 verification-needed-jammy-linux-mtk
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.