Call trace during manual controller reset on NVMe/RoCE array and switch reset on NVMe/IB array

Bug #1873952 reported by Jennifer Duong
16
This bug affects 2 people
Affects Status Importance Assigned to Milestone
nvme-cli (Ubuntu)
Confirmed
Undecided
Unassigned

Bug Description

After manually resetting one of my E-Series NVMe/RoCE controller, I hit the following call trace:

Apr 20 14:08:24 ICTM1611S01H4 kernel: [ 949.958231] workqueue: WQ_MEM_RECLAIM nvme-wq:nvme_rdma_reconnect_ctrl_work [nvme_rdma] is flushing !WQ_MEM_RECLAIM ib_addr:process_one_req [ib_core]
Apr 20 14:08:24 ICTM1611S01H4 kernel: [ 949.958244] WARNING: CPU: 11 PID: 6260 at kernel/workqueue.c:2610 check_flush_dependency+0x11c/0x140
Apr 20 14:08:24 ICTM1611S01H4 kernel: [ 949.958245] Modules linked in: xfs nfsv3 nfs_acl rpcsec_gss_krb5 auth_rpcgss nfsv4 nfs lockd grace fscache rpcrdma rdma_ucm ib_iser ib_umad libiscsi ib_ipoib scsi_transport_iscsi intel_rapl_msr intel_rapl_common sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp ipmi_ssif kvm_intel kvm intel_cstate intel_rapl_perf joydev input_leds dcdbas mei_me mei ipmi_si ipmi_devintf ipmi_msghandler mac_hid acpi_power_meter sch_fq_codel nvme_rdma rdma_cm iw_cm ib_cm nvme_fabrics nvme_core sunrpc ip_tables x_tables autofs4 btrfs zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear mlx5_ib ib_uverbs uas usb_storage ib_core hid_generic usbhid hid mgag200 crct10dif_pclmul drm_vram_helper crc32_pclmul i2c_algo_bit ttm ghash_clmulni_intel drm_kms_helper ixgbe aesni_intel syscopyarea sysfillrect mxm_wmi xfrm_algo sysimgblt crypto_simd mlx5_core fb_sys_fops dca cryptd drm glue_helper mdio pci_hyperv_intf ahci lpc_ich tg3
Apr 20 14:08:24 ICTM1611S01H4 kernel: [ 949.958305] tls libahci mlxfw wmi scsi_dh_emc scsi_dh_rdac scsi_dh_alua dm_multipath
Apr 20 14:08:24 ICTM1611S01H4 kernel: [ 949.958315] CPU: 11 PID: 6260 Comm: kworker/u34:3 Not tainted 5.4.0-24-generic #28-Ubuntu
Apr 20 14:08:24 ICTM1611S01H4 kernel: [ 949.958316] Hardware name: Dell Inc. PowerEdge R730/072T6D, BIOS 2.8.0 005/17/2018
Apr 20 14:08:24 ICTM1611S01H4 kernel: [ 949.958321] Workqueue: nvme-wq nvme_rdma_reconnect_ctrl_work [nvme_rdma]
Apr 20 14:08:24 ICTM1611S01H4 kernel: [ 949.958326] RIP: 0010:check_flush_dependency+0x11c/0x140
Apr 20 14:08:24 ICTM1611S01H4 kernel: [ 949.958329] Code: 8d 8b b0 00 00 00 48 8b 50 18 4d 89 e0 48 8d b1 b0 00 00 00 48 c7 c7 40 f8 75 9d 4c 89 c9 c6 05 f1 d9 74 01 01 e8 1f 14 fe ff <0f> 0b e9 07 ff ff ff 80 3d df d9 74 01 00 75 92 e9 3c ff ff ff 66
Apr 20 14:08:24 ICTM1611S01H4 kernel: [ 949.958331] RSP: 0018:ffffb34bc4e87bf0 EFLAGS: 00010086
Apr 20 14:08:24 ICTM1611S01H4 kernel: [ 949.958333] RAX: 0000000000000000 RBX: ffff946423812400 RCX: 0000000000000000
Apr 20 14:08:24 ICTM1611S01H4 kernel: [ 949.958334] RDX: 0000000000000089 RSI: ffffffff9df926a9 RDI: 0000000000000046
Apr 20 14:08:24 ICTM1611S01H4 kernel: [ 949.958336] RBP: ffffb34bc4e87c10 R08: ffffffff9df92620 R09: 0000000000000089
Apr 20 14:08:24 ICTM1611S01H4 kernel: [ 949.958337] R10: ffffffff9df92a00 R11: 000000009df9268f R12: ffffffffc09be560
Apr 20 14:08:24 ICTM1611S01H4 kernel: [ 949.958338] R13: ffff9468238b2f00 R14: 0000000000000001 R15: ffff94682dbbb700
Apr 20 14:08:24 ICTM1611S01H4 kernel: [ 949.958341] FS: 0000000000000000(0000) GS:ffff94682fd40000(0000) knlGS:0000000000000000
Apr 20 14:08:24 ICTM1611S01H4 kernel: [ 949.958342] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Apr 20 14:08:24 ICTM1611S01H4 kernel: [ 949.958344] CR2: 00007ff61cbf4ff8 CR3: 000000040a40a001 CR4: 00000000003606e0
Apr 20 14:08:24 ICTM1611S01H4 kernel: [ 949.958345] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Apr 20 14:08:24 ICTM1611S01H4 kernel: [ 949.958347] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Apr 20 14:08:24 ICTM1611S01H4 kernel: [ 949.958348] Call Trace:
Apr 20 14:08:24 ICTM1611S01H4 kernel: [ 949.958355] __flush_work+0x97/0x1d0
Apr 20 14:08:24 ICTM1611S01H4 kernel: [ 949.958360] __cancel_work_timer+0x10e/0x190
Apr 20 14:08:24 ICTM1611S01H4 kernel: [ 949.958368] ? dev_printk_emit+0x4e/0x65
Apr 20 14:08:24 ICTM1611S01H4 kernel: [ 949.958371] cancel_delayed_work_sync+0x13/0x20
Apr 20 14:08:24 ICTM1611S01H4 kernel: [ 949.958387] rdma_addr_cancel+0x8a/0xb0 [ib_core]
Apr 20 14:08:24 ICTM1611S01H4 kernel: [ 949.958393] cma_cancel_operation+0x72/0x1e0 [rdma_cm]
Apr 20 14:08:24 ICTM1611S01H4 kernel: [ 949.958398] rdma_destroy_id+0x56/0x2f0 [rdma_cm]
Apr 20 14:08:24 ICTM1611S01H4 kernel: [ 949.958402] nvme_rdma_alloc_queue.cold+0x28/0x5b [nvme_rdma]
Apr 20 14:08:24 ICTM1611S01H4 kernel: [ 949.958406] nvme_rdma_setup_ctrl+0x37/0x720 [nvme_rdma]
Apr 20 14:08:24 ICTM1611S01H4 kernel: [ 949.958412] ? try_to_wake_up+0x224/0x6a0
Apr 20 14:08:24 ICTM1611S01H4 kernel: [ 949.958416] nvme_rdma_reconnect_ctrl_work+0x27/0x40 [nvme_rdma]
Apr 20 14:08:24 ICTM1611S01H4 kernel: [ 949.958419] process_one_work+0x1eb/0x3b0
Apr 20 14:08:24 ICTM1611S01H4 kernel: [ 949.958422] worker_thread+0x4d/0x400
Apr 20 14:08:24 ICTM1611S01H4 kernel: [ 949.958427] kthread+0x104/0x140
Apr 20 14:08:24 ICTM1611S01H4 kernel: [ 949.958430] ? process_one_work+0x3b0/0x3b0
Apr 20 14:08:24 ICTM1611S01H4 kernel: [ 949.958432] ? kthread_park+0x90/0x90
Apr 20 14:08:24 ICTM1611S01H4 kernel: [ 949.958439] ret_from_fork+0x35/0x40
Apr 20 14:08:24 ICTM1611S01H4 kernel: [ 949.958442] ---[ end trace 859f78e32cc2aa61 ]---

This seems to consistently occur on my direct-connect host and not my fabric-attached hosts. I'm running with Ubuntu 20.04 kernel-5.4.0-24. I have the following Mellanox cards installed:

MCX416A-CCAT FW 12.27.1016
MCX4121A-ACAT FW 14.27.1016

ProblemType: Bug
DistroRelease: Ubuntu 20.04
Package: nvme-cli 1.9-1
ProcVersionSignature: Ubuntu 5.4.0-24.28-generic 5.4.30
Uname: Linux 5.4.0-24-generic x86_64
ApportVersion: 2.20.11-0ubuntu27
Architecture: amd64
CasperMD5CheckResult: skip
Date: Fri Apr 17 14:28:50 2020
InstallationDate: Installed on 2020-04-15 (2 days ago)
InstallationMedia: Ubuntu-Server 20.04 LTS "Focal Fossa" - Alpha amd64 (20200124)
ProcEnviron:
 TERM=xterm
 PATH=(custom, no user)
 XDG_RUNTIME_DIR=<set>
 LANG=en_US.UTF-8
 SHELL=/bin/bash
SourcePackage: nvme-cli
UpgradeStatus: No upgrade log present (probably fresh install)
modified.conffile..etc.nvme.hostnqn: ictm1611s01h4-hostnqn
mtime.conffile..etc.nvme.hostnqn: 2020-04-15T13:43:48.076829

Revision history for this message
Jennifer Duong (jduong) wrote :
Revision history for this message
Jennifer Duong (jduong) wrote :
summary: - Call trace during manual controller reset
+ Call trace during manual controller reset on NVMe/RoCE array
Revision history for this message
Jennifer Duong (jduong) wrote : Re: Call trace during manual controller reset on NVMe/RoCE array

I am still seeing this with Ubuntu 20.04 LTS

Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in nvme-cli (Ubuntu):
status: New → Confirmed
Revision history for this message
Jennifer Duong (jduong) wrote :

This call trace is also seen while manually resetting a NVIDIA Mellanox InfiniBand Switch that is connected to an NVMe/IB EF600 storage array. The server has an MCX354A-FCBT installed running with FW 2.42.5000. The system is connected to a QM8700 and SB7800. Both switches are running with MLNX OS 3.9.2110. The message logs have been attached.

summary: - Call trace during manual controller reset on NVMe/RoCE array
+ Call trace during manual controller reset on NVMe/RoCE array and switch
+ reset on NVMe/IB array
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.