Ubuntu
nvme-cli package

Call trace during manual controller reset on NVMe/RoCE array and switch reset on NVMe/IB array

Bug #1873952 reported by Jennifer Duong on 2020-04-20

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	nvme-cli (Ubuntu)	Confirmed	Undecided	Unassigned

Bug Description

After manually resetting one of my E-Series NVMe/RoCE controller, I hit the following call trace:

Apr 20 14:08:24 ICTM1611S01H4 kernel: [ 949.958231] workqueue: WQ_MEM_RECLAIM nvme-wq:nvme_rdma_reconnect_ctrl_work [nvme_rdma] is flushing !WQ_MEM_RECLAIM ib_addr:process_one_req [ib_core]
Apr 20 14:08:24 ICTM1611S01H4 kernel: [ 949.958244] WARNING: CPU: 11 PID: 6260 at kernel/workqueue.c:2610 check_flush_dependency+0x11c/0x140
Apr 20 14:08:24 ICTM1611S01H4 kernel: [ 949.958245] Modules linked in: xfs nfsv3 nfs_acl rpcsec_gss_krb5 auth_rpcgss nfsv4 nfs lockd grace fscache rpcrdma rdma_ucm ib_iser ib_umad libiscsi ib_ipoib scsi_transport_iscsi intel_rapl_msr intel_rapl_common sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp ipmi_ssif kvm_intel kvm intel_cstate intel_rapl_perf joydev input_leds dcdbas mei_me mei ipmi_si ipmi_devintf ipmi_msghandler mac_hid acpi_power_meter sch_fq_codel nvme_rdma rdma_cm iw_cm ib_cm nvme_fabrics nvme_core sunrpc ip_tables x_tables autofs4 btrfs zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear mlx5_ib ib_uverbs uas usb_storage ib_core hid_generic usbhid hid mgag200 crct10dif_pclmul drm_vram_helper crc32_pclmul i2c_algo_bit ttm ghash_clmulni_intel drm_kms_helper ixgbe aesni_intel syscopyarea sysfillrect mxm_wmi xfrm_algo sysimgblt crypto_simd mlx5_core fb_sys_fops dca cryptd drm glue_helper mdio pci_hyperv_intf ahci lpc_ich tg3
Apr 20 14:08:24 ICTM1611S01H4 kernel: [ 949.958305] tls libahci mlxfw wmi scsi_dh_emc scsi_dh_rdac scsi_dh_alua dm_multipath
Apr 20 14:08:24 ICTM1611S01H4 kernel: [ 949.958315] CPU: 11 PID: 6260 Comm: kworker/u34:3 Not tainted 5.4.0-24-generic #28-Ubuntu
Apr 20 14:08:24 ICTM1611S01H4 kernel: [ 949.958316] Hardware name: Dell Inc. PowerEdge R730/072T6D, BIOS 2.8.0 005/17/2018
Apr 20 14:08:24 ICTM1611S01H4 kernel: [ 949.958321] Workqueue: nvme-wq nvme_rdma_reconnect_ctrl_work [nvme_rdma]
Apr 20 14:08:24 ICTM1611S01H4 kernel: [ 949.958326] RIP: 0010:check_flush_dependency+0x11c/0x140
Apr 20 14:08:24 ICTM1611S01H4 kernel: [ 949.958329] Code: 8d 8b b0 00 00 00 48 8b 50 18 4d 89 e0 48 8d b1 b0 00 00 00 48 c7 c7 40 f8 75 9d 4c 89 c9 c6 05 f1 d9 74 01 01 e8 1f 14 fe ff <0f> 0b e9 07 ff ff ff 80 3d df d9 74 01 00 75 92 e9 3c ff ff ff 66
Apr 20 14:08:24 ICTM1611S01H4 kernel: [ 949.958331] RSP: 0018:ffffb34bc4e87bf0 EFLAGS: 00010086
Apr 20 14:08:24 ICTM1611S01H4 kernel: [ 949.958333] RAX: 0000000000000000 RBX: ffff946423812400 RCX: 0000000000000000
Apr 20 14:08:24 ICTM1611S01H4 kernel: [ 949.958334] RDX: 0000000000000089 RSI: ffffffff9df926a9 RDI: 0000000000000046
Apr 20 14:08:24 ICTM1611S01H4 kernel: [ 949.958336] RBP: ffffb34bc4e87c10 R08: ffffffff9df92620 R09: 0000000000000089
Apr 20 14:08:24 ICTM1611S01H4 kernel: [ 949.958337] R10: ffffffff9df92a00 R11: 000000009df9268f R12: ffffffffc09be560
Apr 20 14:08:24 ICTM1611S01H4 kernel: [ 949.958338] R13: ffff9468238b2f00 R14: 0000000000000001 R15: ffff94682dbbb700
Apr 20 14:08:24 ICTM1611S01H4 kernel: [ 949.958341] FS: 0000000000000000(0000) GS:ffff94682fd40000(0000) knlGS:0000000000000000
Apr 20 14:08:24 ICTM1611S01H4 kernel: [ 949.958342] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Apr 20 14:08:24 ICTM1611S01H4 kernel: [ 949.958344] CR2: 00007ff61cbf4ff8 CR3: 000000040a40a001 CR4: 00000000003606e0
Apr 20 14:08:24 ICTM1611S01H4 kernel: [ 949.958345] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Apr 20 14:08:24 ICTM1611S01H4 kernel: [ 949.958347] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Apr 20 14:08:24 ICTM1611S01H4 kernel: [ 949.958348] Call Trace:
Apr 20 14:08:24 ICTM1611S01H4 kernel: [ 949.958355] __flush_work+0x97/0x1d0
Apr 20 14:08:24 ICTM1611S01H4 kernel: [ 949.958360] __cancel_work_timer+0x10e/0x190
Apr 20 14:08:24 ICTM1611S01H4 kernel: [ 949.958368] ? dev_printk_emit+0x4e/0x65
Apr 20 14:08:24 ICTM1611S01H4 kernel: [ 949.958371] cancel_delayed_work_sync+0x13/0x20
Apr 20 14:08:24 ICTM1611S01H4 kernel: [ 949.958387] rdma_addr_cancel+0x8a/0xb0 [ib_core]
Apr 20 14:08:24 ICTM1611S01H4 kernel: [ 949.958393] cma_cancel_operation+0x72/0x1e0 [rdma_cm]
Apr 20 14:08:24 ICTM1611S01H4 kernel: [ 949.958398] rdma_destroy_id+0x56/0x2f0 [rdma_cm]
Apr 20 14:08:24 ICTM1611S01H4 kernel: [ 949.958402] nvme_rdma_alloc_queue.cold+0x28/0x5b [nvme_rdma]
Apr 20 14:08:24 ICTM1611S01H4 kernel: [ 949.958406] nvme_rdma_setup_ctrl+0x37/0x720 [nvme_rdma]
Apr 20 14:08:24 ICTM1611S01H4 kernel: [ 949.958412] ? try_to_wake_up+0x224/0x6a0
Apr 20 14:08:24 ICTM1611S01H4 kernel: [ 949.958416] nvme_rdma_reconnect_ctrl_work+0x27/0x40 [nvme_rdma]
Apr 20 14:08:24 ICTM1611S01H4 kernel: [ 949.958419] process_one_work+0x1eb/0x3b0
Apr 20 14:08:24 ICTM1611S01H4 kernel: [ 949.958422] worker_thread+0x4d/0x400
Apr 20 14:08:24 ICTM1611S01H4 kernel: [ 949.958427] kthread+0x104/0x140
Apr 20 14:08:24 ICTM1611S01H4 kernel: [ 949.958430] ? process_one_work+0x3b0/0x3b0
Apr 20 14:08:24 ICTM1611S01H4 kernel: [ 949.958432] ? kthread_park+0x90/0x90
Apr 20 14:08:24 ICTM1611S01H4 kernel: [ 949.958439] ret_from_fork+0x35/0x40
Apr 20 14:08:24 ICTM1611S01H4 kernel: [ 949.958442] ---[ end trace 859f78e32cc2aa61 ]---

This seems to consistently occur on my direct-connect host and not my fabric-attached hosts. I'm running with Ubuntu 20.04 kernel-5.4.0-24. I have the following Mellanox cards installed:

MCX416A-CCAT FW 12.27.1016
MCX4121A-ACAT FW 14.27.1016

ProblemType: Bug
DistroRelease: Ubuntu 20.04
Package: nvme-cli 1.9-1
ProcVersionSignature: Ubuntu 5.4.0-24.28-generic 5.4.30
Uname: Linux 5.4.0-24-generic x86_64
ApportVersion: 2.20.11-0ubuntu27
Architecture: amd64
CasperMD5CheckResult: skip
Date: Fri Apr 17 14:28:50 2020
InstallationDate: Installed on 2020-04-15 (2 days ago)
InstallationMedia: Ubuntu-Server 20.04 LTS "Focal Fossa" - Alpha amd64 (20200124)
ProcEnviron:
TERM=xterm
PATH=(custom, no user)
XDG_RUNTIME_DIR=<set>
LANG=en_US.UTF-8
SHELL=/bin/bash
SourcePackage: nvme-cli
UpgradeStatus: No upgrade log present (probably fresh install)
modified.conffile..etc.nvme.hostnqn: ictm1611s01h4-hostnqn
mtime.conffile..etc.nvme.hostnqn: 2020-04-15T13:43:48.076829

Tags: