Call trace during manual controller reset on NVMe/RoCE array and switch reset on NVMe/IB array
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
nvme-cli (Ubuntu) |
Confirmed
|
Undecided
|
Unassigned |
Bug Description
After manually resetting one of my E-Series NVMe/RoCE controller, I hit the following call trace:
Apr 20 14:08:24 ICTM1611S01H4 kernel: [ 949.958231] workqueue: WQ_MEM_RECLAIM nvme-wq:
Apr 20 14:08:24 ICTM1611S01H4 kernel: [ 949.958244] WARNING: CPU: 11 PID: 6260 at kernel/
Apr 20 14:08:24 ICTM1611S01H4 kernel: [ 949.958245] Modules linked in: xfs nfsv3 nfs_acl rpcsec_gss_krb5 auth_rpcgss nfsv4 nfs lockd grace fscache rpcrdma rdma_ucm ib_iser ib_umad libiscsi ib_ipoib scsi_transport_
Apr 20 14:08:24 ICTM1611S01H4 kernel: [ 949.958305] tls libahci mlxfw wmi scsi_dh_emc scsi_dh_rdac scsi_dh_alua dm_multipath
Apr 20 14:08:24 ICTM1611S01H4 kernel: [ 949.958315] CPU: 11 PID: 6260 Comm: kworker/u34:3 Not tainted 5.4.0-24-generic #28-Ubuntu
Apr 20 14:08:24 ICTM1611S01H4 kernel: [ 949.958316] Hardware name: Dell Inc. PowerEdge R730/072T6D, BIOS 2.8.0 005/17/2018
Apr 20 14:08:24 ICTM1611S01H4 kernel: [ 949.958321] Workqueue: nvme-wq nvme_rdma_
Apr 20 14:08:24 ICTM1611S01H4 kernel: [ 949.958326] RIP: 0010:check_
Apr 20 14:08:24 ICTM1611S01H4 kernel: [ 949.958329] Code: 8d 8b b0 00 00 00 48 8b 50 18 4d 89 e0 48 8d b1 b0 00 00 00 48 c7 c7 40 f8 75 9d 4c 89 c9 c6 05 f1 d9 74 01 01 e8 1f 14 fe ff <0f> 0b e9 07 ff ff ff 80 3d df d9 74 01 00 75 92 e9 3c ff ff ff 66
Apr 20 14:08:24 ICTM1611S01H4 kernel: [ 949.958331] RSP: 0018:ffffb34bc4
Apr 20 14:08:24 ICTM1611S01H4 kernel: [ 949.958333] RAX: 0000000000000000 RBX: ffff946423812400 RCX: 0000000000000000
Apr 20 14:08:24 ICTM1611S01H4 kernel: [ 949.958334] RDX: 0000000000000089 RSI: ffffffff9df926a9 RDI: 0000000000000046
Apr 20 14:08:24 ICTM1611S01H4 kernel: [ 949.958336] RBP: ffffb34bc4e87c10 R08: ffffffff9df92620 R09: 0000000000000089
Apr 20 14:08:24 ICTM1611S01H4 kernel: [ 949.958337] R10: ffffffff9df92a00 R11: 000000009df9268f R12: ffffffffc09be560
Apr 20 14:08:24 ICTM1611S01H4 kernel: [ 949.958338] R13: ffff9468238b2f00 R14: 0000000000000001 R15: ffff94682dbbb700
Apr 20 14:08:24 ICTM1611S01H4 kernel: [ 949.958341] FS: 000000000000000
Apr 20 14:08:24 ICTM1611S01H4 kernel: [ 949.958342] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Apr 20 14:08:24 ICTM1611S01H4 kernel: [ 949.958344] CR2: 00007ff61cbf4ff8 CR3: 000000040a40a001 CR4: 00000000003606e0
Apr 20 14:08:24 ICTM1611S01H4 kernel: [ 949.958345] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Apr 20 14:08:24 ICTM1611S01H4 kernel: [ 949.958347] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Apr 20 14:08:24 ICTM1611S01H4 kernel: [ 949.958348] Call Trace:
Apr 20 14:08:24 ICTM1611S01H4 kernel: [ 949.958355] __flush_
Apr 20 14:08:24 ICTM1611S01H4 kernel: [ 949.958360] __cancel_
Apr 20 14:08:24 ICTM1611S01H4 kernel: [ 949.958368] ? dev_printk_
Apr 20 14:08:24 ICTM1611S01H4 kernel: [ 949.958371] cancel_
Apr 20 14:08:24 ICTM1611S01H4 kernel: [ 949.958387] rdma_addr_
Apr 20 14:08:24 ICTM1611S01H4 kernel: [ 949.958393] cma_cancel_
Apr 20 14:08:24 ICTM1611S01H4 kernel: [ 949.958398] rdma_destroy_
Apr 20 14:08:24 ICTM1611S01H4 kernel: [ 949.958402] nvme_rdma_
Apr 20 14:08:24 ICTM1611S01H4 kernel: [ 949.958406] nvme_rdma_
Apr 20 14:08:24 ICTM1611S01H4 kernel: [ 949.958412] ? try_to_
Apr 20 14:08:24 ICTM1611S01H4 kernel: [ 949.958416] nvme_rdma_
Apr 20 14:08:24 ICTM1611S01H4 kernel: [ 949.958419] process_
Apr 20 14:08:24 ICTM1611S01H4 kernel: [ 949.958422] worker_
Apr 20 14:08:24 ICTM1611S01H4 kernel: [ 949.958427] kthread+0x104/0x140
Apr 20 14:08:24 ICTM1611S01H4 kernel: [ 949.958430] ? process_
Apr 20 14:08:24 ICTM1611S01H4 kernel: [ 949.958432] ? kthread_
Apr 20 14:08:24 ICTM1611S01H4 kernel: [ 949.958439] ret_from_
Apr 20 14:08:24 ICTM1611S01H4 kernel: [ 949.958442] ---[ end trace 859f78e32cc2aa61 ]---
This seems to consistently occur on my direct-connect host and not my fabric-attached hosts. I'm running with Ubuntu 20.04 kernel-5.4.0-24. I have the following Mellanox cards installed:
MCX416A-CCAT FW 12.27.1016
MCX4121A-ACAT FW 14.27.1016
ProblemType: Bug
DistroRelease: Ubuntu 20.04
Package: nvme-cli 1.9-1
ProcVersionSign
Uname: Linux 5.4.0-24-generic x86_64
ApportVersion: 2.20.11-0ubuntu27
Architecture: amd64
CasperMD5CheckR
Date: Fri Apr 17 14:28:50 2020
InstallationDate: Installed on 2020-04-15 (2 days ago)
InstallationMedia: Ubuntu-Server 20.04 LTS "Focal Fossa" - Alpha amd64 (20200124)
ProcEnviron:
TERM=xterm
PATH=(custom, no user)
XDG_RUNTIME_
LANG=en_US.UTF-8
SHELL=/bin/bash
SourcePackage: nvme-cli
UpgradeStatus: No upgrade log present (probably fresh install)
modified.
mtime.conffile.
I am still seeing this with Ubuntu 20.04 LTS