Comment 7 for bug 1905574

Revision history for this message
Amir Tzin (amirtz) wrote :

Hi Jeff,

upstream commit
50b2412b7e78 net/mlx5: Avoid possible free of command entry while timeout comp handler
was picked to Ubuntu-5.4.0-56.62 kernel
(hash bcd6e98bef76cc8a49a1b736b0fefffbffb75c30)
(v5.4.71 upstream stable release, https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1902110 )

now a new issue arise
reloading mlx5 modules causes an error message in kernel buffer
"cmd_work_handler:887:(pid 292): failed to allocate command entry"

reproduction:
# modprobe -r mlx5_ib mlx5_core
# modprobe mlx5_core mlx5_ib
# dmesg
[ 142.638490] mlx5_core 0000:08:00.1: E-Switch: cleanup
[ 143.734339] mlx5_core 0000:08:00.0: E-Switch: cleanup
[ 164.171511] mlx5_core: unknown parameter 'mlx5_ib' ignored
[ 164.173501] mlx5_core 0000:08:00.0: firmware version: 16.28.1002
[ 164.173576] mlx5_core 0000:08:00.0: 126.016 Gb/s available PCIe bandwidth (8 GT/s x16 link)
[ 164.457342] mlx5_core 0000:08:00.0: Rate limit: 127 rates are supported, range: 0Mbps to 97656Mbps
[ 164.457365] mlx5_core 0000:08:00.0: E-Switch: Total vports 2, per vport: max uc(1024) max mc(16384)
[ 164.484659] port_module: 5 callbacks suppressed
[ 164.484665] mlx5_core 0000:08:00.0: Port module event: module 0, Cable plugged
[ 164.485112] mlx5_core 0000:08:00.0: mlx5_pcie_event:294:(pid 8): PCIe slot advertised sufficient power (75W).
[ 164.494771] mlx5_core 0000:08:00.1: firmware version: 16.28.1002
[ 164.494844] mlx5_core 0000:08:00.1: 126.016 Gb/s available PCIe bandwidth (8 GT/s x16 link)
[ 164.779534] mlx5_core 0000:08:00.1: Rate limit: 127 rates are supported, range: 0Mbps to 97656Mbps
[ 164.779552] mlx5_core 0000:08:00.1: E-Switch: Total vports 2, per vport: max uc(1024) max mc(16384)
[ 164.808886] mlx5_core 0000:08:00.1: Port module event: module 1, Cable plugged
[ 164.809228] mlx5_core 0000:08:00.1: mlx5_pcie_event:294:(pid 292): PCIe slot advertised sufficient power (75W).
[ 164.840667] mlx5_core 0000:08:00.0: MLX5E: StrdRq(1) RqSz(8) StrdSz(2048) RxCqeCmprss(0)
[ 165.081342] mlx5_core 0000:08:00.1: MLX5E: StrdRq(1) RqSz(8) StrdSz(2048) RxCqeCmprss(0)
[ 165.282793] mlx5_ib: Mellanox Connect-IB Infiniband driver v5.0-0
[ 165.438226] mlx5_core 0000:08:00.0: cmd_work_handler:887:(pid 292): failed to allocate command entry
[ 165.442506] infiniband rocep8s0f0: reg_mr_callback:104:(pid 292): async reg mr failed. status -11
#

the following fixes this issue
410bd754cd73 net/mlx5: Add retry mechanism to the command entry index allocation (upstream 5.9)
1d5558b1f0de net/mlx5: poll cmd EQ in case of command timeout (upstream 5.9)
d43b7007dbd1 net/mlx5: Fix a race when moving command interface to events mode (upstream 5.7-rc7)
3ed879965cc4 net/mlx5: net/mlx5: Use async EQ setup cleanup helpers for multiple EQs (upstream 5.6-rc1)

those are on master-next branch off focal tree also synced from linux stable.
(v5.4.79 upstream stable release https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1907151 )

# git log --oneline Ubuntu-5.4.0-59.65..master-next
....
400ec5bb2816 net/mlx5: Add retry mechanism to the command entry index allocation
2bd608898edd net/mlx5: Fix a race when moving command interface to events mode
bec07c488db0 net/mlx5: poll cmd EQ in case of command timeout
0c9bfdf598e1 net/mlx5: Use async EQ setup cleanup helpers for multiple EQs
.....

I compiled master-next, booted the system with it and the issue is resolved.