Devlink reload hangs: fix race and lock issue

Bug #2039869 reported by William Tu
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux-bluefield (Ubuntu)
Invalid
Undecided
Unassigned
Jammy
Fix Committed
Undecided
Unassigned

Bug Description

Summary:
Machine hangs when doing devlink reload

How to reproduce:
Host:
[root@bu-lab24v ~]# echo '2' > /sys/class/net/ens2f0np0/device/sriov_numvfs

Arm:
root@bu-lab24v-oob:~# uname -r
5.15.0-1027-bluefield
root@bu-lab24v-oob:~# devlink dev eswitch set pci/0000:03:00.0 mode switchdev
root@bu-lab24v-oob:~# devlink dev reload pci/0000:03:00.0
*Hangs*

Arm dmesg:
[ 1089.747409] INFO: task devlink:8753 blocked for more than 120 seconds.
[ 1089.760560] Tainted: G OE 5.15.0-1027-bluefield #29-Ubuntu
[ 1089.775086] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1089.790829] task:devlink state:D stack: 0 pid: 8753 ppid: 5090 flags:0x00000004
[ 1089.790838] Call trace:
[ 1089.790840] __switch_to+0xf8/0x150
[ 1089.790857] __schedule+0x2b8/0x790
[ 1089.790865] schedule+0x64/0x140
[ 1089.790870] schedule_preempt_disabled+0x18/0x24
[ 1089.790874] __mutex_lock.constprop.0+0x1a0/0x680
[ 1089.790878] __mutex_lock_slowpath+0x40/0x90
[ 1089.790883] mutex_lock+0x64/0x70
[ 1089.790887] devl_lock+0x1c/0x30
[ 1089.790893] mlx5_detach_device+0x58/0x190 [mlx5_core]
[ 1089.791055] mlx5_unload_one+0x40/0xe4 [mlx5_core]
[ 1089.791177] mlx5_devlink_reload_down+0x184/0x270 [mlx5_core]
[ 1089.791318] devlink_reload+0x214/0x290

Fixes:
Checking the OFED source code, we found this missing devl trap group
also need to be backported to avoid deadlock.

void mlx5_detach_device(struct mlx5_core_dev *dev, bool suspend)
{
...
#ifdef HAVE_DEVL_PORT_REGISTER
#ifdef HAVE_DEVL_TRAP_GROUPS_REGISTER
        devl_assert_locked(priv_to_devlink(dev));
#else
        devl_lock(devlink);
#endif /* HAVE_DEVL_TRAP_GROUPS_REGISTER */
#endif /* HAVE_DEVL_PORT_REGISTER */
        mutex_lock(&mlx5_intf_mutex);
#ifdef HAVE_DEVL_PORT_REGISTER

Related issue:
#2032378 Devlink backport: fix race and lock issue

So cherry-pick the patch below
commit 852e85a704c2e11c050bdea286bc438aba4f4a22
Author: Jiri Pirko <email address hidden>
Date: Sat Jul 16 13:02:34 2022 +0200

    net: devlink: add unlocked variants of devling_trap*() functions

    Add unlocked variants of devl_trap*() functions to be used in drivers
    called-in with devlink->lock held.

Changed in linux-bluefield (Ubuntu):
status: New → Invalid
Changed in linux-bluefield (Ubuntu Jammy):
status: New → Fix Committed
status: Fix Committed → In Progress
Changed in linux-bluefield (Ubuntu Jammy):
status: In Progress → Fix Committed
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the linux-bluefield/5.15.0-1029.31 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-jammy-linux-bluefield' to 'verification-done-jammy-linux-bluefield'. If the problem still exists, change the tag 'verification-needed-jammy-linux-bluefield' to 'verification-failed-jammy-linux-bluefield'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: kernel-spammed-jammy-linux-bluefield-v2 verification-needed-jammy-linux-bluefield
tags: added: verification-done-jammy-linux-bluefield
removed: verification-needed-jammy-linux-bluefield
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.