Mlx5 kworker blocked Kernel 5.19 (Jammy HWE)

Bug #2009594 reported by DUFOUR Olivier
30
This bug affects 4 people
Affects Status Importance Assigned to Milestone
charm-ovn-chassis
Triaged
High
Unassigned
linux (Ubuntu)
Confirmed
Undecided
Unassigned

Bug Description

This is seen on particular with :
* Charmed Openstack with Jammy Yoga
* 5.19.0-35-generic (linux-generic-hwe-22.04/jammy-updates)
* Mellanox Connectx-6 card with mlx5_core module being used
* SR-IOV is being used with VF-LAG for the use of OVN hardware offloading

The servers enter into very high load (around 75~100) quickly during the boot with all process relying on network communication with the Mellanox network card being stuck or extremely slow.
Kernel logs are being displayed about kworkers being blocked for more than 120 seconds

The number of SR-IOV devices configured both from the firmware and the kernel seems to have a serious correlation with the likeliness of this bug to occur.
Having enabled more VF seems to hugely increase the risk for this bug to arise.

This does not happen systematically at every boot, but with 32 VFs on each PF, it occurs about 40% of the time.
To recover the server, a cold reboot is required.

Look at a quick sample of the trace, this seems to involve directly the mlx5 driver within the kernel :

Mar 07 05:24:56 nova-1 kernel: INFO: task kworker/0:1:19 blocked for more than 120 seconds.
Mar 07 05:24:56 nova-1 kernel: Tainted: P OE 5.19.0-35-generic #36~22.04.1-Ubuntu
Mar 07 05:24:56 nova-1 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar 07 05:24:56 nova-1 kernel: task:kworker/0:1 state:D stack: 0 pid: 19 ppid: 2 flags:0x00004000
Mar 07 05:24:56 nova-1 kernel: Workqueue: events work_for_cpu_fn
Mar 07 05:24:56 nova-1 kernel: Call Trace:
Mar 07 05:24:56 nova-1 kernel: <TASK>
Mar 07 05:24:56 nova-1 kernel: __schedule+0x257/0x5d0
Mar 07 05:24:56 nova-1 kernel: schedule+0x68/0x110
Mar 07 05:24:56 nova-1 kernel: schedule_preempt_disabled+0x15/0x30
Mar 07 05:24:56 nova-1 kernel: __mutex_lock.constprop.0+0x4f1/0x750
Mar 07 05:24:56 nova-1 kernel: __mutex_lock_slowpath+0x13/0x20
Mar 07 05:24:56 nova-1 kernel: mutex_lock+0x3e/0x50
Mar 07 05:24:56 nova-1 kernel: mlx5_register_device+0x1c/0xb0 [mlx5_core]
Mar 07 05:24:56 nova-1 kernel: mlx5_init_one+0xe4/0x110 [mlx5_core]
Mar 07 05:24:56 nova-1 kernel: probe_one+0xcb/0x120 [mlx5_core]
Mar 07 05:24:56 nova-1 kernel: local_pci_probe+0x4b/0x90
Mar 07 05:24:56 nova-1 kernel: work_for_cpu_fn+0x1a/0x30
Mar 07 05:24:56 nova-1 kernel: process_one_work+0x21f/0x400
Mar 07 05:24:56 nova-1 kernel: worker_thread+0x200/0x3f0
Mar 07 05:24:56 nova-1 kernel: ? rescuer_thread+0x3a0/0x3a0
Mar 07 05:24:56 nova-1 kernel: kthread+0xee/0x120
Mar 07 05:24:56 nova-1 kernel: ? kthread_complete_and_exit+0x20/0x20
Mar 07 05:24:56 nova-1 kernel: ret_from_fork+0x22/0x30
Mar 07 05:24:56 nova-1 kernel: </TASK>

Revision history for this message
DUFOUR Olivier (odufourc) wrote :
Revision history for this message
DUFOUR Olivier (odufourc) wrote :
Revision history for this message
DUFOUR Olivier (odufourc) wrote :
Revision history for this message
DUFOUR Olivier (odufourc) wrote :
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 2009594

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Frode Nordahl (fnordahl)
Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Revision history for this message
DUFOUR Olivier (odufourc) wrote :

Subscribed ~Field-critical

This impacts deployment on Prodstack-6.
The only workaround is to reduce the number of VFs used by OVN which is not the behavior expected and reduce greatly the number of instances deployed with offloading for the whole environment.

Revision history for this message
Frode Nordahl (fnordahl) wrote :

This does indeed appear to be a serious issue, but it is unfortunately not a charm issue.

It appears to me to be an issue with the mlx5 driver in the kernel which is visible in this specific environment for yet to be uncovered reasons.

Changed in charm-ovn-chassis:
status: New → Invalid
Revision history for this message
DUFOUR Olivier (odufourc) wrote :

After more investigations, I've found the root cause that triggers the issue.

#
# The TL,DR:
#
With many VFs to initialise, systemd-networkd-wait-online.service's timeout of 2 minutes is reached.
Meaning services like Nova, Neutron and OVN/OVS will start while the kernel is still creating and activating VFs.
systemd-networkd-wait-online.service needs to be overrided to increase the default timeout to something like 5 or 10 minutes.

#
# The lenghty explanation:
#
When looking at the global logs of the servers, I noticed all failing boots on servers had one thing in common.
Systemd-networkd-wait-online.service was timing out after spending more than 2 minutes.
On success boots, this error would not be seen. And this is interesting because systemd waits for network this service to succeed before starting other services after the network target.

When allocating 24 VFs on 2 interfaces, it still worked most of the time. But after inspection, it appears it was getting pretty close to the 2 minutes limit.
In the attached file systemd-networkd-wait-24VFs.txt we can see the network initialisation is fairly inconsistent and vary a lot in term of timing.

After testing again at 32 VFs, I could see frequently the network taking longer than 2 minutes to be set up and all remaining services on the host were being started while the kernel was still working on VFs.
I decided to override systemd-networkd-wait-online service to extend the default timeout to 4 minutes.

Here is the test on the same server with 32 VFs :
```
# before override in systemd unit
2min 127ms systemd-networkd-wait-online.service --> timed out
--> has the kernel failing

# reboot after override
2min 33.617s systemd-networkd-wait-online.service
--> works fine

1min 55.140s systemd-networkd-wait-online.service
--> works fine
```

Although the kernel is an issue as well, I think it seriously needs to be considered, from ovn-chassis' charm point of view, to extend the default timeout on systemd-networkd-wait-online.service, especially since probably other softwares depending on network connectivity can probably into error or other unknown bugs.

And this is failing with only the initialisation of 64 VFs in total. In scenarios where current network cards can handle 1000 VFs on a single port or having many network cards, the initialisation of VFs can take a while. May be having a configuration option to choose the timeout value for systemd-networkd-wait-online.service could be useful.

Changed in charm-ovn-chassis:
status: Invalid → New
Revision history for this message
DUFOUR Olivier (odufourc) wrote :

#
# The tested workaround
#

Current workaround on ovn-chassis with is to override systemd-networkd-wait-online.service to extend the timeout, here to 4 minutes :
(The 2 lines of ExecStart are mandatory, this is not a typo)

```
juju run --model openstack --app ovn-chassis \
"sudo mkdir /etc/systemd/system/systemd-networkd-wait-online.service.d
sudo tee /etc/systemd/system/systemd-networkd-wait-online.service.d/override.conf <<EOF
[Service]
ExecStart=
ExecStart=/lib/systemd/systemd-networkd-wait-online --timeout=240
EOF"
```

Revision history for this message
Frode Nordahl (fnordahl) wrote :

Excellent work on finding more information about the cause and a workaround, thank you for that!

I think it would be appropriate to tackle this as part of the resolution of netplan.io bug 1988018 though, so the charm task would still be invalid unfortunately.

Revision history for this message
Frode Nordahl (fnordahl) wrote :

However, it will probably take some time to resolve bug 1988018 so we should probably look into what interim solutions the charm could provide until we get there.

Changed in charm-ovn-chassis:
status: New → Triaged
importance: Undecided → High
Revision history for this message
Billy Olsen (billy-olsen) wrote :

The work-around identified in comment #9 can be used to bypass this. It delays further services from starting up an attempting to interact with the mlnx cards which appears to cause kernel hung tasks due to the kernel hung task timeout of 120 seconds. I'm not convinced at this moment in time that managing the systemd service files from the charm is the correct thing to do here. Notably, this would likely be a general problem on Ubuntu with VFs etc. It may end up being that increasing the timeout is a longer term solution rather than a work around, however we need to understand the problem better in order to address the problem in the right space.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.