cgroups related kernel panics

Bug #1921355 reported by Nikita Nedvetskiy
28
This bug affects 4 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Incomplete
Undecided
Unassigned
linux-hwe-5.4 (Ubuntu)
Confirmed
Undecided
Unassigned

Bug Description

Hi!

Recently (throughout the last 6 months) we've upgraded our hypervisor compute hosts from ubuntu bionic kernel 4.15.* to ubuntu bionic hwe kernel 5.4.

This month we noticed that several nodes failed due to bugs in cgroups.
Trace was different almost every time, but it all revolves around cgroups - either null pointer failures, or panic caught by BUG_ON() macro. Looked like some cgroup didn't exist anymore but somebody tried to access it, thus causing kernel panic.
Please find the logs attached.

3 of 4 cases happened after a VM shutdown. We tried to spawn lots of VMs, load them, shut them down, but didn't manage to reproduce the behavior.
Actually, every case is sort of different - patch kernel versions (5.4.0-42 to 5.4.0-66), uptime vary (from 1 day to ~half a year). There are also lots of hosts with several months of uptime, no issue with them. Also, on 4.15 we've never seen this behavior, at all.
That's quite disturbing, as I don't want dozens of VMs crash (due to host outage) at random times for some vague reason...
I didn't manage to find any related bugs on the bug tracker, thus creating this one.

I wonder if anybody in the community came across something like that.
Could somebody give an advice how to debug further, or where else to report / look for a similar the case?

Tags: cgroups
Revision history for this message
Nikita Nedvetskiy (n-nedvetskiy) wrote :
Revision history for this message
Nikita Nedvetskiy (n-nedvetskiy) wrote :
Revision history for this message
Nikita Nedvetskiy (n-nedvetskiy) wrote :
Revision history for this message
Nikita Nedvetskiy (n-nedvetskiy) wrote :
Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in linux-hwe-5.4 (Ubuntu):
status: New → Confirmed
Revision history for this message
TJ (tj) wrote :

CPU: 0 PID: 1 Comm: systemd Tainted: G OE 5.4.0-66-generic #74~18.04.2-Ubuntu

The stand-out info in the log fragments is the kernel is tainted with GPL (G) unsigned (E) out-of-tree (O) modules:

openvswitch(OE)
mlx5_core(OE)
mlxfw(OE)
mlx4_en(OE)
mlx4_ib(OE)
mlx4_core(OE)
mlx_compat(OE)

Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1921355

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Revision history for this message
Juerg Haefliger (juergh) wrote :

Can you collect and upload logs per the previous comment? I've googled around some but nothing jumped out. This will be difficult without a reproducer.

Have you tried the latest HWE kernel 5.4.0-71.79~18.04.1?

Is there any chance you can enable kdump?

Revision history for this message
Nikita Nedvetskiy (n-nedvetskiy) wrote :

Thank you all for your ideas!

Sure, we do have some modules not from the kernel source tree. These are Mellanox (our NICs) and OpenvSwitch, as we've had some problems that were fixed in the newer driver versions.

We don't have apport enabled, and actually, the hypervisor nodes don't even have direct access to the internet (only some VMs on them).
I checked on a test VM what kind of info it collects, and it seems that these are the arch, kernel version, and the stack trace. That kind of info is attached manually, we have netconsole enabled that collected it.

When the issue started, it was even reproducible on the then-latest kernel (5.4.0-66), so I'm not sure that simply upgrading can help.

Currently I'm working on integrating kdump into our infrastructure, trying to reproduce again, and I'll also try to schedule migration + upgrade for our hypervisor node (that's not fast though).

Revision history for this message
Nikita Nedvetskiy (n-nedvetskiy) wrote :

Hello!

Actually, we got a surprising behavior.
Shortly after communication in this thread, the bug just disappeared, for nearly two months.
Still had no luck reproducing.

We used this opportunity to migrate and reboot part of our servers to activate kdump on them, and decided to wait.
A couple of days ago one of our hypervisors hung, and we got our crash kernel dump :)
Kernel version was 5.4.0-73-generic this time.

Now that we have it, could somebody please have a look at it?
The file is quite large, ~2.5 GB (3.2 GB unpacked)
https://drive.google.com/file/d/1JVMWJpXNeou06UxqJwl5wjbLKzcb2rOq/view?usp=sharing

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

Can you please give 5.4.0-80.90 a try?

Changed in linux (Ubuntu):
status: Confirmed → Incomplete
Revision history for this message
Artem Kustikov (kustikov) wrote :

Greetings!
No luck with 5.4.0-80.90, still getting the same bug as before even on kernel version 5.4.0-86. Still no clue on how to reproduce it – hypervisor nodes just randomly crash. I have attached dmesg of the most recent encounter, but it seems identical to previous ones.

Here is fresh crash dump – https://drive.google.com/file/d/1skA238DVtxpY8t8ANdzX1gBC8muChxto/view?usp=sharing

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.