cgroups related kernel panics
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
linux (Ubuntu) |
Incomplete
|
Undecided
|
Unassigned | ||
linux-hwe-5.4 (Ubuntu) |
Confirmed
|
Undecided
|
Unassigned |
Bug Description
Hi!
Recently (throughout the last 6 months) we've upgraded our hypervisor compute hosts from ubuntu bionic kernel 4.15.* to ubuntu bionic hwe kernel 5.4.
This month we noticed that several nodes failed due to bugs in cgroups.
Trace was different almost every time, but it all revolves around cgroups - either null pointer failures, or panic caught by BUG_ON() macro. Looked like some cgroup didn't exist anymore but somebody tried to access it, thus causing kernel panic.
Please find the logs attached.
3 of 4 cases happened after a VM shutdown. We tried to spawn lots of VMs, load them, shut them down, but didn't manage to reproduce the behavior.
Actually, every case is sort of different - patch kernel versions (5.4.0-42 to 5.4.0-66), uptime vary (from 1 day to ~half a year). There are also lots of hosts with several months of uptime, no issue with them. Also, on 4.15 we've never seen this behavior, at all.
That's quite disturbing, as I don't want dozens of VMs crash (due to host outage) at random times for some vague reason...
I didn't manage to find any related bugs on the bug tracker, thus creating this one.
I wonder if anybody in the community came across something like that.
Could somebody give an advice how to debug further, or where else to report / look for a similar the case?
Status changed to 'Confirmed' because the bug affects multiple users.