sched: Prevent CPU lockups when task groups take longer than the period
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
linux (Ubuntu) |
Fix Released
|
Undecided
|
Unassigned | ||
Bionic |
Fix Released
|
Medium
|
Matthew Ruffell |
Bug Description
BugLink: https:/
[Impact]
On machines with extremely high CPU usage, parent task groups which have a large number of children can make the for loop in sched_cfs_
In this particular case, it is unlikely that the call to hrtimer_
The large number of children makes do_sched_
The kernel will produce this call trace:
CPU: 51 PID: 0 Comm: swapper/51 Tainted: P OELK 4.15.0-50-generic #54-Ubuntu
Call Trace:
<IRQ>
? sched_clock+
walk_tg_
? task_rq_
unthrottle_
distribute_
sched_cfs_
? sched_cfs_
__hrtimer_
hrtimer_
smp_apic_
apic_timer_
</IRQ>
This has been hit in production in a particularly highly utilised hadoop cluster which is powering an analytics platform. About 30% of the cluster experiences this issue every week, and the machines need a manual reboot to get back online.
[Fix]
This was fixed in 5.1 upstream with the below commit:
commit 2e8e19226398db8
Author: Phil Auld <email address hidden>
Date: Tue Mar 19 09:00:05 2019 -0400
subject: sched/fair: Limit sched_cfs_
This commit adds a check to see if the loop has run too many times, and if it
has, scales up the period and quota, so the timer can complete before the
next period expires, which enables the task to be rescheduled normally.
Note, 2e8e19226398db8
This patch requires minor backporting for 4.15, so please cherry pick
d069fe4844f8d79
[Testcase]
Kind of hard to reproduce, so this was tested on a production hadoop cluster
with extremely high CPU load.
I built a test kernel, which is available here:
https:/
For unpatched kernels, expect the machine to lockup and print the call trace in the impact section.
For patched kernels, if the machine hits the condition, it will print a warning to the kernel log with the new period and quota which has been used:
Example from the same hadoop cluster with a machine running the test kernel:
% uname -a
4.15.0-50-generic #54+hf232784v20
% sudo grep cfs /var/log/kern.log.*
cfs_period_
cfs_period_
cfs_period_
cfs_period_
cfs_period_
cfs_period_
cfs_period_
cfs_period_
cfs_period_
cfs_period_
[Regression Potential]
This patch was accepted into upstream stable versions 4.4.179, 4.9.171,
4.14.114, 4.19.37, 5.0.10, and is thus treated as stable and trusted by the
community.
Xenial received this patch in 4.4.0-150.176, as per LP #1828420
Disco will receive this patch in the next version, as per LP #1830922
Eoan already has the patch, being based on 5.2.
While this does effect a core part of the kernel, the scheduler, the patch has been extensively tested, and it has been proven in production environments, so the overall risk is low.
tags: | added: sts |
description: | updated |
Changed in linux (Ubuntu Bionic): | |
importance: | Undecided → Medium |
status: | New → In Progress |
assignee: | nobody → Matthew Ruffell (mruffell) |
description: | updated |
description: | updated |
Changed in linux (Ubuntu Bionic): | |
status: | In Progress → Fix Committed |
Changed in linux (Ubuntu): | |
status: | Incomplete → Fix Released |
This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:
apport-collect 1836971
and then change the status of the bug to 'Confirmed'.
If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.
This change has been made by an automated script, maintained by the Ubuntu Kernel Team.