On machines with extremely high CPU usage, parent task groups which have a large
number of children can make the for loop in sched_cfs_period_timer() run until
the watchdog fires when the cfs_period_us setting is too short.
In this particular case, it is unlikely that the call to hrtimer_forward_now()
will return 0, meaning the for loop is never left, and tasks are never
rescheduled.
The large number of children makes do_sched_cfs_period_timer() take longer than
the period, which impacts calls to hrtimer_forward_now().
This has been hit in production in a particularly highly utilised hadoop cluster
which is powering an analytics platform. About 30% of the cluster experiences
this issue every week, and the machines need a manual reboot to get back online.
[Fix]
This was fixed in 5.1 upstream with the below commit:
commit 2e8e19226398db8265a8e675fcc0118b9e80c9e8
Author: Phil Auld <email address hidden>
Date: Tue Mar 19 09:00:05 2019 -0400
subject: sched/fair: Limit sched_cfs_period_timer() loop to avoid hard lockup
This commit adds a check to see if the loop has run too many times, and if it
has, scales up the period and quota, so the timer can complete before the
next period expires, which enables the task to be rescheduled normally.
Note, 2e8e19226398db8265a8e675fcc0118b9e80c9e8 was included in upstream stable
versions 4.4.179, 4.9.171, 4.14.114, 4.19.37, 5.0.10.
Please cherry pick 2e8e19226398db8265a8e675fcc0118b9e80c9e8 to all bionic
kernels.
[Testcase]
Kind of hard to reproduce, so this was tested on a production hadoop cluster
with extremely high CPU load.
For unpatched kernels, expect the machine to lockup and print the call trace in
the impact section.
For patched kernels, if the machine hits the condition, it will print a warning
to the kernel log with the new period and quota which has been used:
Example from the same hadoop cluster with a machine running the test kernel:
% uname -a
4.15.0-50-generic #54+hf232784v20190626b1-Ubuntu
% sudo grep cfs /var/log/kern.log.*
cfs_period_timer[cpu40]: period too short, scaling up (new cfs_period_us 67872, cfs_quota_us = 3475091)
cfs_period_timer[cpu48]: period too short, scaling up (new cfs_period_us 22430, cfs_quota_us = 1148437)
cfs_period_timer[cpu48]: period too short, scaling up (new cfs_period_us 25759, cfs_quota_us = 1318908)
cfs_period_timer[cpu68]: period too short, scaling up (new cfs_period_us 29583, cfs_quota_us = 1514684)
cfs_period_timer[cpu49]: period too short, scaling up (new cfs_period_us 33974, cfs_quota_us = 1739519)
cfs_period_timer[cpu3]: period too short, scaling up (new cfs_period_us 39017, cfs_quota_us = 1997729)
cfs_period_timer[cpu10]: period too short, scaling up (new cfs_period_us 44809, cfs_quota_us = 2294267)
cfs_period_timer[cpu3]: period too short, scaling up (new cfs_period_us 51460, cfs_quota_us = 2634823)
cfs_period_timer[cpu3]: period too short, scaling up (new cfs_period_us 59099, cfs_quota_us = 3025929)
cfs_period_timer[cpu3]: period too short, scaling up (new cfs_period_us 67872, cfs_quota_us = 3475091)
[Regression Potential]
This patch was accepted into upstream stable versions 4.4.179, 4.9.171,
4.14.114, 4.19.37, 5.0.10, and is thus treated as stable and trusted by the
community.
Xenial received this patch in 4.4.0-150.176, as per LP #1828420
Disco will receive this patch in the next version, as per LP #1830922
Eoan already has the patch, being based on 5.2.
While this does effect a core part of the kernel, the scheduler, the patch has
been extensively tested, and it has been proven in production environments,
so the overall risk is low.
BugLink: https:/ /bugs.launchpad .net/bugs/
[Impact]
On machines with extremely high CPU usage, parent task groups which have a large period_ timer() run until
number of children can make the for loop in sched_cfs_
the watchdog fires when the cfs_period_us setting is too short.
In this particular case, it is unlikely that the call to hrtimer_ forward_ now()
will return 0, meaning the for loop is never left, and tasks are never
rescheduled.
The large number of children makes do_sched_ cfs_period_ timer() take longer than forward_ now().
the period, which impacts calls to hrtimer_
The kernel will produce this call trace:
CPU: 51 PID: 0 Comm: swapper/51 Tainted: P OELK 4.15.0-50-generic #54-Ubuntu
Call Trace: 0x9/0x10 tree_from+ 0x61/0xd0 unlock+ 0x30/0x30 cfs_rq+ 0xcb/0x1a0 cfs_runtime+ 0xd7/0x100 period_ timer+0xd9/ 0x1a0 slack_timer+ 0xe0/0xe0 run_queues+ 0xdf/0x230 interrupt+ 0xa0/0x1d0 timer_interrupt +0x6f/0x130 interrupt+ 0x84/0x90
<IRQ>
? sched_clock+
walk_tg_
? task_rq_
unthrottle_
distribute_
sched_cfs_
? sched_cfs_
__hrtimer_
hrtimer_
smp_apic_
apic_timer_
</IRQ>
This has been hit in production in a particularly highly utilised hadoop cluster
which is powering an analytics platform. About 30% of the cluster experiences
this issue every week, and the machines need a manual reboot to get back online.
[Fix]
This was fixed in 5.1 upstream with the below commit:
commit 2e8e19226398db8 265a8e675fcc011 8b9e80c9e8 period_ timer() loop to avoid hard lockup
Author: Phil Auld <email address hidden>
Date: Tue Mar 19 09:00:05 2019 -0400
subject: sched/fair: Limit sched_cfs_
This commit adds a check to see if the loop has run too many times, and if it
has, scales up the period and quota, so the timer can complete before the
next period expires, which enables the task to be rescheduled normally.
Note, 2e8e19226398db8 265a8e675fcc011 8b9e80c9e8 was included in upstream stable
versions 4.4.179, 4.9.171, 4.14.114, 4.19.37, 5.0.10.
Please cherry pick 2e8e19226398db8 265a8e675fcc011 8b9e80c9e8 to all bionic
kernels.
[Testcase]
Kind of hard to reproduce, so this was tested on a production hadoop cluster
with extremely high CPU load.
I built a test kernel, which is available here:
https:/ /launchpad. net/~mruffell/ +archive/ ubuntu/ sf232784- test
For unpatched kernels, expect the machine to lockup and print the call trace in
the impact section.
For patched kernels, if the machine hits the condition, it will print a warning
to the kernel log with the new period and quota which has been used:
Example from the same hadoop cluster with a machine running the test kernel:
% uname -a 190626b1- Ubuntu timer[cpu40] : period too short, scaling up (new cfs_period_us 67872, cfs_quota_us = 3475091) timer[cpu48] : period too short, scaling up (new cfs_period_us 22430, cfs_quota_us = 1148437) timer[cpu48] : period too short, scaling up (new cfs_period_us 25759, cfs_quota_us = 1318908) timer[cpu68] : period too short, scaling up (new cfs_period_us 29583, cfs_quota_us = 1514684) timer[cpu49] : period too short, scaling up (new cfs_period_us 33974, cfs_quota_us = 1739519) timer[cpu3] : period too short, scaling up (new cfs_period_us 39017, cfs_quota_us = 1997729) timer[cpu10] : period too short, scaling up (new cfs_period_us 44809, cfs_quota_us = 2294267) timer[cpu3] : period too short, scaling up (new cfs_period_us 51460, cfs_quota_us = 2634823) timer[cpu3] : period too short, scaling up (new cfs_period_us 59099, cfs_quota_us = 3025929) timer[cpu3] : period too short, scaling up (new cfs_period_us 67872, cfs_quota_us = 3475091)
4.15.0-50-generic #54+hf232784v20
% sudo grep cfs /var/log/kern.log.*
cfs_period_
cfs_period_
cfs_period_
cfs_period_
cfs_period_
cfs_period_
cfs_period_
cfs_period_
cfs_period_
cfs_period_
[Regression Potential]
This patch was accepted into upstream stable versions 4.4.179, 4.9.171,
4.14.114, 4.19.37, 5.0.10, and is thus treated as stable and trusted by the
community.
Xenial received this patch in 4.4.0-150.176, as per LP #1828420
Disco will receive this patch in the next version, as per LP #1830922
Eoan already has the patch, being based on 5.2.
While this does effect a core part of the kernel, the scheduler, the patch has
been extensively tested, and it has been proven in production environments,
so the overall risk is low.