sched: Prevent CPU lockups when task groups take longer than the period

Bug #1836971 reported by Matthew Ruffell on 2019-07-17
22
This bug affects 3 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Undecided
Unassigned
Bionic
Medium
Matthew Ruffell

Bug Description

BugLink: https://bugs.launchpad.net/bugs/1836971

[Impact]

On machines with extremely high CPU usage, parent task groups which have a large number of children can make the for loop in sched_cfs_period_timer() run until the watchdog fires when the cfs_period_us setting is too short.

In this particular case, it is unlikely that the call to hrtimer_forward_now() will return 0, meaning the for loop is never left, and tasks are never rescheduled.

The large number of children makes do_sched_cfs_period_timer() take longer than the period, which impacts calls to hrtimer_forward_now().

The kernel will produce this call trace:

CPU: 51 PID: 0 Comm: swapper/51 Tainted: P OELK 4.15.0-50-generic #54-Ubuntu

Call Trace:
<IRQ>
? sched_clock+0x9/0x10
walk_tg_tree_from+0x61/0xd0
? task_rq_unlock+0x30/0x30
unthrottle_cfs_rq+0xcb/0x1a0
distribute_cfs_runtime+0xd7/0x100
sched_cfs_period_timer+0xd9/0x1a0
? sched_cfs_slack_timer+0xe0/0xe0
__hrtimer_run_queues+0xdf/0x230
hrtimer_interrupt+0xa0/0x1d0
smp_apic_timer_interrupt+0x6f/0x130
apic_timer_interrupt+0x84/0x90
</IRQ>

This has been hit in production in a particularly highly utilised hadoop cluster which is powering an analytics platform. About 30% of the cluster experiences this issue every week, and the machines need a manual reboot to get back online.

[Fix]

This was fixed in 5.1 upstream with the below commit:

commit 2e8e19226398db8265a8e675fcc0118b9e80c9e8
Author: Phil Auld <email address hidden>
Date: Tue Mar 19 09:00:05 2019 -0400
subject: sched/fair: Limit sched_cfs_period_timer() loop to avoid hard lockup

This commit adds a check to see if the loop has run too many times, and if it
has, scales up the period and quota, so the timer can complete before the
next period expires, which enables the task to be rescheduled normally.

Note, 2e8e19226398db8265a8e675fcc0118b9e80c9e8 was included in upstream stable versions 4.4.179, 4.9.171, 4.14.114, 4.19.37, 5.0.10.

This patch requires minor backporting for 4.15, so please cherry pick
d069fe4844f8d799d771659a745fe91870c93fda from upstream stable 4.14.y, where the backport has been done by the original author, to all bionic kernels.

[Testcase]

Kind of hard to reproduce, so this was tested on a production hadoop cluster
with extremely high CPU load.

I built a test kernel, which is available here:

https://launchpad.net/~mruffell/+archive/ubuntu/sf232784-test

For unpatched kernels, expect the machine to lockup and print the call trace in the impact section.

For patched kernels, if the machine hits the condition, it will print a warning to the kernel log with the new period and quota which has been used:

Example from the same hadoop cluster with a machine running the test kernel:

% uname -a
4.15.0-50-generic #54+hf232784v20190626b1-Ubuntu
% sudo grep cfs /var/log/kern.log.*
cfs_period_timer[cpu40]: period too short, scaling up (new cfs_period_us 67872, cfs_quota_us = 3475091)
cfs_period_timer[cpu48]: period too short, scaling up (new cfs_period_us 22430, cfs_quota_us = 1148437)
cfs_period_timer[cpu48]: period too short, scaling up (new cfs_period_us 25759, cfs_quota_us = 1318908)
cfs_period_timer[cpu68]: period too short, scaling up (new cfs_period_us 29583, cfs_quota_us = 1514684)
cfs_period_timer[cpu49]: period too short, scaling up (new cfs_period_us 33974, cfs_quota_us = 1739519)
cfs_period_timer[cpu3]: period too short, scaling up (new cfs_period_us 39017, cfs_quota_us = 1997729)
cfs_period_timer[cpu10]: period too short, scaling up (new cfs_period_us 44809, cfs_quota_us = 2294267)
cfs_period_timer[cpu3]: period too short, scaling up (new cfs_period_us 51460, cfs_quota_us = 2634823)
cfs_period_timer[cpu3]: period too short, scaling up (new cfs_period_us 59099, cfs_quota_us = 3025929)
cfs_period_timer[cpu3]: period too short, scaling up (new cfs_period_us 67872, cfs_quota_us = 3475091)

[Regression Potential]
This patch was accepted into upstream stable versions 4.4.179, 4.9.171,
4.14.114, 4.19.37, 5.0.10, and is thus treated as stable and trusted by the
community.

Xenial received this patch in 4.4.0-150.176, as per LP #1828420
Disco will receive this patch in the next version, as per LP #1830922
Eoan already has the patch, being based on 5.2.

While this does effect a core part of the kernel, the scheduler, the patch has been extensively tested, and it has been proven in production environments, so the overall risk is low.

tags: added: sts
description: updated
Changed in linux (Ubuntu Bionic):
importance: Undecided → Medium
status: New → In Progress
assignee: nobody → Matthew Ruffell (mruffell)
description: updated

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1836971

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
tags: added: bionic
description: updated
Changed in linux (Ubuntu Bionic):
status: In Progress → Fix Committed

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-bionic' to 'verification-done-bionic'. If the problem still exists, change the tag 'verification-needed-bionic' to 'verification-failed-bionic'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-bionic
Matthew Ruffell (mruffell) wrote :

I installed 4.15.0-56 from -proposed and ran some CPU stress tests and
everything was stable and no problems detected.

The kernel was also installed onto a few machines from the production hadoop
cluster which experienced extremely high cpu load, and the kernel is stable, with
no CPU lockups in sight.

Since this is also an upstream -stable patch, and we have good test results from
the hadoop cluster which was suffering the issue, I am happy to mark this as
verified.

tags: added: verification-done-bionic
removed: verification-needed-bionic

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-xenial' to 'verification-done-xenial'. If the problem still exists, change the tag 'verification-needed-xenial' to 'verification-failed-xenial'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-xenial
Matthew Ruffell (mruffell) wrote :

I installed linux-hwe 4.15.0-57 #63~16.04.1-Ubuntu on xenial, placed a heavy load on the system.

Everything seemed stable and no lockups occurred. Due to the successful tests in the hadoop environment under bionic and the nature of this patch being from upstream stable, I am happy to mark this as verified for xenial.

tags: added: verification-done-xenial
removed: verification-needed-xenial
Launchpad Janitor (janitor) wrote :
Download full text (171.3 KiB)

This bug was fixed in the package linux - 4.15.0-58.64

---------------
linux (4.15.0-58.64) bionic; urgency=medium

  * unable to handle kernel NULL pointer dereference at 000000000000002c (IP:
    iget5_locked+0x9e/0x1f0) (LP: #1838982)
    - Revert "ovl: set I_CREATING on inode being created"
    - Revert "new primitive: discard_new_inode()"

linux (4.15.0-57.63) bionic; urgency=medium

  * CVE-2019-1125
    - x86/cpufeatures: Carve out CQM features retrieval
    - x86/cpufeatures: Combine word 11 and 12 into a new scattered features word
    - x86/speculation: Prepare entry code for Spectre v1 swapgs mitigations
    - x86/speculation: Enable Spectre v1 swapgs mitigations
    - x86/entry/64: Use JMP instead of JMPQ
    - x86/speculation/swapgs: Exclude ATOMs from speculation through SWAPGS

  * Packaging resync (LP: #1786013)
    - update dkms package versions

linux (4.15.0-56.62) bionic; urgency=medium

  * bionic/linux: 4.15.0-56.62 -proposed tracker (LP: #1837626)

  * Packaging resync (LP: #1786013)
    - [Packaging] resync git-ubuntu-log
    - [Packaging] update helper scripts

  * CVE-2019-2101
    - media: uvcvideo: Fix 'type' check leading to overflow

  * hibmc-drm Causes Unreadable Display for Huawei amd64 Servers (LP: #1762940)
    - [Config] Set CONFIG_DRM_HISI_HIBMC to arm64 only
    - SAUCE: Make CONFIG_DRM_HISI_HIBMC depend on ARM64

  * Bionic: support for Solarflare X2542 network adapter (sfc driver)
    (LP: #1836635)
    - sfc: make mem_bar a function rather than a constant
    - sfc: support VI strides other than 8k
    - sfc: add Medford2 (SFC9250) PCI Device IDs
    - sfc: improve PTP error reporting
    - sfc: update EF10 register definitions
    - sfc: populate the timer reload field
    - sfc: update MCDI protocol headers
    - sfc: support variable number of MAC stats
    - sfc: expose FEC stats on Medford2
    - sfc: expose CTPIO stats on NICs that support them
    - sfc: basic MCDI mapping of 25/50/100G link speeds
    - sfc: support the ethtool ksettings API properly so that 25/50/100G works
    - sfc: add bits for 25/50/100G supported/advertised speeds
    - sfc: remove tx and MCDI handling from NAPI budget consideration
    - sfc: handle TX timestamps in the normal data path
    - sfc: add function to determine which TX timestamping method to use
    - sfc: use main datapath for HW timestamps if available
    - sfc: only enable TX timestamping if the adapter is licensed for it
    - sfc: MAC TX timestamp handling on the 8000 series
    - sfc: on 8000 series use TX queues for TX timestamps
    - sfc: only advertise TX timestamping if we have the license for it
    - sfc: simplify RX datapath timestamping
    - sfc: support separate PTP and general timestamping
    - sfc: support second + quarter ns time format for receive datapath
    - sfc: support Medford2 frequency adjustment format
    - sfc: add suffix to large constant in ptp
    - sfc: mark some unexported symbols as static
    - sfc: update MCDI protocol headers
    - sfc: support FEC configuration through ethtool
    - sfc: remove ctpio_dmabuf_start from stats
    - sfc: stop the TX queue before pushing new buffers

  * [18.04 FEAT] zKVM: Add hardwar...

Changed in linux (Ubuntu Bionic):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers