kernel: Applications doing file I/O on nohz_full CPUs with preempt-rt kernel may cause vm.dirty_bytes threshold to be reached
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
StarlingX |
Fix Released
|
Medium
|
M. Vefa Bicakci |
Bug Description
Brief Description
-----------------
Applications carrying out file I/O on nohz_full CPUs may cause the vm.dirty_bytes sysctl threshold to be reached.
We have received a report that running a test application on nohz_full CPUs causes the Dirty field in /proc/meminfo to eventually reach the threshold value set by the vm.dirty_bytes threshold, and this causes all applications carrying out disk I/O to eventually block.
The issue was found to be in the handling of the vm_node_stat array, which is updated from multiple contexts:
* hard IRQ contexts (such as via quiet_vmstat, which is called from hard IRQ context)
* other contexts (such as via __mod_node_
We found that __mod_node_
This bug is opened as a placeholder so that a fix can be merged.
Severity
--------
Major: Certain workloads eventually result in system hangs
Steps to Reproduce
------------------
Applications that carry out logging to files from nohz_full CPUs and rotating the log files appear to trigger this issue. This description is admittedly vague. If there is interest, I can publish a cleaned-up test application.
Expected Behavior
------------------
Dirty field in /proc/meminfo should not increase without bounds.
Actual Behavior
----------------
Dirty field increases gradually, and eventually reaches the threshold set by vm.dirty_bytes sysctl. The value does not decrease even if the problematic/
Reproducibility
---------------
Reliably reproducible
System Configuration
-------
Reproduced on all-in-one-simplex and duplex with low-latency/
Branch/Pull Time/Commit
-------
Not applicable.
Last Pass
---------
StarlingX versions with 3.10-based kernels are not affected, as the issue was "introduced" with a commit that was merged in the v4.15-rc1 development time frame:
commit 62cb1188ed86a9c
Author: Peter Zijlstra <email address hidden>
Date: Tue Aug 29 15:07:54 2017 +0200
sched/idle: Move quiet_vmstate() into the NOHZ code
quiet_vmstat() is an expensive function that only makes sense when we
go into NOHZ.
Timestamp/Logs
--------------
Not applicable.
Test Activity
-------------
Normal use.
Workaround
----------
None.
Changed in starlingx: | |
assignee: | nobody → M. Vefa Bicakci (vbicakci) |
status: | New → Confirmed |
description: | updated |
Changed in starlingx: | |
status: | Confirmed → In Progress |
Changed in starlingx: | |
importance: | Undecided → Medium |
tags: | added: stx.8.0 stx.distro.other |
Reviewed: https:/ /review. opendev. org/c/starlingx /kernel/ +/869382 /opendev. org/starlingx/ kernel/ commit/ 436c7067d0e022d 2053272e8b9c3d9 c18473de5e
Committed: https:/
Submitter: "Zuul (22348)"
Branch: master
commit 436c7067d0e022d 2053272e8b9c3d9 c18473de5e
Author: M. Vefa Bicakci <email address hidden>
Date: Thu Jan 5 15:47:33 2023 +0000
kernel: Do not call quiet_vmstat from IRQ context
We received a bug report indicating that the "Dirty" field in
/proc/meminfo was increasing without bounds, to the point that the
number of dirty file pages would eventually reach what is enforced by
the vm.dirty_bytes threshold (which is set to 800_000_000 bytes in
StarlingX) and cause any task attempting to carry out disk I/O to get
blocked.
Upon further debugging, we noticed that this issue occurred on nohz_full
CPUs where a user application was carrying out disk I/O by writing to
and rotating log files. The issue was reproducible with the preempt-rt
patch set very reliably.
This commit addresses the issue in question, by reverting commit
62cb1188ed86 ("sched/idle: Move quiet_vmstate() into the NOHZ code"),
which was merged in the v4.15-rc1 time frame. The revert, in effect,
moves the quiet_vmstat function call from hard IRQ context back to the
start of the idle loop. Please see the patch description for a more
detailed overview.
Note that this commit does not introduce a "novel" change, as the
4.14.298-rt140 kernel, released on 2022-11-04 does not have the reverted
commit either, which should preclude the need for regression testing in
terms of functionality and performance.
I would like to acknowledge the extensive help and guidance provided by
Jim Somerville <email address hidden> during the debugging and
investigation of this issue.
Verification
- The issue was reproduced with an older CentOS-based StarlingX-based linux-yocto preempt-rt kernel based on 10.112- rt61 by running a test application for about 4~5 hours. In dirty_backgroun d_bytes (set to 600_000_000 bytes in StarlingX). By dirty_bytes threshold sysctl (800_000_000 bytes).
system, running a StarlingX/
v5.
this configuration, the issue becomes apparent within 1 hour or so,
where the Dirty field in /proc/meminfo reaches the threshold sysctl
vm.
the end of the test, the Dirty field was very close to the
vm.
Afterwards, a kernel patched with this commit was found to no longer Spectre mitigations disabled, for a duration of 1.75 hours,
reproduce the issue, by running the same test application for ~12.5
hours. (Note that the second test had Meltdown/Spectre mitigations
enabled by accident, but we are confident that this does not affect
the test results.) The Dirty value in /proc/meminfo stayed around
180_000 KiB for the duration of the test. A test re-run with the
Meltdown/
had similar results.
The test application that reproduces this issue writes to and rotates
log files in a rapid manner, with a usleep(0) call between every log
file rot...