kernel: Applications doing file I/O on nohz_full CPUs with preempt-rt kernel may cause vm.dirty_bytes threshold to be reached

Bug #2002039 reported by M. Vefa Bicakci
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
M. Vefa Bicakci

Bug Description

Brief Description
-----------------
Applications carrying out file I/O on nohz_full CPUs may cause the vm.dirty_bytes sysctl threshold to be reached.

We have received a report that running a test application on nohz_full CPUs causes the Dirty field in /proc/meminfo to eventually reach the threshold value set by the vm.dirty_bytes threshold, and this causes all applications carrying out disk I/O to eventually block.

The issue was found to be in the handling of the vm_node_stat array, which is updated from multiple contexts:

* hard IRQ contexts (such as via quiet_vmstat, which is called from hard IRQ context)
* other contexts (such as via __mod_node_page_state, which is called from numerous other parts of the kernel)

We found that __mod_node_page_state and its sibling functions update vm_node_stat (and other arrays) in a non-IRQ-safe manner. When combined with the fact that quiet_vmstat is called from hard IRQ context, this appears to cause vm_node_stat and other statistics arrays to be incorrectly updated. (Recall that the preempt-rt kernel makes spin_lock_irqsave a sleeping lock that does *not* disable IRQs.)

This bug is opened as a placeholder so that a fix can be merged.

Severity
--------
Major: Certain workloads eventually result in system hangs

Steps to Reproduce
------------------
Applications that carry out logging to files from nohz_full CPUs and rotating the log files appear to trigger this issue. This description is admittedly vague. If there is interest, I can publish a cleaned-up test application.

Expected Behavior
------------------
Dirty field in /proc/meminfo should not increase without bounds.

Actual Behavior
----------------
Dirty field increases gradually, and eventually reaches the threshold set by vm.dirty_bytes sysctl. The value does not decrease even if the problematic/triggering user-space application is killed.

Reproducibility
---------------
Reliably reproducible

System Configuration
--------------------
Reproduced on all-in-one-simplex and duplex with low-latency/preempt-rt kernel.

Branch/Pull Time/Commit
-----------------------
Not applicable.

Last Pass
---------
StarlingX versions with 3.10-based kernels are not affected, as the issue was "introduced" with a commit that was merged in the v4.15-rc1 development time frame:

commit 62cb1188ed86a9cf082fd2f757d4dd9b54741f24
Author: Peter Zijlstra <email address hidden>
Date: Tue Aug 29 15:07:54 2017 +0200

    sched/idle: Move quiet_vmstate() into the NOHZ code

    quiet_vmstat() is an expensive function that only makes sense when we
    go into NOHZ.

Timestamp/Logs
--------------
Not applicable.

Test Activity
-------------
Normal use.

Workaround
----------
None.

Changed in starlingx:
assignee: nobody → M. Vefa Bicakci (vbicakci)
status: New → Confirmed
description: updated
Changed in starlingx:
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to kernel (master)
Download full text (4.1 KiB)

Reviewed: https://review.opendev.org/c/starlingx/kernel/+/869382
Committed: https://opendev.org/starlingx/kernel/commit/436c7067d0e022d2053272e8b9c3d9c18473de5e
Submitter: "Zuul (22348)"
Branch: master

commit 436c7067d0e022d2053272e8b9c3d9c18473de5e
Author: M. Vefa Bicakci <email address hidden>
Date: Thu Jan 5 15:47:33 2023 +0000

    kernel: Do not call quiet_vmstat from IRQ context

    We received a bug report indicating that the "Dirty" field in
    /proc/meminfo was increasing without bounds, to the point that the
    number of dirty file pages would eventually reach what is enforced by
    the vm.dirty_bytes threshold (which is set to 800_000_000 bytes in
    StarlingX) and cause any task attempting to carry out disk I/O to get
    blocked.

    Upon further debugging, we noticed that this issue occurred on nohz_full
    CPUs where a user application was carrying out disk I/O by writing to
    and rotating log files. The issue was reproducible with the preempt-rt
    patch set very reliably.

    This commit addresses the issue in question, by reverting commit
    62cb1188ed86 ("sched/idle: Move quiet_vmstate() into the NOHZ code"),
    which was merged in the v4.15-rc1 time frame. The revert, in effect,
    moves the quiet_vmstat function call from hard IRQ context back to the
    start of the idle loop. Please see the patch description for a more
    detailed overview.

    Note that this commit does not introduce a "novel" change, as the
    4.14.298-rt140 kernel, released on 2022-11-04 does not have the reverted
    commit either, which should preclude the need for regression testing in
    terms of functionality and performance.

    I would like to acknowledge the extensive help and guidance provided by
    Jim Somerville <email address hidden> during the debugging and
    investigation of this issue.

    Verification

    - The issue was reproduced with an older CentOS-based StarlingX-based
      system, running a StarlingX/linux-yocto preempt-rt kernel based on
      v5.10.112-rt61 by running a test application for about 4~5 hours. In
      this configuration, the issue becomes apparent within 1 hour or so,
      where the Dirty field in /proc/meminfo reaches the threshold sysctl
      vm.dirty_background_bytes (set to 600_000_000 bytes in StarlingX). By
      the end of the test, the Dirty field was very close to the
      vm.dirty_bytes threshold sysctl (800_000_000 bytes).

      Afterwards, a kernel patched with this commit was found to no longer
      reproduce the issue, by running the same test application for ~12.5
      hours. (Note that the second test had Meltdown/Spectre mitigations
      enabled by accident, but we are confident that this does not affect
      the test results.) The Dirty value in /proc/meminfo stayed around
      180_000 KiB for the duration of the test. A test re-run with the
      Meltdown/Spectre mitigations disabled, for a duration of 1.75 hours,
      had similar results.

      The test application that reproduces this issue writes to and rotates
      log files in a rapid manner, with a usleep(0) call between every log
      file rot...

Read more...

Changed in starlingx:
status: In Progress → Fix Released
Ghada Khalil (gkhalil)
Changed in starlingx:
importance: Undecided → Medium
tags: added: stx.8.0 stx.distro.other
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.