StarlingX

kernel: Applications doing file I/O on nohz_full CPUs with preempt-rt kernel may cause vm.dirty_bytes threshold to be reached

Bug #2002039 reported by M. Vefa Bicakci on 2023-01-05

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	StarlingX	Fix Released	Medium	M. Vefa Bicakci

Bug Description

Brief Description
-----------------
Applications carrying out file I/O on nohz_full CPUs may cause the vm.dirty_bytes sysctl threshold to be reached.

We have received a report that running a test application on nohz_full CPUs causes the Dirty field in /proc/meminfo to eventually reach the threshold value set by the vm.dirty_bytes threshold, and this causes all applications carrying out disk I/O to eventually block.

The issue was found to be in the handling of the vm_node_stat array, which is updated from multiple contexts:

* hard IRQ contexts (such as via quiet_vmstat, which is called from hard IRQ context)
* other contexts (such as via __mod_node_page_state, which is called from numerous other parts of the kernel)

We found that __mod_node_page_state and its sibling functions update vm_node_stat (and other arrays) in a non-IRQ-safe manner. When combined with the fact that quiet_vmstat is called from hard IRQ context, this appears to cause vm_node_stat and other statistics arrays to be incorrectly updated. (Recall that the preempt-rt kernel makes spin_lock_irqsave a sleeping lock that does *not* disable IRQs.)

This bug is opened as a placeholder so that a fix can be merged.

Severity
--------
Major: Certain workloads eventually result in system hangs

Steps to Reproduce
------------------
Applications that carry out logging to files from nohz_full CPUs and rotating the log files appear to trigger this issue. This description is admittedly vague. If there is interest, I can publish a cleaned-up test application.

Expected Behavior
------------------
Dirty field in /proc/meminfo should not increase without bounds.

Actual Behavior
----------------
Dirty field increases gradually, and eventually reaches the threshold set by vm.dirty_bytes sysctl. The value does not decrease even if the problematic/triggering user-space application is killed.

Reproducibility
---------------
Reliably reproducible

System Configuration
--------------------
Reproduced on all-in-one-simplex and duplex with low-latency/preempt-rt kernel.

Branch/Pull Time/Commit
-----------------------
Not applicable.

Last Pass
---------
StarlingX versions with 3.10-based kernels are not affected, as the issue was "introduced" with a commit that was merged in the v4.15-rc1 development time frame:

commit 62cb1188ed86a9cf082fd2f757d4dd9b54741f24
Author: Peter Zijlstra <email address hidden>
Date: Tue Aug 29 15:07:54 2017 +0200

sched/idle: Move quiet_vmstate() into the NOHZ code

quiet_vmstat() is an expensive function that only makes sense when we
go into NOHZ.

Timestamp/Logs
--------------
Not applicable.

Test Activity
-------------
Normal use.

Workaround
----------
None.

See original description

Tags:

M. Vefa Bicakci (vbicakci) on 2023-01-05

Changed in starlingx:
assignee:	nobody → M. Vefa Bicakci (vbicakci)
status:	New → Confirmed
description:	updated

OpenStack Infra (hudson-openstack) on 2023-01-05

Changed in starlingx:
status:	Confirmed → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2023-01-06: Fix merged to kernel (master)

Download full text (4.1 KiB)

Reviewed: https://review.opendev.org/c/starlingx/kernel/+/869382
Committed: https://opendev.org/starlingx/kernel/commit/436c7067d0e022d2053272e8b9c3d9c18473de5e
Submitter: "Zuul (22348)"
Branch: master

commit 436c7067d0e022d2053272e8b9c3d9c18473de5e
Author: M. Vefa Bicakci <email address hidden>
Date: Thu Jan 5 15:47:33 2023 +0000

kernel: Do not call quiet_vmstat from IRQ context

    We received a bug report indicating that the "Dirty" field in
    /proc/meminfo was increasing without bounds, to the point that the
    number of dirty file pages would eventually reach what is enforced by
    the vm.dirty_bytes threshold (which is set to 800_000_000 bytes in
    StarlingX) and cause any task attempting to carry out disk I/O to get
    blocked.

    Upon further debugging, we noticed that this issue occurred on nohz_full
    CPUs where a user application was carrying out disk I/O by writing to
    and rotating log files. The issue was reproducible with the preempt-rt
    patch set very reliably.

    This commit addresses the issue in question, by reverting commit
    62cb1188ed86 ("sched/idle: Move quiet_vmstate() into the NOHZ code"),
    which was merged in the v4.15-rc1 time frame. The revert, in effect,
    moves the quiet_vmstat function call from hard IRQ context back to the
    start of the idle loop. Please see the patch description for a more
    detailed overview.

    Note that this commit does not introduce a "novel" change, as the
    4.14.298-rt140 kernel, released on 2022-11-04 does not have the reverted
    commit either, which should preclude the need for regression testing in
    terms of functionality and performance.

    I would like to acknowledge the extensive help and guidance provided by
    Jim Somerville <email address hidden> during the debugging and
    investigation of this issue.

Verification

    - The issue was reproduced with an older CentOS-based StarlingX-based
      system, running a StarlingX/linux-yocto preempt-rt kernel based on
      v5.10.112-rt61 by running a test application for about 4~5 hours. In
      this configuration, the issue becomes apparent within 1 hour or so,
      where the Dirty field in /proc/meminfo reaches the threshold sysctl
      vm.dirty_background_bytes (set to 600_000_000 bytes in StarlingX). By
      the end of the test, the Dirty field was very close to the
      vm.dirty_bytes threshold sysctl (800_000_000 bytes).

      Afterwards, a kernel patched with this commit was found to no longer
      reproduce the issue, by running the same test application for ~12.5
      hours. (Note that the second test had Meltdown/Spectre mitigations
      enabled by accident, but we are confident that this does not affect
      the test results.) The Dirty value in /proc/meminfo stayed around
      180_000 KiB for the duration of the test. A test re-run with the
      Meltdown/Spectre mitigations disabled, for a duration of 1.75 hours,
      had similar results.

      The test application that reproduces this issue writes to and rotates
      log files in a rapid manner, with a usleep(0) call between every log
      file rot...

Reviewed:  https://review.opendev.org/c/starlingx/kernel/+/869382
Committed: https://opendev.org/starlingx/kernel/commit/436c7067d0e022d2053272e8b9c3d9c18473de5e
Submitter: "Zuul (22348)"
Branch:    master

commit 436c7067d0e022d2053272e8b9c3d9c18473de5e
Author: M. Vefa Bicakci <vefa.bicakci@windriver.com>
Date:   Thu Jan 5 15:47:33 2023 +0000

kernel: Do not call quiet_vmstat from IRQ context
    
    We received a bug report indicating that the "Dirty" field in
    /proc/meminfo was increasing without bounds, to the point that the
    number of dirty file pages would eventually reach what is enforced by
    the vm.dirty_bytes threshold (which is set to 800_000_000 bytes in
    StarlingX) and cause any task attempting to carry out disk I/O to get
    blocked.
    
    Upon further debugging, we noticed that this issue occurred on nohz_full
    CPUs where a user application was carrying out disk I/O by writing to
    and rotating log files. The issue was reproducible with the preempt-rt
    patch set very reliably.
    
    This commit addresses the issue in question, by reverting commit
    62cb1188ed86 ("sched/idle: Move quiet_vmstate() into the NOHZ code"),
    which was merged in the v4.15-rc1 time frame. The revert, in effect,
    moves the quiet_vmstat function call from hard IRQ context back to the
    start of the idle loop. Please see the patch description for a more
    detailed overview.
    
    Note that this commit does not introduce a "novel" change, as the
    4.14.298-rt140 kernel, released on 2022-11-04 does not have the reverted
    commit either, which should preclude the need for regression testing in
    terms of functionality and performance.
    
    I would like to acknowledge the extensive help and guidance provided by
    Jim Somerville <jim.somerville@windriver.com> during the debugging and
    investigation of this issue.
    
    Verification
    
    - The issue was reproduced with an older CentOS-based StarlingX-based
      system, running a StarlingX/linux-yocto preempt-rt kernel based on
      v5.10.112-rt61 by running a test application for about 4~5 hours. In
      this configuration, the issue becomes apparent within 1 hour or so,
      where the Dirty field in /proc/meminfo reaches the threshold sysctl
      vm.dirty_background_bytes (set to 600_000_000 bytes in StarlingX). By
      the end of the test, the Dirty field was very close to the
      vm.dirty_bytes threshold sysctl (800_000_000 bytes).
    
      Afterwards, a kernel patched with this commit was found to no longer
      reproduce the issue, by running the same test application for ~12.5
      hours. (Note that the second test had Meltdown/Spectre mitigations
      enabled by accident, but we are confident that this does not affect
      the test results.) The Dirty value in /proc/meminfo stayed around
      180_000 KiB for the duration of the test. A test re-run with the
      Meltdown/Spectre mitigations disabled, for a duration of 1.75 hours,
      had similar results.
    
      The test application that reproduces this issue writes to and rotates
      log files in a rapid manner, with a usleep(0) call between every log
      file rotation. The issue is reproduced on nohz_full CPUs with the
      preempt-rt kernel, more reliably at least.
    
    - A Debian-based StarlingX ISO image was successfully built with this
      commit.
    
    - The ISO image was successfully installed into a qemu/KVM-based virtual
      machine using the All-in-One Simplex, low-latency profile, and the
      Ansible bootstrap procedure was successful.
    
    - The issue was confirmed to no longer exist with this commit, by
      running multiple concurrent instances of a simplified test application
      for about 30 minutes (with the installation resulting from the
      Debian-based StarlingX ISO image built with this commit). Without a
      patched kernel, the issue becomes apparent within 10 minutes of test
      runtime in this configuration.
    
    Closes-Bug: 2002039
    Change-Id: I818d8bd751f4b1941a26530a99a4a635e98d5c54
    Signed-off-by: M. Vefa Bicakci <vefa.bicakci@windriver.com>

Changed in starlingx:
status:	In Progress → Fix Released

Ghada Khalil (gkhalil) on 2023-01-09

Changed in starlingx:
importance:	Undecided → Medium
tags:	added: stx.8.0 stx.distro.other

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.