systemd excessively reads mountinfo and udev in dense container environments

Bug #1924686 reported by Li Zhou
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Li Zhou

Bug Description

Brief Description
-----------------
When testing a large number of pods (> 230), occasionally observed a number of issues related to systemd process:
    systemd ran continually 90-100% cpu usage
    systemd memory usage started increasing rapidly (20GB/hour)
    systemctl commands would always timeout (Failed to get properties: Connection timed out)
    sm services fail and can't recover: open-ldap, registry-token-server, docker-distribution, etcd
    new pods can't start, get stuck in state ContainerCreating

This is verified to be a known issue:
https://bugzilla.redhat.com/show_bug.cgi?id=1819868

Severity
--------
Major

Steps to Reproduce
------------------
Refer to [https://bugzilla.redhat.com/show_bug.cgi?id=1819868]

Expected Behavior
------------------
All of the pods are deployed successfully.

Actual Behavior
----------------
systemd ran continually 90-100% cpu usage
systemd memory usage started increasing rapidly (20GB/hour)
systemctl commands would always timeout (Failed to get properties: Connection timed out)
sm services fail and can't recover: open-ldap, registry-token-server, docker-distribution, etcd
new pods can't start, get stuck in state ContainerCreating

Reproducibility
---------------
Intermittent

System Configuration
--------------------
One node system

Test Activity
-------------
Developer Testing

Workaround
----------
None

Li Zhou (lzhou2)
Changed in starlingx:
assignee: nobody → Li Zhou (lzhou2)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to integ (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/integ/+/786599

Changed in starlingx:
status: New → In Progress
Revision history for this message
Ghada Khalil (gkhalil) wrote : Re: systemd excessively reads mountinfo and udev in dense OpenShift environments

Marking for stx.6.0 as this is an issue with very large deployments

Changed in starlingx:
importance: Undecided → Medium
tags: added: stx.6.0 stx.distro.other
Ghada Khalil (gkhalil)
summary: - systemd excessively reads mountinfo and udev in dense OpenShift
+ systemd excessively reads mountinfo and udev in dense container
environments
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to integ (master)
Download full text (3.3 KiB)

Reviewed: https://review.opendev.org/c/starlingx/integ/+/786599
Committed: https://opendev.org/starlingx/integ/commit/ccfeeef59d39e42b2775bb5a216732c4999f6e42
Submitter: "Zuul (22348)"
Branch: master

commit ccfeeef59d39e42b2775bb5a216732c4999f6e42
Author: Li Zhou <email address hidden>
Date: Mon Apr 12 02:15:25 2021 -0400

    systemd: Prevent excessive /proc/1/mountinfo reparsing

    Backport the patches for this issue:
    https://bugzilla.redhat.com/show_bug.cgi?id=1819868

    We met such an issue:
    When testing a large number of pods (> 230), occasionally observed a
    number of issues related to systemd process:
        systemd ran continually 90-100% cpu usage
        systemd memory usage started increasing rapidly (20GB/hour)
        systemctl commands would always timeout (Failed to get properties:
            Connection timed out)
        sm services failed and can't recover: open-ldap,
            registry-token-server, docker-distribution, etcd
        new pods can't start, and got stuck in state ContainerCreating

    Those patches work to prevent excessive /proc/1/mountinfo reparsing.
    It has been verified that those patches can improve this performance
    greatly.

    16 commits are listed in sequence (from [1] to [16]) at below link
    for the issue:
    https://github.com/systemd-rhel/rhel-8/pull/154/commits

    [16](10)core: prevent excessive /proc/self/mountinfo parsing
    [15][Dropped-6]test: add ratelimiting test
    [14](9)sd-event: add ability to ratelimit event sources
    [13](8)sd-event: increase n_enabled_child_sources just once
    [12](7)sd-event: update state at the end in event_source_enable
    [11](6)sd-event: remove earliest_index/latest_index into common part of
    event source objects
    [10][Dropped-5]sd-event: follow coding style with naming return
    parameter
    [9] [Dropped-4]sd-event: ref event loop while in sd_event_prepare() ot
    sd_event_run()
    [8] (5)sd-event: refuse running default event loops in any other thread
    than the one they are default for
    [7] [Dropped-3]sd-event: let's suffix last_run/last_log with "_usec"
    [6] [Dropped-2]sd-event: fix delays assert brain-o (#17790)
    [5] (4)sd-event: split out code to add/remove timer event sources to
    earliest/latest prioq
    [4] (3)sd-event: split clock data allocation out of sd_event_add_time()
    [3] [Dropped-1]sd-event: mention that two debug logged events are
    ignored
    [2] (2)sd-event: split out enable and disable codepaths from
    sd_event_source_set_enabled()
    [1] (1)sd-event: split out helper functions for reshuffling prioqs

    I ported 10 of them back (from (1) to (10)) to fix this issue
    and dropped the other 6 (from [Dropped-1] to [Dropped-6]) for those
    reasons:
    [Dropped-1]Only changes error log.
    [Dropped-2]Fixes a bug introduced in a commit which doesn't exist in
    this version.
    [Dropped-3]Only changes vars' names and there is no functional change.
    [Dropped-4]More commits are needed for merging it, while I don't see
    any help on adding the rate-limiting ability.
    [Dropped-5]Change coding style for a function which isn't really u...

Read more...

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to integ (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/integ/+/793754

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to integ (f/centos8)
Download full text (37.0 KiB)

Reviewed: https://review.opendev.org/c/starlingx/integ/+/793754
Committed: https://opendev.org/starlingx/integ/commit/a13966754d4e19423874ca31bf1533f057380c52
Submitter: "Zuul (22348)"
Branch: f/centos8

commit b310077093fd567944c6a46b7d0adcabe1f2b4b9
Author: Mihnea Saracin <email address hidden>
Date: Sat May 22 18:19:54 2021 +0300

    Fix resize of filesystems in puppet logical_volume

    After system reinstalls there is stale data on the disk
    and puppet fails when resizing, reporting some wrong filesystem
    types. In our case docker-lv was reported as drbd when
    it should have been xfs.

    This problem was solved in some cases e.g:
    when doing a live fs resize we wipe the last 10MB
    at the end of partition:
    https://opendev.org/starlingx/stx-puppet/src/branch/master/puppet-manifests/src/modules/platform/manifests/filesystem.pp#L146

    Our issue happened here:
    https://opendev.org/starlingx/stx-puppet/src/branch/master/puppet-manifests/src/modules/platform/manifests/filesystem.pp#L65
    Resize can happen at unlock when a bigger size is detected for the
    filesystem and the 'logical_volume' will resize it.
    To fix this we have to wipe the last 10MB of the partition after the
    'lvextend' cmd in the 'logical_volume' module.

    Tested the following scenarios:

    B&R on SX with default sizes of filesystems and cgts-vg.

    B&R on SX with with docker-lv of size 50G, backup-lv also 50G and
    cgts-vg with additional physical volumes:

    - name: cgts-vg
        physicalVolumes:
        - path: /dev/disk/by-path/pci-0000:00:0d.0-ata-1.0
        size: 50
        type: partition
        - path: /dev/disk/by-path/pci-0000:00:0d.0-ata-1.0
        size: 30
        type: partition
        - path: /dev/disk/by-path/pci-0000:00:0d.0-ata-3.0
        type: disk

    B&R on DX system with backup of size 70G and cgts-vg
    with additional physical volumes:

    physicalVolumes:
    - path: /dev/disk/by-path/pci-0000:00:0d.0-ata-1.0
        size: 50
        type: partition
    - path: /dev/disk/by-path/pci-0000:00:0d.0-ata-1.0
        size: 30
        type: partition
    - path: /dev/disk/by-path/pci-0000:00:0d.0-ata-3.0
        type: disk

    Closes-Bug: 1926591
    Change-Id: I55ae6954d24ba32e40c2e5e276ec17015d9bba44
    Signed-off-by: Mihnea Saracin <email address hidden>

commit 3225570530458956fd642fa06b83360a7e4e2e61
Author: Mihnea Saracin <email address hidden>
Date: Thu May 20 14:33:58 2021 +0300

    Execute once the ceph services script on AIO

    The MTC client manages ceph services via ceph.sh which
    is installed on all node types in
    /etc/service.d/{controller,worker,storage}/ceph.sh

    Since the AIO controllers have both controller and worker
    personalities, the MTC client will execute the ceph script
    twice (/etc/service.d/worker/ceph.sh,
    /etc/service.d/controller/ceph.sh).
    This behavior will generate some issues.

    We fix this by exiting the ceph script if it is the one from
    /etc/services.d/worker on AIO systems.

    Closes-Bug: 1928934
    Change-Id: I3e4dc313cc3764f870b8f6c640a60338...

tags: added: in-f-centos8
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.