We met such an issue:
When testing a large number of pods (> 230), occasionally observed a
number of issues related to systemd process:
systemd ran continually 90-100% cpu usage
systemd memory usage started increasing rapidly (20GB/hour)
systemctl commands would always timeout (Failed to get properties: Connection timed out)
sm services failed and can't recover: open-ldap, registry-token-server, docker-distribution, etcd
new pods can't start, and got stuck in state ContainerCreating
Those patches work to prevent excessive /proc/1/mountinfo reparsing.
It has been verified that those patches can improve this performance
greatly.
[16](10)core: prevent excessive /proc/self/mountinfo parsing
[15][Dropped-6]test: add ratelimiting test
[14](9)sd-event: add ability to ratelimit event sources
[13](8)sd-event: increase n_enabled_child_sources just once
[12](7)sd-event: update state at the end in event_source_enable
[11](6)sd-event: remove earliest_index/latest_index into common part of
event source objects
[10][Dropped-5]sd-event: follow coding style with naming return
parameter
[9] [Dropped-4]sd-event: ref event loop while in sd_event_prepare() ot
sd_event_run()
[8] (5)sd-event: refuse running default event loops in any other thread
than the one they are default for
[7] [Dropped-3]sd-event: let's suffix last_run/last_log with "_usec"
[6] [Dropped-2]sd-event: fix delays assert brain-o (#17790)
[5] (4)sd-event: split out code to add/remove timer event sources to
earliest/latest prioq
[4] (3)sd-event: split clock data allocation out of sd_event_add_time()
[3] [Dropped-1]sd-event: mention that two debug logged events are
ignored
[2] (2)sd-event: split out enable and disable codepaths from
sd_event_source_set_enabled()
[1] (1)sd-event: split out helper functions for reshuffling prioqs
I ported 10 of them back (from (1) to (10)) to fix this issue
and dropped the other 6 (from [Dropped-1] to [Dropped-6]) for those
reasons:
[Dropped-1]Only changes error log.
[Dropped-2]Fixes a bug introduced in a commit which doesn't exist in
this version.
[Dropped-3]Only changes vars' names and there is no functional change.
[Dropped-4]More commits are needed for merging it, while I don't see
any help on adding the rate-limiting ability.
[Dropped-5]Change coding style for a function which isn't really used
by anyone.
[Dropped-6]Add test cases.
Closes-Bug: #1924686
Signed-off-by: Li Zhou <email address hidden>
Change-Id: Ia4c8f162cb1a47b40d1b26cf4d604976b97e92d6
Reviewed: https:/ /review. opendev. org/c/starlingx /integ/ +/786599 /opendev. org/starlingx/ integ/commit/ ccfeeef59d39e42 b2775bb5a216732 c4999f6e42
Committed: https:/
Submitter: "Zuul (22348)"
Branch: master
commit ccfeeef59d39e42 b2775bb5a216732 c4999f6e42
Author: Li Zhou <email address hidden>
Date: Mon Apr 12 02:15:25 2021 -0400
systemd: Prevent excessive /proc/1/mountinfo reparsing
Backport the patches for this issue: /bugzilla. redhat. com/show_ bug.cgi? id=1819868
https:/
We met such an issue:
Connection timed out)
registry- token-server, docker- distribution, etcd
When testing a large number of pods (> 230), occasionally observed a
number of issues related to systemd process:
systemd ran continually 90-100% cpu usage
systemd memory usage started increasing rapidly (20GB/hour)
systemctl commands would always timeout (Failed to get properties:
sm services failed and can't recover: open-ldap,
new pods can't start, and got stuck in state ContainerCreating
Those patches work to prevent excessive /proc/1/mountinfo reparsing.
It has been verified that those patches can improve this performance
greatly.
16 commits are listed in sequence (from [1] to [16]) at below link /github. com/systemd- rhel/rhel- 8/pull/ 154/commits
for the issue:
https:/
[16](10)core: prevent excessive /proc/self/ mountinfo parsing [Dropped- 6]test: add ratelimiting test (9)sd-event: add ability to ratelimit event sources (8)sd-event: increase n_enabled_ child_sources just once (7)sd-event: update state at the end in event_source_enable (6)sd-event: remove earliest_ index/latest_ index into common part of [Dropped- 5]sd-event: follow coding style with naming return 4]sd-event: ref event loop while in sd_event_prepare() ot 3]sd-event: let's suffix last_run/last_log with "_usec" 2]sd-event: fix delays assert brain-o (#17790) 1]sd-event: mention that two debug logged events are event_source_ set_enabled( )
[15]
[14]
[13]
[12]
[11]
event source objects
[10]
parameter
[9] [Dropped-
sd_event_run()
[8] (5)sd-event: refuse running default event loops in any other thread
than the one they are default for
[7] [Dropped-
[6] [Dropped-
[5] (4)sd-event: split out code to add/remove timer event sources to
earliest/latest prioq
[4] (3)sd-event: split clock data allocation out of sd_event_add_time()
[3] [Dropped-
ignored
[2] (2)sd-event: split out enable and disable codepaths from
sd_
[1] (1)sd-event: split out helper functions for reshuffling prioqs
I ported 10 of them back (from (1) to (10)) to fix this issue 2]Fixes a bug introduced in a commit which doesn't exist in 5]Change coding style for a function which isn't really used
and dropped the other 6 (from [Dropped-1] to [Dropped-6]) for those
reasons:
[Dropped-1]Only changes error log.
[Dropped-
this version.
[Dropped-3]Only changes vars' names and there is no functional change.
[Dropped-4]More commits are needed for merging it, while I don't see
any help on adding the rate-limiting ability.
[Dropped-
by anyone.
[Dropped-6]Add test cases.
Closes-Bug: #1924686 b40d1b26cf4d604 976b97e92d6
Signed-off-by: Li Zhou <email address hidden>
Change-Id: Ia4c8f162cb1a47