Kernel memory resource leak on hosts running containers performing sudo tasks
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
kolla-ansible |
Fix Released
|
High
|
Michal Nasiadka | ||
Rocky |
Fix Released
|
High
|
Michal Nasiadka | ||
Stein |
Fix Released
|
High
|
Michal Nasiadka | ||
Train |
Fix Released
|
High
|
Michal Nasiadka | ||
Ussuri |
Fix Released
|
High
|
Michal Nasiadka |
Bug Description
There appears to be an issue with an interaction between systemd, systemd-logind and cgroupfs that leads to the following symptoms:
In systemd journald, regular logs of this form:
systemd[1]: Failed to start Session c314331 of user root.
In /proc/slabinfo, ever-increasing growth in the inode_cache slab:
# grep inode_cache /proc/slabinfo
inode_cache 3396340 3397020 592 55 8 : tunables 0 0 0 : slabdata 61764 61764 0
In /sys/fs/
# ls /sys/fs/
555543
Ultimately this is a resource leak that may result in memory exhaustion.
The issue appears to be that systemd is requested to create a cgroup for a PID that is not found. Here's an strace of systemd in action:
open("/
fcntl(60, F_GETFL) = 0x8001 (flags O_WRONLY|
fstat(60, {st_mode=
mmap(NULL, 4096, PROT_READ|
write(60, "28114\n", 6) = -1 ESRCH (No such process)
systemd is not handling this well and does not clean up the cgroup state created to this point. This causes the growing number of files and directories in cgroupfs (and the growth in inode_cache).
Systemd gets the instruction to attach a new process to a cgroup via dbus message from systemd-logind. Here is an example from dbus-monitor --system:
method call time=1580407532
string "session-
string "fail"
array [
struct {
string "Slice"
variant string "user-0.slice"
}
struct {
string "Description"
variant string "Session c313601 of user root"
}
struct {
string "After"
variant array [
]
}
struct {
string "After"
variant array [
]
}
struct {
string "SendSIGHUP"
variant boolean true
}
struct {
string "PIDs"
variant array [
]
}
struct {
string "TasksMax"
variant uint64 184467440737095
}
]
array [
]
The PID referenced is not found. Systemd-logind appears to be relaying a CreateSession notification from another source on dbus. It appears this other source is transient, and is attached to dbus for the duration of a sudo command.
During the time when the cgroup creation attempt occurs, it appears that the only process exec is for a periodic neutron-rootwrap polling command, which is being executed from within the Neutron Kolla containers.
My theory is that references are made on dbus to processes in a different PID namespace from the host. These messages make their way to systemd on the host, which then cannot act accordingly as it is in a different process namespace. Systemd failing to clean up is a secondary issue to the root cause.
I see this on latest CentOS 7.7 with Rocky deployed via Kayobe and Kolla-Ansible. I believe it also affects Stein.
Changed in kolla-ansible: | |
status: | New → Triaged |
importance: | Undecided → High |
assignee: | nobody → Michal Nasiadka (mnasiadka) |
Changed in kolla-ansible: | |
status: | Triaged → In Progress |
Could you paste your kernel and systemd versions? I am not seeing this behavior on Stein.