Comment 5 for bug 1987430

Revision history for this message
Chris Siebenmann (cks) wrote :

We've seen this on a wide variety of workloads, including general user logins with NFS mounts, SLURM head and cluster nodes, a Prometheus/Grafana server, a Grafana Loki server, two Exim servers, a Samba server, LDAP servers, Matlab license servers, and a monitoring machine that just runs conserver. It seems to be correlated with the amount of processes and activity that happens on a machine, as the two machines that leaked the most are our primary general use login server and our Prometheus server (which is constantly running a churn of monitoring and probe activity). As a result of this, I don't currently have any particular commands that reproduce this.

It may be relevant that we are auditing some system calls. The generated /etc/audit/audit.rules on our servers has:
-D
-b 8192
-f 1
--backlog_wait_time 60000
-a exit,always -F arch=b64 -S execve
-a exit,always -F arch=b32 -S execve

We also have audit log only to files by masking systemd-journald-audit.socket.

I will see if I can reproduce this in a VM by generating random activity (I'm going to try repeatedly compiling something over and over), first in our standard configuration and then in a more minimal one. It will likely take at least a day or two to know one way or another.