juju 2.9.18 is leaking file descriptors to log files

Bug #1952942 reported by John A Meinel
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Canonical Juju
Triaged
Low
Joseph Phillips

Bug Description

We took an lsof dump of PS4.5 and as we were seeing that there were far too many open files.

Doing:

$ awk -F " " '{ print $10 }' lsof.2021-12-01.txt | sort | uniq -c | sort -n
...
   4176 /var/log/juju/models/admin-<model-A>.log
   4176 /var/log/juju/models/admin-<model-B>.log
   4176 /var/log/juju/models/admin-<model-C>.log
   4176 /var/log/juju/models/admin-<model-D>.log
   4176 /var/log/juju/models/admin-<model-E>.log
   4176 /var/log/juju/models/admin-<model-F>.log
   4176 /var/log/juju/models/prod-<model-G>.log
   4176 /var/log/juju/models/prod-<model-H>.log
   4176 /var/log/juju/models/stg-<model-I>.log
   4176 /var/log/juju/models/stg-<model-J>.log
   4176 /var/log/juju/models/stg-<model-K>.log
   4176 /var/log/juju/models/stg-<model-L>.log
   4234 /var/log/juju/models/admin-<model-M>.log
   4234 /var/log/juju/models/admin-<model-N>.log
   4234 /var/log/juju/models/stg-<model-O>.log
   4640 /var/log/juju/unit-controller-backup-14.log
   4640 /var/log/juju/unit-juju-introspection-proxy-2.log
   4640 /var/log/juju/unit-landscape-client-1.log
   4640 /var/log/juju/unit-livepatch-1.log
   4640 /var/log/juju/unit-nrpe-1.log
   4640 /var/log/juju/unit-ntp-1.log
   4640 /var/log/juju/unit-telegraf-1.log
   4640 /var/log/juju/unit-turku-agent-1.log
   4640 /var/log/juju/unit-ubuntu-1.log
   7205 (ESTABLISHED)
  30895

That says we have something like 7k network connections (which is in the ballpark of the number of expected unit agents connected to this controller),

but we have the models/*.log file opened 4k times and the /var/log/juju/unit* log files opened 4.6k times.

Clearly something is not letting go of its logging.

It is interesting to note that Unit logs are open more than the model logs. It seems something like we are not closing out a State object that ends up having all of its associated logs.

Looking at a particular log file we see:
COMMAND PID TID USER FD TYPE DEVICE SIZE/OFF NODE NAME
jujud 16993 root 566w REG 253,1 370194 268226 /var/log/juju/models/admin-<model-A>.log
with lots of different FD values, and occassionally different TID values.

I think lsof truncates once a FD value is >10000 because we see things like:
9733w and *513w.

Trying to look at graphs, it appears we 'leak' file descriptors when mongo disconnects us, and then a subset of our internal process gets restarted but not everything. So it would appear to be a bad cleanup of previous state when we do a soft restart.

Revision history for this message
John A Meinel (jameinel) wrote :

We had an idea floated that Go might not be reaping file descriptors when the File objects were being garbage collected, but Arnaud specifically located the code in Go where it sets a finalizer.
https://cs.opensource.google/go/go/+/refs/tags/go1.17.3:src/os/file_unix.go;l=187;drc=refs%2Ftags%2Fgo1.17.3

And that has been true for quite some time:
https://cs.opensource.google/go/go/+/refs/tags/go1.8:src/os/file_unix.go;l=57

So it is likely that something inside Juju itself is clearly holding on to the File objects. (My personal guess is that we leak State objects as part of the workers restarting, but I wouldn't expect State objects to be holding onto the files that we are using for logging.)

Revision history for this message
Joseph Phillips (manadart) wrote :

Worth noting that I did a pass over the worker manifolds that use a state tracker, auditing for missing pool.Done calls.

Found 3 workers where start-up errors can fail to release the pool. Those are fixed under:
https://github.com/juju/juju/pull/13549

As mentioned above, this is unlikely to be the root cause. Good to fix in any case.

Changed in juju:
importance: High → Low
assignee: nobody → Joseph Phillips (manadart)
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.