juju 2.9.18 is leaking file descriptors to log files
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Canonical Juju |
Triaged
|
Low
|
Joseph Phillips |
Bug Description
We took an lsof dump of PS4.5 and as we were seeing that there were far too many open files.
Doing:
$ awk -F " " '{ print $10 }' lsof.2021-12-01.txt | sort | uniq -c | sort -n
...
4176 /var/log/
4176 /var/log/
4176 /var/log/
4176 /var/log/
4176 /var/log/
4176 /var/log/
4176 /var/log/
4176 /var/log/
4176 /var/log/
4176 /var/log/
4176 /var/log/
4176 /var/log/
4234 /var/log/
4234 /var/log/
4234 /var/log/
4640 /var/log/
4640 /var/log/
4640 /var/log/
4640 /var/log/
4640 /var/log/
4640 /var/log/
4640 /var/log/
4640 /var/log/
4640 /var/log/
7205 (ESTABLISHED)
30895
That says we have something like 7k network connections (which is in the ballpark of the number of expected unit agents connected to this controller),
but we have the models/*.log file opened 4k times and the /var/log/juju/unit* log files opened 4.6k times.
Clearly something is not letting go of its logging.
It is interesting to note that Unit logs are open more than the model logs. It seems something like we are not closing out a State object that ends up having all of its associated logs.
Looking at a particular log file we see:
COMMAND PID TID USER FD TYPE DEVICE SIZE/OFF NODE NAME
jujud 16993 root 566w REG 253,1 370194 268226 /var/log/
with lots of different FD values, and occassionally different TID values.
I think lsof truncates once a FD value is >10000 because we see things like:
9733w and *513w.
Trying to look at graphs, it appears we 'leak' file descriptors when mongo disconnects us, and then a subset of our internal process gets restarted but not everything. So it would appear to be a bad cleanup of previous state when we do a soft restart.
Changed in juju: | |
importance: | High → Low |
assignee: | nobody → Joseph Phillips (manadart) |
We had an idea floated that Go might not be reaping file descriptors when the File objects were being garbage collected, but Arnaud specifically located the code in Go where it sets a finalizer. /cs.opensource. google/ go/go/+ /refs/tags/ go1.17. 3:src/os/ file_unix. go;l=187; drc=refs% 2Ftags% 2Fgo1.17. 3
https:/
And that has been true for quite some time: /cs.opensource. google/ go/go/+ /refs/tags/ go1.8:src/ os/file_ unix.go; l=57
https:/
So it is likely that something inside Juju itself is clearly holding on to the File objects. (My personal guess is that we leak State objects as part of the workers restarting, but I wouldn't expect State objects to be holding onto the files that we are using for logging.)