Growth of file descriptors on the juju controller
Bug #2052634 reported by
Simon Richardson
This bug affects 1 person
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Canonical Juju |
Fix Released
|
Critical
|
Simon Richardson |
Bug Description
The number of file descriptors grows on juju controllers over time. Once we hit a certain threshold (ulimit defined), then the leader (HA setup), will start to refuse connections.
Log messages in the controller log will be the form of:
2024-01-13 10:03:25 WARNING juju.worker.
To remedy the solution in the short term, restarting the controller with close all open file descriptors and release the appropriately.
tags: | added: canonical-is |
summary: |
- Growth of file discriptors on the juju controller + Growth of file descriptors on the juju controller |
Changed in juju: | |
status: | In Progress → Fix Committed |
Changed in juju: | |
status: | Fix Committed → Fix Released |
To post a comment you must log in.
@manadart located the goroutine dumps from the day of the crash and 6 days prior. After analysis of the goroutine dumps, we noticed that the number of api clients was increasing. Note: each client has a monitor which contains a goroutine, that provides health checks to the server to ensure that it keeps open the connection. In turn keeps the facade(s) for the connection alive, once the connection has gone, the facade will be closed, along with all of it's resources.
Inspection of the dump identified the number of goroutines for the api client:
6 day old:
------- ----+-- ------- ------- ------- ------- ------- ------- ------- ----
runtime. selectgo
github. com/juju/ juju/api. (*monitor) .run
runtime. goexit ----+-- ------- ------- ------- ------- ------- ------- ------- ----
11685 runtime.gopark
-------
Day of crash:
------- ----+-- ------- ------- ------- ------- ------- ------- ------- ----
runtime. selectgo
github. com/juju/ juju/api. (*monitor) .run
runtime. goexit ----+-- ------- ------- ------- ------- ------- ------- ------- ----
16549 runtime.gopark
-------
We would expect the number of goroutines to plateau based on the number of units and active models. In addition, the clients were being spawned by the controller itself. These are the goroutines for the jujud controller. Normally, only workers spawn an api client, via the apicaller worker. Unless the apicaller worker, which has remained untouched for a long time was suddenly broken, something else was at fault.
Investigation of the heap profile quickly identified the problem.
The secrets manager facade for cross model remote secrets wasn't correctly closing the remote client. A patch to address this will be proposed.