Growth of file descriptors on the juju controller

Bug #2052634 reported by Simon Richardson
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Canonical Juju
Fix Released
Critical
Simon Richardson

Bug Description

The number of file descriptors grows on juju controllers over time. Once we hit a certain threshold (ulimit defined), then the leader (HA setup), will start to refuse connections.

Log messages in the controller log will be the form of:

    2024-01-13 10:03:25 WARNING juju.worker.httpserver log.go:194 http: Accept error: accept tcp [::]:17070: accept4: too many open files;

To remedy the solution in the short term, restarting the controller with close all open file descriptors and release the appropriately.

Tags: canonical-is
Tom Haddon (mthaddon)
tags: added: canonical-is
summary: - Growth of file discriptors on the juju controller
+ Growth of file descriptors on the juju controller
Revision history for this message
Simon Richardson (simonrichardson) wrote :

@manadart located the goroutine dumps from the day of the crash and 6 days prior. After analysis of the goroutine dumps, we noticed that the number of api clients was increasing. Note: each client has a monitor which contains a goroutine, that provides health checks to the server to ensure that it keeps open the connection. In turn keeps the facade(s) for the connection alive, once the connection has gone, the facade will be closed, along with all of it's resources.

Inspection of the dump identified the number of goroutines for the api client:

6 day old:

-----------+-------------------------------------------------------
     11685 runtime.gopark
             runtime.selectgo
             github.com/juju/juju/api.(*monitor).run
             runtime.goexit
-----------+-------------------------------------------------------

Day of crash:

-----------+-------------------------------------------------------
     16549 runtime.gopark
             runtime.selectgo
             github.com/juju/juju/api.(*monitor).run
             runtime.goexit
-----------+-------------------------------------------------------

We would expect the number of goroutines to plateau based on the number of units and active models. In addition, the clients were being spawned by the controller itself. These are the goroutines for the jujud controller. Normally, only workers spawn an api client, via the apicaller worker. Unless the apicaller worker, which has remained untouched for a long time was suddenly broken, something else was at fault.

Investigation of the heap profile quickly identified the problem.

The secrets manager facade for cross model remote secrets wasn't correctly closing the remote client. A patch to address this will be proposed.

Revision history for this message
Simon Richardson (simonrichardson) wrote :

PR to address the issue: https://github.com/juju/juju/pull/16905

I'm also looking to see if there are any more of these in the wild, just in case this is the tip of the iceberg.

Changed in juju:
status: In Progress → Fix Committed
Changed in juju:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.