Canonical Juju

juju controllers hit open file limit

Bug #2025160 reported by Ponnuvel Palaniyappan on 2023-06-27

This bug affects 3 people

Affects		Status	Importance	Assigned to	Milestone
	Canonical Juju	New	Undecided	Unassigned

Bug Description

Juju controllers use a lot of socket connections that they frequently hit the open files limit 64000.

```
2023-06-07 11:35:41 WARNING juju.worker.httpserver log.go:198 http: Accept error: accept tcp [::]:17070: accept4: too many open files; retrying in 320ms

2023-06-07 11:35:41 WARNING juju.worker.httpserver log.go:198 http: Accept error: accept tcp [::]:17070: accept4: too many open files; retrying in 640ms

2023-06-07 11:35:42 WARNING juju.worker.httpserver log.go:198 http: Accept error: accept tcp [::]:17070: accept4: too many open files; retrying in 1s

2023-06-07 11:35:43 ERROR juju.worker.dependency engine.go:695 "instance-poller" manifold worker returned unexpected error: unexpected: Get "http://10.134.171.3/MAAS/api/2.0/machines/?agent_name=fb611b32-8f03-4ab6-8e68-398a7f50ca86&id=we7mgc&id=amd4rr&id=4tgxyw&id=mar6yg&id=k4h8pt&id=tfsgh6&id=pwnn8y&id=4gt4ka": dial tcp 10.134.171.3:80: socket: too many open files
2023-06-07 11:35:43 WARNING juju.worker.httpserver log.go:198 http: Accept error: accept tcp [::]:17070: accept4: too many open files; retrying in 1s

2023-06-07 11:35:44 WARNING juju.worker.httpserver log.go:198 http: Accept error: accept tcp [::]:17070: accept4: too many open files; retrying in 1s

2023-06-07 11:35:45 WARNING juju.worker.httpserver log.go:198 http: Accept error: accept tcp [::]:17070: accept4: too many open files; retrying in 1s

2023-06-07 11:35:46 WARNING juju.worker.httpserver log.go:198 http: Accept error: accept tcp [::]:17070: accept4: too many open files; retrying in 1s

2023-06-07 11:35:47 ERROR juju.apiserver.metricsender metricsender.go:114 Post "https://api.jujucharms.com/omnibus/v3/metrics": dial tcp: lookup api.jujucharms.com on 127.0.0.53:53: dial udp 127.0.0.53:53: socket: too many open files
github.com/juju/juju/apiserver/facades/agent/metricsender.(*HTTPSender).Send:31:
2023-06-07 11:35:47 WARNING juju.worker.httpserver log.go:198 http: Accept error: accept tcp [::]:17070: accept4: too many open files; retrying in 1s

2023-06-07 11:35:48 WARNING juju.worker.httpserver log.go:198 http: Accept error: accept tcp [::]:17070: accept4: too many open files; retrying in 1s

2023-06-07 11:35:48 WARNING juju.apiserver.metricsmanager metricsmanager.go:242 failed to send metrics for model-b2b523b4-18d9-4b62-8615-d2f3d96e3e51: Post "https://api.jujucharms.com/omnibus/v3/metrics": dial tcp: lookup api.jujucharms.com on 127.0.0.53:53: dial udp 127.0.0.53:53: socket: too many open files
2023-06-07 11:35:49 WARNING juju.worker.httpserver log.go:198 http: Accept error: accept tcp [::]:17070: accept4: too many open files; retrying in 1s
```

At the time this was oberserved the controllers were at 2.9.42. Since then they've been upgraded to 2.9.43. I doubt this matters - just noting it.

After restarting the controllers, they started gradually going up again - in about 2 days the fd usage of one of the controllers reached ~15000 and growing.

This looks similar to https://bugs.launchpad.net/juju/+bug/1979957.

See original description

Ponnuvel Palaniyappan (pponnuvel) on 2023-06-27

description:	updated
description:	updated

Ponnuvel Palaniyappan (pponnuvel) on 2023-06-28

summary:

- juju controllers git open file limit
+ juju controllers hit open file limit

Revision history for this message

Joseph Phillips (manadart) wrote on 2023-06-29:

Can you get the lsof output similarly to the bug you've linked, so we can see the source/destination of the connections?

Is this controller involved in CMRs like that one?

Changed in juju:
status:	New → Incomplete

Revision history for this message

Ponnuvel Palaniyappan (pponnuvel) wrote on 2023-06-29:

juju_engine_report Edit (4.0 MiB, text/plain)

lsof: https://pastebin.canonical.com/p/RPTVyTFbVH/
/proc/<jujud>/fd: https://pastebin.canonical.com/p/FGKJHtTyMJ/
Also attached juju_engine_report

Revision history for this message

Ponnuvel Palaniyappan (pponnuvel) wrote on 2023-06-30 (last edit on 2023-06-30):

And yes, there's CMR in use for LMA stack (not COS).

The leadership of controllers has been changed since the original issue was observed.

Here's the juju_metrics from a controller: https://pastebin.canonical.com/p/Rn9z9h6yyw/

Right now, it seems to be very low (process_open_fds is 2568). But the user reported it's slowly growing.

For 680 machines and 5600+ agents, it may very well be normal even if process_open_fds grows to another few 1000s.
So I'll update if it reaches suspiciously high (say 20K).

A possibility is that the sockets weren't closed sufficiently quickly and in time, it settles on its own.
So it may not be a "leak" in the usual sense. Essentially this "spike" could be temporary.

However, it's still a concern when it happens because:

1. If it hits 64000 limit, jujud has to be restarted.

2. Even if it doesn't hit 64000, the controller is very unresponsive when the open fds were too high.

So it'd be useful to figure out why the sockets weren't closed and some way to trace those open
sockets so that we could get more info when it happens the next time.

Revision history for this message

Joseph Phillips (manadart) wrote on 2023-06-30:

Can you grep for errors in the controller logs?
I mean on the machines rather than from debug-log.

Revision history for this message

Ponnuvel Palaniyappan (pponnuvel) wrote on 2023-07-04:

machine-2-2023-06-08T16-44-14.977.log.gz Edit (4.0 MiB, application/octet-stream)

Attached the controller's log.

Revision history for this message

John A Meinel (jameinel) wrote on 2023-07-07:

I do believe there were bugs in older versions of Juju where certain types of failures would leave an open socket, that would then leak extra file descriptors over time.
We fixed the ones that we saw in the wild, but it is plausible that there is another.

Of the ones that are listed, a number of them are connections to our Mongo Database, such as:
10.aaa.98:42788->10.bbb.215:37017 (ESTABLISHED)

but that only accounts for 566 of them.
The vast majority of the ones in the lsof are connections to the API port (17070):
14250 of them. Such as:
10.aaa.98:17070->10.ccc.98:47918 (ESTABLISHED)

In fact, there were only 2 others that were TCP connections that weren't 17070:
10.aaa.98:42368->10.xxx.16:8774 (ESTABLISHED)
10.aaa.98:51798->10.xxx.17:9696 (ESTABLISHED)

And those don't make any particular sense to me, but I'm not particularly worried about those.

I do see a lot of connections that involve Ubuntu FAN addresses, eg:
252.98.0.1:56120->252.98.0.1:17070 (ESTABLISHED)

2152 of them are of the form 252...->252.98.0.1:17070
another 2152 of them are of the form:
252.98.0.1:17070->

and the other 498 of the form
252.98.0.1:*->252.144.0.1:17070 or 252.222.0.1:17070

Those should be connections between the HA controllers to each other on the FAN address. (IIRC .0.1 is the FAN address associated with the host machine.)

There are a lot more connections from this machine to 17070 (both on self and on others) than I would expect, but it may not be wrong (3472 conns of the form:
.*->.*:17070)
such as:
127.0.0.1:48788->127.0.0.1:17070 (ESTABLISHED)
10.aaa.98:40066->10.aaa.98:17070 (ESTABLISHED)
10.aaa.98:39090->10.bbb.215:17070 (ESTABLISHED)
10.aaa.98:36414->10.aaa.144:17070 (ESTABLISHED)

Note that I would expect some of those (potentially 1 per model), but 3000 of them seems a little surprising.
Best guess is that there is some sort of leak around a model's socket getting stuck open and another one getting created for it.

Revision history for this message

Ponnuvel Palaniyappan (pponnuvel) wrote on 2023-07-08 (last edit on 2023-07-08):

Thanks, jameinel, for taking a look. You seem to agree that there's a definitely some issue just not what it could be. As noted in the description, this wasn't the only instance it happened.

The user reported further increase since (there's currently one controller with 24K+ open fds). The one you looked at had only about 15K. I don't have many bright ideas on how to extract relevant info from the user environment (lsof, proc/<jujud>/fd, juju_metrics, etc tell "there's a problem" but not much more). I am currently looking into Delve [0]. But I don't know if the user would be willing to run on a production deployment.

If you have any suggestions on what info would help narrow down the problem, that'd be useful too.
I looked into all the "agent introspection" tools and nothing seem to be particularly good for this issue.
Otherwise I'll check if Delve is an option.

[0] https://github.com/go-delve/delve

Revision history for this message

Launchpad Janitor (janitor) wrote on 2023-09-07:

[Expired for Canonical Juju because there has been no activity for 60 days.]

Changed in juju:
status:	Incomplete → Expired

Revision history for this message

Ponnuvel Palaniyappan (pponnuvel) wrote on 2023-10-29:

FYI. This problem still persists and only regular reboots are help avoid hitting the FD limit.

Changed in juju:
status:	Expired → New

Revision history for this message

Joseph Phillips (manadart) wrote on 2023-11-23:

#10

Is this controller still on 2.9.43?

2.9.44 received a fix that prevents the firewaller restarts indicated by:

2023-03-23 16:40:24 ERROR juju.worker.dependency engine.go:693 "firewaller" manifold worker returned unexpected error: invalid destination for ingress rule: invalid protocol "ipv6-icmp", expected "tcp", "udp", or "icmp"

One variable that could be eliminated.

Can you also attach a fresh capture of the controller's log?

Changed in juju:
status:	New → Incomplete

Revision history for this message

Ponnuvel Palaniyappan (pponnuvel) wrote on 2023-11-23:

#11

Thanks, Jo, for looking into this.

I'll get the controllers upgraded to 2.9.45 and get fresh logs.

Revision history for this message

Ponnuvel Palaniyappan (pponnuvel) wrote on 2023-11-30:

#12

The environment has been updated to 2.9.45 and so far, no issues. But it's only 3 days since. I'll followup a couple of weeks time if this reoccurs.

Revision history for this message

Benjamin Allot (ballot) wrote on 2024-01-15:

#13

fwiw we saw this on a controller 3.1.6:

```
ubuntu@juju-f90a8b-controller-2:~$ /var/lib/juju/tools/machine-2/jujud --version
3.1.6-ubuntu-amd64
```

Logs were showing this kind of logs:
```
2024-01-03 10:15:36 WARNING juju.worker.httpserver log.go:194 http: Accept error: accept tcp [::]:17070: accept4: too many open files; retrying in 1s

2024-01-03 10:15:37 WARNING juju.worker.httpserver log.go:194 http: Accept error: accept tcp [::]:17070: accept4: too many open files; retrying in 1s

2024-01-03 10:15:38 WARNING juju.worker.httpserver log.go:194 http: Accept error: accept tcp [::]:17070: accept4: too many open files; retrying in 1s

2024-01-03 10:15:39 WARNING juju.worker.httpserver log.go:194 http: Accept error: accept tcp [::]:17070: accept4: too many open files; retrying in 1s

2024-01-03 10:15:40 WARNING juju.worker.httpserver log.go:194 http: Accept error: accept tcp [::]:17070: accept4: too many open files; retrying in 1s
```

Revision history for this message

Launchpad Janitor (janitor) wrote on 2024-03-16:

#14

[Expired for Canonical Juju because there has been no activity for 60 days.]

Changed in juju:
status:	Incomplete → Expired

Junien F (axino) on 2024-03-18

Changed in juju:
status:	Expired → New

Revision history for this message

Joseph Phillips (manadart) wrote on 2024-03-21:

#15

Can we get an update on where this is being observed today? Happy to dig to any introspection we have for Prodstack controllers.

Joseph Phillips (manadart) on 2024-04-18

Changed in juju:
status:	New → Incomplete

Revision history for this message

James Simpson (jsimpso) wrote on 2024-06-11:

#16

We experienced this today with one of the new 3.4.2 controllers. Per usual the controllers are deployed in a HA set of three, but only one of them is showing the issue.

https://pastebin.ubuntu.com/p/hr5rvNf62D/ - Juju debug output showing the error
https://pastebin.ubuntu.com/p/PPN2Vvx3m6/ - lsof comparison (summarised) between the three controllers

Looking through the lsof output I had, the vast majority of this appears to be sockets held open talking to a different set of controllers (3.1.8 - presumably for cross model relations between the two controller sets):

ubuntu@juju-3fe98c-controller-2:~$ wc -l lsof.log
1389690 lsof.log
ubuntu@juju-3fe98c-controller-2:~$ grep -E "10.141.22.(89|175|167):17070 $ESTABLISHED$" lsof.log | wc -l
1338277

This is controller/2 of "juju-controller-34-production-ps6", introspection reports available here:
https://juju-controller-reports.admin.canonical.com/ps6/juju-controller-34-production-ps6/controller-2/

The first recorded log we have of the issue was 2024-06-10T19:35:57.583Z

Changed in juju:
status:	Incomplete → New

Revision history for this message

Simon Richardson (simonrichardson) wrote on 2024-06-11:

#17

I've spent some time digging into this.

Looking at the firewaller, there is a potential race condition around managing remote relation data[1].
More time is required to setup the workflow to replicate the leak.

https://github.com/juju/juju/blob/76eba8c7a521e5dea6efdf81d872cef2c788001f/worker/firewaller/firewaller.go#L1518-L1548

Revision history for this message

Harry Pidcock (hpidcock) wrote on 2024-06-11:

#18

`data.isOffer = remoteAppResult.Result.IsConsumerProxy` is the only data race I can see there.

I looked into the strange data.Kill(); data.catacomb = catacomb.Catacomb{}; but it should be safe since the start function is only called by the runner after the Wait() call on the worker has returned and the Wait function on the catacomb only returns when all the added workers have also returned. The remote controller facade is closed in a defer on the remoteRelationData loop, so that should close the remote connection.

I looked into the callers of startRelation, and there it is only called once per relation. Even if it was called multiple times (which it cannot), the StartWorkers call merely "queues" the start up when the worker restarts.

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.