juju controllers hit open file limit

Bug #2025160 reported by Ponnuvel Palaniyappan

This bug report will be marked for expiration in 44 days if no further activity occurs. (find out why)

14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Canonical Juju
Incomplete
Undecided
Unassigned

Bug Description

Juju controllers use a lot of socket connections that they frequently hit the open files limit 64000.

```
2023-06-07 11:35:41 WARNING juju.worker.httpserver log.go:198 http: Accept error: accept tcp [::]:17070: accept4: too many open files; retrying in 320ms

2023-06-07 11:35:41 WARNING juju.worker.httpserver log.go:198 http: Accept error: accept tcp [::]:17070: accept4: too many open files; retrying in 640ms

2023-06-07 11:35:42 WARNING juju.worker.httpserver log.go:198 http: Accept error: accept tcp [::]:17070: accept4: too many open files; retrying in 1s

2023-06-07 11:35:43 ERROR juju.worker.dependency engine.go:695 "instance-poller" manifold worker returned unexpected error: unexpected: Get "http://10.134.171.3/MAAS/api/2.0/machines/?agent_name=fb611b32-8f03-4ab6-8e68-398a7f50ca86&id=we7mgc&id=amd4rr&id=4tgxyw&id=mar6yg&id=k4h8pt&id=tfsgh6&id=pwnn8y&id=4gt4ka": dial tcp 10.134.171.3:80: socket: too many open files
2023-06-07 11:35:43 WARNING juju.worker.httpserver log.go:198 http: Accept error: accept tcp [::]:17070: accept4: too many open files; retrying in 1s

2023-06-07 11:35:44 WARNING juju.worker.httpserver log.go:198 http: Accept error: accept tcp [::]:17070: accept4: too many open files; retrying in 1s

2023-06-07 11:35:45 WARNING juju.worker.httpserver log.go:198 http: Accept error: accept tcp [::]:17070: accept4: too many open files; retrying in 1s

2023-06-07 11:35:46 WARNING juju.worker.httpserver log.go:198 http: Accept error: accept tcp [::]:17070: accept4: too many open files; retrying in 1s

2023-06-07 11:35:47 ERROR juju.apiserver.metricsender metricsender.go:114 Post "https://api.jujucharms.com/omnibus/v3/metrics": dial tcp: lookup api.jujucharms.com on 127.0.0.53:53: dial udp 127.0.0.53:53: socket: too many open files
github.com/juju/juju/apiserver/facades/agent/metricsender.(*HTTPSender).Send:31:
2023-06-07 11:35:47 WARNING juju.worker.httpserver log.go:198 http: Accept error: accept tcp [::]:17070: accept4: too many open files; retrying in 1s

2023-06-07 11:35:48 WARNING juju.worker.httpserver log.go:198 http: Accept error: accept tcp [::]:17070: accept4: too many open files; retrying in 1s

2023-06-07 11:35:48 WARNING juju.apiserver.metricsmanager metricsmanager.go:242 failed to send metrics for model-b2b523b4-18d9-4b62-8615-d2f3d96e3e51: Post "https://api.jujucharms.com/omnibus/v3/metrics": dial tcp: lookup api.jujucharms.com on 127.0.0.53:53: dial udp 127.0.0.53:53: socket: too many open files
2023-06-07 11:35:49 WARNING juju.worker.httpserver log.go:198 http: Accept error: accept tcp [::]:17070: accept4: too many open files; retrying in 1s
```

At the time this was oberserved the controllers were at 2.9.42. Since then they've been upgraded to 2.9.43. I doubt this matters - just noting it.

After restarting the controllers, they started gradually going up again - in about 2 days the fd usage of one of the controllers reached ~15000 and growing.

This looks similar to https://bugs.launchpad.net/juju/+bug/1979957.

description: updated
description: updated
summary: - juju controllers git open file limit
+ juju controllers hit open file limit
Revision history for this message
Joseph Phillips (manadart) wrote :

Can you get the lsof output similarly to the bug you've linked, so we can see the source/destination of the connections?

Is this controller involved in CMRs like that one?

Changed in juju:
status: New → Incomplete
Revision history for this message
Ponnuvel Palaniyappan (pponnuvel) wrote :
Revision history for this message
Ponnuvel Palaniyappan (pponnuvel) wrote (last edit ):

And yes, there's CMR in use for LMA stack (not COS).

The leadership of controllers has been changed since the original issue was observed.

Here's the juju_metrics from a controller: https://pastebin.canonical.com/p/Rn9z9h6yyw/

Right now, it seems to be very low (process_open_fds is 2568). But the user reported it's slowly growing.

For 680 machines and 5600+ agents, it may very well be normal even if process_open_fds grows to another few 1000s.
So I'll update if it reaches suspiciously high (say 20K).

A possibility is that the sockets weren't closed sufficiently quickly and in time, it settles on its own.
So it may not be a "leak" in the usual sense. Essentially this "spike" could be temporary.

However, it's still a concern when it happens because:

1. If it hits 64000 limit, jujud has to be restarted.

2. Even if it doesn't hit 64000, the controller is very unresponsive when the open fds were too high.

So it'd be useful to figure out why the sockets weren't closed and some way to trace those open
sockets so that we could get more info when it happens the next time.

Revision history for this message
Joseph Phillips (manadart) wrote :

Can you grep for errors in the controller logs?
I mean on the machines rather than from debug-log.

Revision history for this message
Ponnuvel Palaniyappan (pponnuvel) wrote :

Attached the controller's log.

Revision history for this message
John A Meinel (jameinel) wrote :

I do believe there were bugs in older versions of Juju where certain types of failures would leave an open socket, that would then leak extra file descriptors over time.
We fixed the ones that we saw in the wild, but it is plausible that there is another.

Of the ones that are listed, a number of them are connections to our Mongo Database, such as:
10.aaa.98:42788->10.bbb.215:37017 (ESTABLISHED)

but that only accounts for 566 of them.
The vast majority of the ones in the lsof are connections to the API port (17070):
14250 of them. Such as:
10.aaa.98:17070->10.ccc.98:47918 (ESTABLISHED)

In fact, there were only 2 others that were TCP connections that weren't 17070:
10.aaa.98:42368->10.xxx.16:8774 (ESTABLISHED)
10.aaa.98:51798->10.xxx.17:9696 (ESTABLISHED)

And those don't make any particular sense to me, but I'm not particularly worried about those.

I do see a lot of connections that involve Ubuntu FAN addresses, eg:
252.98.0.1:56120->252.98.0.1:17070 (ESTABLISHED)

2152 of them are of the form 252...->252.98.0.1:17070
another 2152 of them are of the form:
252.98.0.1:17070->

and the other 498 of the form
252.98.0.1:*->252.144.0.1:17070 or 252.222.0.1:17070

Those should be connections between the HA controllers to each other on the FAN address. (IIRC .0.1 is the FAN address associated with the host machine.)

There are a lot more connections from this machine to 17070 (both on self and on others) than I would expect, but it may not be wrong (3472 conns of the form:
.*->.*:17070)
such as:
127.0.0.1:48788->127.0.0.1:17070 (ESTABLISHED)
10.aaa.98:40066->10.aaa.98:17070 (ESTABLISHED)
10.aaa.98:39090->10.bbb.215:17070 (ESTABLISHED)
10.aaa.98:36414->10.aaa.144:17070 (ESTABLISHED)

Note that I would expect some of those (potentially 1 per model), but 3000 of them seems a little surprising.
Best guess is that there is some sort of leak around a model's socket getting stuck open and another one getting created for it.

Revision history for this message
Ponnuvel Palaniyappan (pponnuvel) wrote (last edit ):

Thanks, jameinel, for taking a look. You seem to agree that there's a definitely some issue just not what it could be. As noted in the description, this wasn't the only instance it happened.

The user reported further increase since (there's currently one controller with 24K+ open fds). The one you looked at had only about 15K. I don't have many bright ideas on how to extract relevant info from the user environment (lsof, proc/<jujud>/fd, juju_metrics, etc tell "there's a problem" but not much more). I am currently looking into Delve [0]. But I don't know if the user would be willing to run on a production deployment.

If you have any suggestions on what info would help narrow down the problem, that'd be useful too.
I looked into all the "agent introspection" tools and nothing seem to be particularly good for this issue.
Otherwise I'll check if Delve is an option.

[0] https://github.com/go-delve/delve

Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for Canonical Juju because there has been no activity for 60 days.]

Changed in juju:
status: Incomplete → Expired
Revision history for this message
Ponnuvel Palaniyappan (pponnuvel) wrote :

FYI. This problem still persists and only regular reboots are help avoid hitting the FD limit.

Changed in juju:
status: Expired → New
Revision history for this message
Joseph Phillips (manadart) wrote :

Is this controller still on 2.9.43?

2.9.44 received a fix that prevents the firewaller restarts indicated by:

2023-03-23 16:40:24 ERROR juju.worker.dependency engine.go:693 "firewaller" manifold worker returned unexpected error: invalid destination for ingress rule: invalid protocol "ipv6-icmp", expected "tcp", "udp", or "icmp"

One variable that could be eliminated.

Can you also attach a fresh capture of the controller's log?

Changed in juju:
status: New → Incomplete
Revision history for this message
Ponnuvel Palaniyappan (pponnuvel) wrote :

Thanks, Jo, for looking into this.

I'll get the controllers upgraded to 2.9.45 and get fresh logs.

Revision history for this message
Ponnuvel Palaniyappan (pponnuvel) wrote :

The environment has been updated to 2.9.45 and so far, no issues. But it's only 3 days since. I'll followup a couple of weeks time if this reoccurs.

Revision history for this message
Benjamin Allot (ballot) wrote :

fwiw we saw this on a controller 3.1.6:

```
ubuntu@juju-f90a8b-controller-2:~$ /var/lib/juju/tools/machine-2/jujud --version
3.1.6-ubuntu-amd64
```

Logs were showing this kind of logs:
```
2024-01-03 10:15:36 WARNING juju.worker.httpserver log.go:194 http: Accept error: accept tcp [::]:17070: accept4: too many open files; retrying in 1s

2024-01-03 10:15:37 WARNING juju.worker.httpserver log.go:194 http: Accept error: accept tcp [::]:17070: accept4: too many open files; retrying in 1s

2024-01-03 10:15:38 WARNING juju.worker.httpserver log.go:194 http: Accept error: accept tcp [::]:17070: accept4: too many open files; retrying in 1s

2024-01-03 10:15:39 WARNING juju.worker.httpserver log.go:194 http: Accept error: accept tcp [::]:17070: accept4: too many open files; retrying in 1s

2024-01-03 10:15:40 WARNING juju.worker.httpserver log.go:194 http: Accept error: accept tcp [::]:17070: accept4: too many open files; retrying in 1s
```

Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for Canonical Juju because there has been no activity for 60 days.]

Changed in juju:
status: Incomplete → Expired
Junien F (axino)
Changed in juju:
status: Expired → New
Revision history for this message
Joseph Phillips (manadart) wrote :

Can we get an update on where this is being observed today? Happy to dig to any introspection we have for Prodstack controllers.

Changed in juju:
status: New → Incomplete
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.