Comment 6 for bug 2025160

Revision history for this message
John A Meinel (jameinel) wrote :

I do believe there were bugs in older versions of Juju where certain types of failures would leave an open socket, that would then leak extra file descriptors over time.
We fixed the ones that we saw in the wild, but it is plausible that there is another.

Of the ones that are listed, a number of them are connections to our Mongo Database, such as:
10.aaa.98:42788->10.bbb.215:37017 (ESTABLISHED)

but that only accounts for 566 of them.
The vast majority of the ones in the lsof are connections to the API port (17070):
14250 of them. Such as:
10.aaa.98:17070->10.ccc.98:47918 (ESTABLISHED)

In fact, there were only 2 others that were TCP connections that weren't 17070:
10.aaa.98:42368->10.xxx.16:8774 (ESTABLISHED)
10.aaa.98:51798->10.xxx.17:9696 (ESTABLISHED)

And those don't make any particular sense to me, but I'm not particularly worried about those.

I do see a lot of connections that involve Ubuntu FAN addresses, eg:
252.98.0.1:56120->252.98.0.1:17070 (ESTABLISHED)

2152 of them are of the form 252...->252.98.0.1:17070
another 2152 of them are of the form:
252.98.0.1:17070->

and the other 498 of the form
252.98.0.1:*->252.144.0.1:17070 or 252.222.0.1:17070

Those should be connections between the HA controllers to each other on the FAN address. (IIRC .0.1 is the FAN address associated with the host machine.)

There are a lot more connections from this machine to 17070 (both on self and on others) than I would expect, but it may not be wrong (3472 conns of the form:
.*->.*:17070)
such as:
127.0.0.1:48788->127.0.0.1:17070 (ESTABLISHED)
10.aaa.98:40066->10.aaa.98:17070 (ESTABLISHED)
10.aaa.98:39090->10.bbb.215:17070 (ESTABLISHED)
10.aaa.98:36414->10.aaa.144:17070 (ESTABLISHED)

Note that I would expect some of those (potentially 1 per model), but 3000 of them seems a little surprising.
Best guess is that there is some sort of leak around a model's socket getting stuck open and another one getting created for it.