juju API can leak connections
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Canonical Juju |
Triaged
|
Low
|
Unassigned |
Bug Description
This might only be a bug in our metrics gathering.
I was trying to do some scale testing, with a client that just connects and plays with leadership.
I started running into problems on only 1 controller out of all 3 controllers that was getting:
2019-02-07 06:00:18 WARNING juju.worker.
I kept killing processes, trying to reduce the number of connections on that controller. It seemed that after killing some processes, some of the other ones would wake up and keep trying to connect (apparently they hadn't been able to establish all the connections they wanted, or maybe my retry logic was wrong).
However, once I did 'killall', then 'juju_metrics' was wrong on the various machines:
machine/0:
juju_apiserver_
juju_apiserver_
machine/1:
juju_apiserver_
juju_apiserver_
machine/2:
juju_apiserver_
juju_apiserver_
so you can see that machines 0 and 2 have gone back to a sane number of connections, but machine 1 still thinks it has too many.
Which actually appears to be true:
lsof -i -p <JUJUPID> | vim -R -
shows 11833 lines (which is ballpark the same as the above connection count).
I see quite a number of them with:
jujud 1476 root *923u IPv6 22609791 0t0 TCP juju-6ef06f-
"%s/CLOSE_WAIT//n" says 2670 lines have CLOSE_WAIT
I see only 69 with ESTABLISHED
I see a lot (9062) that just look like this:
jujud 1476 root 50u sock 0,9 0t0 22082587 protocol: TCPv6
Doing a google of: lsof "protocol: TCPv6" had this as a top hit:
https:/
https:/
so maybe something about one of the hashicorp libraries is creating these sockets, and we're doing something wrong causing us to leak them?
I'll grab an engine report and the machine's logs. One possibility is it is what happens if we get a raft timeout.
netstat has a different view. It lists 2649 sockets, and 65 of them are under: journal/ stdout journal/ stdout
Active UNIX domain sockets (w/o servers)
...
unix 3 [ ] STREAM CONNECTED 22022469 /run/systemd/
unix 3 [ ] STREAM CONNECTED 22025835
unix 3 [ ] DGRAM 22022209
unix 3 [ ] STREAM CONNECTED 22027567
unix 3 [ ] DGRAM 22022210
unix 3 [ ] STREAM CONNECTED 22022016
unix 3 [ ] STREAM CONNECTED 22022934 /run/systemd/
...
the inodes look suspiciously similar.
This shows 69 ESTABLISHED and 2510 CLOSE_WAIT.