Canonical Juju

juju API can leak connections

Bug #1815003 reported by John A Meinel on 2019-02-07

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Canonical Juju	Triaged	Low	Unassigned

Bug Description

This might only be a bug in our metrics gathering.

I was trying to do some scale testing, with a client that just connects and plays with leadership.

I started running into problems on only 1 controller out of all 3 controllers that was getting:
2019-02-07 06:00:18 WARNING juju.worker.httpserver log.go:172 http: Accept error: accept tcp [::]:17070: accept4: too many open files; retrying in 1s

I kept killing processes, trying to reduce the number of connections on that controller. It seemed that after killing some processes, some of the other ones would wake up and keep trying to connect (apparently they hadn't been able to establish all the connections they wanted, or maybe my retry logic was wrong).

However, once I did 'killall', then 'juju_metrics' was wrong on the various machines:
machine/0:
juju_apiserver_connections{endpoint="api"} 7
juju_apiserver_connections{endpoint="logsink"} 2
machine/1:
juju_apiserver_connections{endpoint="api"} 11737
juju_apiserver_connections{endpoint="logsink"} 1
machine/2:
juju_apiserver_connections{endpoint="api"} 7
juju_apiserver_connections{endpoint="logsink"} 1

so you can see that machines 0 and 2 have gone back to a sane number of connections, but machine 1 still thinks it has too many.
Which actually appears to be true:
lsof -i -p <JUJUPID> | vim -R -

shows 11833 lines (which is ballpark the same as the above connection count).

I see quite a number of them with:
jujud 1476 root *923u IPv6 22609791 0t0 TCP juju-6ef06f-1.lxd:17070->bio:53564 (CLOSE_WAIT)

"%s/CLOSE_WAIT//n" says 2670 lines have CLOSE_WAIT
I see only 69 with ESTABLISHED

I see a lot (9062) that just look like this:
jujud 1476 root 50u sock 0,9 0t0 22082587 protocol: TCPv6

Doing a google of: lsof "protocol: TCPv6" had this as a top hit:
https://github.com/hashicorp/vault/issues/3244
https://github.com/hashicorp/consul/issues/2297

so maybe something about one of the hashicorp libraries is creating these sockets, and we're doing something wrong causing us to leak them?

I'll grab an engine report and the machine's logs. One possibility is it is what happens if we get a raft timeout.

Tags: