juju API can leak connections

Bug #1815003 reported by John A Meinel
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Canonical Juju
Triaged
Low
Unassigned

Bug Description

This might only be a bug in our metrics gathering.

I was trying to do some scale testing, with a client that just connects and plays with leadership.

I started running into problems on only 1 controller out of all 3 controllers that was getting:
2019-02-07 06:00:18 WARNING juju.worker.httpserver log.go:172 http: Accept error: accept tcp [::]:17070: accept4: too many open files; retrying in 1s

I kept killing processes, trying to reduce the number of connections on that controller. It seemed that after killing some processes, some of the other ones would wake up and keep trying to connect (apparently they hadn't been able to establish all the connections they wanted, or maybe my retry logic was wrong).

However, once I did 'killall', then 'juju_metrics' was wrong on the various machines:
machine/0:
juju_apiserver_connections{endpoint="api"} 7
juju_apiserver_connections{endpoint="logsink"} 2
machine/1:
juju_apiserver_connections{endpoint="api"} 11737
juju_apiserver_connections{endpoint="logsink"} 1
machine/2:
juju_apiserver_connections{endpoint="api"} 7
juju_apiserver_connections{endpoint="logsink"} 1

so you can see that machines 0 and 2 have gone back to a sane number of connections, but machine 1 still thinks it has too many.
Which actually appears to be true:
lsof -i -p <JUJUPID> | vim -R -

shows 11833 lines (which is ballpark the same as the above connection count).

I see quite a number of them with:
jujud 1476 root *923u IPv6 22609791 0t0 TCP juju-6ef06f-1.lxd:17070->bio:53564 (CLOSE_WAIT)

"%s/CLOSE_WAIT//n" says 2670 lines have CLOSE_WAIT
I see only 69 with ESTABLISHED

I see a lot (9062) that just look like this:
jujud 1476 root 50u sock 0,9 0t0 22082587 protocol: TCPv6

Doing a google of: lsof "protocol: TCPv6" had this as a top hit:
https://github.com/hashicorp/vault/issues/3244
https://github.com/hashicorp/consul/issues/2297

so maybe something about one of the hashicorp libraries is creating these sockets, and we're doing something wrong causing us to leak them?

I'll grab an engine report and the machine's logs. One possibility is it is what happens if we get a raft timeout.

Revision history for this message
John A Meinel (jameinel) wrote :
Revision history for this message
John A Meinel (jameinel) wrote :
Revision history for this message
John A Meinel (jameinel) wrote :
Revision history for this message
John A Meinel (jameinel) wrote :
Revision history for this message
John A Meinel (jameinel) wrote :

netstat has a different view. It lists 2649 sockets, and 65 of them are under:
Active UNIX domain sockets (w/o servers)
...
unix 3 [ ] STREAM CONNECTED 22022469 /run/systemd/journal/stdout
unix 3 [ ] STREAM CONNECTED 22025835
unix 3 [ ] DGRAM 22022209
unix 3 [ ] STREAM CONNECTED 22027567
unix 3 [ ] DGRAM 22022210
unix 3 [ ] STREAM CONNECTED 22022016
unix 3 [ ] STREAM CONNECTED 22022934 /run/systemd/journal/stdout
...

the inodes look suspiciously similar.
This shows 69 ESTABLISHED and 2510 CLOSE_WAIT.

Revision history for this message
John A Meinel (jameinel) wrote :

All of the CLOSE_WAIT entries in netstat are for port 17070.

Revision history for this message
Tim Penhey (thumper) wrote :

I think the tcp keep alives would help with this.

Revision history for this message
Canonical Juju QA Bot (juju-qa-bot) wrote :

This bug has not been updated in 2 years, so we're marking it Low importance. If you believe this is incorrect, please update the importance.

Changed in juju:
importance: High → Low
tags: added: expirebugs-bot
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.