Canonical Juju

Comment 6 for bug 1739380

Revision history for this message

John A Meinel (jameinel) wrote on 2018-01-13:

For the file handle issues, digging through the log file from bug #1742604

The controller was happily started here:
2018-01-11 02:25:53 INFO juju.cmd supercommand.go:63 running jujud [2.2.9 gc go1.8]

Things generally seem happy, with model workers coming up and connecting correctly.

Up until we get to here:

2018-01-11 02:26:13 INFO juju.apiserver.common password.go:105 setting password for "machine-1"
2018-01-11 02:26:19 INFO juju.apiserver.connection request_notifier.go:80 agent login: machine-1 for dd562d7d-512e-4241-83a1-3a64cd84439b
2018-01-11 02:26:20 WARNING juju.mongo open.go:165 TLS handshake failed: read tcp 10.25.2.111:46598->10.25.2.109:37017: read: connection reset by peer
2018-01-11 02:26:20 WARNING juju.mongo open.go:165 TLS handshake failed: read tcp 10.25.2.111:46592->10.25.2.109:37017: read: connection reset by peer
2018-01-11 02:26:20 INFO juju.state.pool.txnwatcher txnwatcher.go:135 watcher loop failed: watcher iteration error: Closed explicitly

The TXN watcher suddenly gets a tcp read failure trying to read from mongo, and that causes its connection to get reset, and flagged as "Closed explicitly".

Now, I don't know if that is mongo closing the connection, something we did on our end, or something else.

But after that, we get a *flurry* of:
2018-01-11 02:26:20 INFO juju.worker runner.go:483 stopped "txnlog", err: Closed explicitly
2018-01-11 02:26:20 ERROR juju.worker runner.go:392 exited "txnlog": Closed explicitly
2018-01-11 02:26:20 INFO juju.worker runner.go:467 restarting "txnlog" in 1s
2018-01-11 02:26:20 INFO juju.worker runner.go:483 stopped "leadership", err: Closed explicitly
2018-01-11 02:26:20 ERROR juju.worker runner.go:392 exited "leadership": Closed explicitly
2018-01-11 02:26:20 INFO juju.worker runner.go:467 restarting "leadership" in 1s
2018-01-11 02:26:20 INFO juju.worker runner.go:483 stopped "leadership", err: Closed explicitly
2018-01-11 02:26:20 ERROR juju.worker runner.go:392 exited "leadership": Closed explicitly
2018-01-11 02:26:20 INFO juju.worker runner.go:467 restarting "leadership" in 1s
2018-01-11 02:26:20 INFO juju.state.presence presence.go:186 watcher loop failed: Closed explicitly
2018-01-11 02:26:20 INFO juju.worker runner.go:483 stopped "presence", err: Closed explicitly
2018-01-11 02:26:20 ERROR juju.worker runner.go:392 exited "presence": Closed explicitly
2018-01-11 02:26:20 INFO juju.worker runner.go:467 restarting "presence" in 1s

Which then proceeds onward until we get to:
2018-01-11 02:26:21 INFO juju.worker runner.go:477 start "txnlog"
2018-01-11 02:26:24 WARNING juju.mongo open.go:165 TLS handshake failed: read tcp 10.25.2.111:38931->10.25.2.109:37017: read: connection reset by peer
2018-01-11 02:26:24 WARNING juju.mongo open.go:165 TLS handshake failed: read tcp 10.25.2.111:38909->10.25.2.109:37017: read: connection reset by peer
...
and
2018-01-11 02:26:25 WARNING juju.mongo open.go:159 mongodb connection failed, will retry: dial tcp 10.25.2.109:37017: i/o timeout
2018-01-11 02:26:25 WARNING juju.mongo open.go:159 mongodb connection failed, will retry: dial tcp 10.25.2.109:37017: i/o timeout
2018-01-11 02:26:25 WARNING juju.mongo open.go:159 mongodb connection failed, will retry: dial tcp 10.25.2.109:37017: i/o timeout
2018-01-11 02:26:25 WARNING juju.mongo open.go:159 mongodb connection failed, will retry: dial tcp 10.25.2.109:37017: i/o timeout
...
2018-01-11 02:26:26 WARNING juju.mongo open.go:165 TLS handshake failed: read tcp 10.25.2.111:35077->10.25.2.109:37017: read: connection reset by peer
2018-01-11 02:26:26 WARNING juju.mongo open.go:165 TLS handshake failed: read tcp 10.25.2.111:35069->10.25.2.109:37017: read: connection reset by peer
2018-01-11 02:26:26 WARNING juju.mongo open.go:165 TLS handshake failed: read tcp 10.25.2.111:35063->10.25.2.109:37017: read: connection reset by peer

So at this point we're spinning on:
2018-01-11 02:26:30 WARNING juju.mongo open.go:165 TLS handshake failed: EOF
2018-01-11 02:26:30 WARNING juju.mongo open.go:165 TLS handshake failed: EOF

I wonder if it is something like Mongo had a re-election while we were coming up, and then we spin trying to connect to mongo.
It makes me wonder if we aren't leaking a file descriptor if TLS fails to negotiate.

After a bit more time, something similar seems to start happening again:
2018-01-11 02:38:34 INFO juju.worker runner.go:477 start "singular"
2018-01-11 02:38:34 WARNING juju.apiserver log.go:168 http: TLS handshake error from 185.125.191.18:49276: EOF

2018-01-11 02:38:35 WARNING juju.apiserver log.go:168 http: TLS handshake error from 91.189.90.53:36930: EOF

2018-01-11 02:38:35 WARNING juju.apiserver log.go:168 http: TLS handshake error from 91.189.90.53:36939: EOF

2018-01-11 02:38:35 WARNING juju.apiserver log.go:168 http: TLS handshake error from 162.213.33.61:51336: EOF

These actually appear to be EOF errors from clients trying to connect to us (things like 91.189.90.53 would be an external 'juju' client, most likely).

And then while that is going on, we end up with:
2018-01-11 02:39:13 WARNING juju.apiserver log.go:168 http: Accept error: accept tcp [::]:17070: accept4: too many open files; retrying in 5ms

2018-01-11 02:39:14 WARNING juju.apiserver log.go:168 http: Accept error: accept tcp [::]:17070: accept4: too many open files; retrying in 5ms

That seems to be more "we can no longer accept new connections to the controller, because we are out of file handles", but given that is happening while we're getting EOF, I can suppose that:

a) we might be leaking file handles if we get a TLS handshake error.