Canonical Juju

"too many open files" during upgrade

Bug #1739380 reported by Junien F on 2017-12-20

This bug affects 2 people

	Status	Importance	Assigned to
Canonical Juju	Triaged	Low	Unassigned
2.3	Won't Fix	High	Unassigned
2.4	Won't Fix	High	Unassigned

Bug Description

Hi,

We upgraded an HA controller (3 machines) from 2.2.6 to 2.2.8 today.

The upgrade got apparently blocked because a machine agent was reaching its open file descriptor limit (see logs at the bottom). These logs repeated until we restarted it.

The maxfd limit was 20k, and the opened files were connections to mongodb.

Stats for this controller :
applications: 923
machines: 558
models: 175
units: 2449

Given these numbers, I don't believe we should have 20k TCP connections ever opened to mongodb at once.

Once this machine agent got restarted, the upgrade got unblocked and proceeded without issue.

Thanks

2017-12-20 07:24:02 WARNING juju.apiserver log.go:168 http: Accept error: accept tcp [::]:17070: accept4: too many open files; retrying in 5ms

2017-12-20 07:24:02 WARNING juju.apiserver log.go:168 http: Accept error: accept tcp [::]:17070: accept4: too many open files; retrying in 10ms

2017-12-20 07:24:02 WARNING juju.apiserver log.go:168 http: Accept error: accept tcp [::]:17070: accept4: too many open files; retrying in 20ms

2017-12-20 07:24:02 WARNING juju.apiserver log.go:168 http: Accept error: accept tcp [::]:17070: accept4: too many open files; retrying in 40ms

2017-12-20 07:24:02 WARNING juju.apiserver log.go:168 http: Accept error: accept tcp [::]:17070: accept4: too many open files; retrying in 5ms

2017-12-20 07:24:02 WARNING juju.apiserver log.go:168 http: TLS handshake error from <IP>:36216: EOF

2017-12-20 07:24:02 WARNING juju.apiserver log.go:168 http: TLS handshake error from <IP>:50668: EOF

2017-12-20 07:24:02 WARNING juju.apiserver log.go:168 http: Accept error: accept tcp [::]:17070: accept4: too many open files; retrying in 5ms

2017-12-20 07:24:02 WARNING juju.apiserver log.go:168 http: TLS handshake error from <IP>:51260: EOF

2017-12-20 07:24:02 WARNING juju.apiserver log.go:168 http: Accept error: accept tcp [::]:17070: accept4: too many open files; retrying in 5ms

2017-12-20 07:24:02 WARNING juju.apiserver log.go:168 http: Accept error: accept tcp [::]:17070: accept4: too many open files; retrying in 10ms

2017-12-20 07:24:02 WARNING juju.apiserver log.go:168 http: Accept error: accept tcp [::]:17070: accept4: too many open files; retrying in 5ms

2017-12-20 07:24:02 WARNING juju.apiserver log.go:168 http: TLS handshake error from <IP>:33982: EOF

2017-12-20 07:24:02 WARNING juju.apiserver log.go:168 http: Accept error: accept tcp [::]:17070: accept4: too many open files; retrying in 5ms

2017-12-20 07:24:02 WARNING juju.apiserver log.go:168 http: Accept error: accept tcp [::]:17070: accept4: too many open files; retrying in 10ms

2017-12-20 07:24:02 WARNING juju.apiserver log.go:168 http: Accept error: accept tcp [::]:17070: accept4: too many open files; retrying in 5ms

2017-12-20 07:24:02 WARNING juju.apiserver log.go:168 http: Accept error: accept tcp [::]:17070: accept4: too many open files; retrying in 10ms

2017-12-20 07:24:02 WARNING juju.apiserver log.go:168 http: Accept error: accept tcp [::]:17070: accept4: too many open files; retrying in 5ms

2017-12-20 07:24:03 WARNING juju.apiserver log.go:168 http: Accept error: accept tcp [::]:17070: accept4: too many open files; retrying in 10ms

Tags:

Revision history for this message

John A Meinel (jameinel) wrote on 2017-12-20:

Indeed, if there are 20k file handles open to Mongo something very odd is going on internally. The mgo socket pool that is meant to keep a certain number of connections open is limited to 4000 connections.

I can imagine that there are ways to create >1 pool, but we *shouldn't* inside of a single controller.

Changed in juju:
importance:	Undecided → High
status:	New → Triaged

Tim Penhey (thumper) on 2018-01-02

tags:

added: upgrade-juju

Revision history for this message

Tim Penhey (thumper) wrote on 2018-01-03:

Did a quick search of the state package on the 2.2 branch. The following functions aren't closing the iterator:

state/clouds.go:

func (st *State) Clouds()

state/cloudcredentials.go

func (st *State) CloudCredentials(user names.UserTag, cloudName string)

state/lease/client.go

func (client *client) readEntries // doesn't close iterator if leaseDoc.entry() returns an error

state/prune.go

The deleteInBatches function doesn't close the iterator, nor do either of the call sites.

state/upgrades.go

func updateLegacyLXDCredentialsOps(...)
func UpgradeNoProxyDefaults(st *State) error
func RemoveNilValueApplicationSettings(st *State) error

Revision history for this message

Tim Penhey (thumper) wrote on 2018-01-03:

I'm guessing, quite wildly I must add, that during the upgrade process, we get the credentails for every model for potential model upgrades.

However, all of these are worrying and will cause problems.

The pruner also not closing the iterator will leak over time causing lost file handles.

Revision history for this message

Andrew Wilkins (axwalk) wrote on 2018-01-04:

https://github.com/juju/juju/pull/8255

Tim Penhey (thumper) on 2018-01-07

Changed in juju:
milestone:	none → 2.4-beta1

Revision history for this message

Tim Penhey (thumper) wrote on 2018-01-11:

Unfortunately this also happened with 2.2.9, so it wasn't the iterators, but is something else.

Revision history for this message

John A Meinel (jameinel) wrote on 2018-01-13:

Download full text (5.5 KiB)

For the file handle issues, digging through the log file from bug #1742604

The controller was happily started here:
2018-01-11 02:25:53 INFO juju.cmd supercommand.go:63 running jujud [2.2.9 gc go1.8]

Things generally seem happy, with model workers coming up and connecting correctly.

Up until we get to here:

2018-01-11 02:26:13 INFO juju.apiserver.common password.go:105 setting password for "machine-1"
2018-01-11 02:26:19 INFO juju.apiserver.connection request_notifier.go:80 agent login: machine-1 for dd562d7d-512e-4241-83a1-3a64cd84439b
2018-01-11 02:26:20 WARNING juju.mongo open.go:165 TLS handshake failed: read tcp 10.25.2.111:46598->10.25.2.109:37017: read: connection reset by peer
2018-01-11 02:26:20 WARNING juju.mongo open.go:165 TLS handshake failed: read tcp 10.25.2.111:46592->10.25.2.109:37017: read: connection reset by peer
2018-01-11 02:26:20 INFO juju.state.pool.txnwatcher txnwatcher.go:135 watcher loop failed: watcher iteration error: Closed explicitly

The TXN watcher suddenly gets a tcp read failure trying to read from mongo, and that causes its connection to get reset, and flagged as "Closed explicitly".

Now, I don't know if that is mongo closing the connection, something we did on our end, or something else.

But after that, we get a *flurry* of:
2018-01-11 02:26:20 INFO juju.worker runner.go:483 stopped "txnlog", err: Closed explicitly
2018-01-11 02:26:20 ERROR juju.worker runner.go:392 exited "txnlog": Closed explicitly
2018-01-11 02:26:20 INFO juju.worker runner.go:467 restarting "txnlog" in 1s
2018-01-11 02:26:20 INFO juju.worker runner.go:483 stopped "leadership", err: Closed explicitly
2018-01-11 02:26:20 ERROR juju.worker runner.go:392 exited "leadership": Closed explicitly
2018-01-11 02:26:20 INFO juju.worker runner.go:467 restarting "leadership" in 1s
2018-01-11 02:26:20 INFO juju.worker runner.go:483 stopped "leadership", err: Closed explicitly
2018-01-11 02:26:20 ERROR juju.worker runner.go:392 exited "leadership": Closed explicitly
2018-01-11 02:26:20 INFO juju.worker runner.go:467 restarting "leadership" in 1s
2018-01-11 02:26:20 INFO juju.state.presence presence.go:186 watcher loop failed: Closed explicitly
2018-01-11 02:26:20 INFO juju.worker runner.go:483 stopped "presence", err: Closed explicitly
2018-01-11 02:26:20 ERROR juju.worker runner.go:392 exited "presence": Closed explicitly
2018-01-11 02:26:20 INFO juju.worker runner.go:467 restarting "presence" in 1s

For the file handle issues, digging through the log file from bug #1742604

The controller was happily started here:
2018-01-11 02:25:53 INFO juju.cmd supercommand.go:63 running jujud [2.2.9 gc go1.8]

Things generally seem happy, with model workers coming up and connecting correctly.

Up until we get to here:

The TXN watcher suddenly gets a tcp read failure trying to read from mongo, and that causes its connection to get reset, and flagged as "Closed explicitly".

Now, I don't know if that is mongo closing the connection, something we did on our end, or something else.

Which then proceeds onward until we get to:
2018-01-11 02:26:21 INFO juju.worker runner.go:477 start "txnlog"
2018-01-11 02:26:24 WARNING juju.mongo open.go:165 TLS handshake failed: read tcp 10.25.2.111:38931->10.25.2.109:37017: read: connection reset by peer
2018-01-11 02:26:24 WARNING juju.mongo open.go:165 TLS handshake failed: read tcp 10.25.2.111:38909->10.25.2.109:37017: read: connection reset by peer
...
and
2018-01-11 02:26:25 WARNING juju.mongo open.go:159 mongodb connection failed, will retry: dial tcp 10.25.2.109:37017: i/o timeout
2018-01-11 02:26:25 WARNING juju.mongo open.go:159 mongodb connection failed, will retry: dial tcp 10.25.2.109:37017: i/o timeout
2018-01-11 02:26:25 WARNING juju.mongo open.go:159 mongodb connection failed, will retry: dial tcp 10.25.2.109:37017: i/o timeout
2018-01-11 02:26:25 WARNING juju.mongo open.go:159 mongodb connection failed, will retry: dial tcp 10.25.2.109:37017: i/o timeout
...
2018-01-11 02:26:26 WARNING juju.mongo open.go:165 TLS handshake failed: read tcp 10.25.2.111:35077->10.25.2.109:37017: read: connection reset by peer
2018-01-11 02:26:26 WARNING juju.mongo open.go:165 TLS handshake failed: read tcp 10.25.2.111:35069->10.25.2.109:37017: read: connection reset by peer
2018-01-11 02:26:26 WARNING juju.mongo open.go:165 TLS handshake failed: read tcp 10.25.2.111:35063->10.25.2.109:37017: read: connection reset by peer

So at this point we're spinning on:
2018-01-11 02:26:30 WARNING juju.mongo open.go:165 TLS handshake failed: EOF
2018-01-11 02:26:30 WARNING juju.mongo open.go:165 TLS handshake failed: EOF

I wonder if it is something like Mongo had a re-election while we were coming up, and then we spin trying to connect to mongo.
It makes me wonder if we aren't leaking a file descriptor if TLS fails to negotiate.

After a bit more time, something similar seems to start happening again:
2018-01-11 02:38:34 INFO juju.worker runner.go:477 start "singular"
2018-01-11 02:38:34 WARNING juju.apiserver log.go:168 http: TLS handshake error from 185.125.191.18:49276: EOF

2018-01-11 02:38:35 WARNING juju.apiserver log.go:168 http: TLS handshake error from 91.189.90.53:36930: EOF

2018-01-11 02:38:35 WARNING juju.apiserver log.go:168 http: TLS handshake error from 91.189.90.53:36939: EOF

2018-01-11 02:38:35 WARNING juju.apiserver log.go:168 http: TLS handshake error from 162.213.33.61:51336: EOF

These actually appear to be EOF errors from clients trying to connect to us (things like 91.189.90.53 would be an external 'juju' client, most likely).

And then while that is going on, we end up with:
2018-01-11 02:39:13 WARNING juju.apiserver log.go:168 http: Accept error: accept tcp [::]:17070: accept4: too many open files; retrying in 5ms

2018-01-11 02:39:14 WARNING juju.apiserver log.go:168 http: Accept error: accept tcp [::]:17070: accept4: too many open files; retrying in 5ms

That seems to be more "we can no longer accept new connections to the controller, because we are out of file handles", but given that is happening while we're getting EOF, I can suppose that:

a) we might be leaking file handles if we get a TLS handshake error.

Revision history for this message

John A Meinel (jameinel) wrote on 2018-01-13:

As a summary comment, my first guess is that if we get an error during TLS negotation, we are failing to close the underlying socket.

Revision history for this message

John A Meinel (jameinel) wrote on 2018-01-13:

Looking at the code to dial Mongo, I think it is correct:
  c, err := net.DialTimeout("tcp", addr, opts.Timeout)
  if err != nil {
   logger.Warningf("mongodb connection failed, will retry: %v", err)
   return nil, err
  }
  if tlsConfig != nil {
   cc := tls.Client(c, tlsConfig)
   if err := cc.Handshake(); err != nil {
    logger.Warningf("TLS handshake failed: %v", err)
    if err := c.Close(); err != nil {
     logger.Warningf("failed to close connection: %v", err)
    }
    return nil, err
   }
   c = cc
  }

The only question is if DialTimeout returns an error, would you ever need to close the connection? I would assume that 'c' is nil at that point.
And the error we're seeing in the log file is:
2018-01-11 02:26:20 WARNING juju.mongo open.go:165 TLS handshake failed: read tcp 10.25.2.111:46598->10.25.2.109:37017: read: connection reset by peer

"handshake failed"

looks like it should be closing the file handle correctly.

The other error that we are seeing, though, is "TLS handshake error" coming from the http package. Which is more the API server layer, than the Mongo layer.

I haven't tracked down where those are actually coming from. What is really weird is that it is saying it is happening in "log.go line 168", but we don't have a 'log.go' file in the Juju code base.

I do see several places that are creating an "apiserver" logger:
./juju/apiserver/apiserver.go:48:var logger = loggo.GetLogger("juju.apiserver")
./juju/cmd/jujud/agent/machine.go:1509: logger := loggo.GetLogger("juju.apiserver")

And the issue with the latter one is that it is part of the observer factories, thus is likely to be passed to something else.

I haven't found any obvious log.go files that are using a logger object on line 168.

Canonical Juju QA Bot (juju-qa-bot) on 2018-04-18

Changed in juju:
milestone:	2.4-beta1 → none

Revision history for this message

Tim Penhey (thumper) wrote on 2018-09-04:

We haven't heard of this problem for a while, it is possible that it has been fixed inadvertantly.

no longer affects:	juju/2.2
Changed in juju:
status:	Triaged → Incomplete

Revision history for this message

Haw Loeung (hloeung) wrote on 2018-09-10:

#10

Seeing this with a 2.3.8 to 2.4.3 upgrade: https://pastebin.canonical.com/p/nDJsp6gt66/

Changed in juju:
status:	Incomplete → Confirmed

Anastasia (anastasia-macmood) on 2018-10-09

Changed in juju:
status:	Confirmed → Triaged

Revision history for this message

Anastasia (anastasia-macmood) wrote on 2020-02-26:

#11

Marking as Won't Fix for 2.3 and 2.4 series since we are not planning to make any further releases in these series at this stage.

Revision history for this message

Canonical Juju QA Bot (juju-qa-bot) wrote on 2022-11-03:

#12

This bug has not been updated in 2 years, so we're marking it Low importance. If you believe this is incorrect, please update the importance.

Changed in juju:
importance:	High → Low
tags:	added: expirebugs-bot

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.