"too many open files" during upgrade

Bug #1739380 reported by Junien Fridrick on 2017-12-20
16
This bug affects 2 people
Affects Status Importance Assigned to Milestone
juju
High
Unassigned
2.3
High
Unassigned
2.4
High
Unassigned

Bug Description

Hi,

We upgraded an HA controller (3 machines) from 2.2.6 to 2.2.8 today.

The upgrade got apparently blocked because a machine agent was reaching its open file descriptor limit (see logs at the bottom). These logs repeated until we restarted it.

The maxfd limit was 20k, and the opened files were connections to mongodb.

Stats for this controller :
applications: 923
machines: 558
models: 175
units: 2449

Given these numbers, I don't believe we should have 20k TCP connections ever opened to mongodb at once.

Once this machine agent got restarted, the upgrade got unblocked and proceeded without issue.

Thanks

2017-12-20 07:24:02 WARNING juju.apiserver log.go:168 http: Accept error: accept tcp [::]:17070: accept4: too many open files; retrying in 5ms

2017-12-20 07:24:02 WARNING juju.apiserver log.go:168 http: Accept error: accept tcp [::]:17070: accept4: too many open files; retrying in 10ms

2017-12-20 07:24:02 WARNING juju.apiserver log.go:168 http: Accept error: accept tcp [::]:17070: accept4: too many open files; retrying in 20ms

2017-12-20 07:24:02 WARNING juju.apiserver log.go:168 http: Accept error: accept tcp [::]:17070: accept4: too many open files; retrying in 40ms

2017-12-20 07:24:02 WARNING juju.apiserver log.go:168 http: Accept error: accept tcp [::]:17070: accept4: too many open files; retrying in 5ms

2017-12-20 07:24:02 WARNING juju.apiserver log.go:168 http: TLS handshake error from <IP>:36216: EOF

2017-12-20 07:24:02 WARNING juju.apiserver log.go:168 http: TLS handshake error from <IP>:50668: EOF

2017-12-20 07:24:02 WARNING juju.apiserver log.go:168 http: Accept error: accept tcp [::]:17070: accept4: too many open files; retrying in 5ms

2017-12-20 07:24:02 WARNING juju.apiserver log.go:168 http: TLS handshake error from <IP>:51260: EOF

2017-12-20 07:24:02 WARNING juju.apiserver log.go:168 http: Accept error: accept tcp [::]:17070: accept4: too many open files; retrying in 5ms

2017-12-20 07:24:02 WARNING juju.apiserver log.go:168 http: Accept error: accept tcp [::]:17070: accept4: too many open files; retrying in 5ms

2017-12-20 07:24:02 WARNING juju.apiserver log.go:168 http: Accept error: accept tcp [::]:17070: accept4: too many open files; retrying in 10ms

2017-12-20 07:24:02 WARNING juju.apiserver log.go:168 http: Accept error: accept tcp [::]:17070: accept4: too many open files; retrying in 5ms

2017-12-20 07:24:02 WARNING juju.apiserver log.go:168 http: TLS handshake error from <IP>:33982: EOF

2017-12-20 07:24:02 WARNING juju.apiserver log.go:168 http: Accept error: accept tcp [::]:17070: accept4: too many open files; retrying in 5ms

2017-12-20 07:24:02 WARNING juju.apiserver log.go:168 http: Accept error: accept tcp [::]:17070: accept4: too many open files; retrying in 10ms

2017-12-20 07:24:02 WARNING juju.apiserver log.go:168 http: Accept error: accept tcp [::]:17070: accept4: too many open files; retrying in 5ms

2017-12-20 07:24:02 WARNING juju.apiserver log.go:168 http: Accept error: accept tcp [::]:17070: accept4: too many open files; retrying in 10ms

2017-12-20 07:24:02 WARNING juju.apiserver log.go:168 http: Accept error: accept tcp [::]:17070: accept4: too many open files; retrying in 5ms

2017-12-20 07:24:03 WARNING juju.apiserver log.go:168 http: Accept error: accept tcp [::]:17070: accept4: too many open files; retrying in 10ms

John A Meinel (jameinel) wrote :

Indeed, if there are 20k file handles open to Mongo something very odd is going on internally. The mgo socket pool that is meant to keep a certain number of connections open is limited to 4000 connections.

I can imagine that there are ways to create >1 pool, but we *shouldn't* inside of a single controller.

Changed in juju:
importance: Undecided → High
status: New → Triaged
Tim Penhey (thumper) on 2018-01-02
tags: added: upgrade-juju
Tim Penhey (thumper) wrote :

Did a quick search of the state package on the 2.2 branch. The following functions aren't closing the iterator:

state/clouds.go:

func (st *State) Clouds()

state/cloudcredentials.go

func (st *State) CloudCredentials(user names.UserTag, cloudName string)

state/lease/client.go

func (client *client) readEntries // doesn't close iterator if leaseDoc.entry() returns an error

state/prune.go

The deleteInBatches function doesn't close the iterator, nor do either of the call sites.

state/upgrades.go

func updateLegacyLXDCredentialsOps(...)
func UpgradeNoProxyDefaults(st *State) error
func RemoveNilValueApplicationSettings(st *State) error

Tim Penhey (thumper) wrote :

I'm guessing, quite wildly I must add, that during the upgrade process, we get the credentails for every model for potential model upgrades.

However, all of these are worrying and will cause problems.

The pruner also not closing the iterator will leak over time causing lost file handles.

Tim Penhey (thumper) on 2018-01-07
Changed in juju:
milestone: none → 2.4-beta1
Tim Penhey (thumper) wrote :

Unfortunately this also happened with 2.2.9, so it wasn't the iterators, but is something else.

John A Meinel (jameinel) wrote :
Download full text (5.5 KiB)

For the file handle issues, digging through the log file from bug #1742604

The controller was happily started here:
2018-01-11 02:25:53 INFO juju.cmd supercommand.go:63 running jujud [2.2.9 gc go1.8]

Things generally seem happy, with model workers coming up and connecting correctly.

Up until we get to here:

2018-01-11 02:26:13 INFO juju.apiserver.common password.go:105 setting password for "machine-1"
2018-01-11 02:26:19 INFO juju.apiserver.connection request_notifier.go:80 agent login: machine-1 for dd562d7d-512e-4241-83a1-3a64cd84439b
2018-01-11 02:26:20 WARNING juju.mongo open.go:165 TLS handshake failed: read tcp 10.25.2.111:46598->10.25.2.109:37017: read: connection reset by peer
2018-01-11 02:26:20 WARNING juju.mongo open.go:165 TLS handshake failed: read tcp 10.25.2.111:46592->10.25.2.109:37017: read: connection reset by peer
2018-01-11 02:26:20 INFO juju.state.pool.txnwatcher txnwatcher.go:135 watcher loop failed: watcher iteration error: Closed explicitly

The TXN watcher suddenly gets a tcp read failure trying to read from mongo, and that causes its connection to get reset, and flagged as "Closed explicitly".

Now, I don't know if that is mongo closing the connection, something we did on our end, or something else.

But after that, we get a *flurry* of:
2018-01-11 02:26:20 INFO juju.worker runner.go:483 stopped "txnlog", err: Closed explicitly
2018-01-11 02:26:20 ERROR juju.worker runner.go:392 exited "txnlog": Closed explicitly
2018-01-11 02:26:20 INFO juju.worker runner.go:467 restarting "txnlog" in 1s
2018-01-11 02:26:20 INFO juju.worker runner.go:483 stopped "leadership", err: Closed explicitly
2018-01-11 02:26:20 ERROR juju.worker runner.go:392 exited "leadership": Closed explicitly
2018-01-11 02:26:20 INFO juju.worker runner.go:467 restarting "leadership" in 1s
2018-01-11 02:26:20 INFO juju.worker runner.go:483 stopped "leadership", err: Closed explicitly
2018-01-11 02:26:20 ERROR juju.worker runner.go:392 exited "leadership": Closed explicitly
2018-01-11 02:26:20 INFO juju.worker runner.go:467 restarting "leadership" in 1s
2018-01-11 02:26:20 INFO juju.state.presence presence.go:186 watcher loop failed: Closed explicitly
2018-01-11 02:26:20 INFO juju.worker runner.go:483 stopped "presence", err: Closed explicitly
2018-01-11 02:26:20 ERROR juju.worker runner.go:392 exited "presence": Closed explicitly
2018-01-11 02:26:20 INFO juju.worker runner.go:467 restarting "presence" in 1s

Which then proceeds onward until we get to:
2018-01-11 02:26:21 INFO juju.worker runner.go:477 start "txnlog"
2018-01-11 02:26:24 WARNING juju.mongo open.go:165 TLS handshake failed: read tcp 10.25.2.111:38931->10.25.2.109:37017: read: connection reset by peer
2018-01-11 02:26:24 WARNING juju.mongo open.go:165 TLS handshake failed: read tcp 10.25.2.111:38909->10.25.2.109:37017: read: connection reset by peer
...
and
2018-01-11 02:26:25 WARNING juju.mongo open.go:159 mongodb connection failed, will retry: dial tcp 10.25.2.109:37017: i/o timeout
2018-01-11 02:26:25 WARNING juju.mongo open.go:159 mongodb connection failed, will retry: dial tcp 10.25.2.109:37017: i/o timeout
2018-01-11 02:26:25 WARNING juju.mongo open.go:159 mongodb connection failed...

Read more...

John A Meinel (jameinel) wrote :

As a summary comment, my first guess is that if we get an error during TLS negotation, we are failing to close the underlying socket.

John A Meinel (jameinel) wrote :

Looking at the code to dial Mongo, I think it is correct:
  c, err := net.DialTimeout("tcp", addr, opts.Timeout)
  if err != nil {
   logger.Warningf("mongodb connection failed, will retry: %v", err)
   return nil, err
  }
  if tlsConfig != nil {
   cc := tls.Client(c, tlsConfig)
   if err := cc.Handshake(); err != nil {
    logger.Warningf("TLS handshake failed: %v", err)
    if err := c.Close(); err != nil {
     logger.Warningf("failed to close connection: %v", err)
    }
    return nil, err
   }
   c = cc
  }

The only question is if DialTimeout returns an error, would you ever need to close the connection? I would assume that 'c' is nil at that point.
And the error we're seeing in the log file is:
2018-01-11 02:26:20 WARNING juju.mongo open.go:165 TLS handshake failed: read tcp 10.25.2.111:46598->10.25.2.109:37017: read: connection reset by peer

"handshake failed"

looks like it should be closing the file handle correctly.

The other error that we are seeing, though, is "TLS handshake error" coming from the http package. Which is more the API server layer, than the Mongo layer.

I haven't tracked down where those are actually coming from. What is really weird is that it is saying it is happening in "log.go line 168", but we don't have a 'log.go' file in the Juju code base.

I do see several places that are creating an "apiserver" logger:
./juju/apiserver/apiserver.go:48:var logger = loggo.GetLogger("juju.apiserver")
./juju/cmd/jujud/agent/machine.go:1509: logger := loggo.GetLogger("juju.apiserver")

And the issue with the latter one is that it is part of the observer factories, thus is likely to be passed to something else.

I haven't found any obvious log.go files that are using a logger object on line 168.

Changed in juju:
milestone: 2.4-beta1 → none
Tim Penhey (thumper) wrote :

We haven't heard of this problem for a while, it is possible that it has been fixed inadvertantly.

no longer affects: juju/2.2
Changed in juju:
status: Triaged → Incomplete
Haw Loeung (hloeung) wrote :

Seeing this with a 2.3.8 to 2.4.3 upgrade: https://pastebin.canonical.com/p/nDJsp6gt66/

Changed in juju:
status: Incomplete → Confirmed
Changed in juju:
status: Confirmed → Triaged
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers