juju is trying to reuse dead connections to mongodb

Bug #1734903 reported by Junien F
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Canonical Juju
Fix Released
High
John A Meinel
2.2
Fix Released
High
John A Meinel
2.3
Fix Released
High
John A Meinel

Bug Description

Hi,

This is all on a juju 2.2.6 ha-enabled controller (3 units).

We discovered this morning that there were 29M documents in the "juju.txns" collection. On 2 out of 3 controllers, which didn't get restarted for a while, juju was apparently trying, every second, to connect to mongo using a dead connection ("mongodb ip" is the IP of the mongodb primary) :

2017-11-28 11:18:48 INFO juju.state.watcher watcher.go:139 watcher loop failed: read tcp <local ip>:59783-><mongodb ip>:37017: i/o timeout
2017-11-28 11:18:48 INFO juju.worker runner.go:483 stopped "txnlog", err: read tcp <local ip>:59783-><mongodb ip>:37017: i/o timeout
2017-11-28 11:18:48 ERROR juju.worker runner.go:392 exited "txnlog": read tcp <local ip>:59783-><mongodb ip>:37017: i/o timeout
2017-11-28 11:18:48 INFO juju.worker runner.go:467 restarting "txnlog" in 1s
2017-11-28 11:18:49 INFO juju.worker runner.go:477 start "txnlog"
2017-11-28 11:18:49 INFO juju.state.watcher watcher.go:139 watcher loop failed: read tcp <local ip>:59783-><mongodb ip>:37017: i/o timeout
2017-11-28 11:18:49 INFO juju.worker runner.go:483 stopped "txnlog", err: read tcp <local ip>:59783-><mongodb ip>:37017: i/o timeout
2017-11-28 11:18:49 ERROR juju.worker runner.go:392 exited "txnlog": read tcp <local ip>:59783-><mongodb ip>:37017: i/o timeout
2017-11-28 11:18:49 INFO juju.worker runner.go:467 restarting "txnlog" in 1s
2017-11-28 11:18:50 INFO juju.worker runner.go:477 start "txnlog"
2017-11-28 11:18:50 INFO juju.state.watcher watcher.go:139 watcher loop failed: read tcp <local ip>:59783-><mongodb ip>:37017: i/o timeout
2017-11-28 11:18:50 INFO juju.worker runner.go:483 stopped "txnlog", err: read tcp <local ip>:59783-><mongodb ip>:37017: i/o timeout
2017-11-28 11:18:50 ERROR juju.worker runner.go:392 exited "txnlog": read tcp <local ip>:59783-><mongodb ip>:37017: i/o timeout
2017-11-28 11:18:50 INFO juju.worker runner.go:467 restarting "txnlog" in 1s

As you can see, it's trying to re-use the same quadruplet srcIP/srcPort/dstIP/dstPort, meaning the same TCP connection.

However, a "sudo ss -antp|grep 59783" returned 0 result, meaning, as I understand it, that the connection was dead.

This in turned prevented juju from purging txns, which led to the 29M records we now have (we're purging them with mgopurge as I write this).

juju should probably be more clever when re-using connections. Note that this situation could be due to a mongodb primary switch that happened a few days ago.

Thanks

Revision history for this message
John A Meinel (jameinel) wrote :

To put some specifics on it, the internal TXN log watcher is started via:
state/workers.go:
 ws.StartWorker(txnLogWorker, func() (worker.Worker, error) {
  return watcher.New(st.getTxnLogCollection()), nil
 })

But if you look at getTxnLogCollection:
func (st *State) getTxnLogCollection() *mgo.Collection {
 return st.session.DB(jujuDB).C(txnLogC)
}

That uses a static "State.session" member. And I cannot find anywhere that calls
\.session.Refresh()

If we wanted to do something like:
func (st *State) getTxnLogCollection() *mgo.Collection {
        session := st.session.Copy()
 return session.DB(jujuDB).C(txnLogC)
}
We would still need to figure out the lifecycle of that connection, figure out who needs to close it, etc.

So it needs a bit of a rework internally to probably have the worker itself copy the session and/or refresh it if there is an error.

We're probably actually making this worse in 2.3 with the StatePool which ensures that we'll be using the same State objects for longer, and for more purposes. Which is good in the sense we don't redo lots of work, but bad in the sense that any fatal error ends up fatal forever.

Changed in juju:
importance: Undecided → High
milestone: none → 2.3.1
status: New → Triaged
Revision history for this message
John A Meinel (jameinel) wrote :

(note that this should probably be considered if/when we work on sharing the underlying log watcher across all models.)

Revision history for this message
John A Meinel (jameinel) wrote :

a potential "cheap" fix for this would be to do something like:

err := st.session.Ping()
if err != nil {
  // failed to talk to the database, try refreshing our connection
  st.session.Refresh()
}
return st.session.DB(jujuDB).C(txnLogC)

It might even be reasonable to do a Refresh more often, as if the txnlog watcher is restarting, we probably have reason to suspect the connection.

Changed in juju:
milestone: 2.3.1 → none
Tim Penhey (thumper)
Changed in juju:
milestone: none → 2.3.2
Revision history for this message
John A Meinel (jameinel) wrote :

https://github.com/juju/juju/pull/8216
is targeting 2.2 for this fix. We may decide to only land it in 2.3+

Changed in juju:
milestone: 2.3.2 → none
milestone: none → 2.4-beta1
assignee: nobody → John A Meinel (jameinel)
status: Triaged → In Progress
Revision history for this message
Anastasia (anastasia-macmood) wrote :

As the PR from comment # 4 has been merged in 2.2, it has since been Released in 2.2.7 and 2.3.2.

I will also mark it as Fix Committed in 2.4-beta1 as I am sure that it was merged into develop which heads into 2.4.

Changed in juju:
status: In Progress → Fix Committed
Changed in juju:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.