txnpruner failing to prune on large controller

Bug #1771906 reported by Paul Collins on 2018-05-18
26
This bug affects 4 people
Affects Status Importance Assigned to Milestone
juju
High
John A Meinel
2.3
High
John A Meinel
2.4
High
John A Meinel

Bug Description

In Juju 2.3.9 and 2.4.2 we introduced two new 'juju controller-config' options:
  max-prune-txn-size: 1000000
  max-prune-passes: 100

We default to pruning 1M transactions at a time, and at most 100 passes per attempt. That gives us up to 100M transactions that we can prune per hour. That shouldn't prevent us from cleaning up any sort of backlog, but it should allow us to make more progress with each attempt. It is expected to take longer if we aren't pruning everything in one pass, but the normal load is about 200k transactions per hour, so 1M should operate the same under normal load. The config values will allow us to change the limits on a running controller, in case these are poor defaults.
Setting the max-prune-txn-size to 0 will remove the limit. A setting of max-prune-passes: 0 is just treated as a value of 1 pass.

We've observed that txns are building up on our production Juju 2 controller.
Yesterday I completed an mgopurge of 56M txns, and rebuilt all of the mongodb replics to compact the database from 38G to 24G, which took about five hours in all.

This controller is running 2.2.9, although since there are no significant pruning changes in 2.3.x (or so I'm told) I'm filing this bug now.

After about 13 hours of uptime, since approx 2018-05-17 07:44, we were up to 380,000 txns. I checked the controller agents logs and the txnpruner logged as follows.

I take the "failed to retrieve final txns count" message to mean that even after one hour of uptime this controller is generating more transactions than the pruner can cope with.

juju:PRIMARY> db.txns.count()
388057
juju:PRIMARY>
bye
ubuntu@juju-4da59b22-9710-4e69-840a-be49ee864a97-machine-2:~$ date -u
Thu May 17 20:41:58 UTC 2018

ubuntu@juju-4da59b22-9710-4e69-840a-be49ee864a97-machine-0:~$ ps -o lstart,cmd $(pgrep -f -- --machine-id)
                 STARTED CMD
Thu May 17 07:44:03 2018 /var/lib/juju/tools/machine-0/jujud machine --data-dir /var/lib/juju --machine-id 0 --debug
ubuntu@juju-4da59b22-9710-4e69-840a-be49ee864a97-machine-0:~$ sudo grep txnprune /var/log/juju/machine-?.log | tail -n 50
[...]
2018-05-17 07:44:24 INFO juju.worker.singular singular.go:104 standby "txnpruner"
2018-05-17 07:44:24 INFO juju.worker runner.go:477 start "txnpruner"
2018-05-17 09:44:03 INFO juju.worker runner.go:483 stopped "txnpruner", err: EOF
2018-05-17 09:44:03 ERROR juju.worker runner.go:381 fatal "txnpruner": EOF
2018-05-17 10:01:40 INFO juju.worker.singular singular.go:104 standby "txnpruner"
2018-05-17 10:01:40 INFO juju.worker runner.go:477 start "txnpruner"
ubuntu@juju-4da59b22-9710-4e69-840a-be49ee864a97-machine-0:~$

ubuntu@juju-4da59b22-9710-4e69-840a-be49ee864a97-machine-1:~$ ps -o lstart,cmd $(pgrep -f -- --machine-id)
                 STARTED CMD
Thu May 17 07:44:02 2018 /var/lib/juju/tools/machine-1/jujud machine --data-dir /var/lib/juju --machine-id 1 --debug
ubuntu@juju-4da59b22-9710-4e69-840a-be49ee864a97-machine-1:~$ sudo grep txnprune /var/log/juju/machine-?.log | tail -n 50
[...]
2018-05-17 07:44:23 INFO juju.worker.singular singular.go:104 standby "txnpruner"
2018-05-17 07:44:23 INFO juju.worker runner.go:477 start "txnpruner"
2018-05-17 09:44:03 INFO juju.worker runner.go:483 stopped "txnpruner", err: <nil>
ubuntu@juju-4da59b22-9710-4e69-840a-be49ee864a97-machine-1:~$

ubuntu@juju-4da59b22-9710-4e69-840a-be49ee864a97-machine-2:~$ ps -o lstart,cmd $(pgrep -f -- --machine-id)
                 STARTED CMD
ubuntu@juju-4da59b22-9710-4e69-840a-be49ee864a97-machine-2:~$ sudo grep txnprune /var/log/juju/machine-?.log | tail -n 50
ubuntu@juju-4da59b22-9710-4e69-840a-be49ee864a97-machine-2:~$ sudo zgrep txnprune /var/log/juju/machine-2-2018-05-17T17-15-21.020.log.gz | tail -n 50
2018-04-22 12:01:52 INFO juju.worker.singular singular.go:101 starting "txnpruner"
2018-04-22 12:01:52 INFO juju.worker runner.go:477 start "txnpruner"
2018-04-23 19:41:52 INFO juju.worker runner.go:483 stopped "txnpruner", err: worker "txnpruner" exited: pruning failed, txnpruner stopping: failed to retrieve final txns count: read
tcp 10.25.2.110:57904->10.25.2.110:37017: i/o timeout
2018-04-23 19:41:53 ERROR juju.worker runner.go:392 exited "txnpruner": worker "txnpruner" exited: pruning failed, txnpruner stopping: failed to retrieve final txns count: read tcp 10.25.2.110:57904->10.25.2.110:37017: i/o timeout
2018-04-23 19:41:53 INFO juju.worker runner.go:467 restarting "txnpruner" in 3s
2018-04-23 19:41:56 INFO juju.worker runner.go:477 start "txnpruner"
2018-04-23 20:23:13 INFO juju.worker runner.go:483 stopped "txnpruner", err: <nil>
2018-04-23 20:36:47 INFO juju.worker.singular singular.go:101 starting "txnpruner"
2018-04-23 20:36:47 INFO juju.worker runner.go:477 start "txnpruner"
2018-05-02 08:02:40 INFO juju.worker runner.go:483 stopped "txnpruner", err: worker "txnpruner" exited: pruning failed, txnpruner stopping: failed to retrieve final txns count: read
tcp 127.0.0.1:34474->127.0.0.1:37017: i/o timeout
2018-05-02 08:02:42 ERROR juju.worker runner.go:392 exited "txnpruner": worker "txnpruner" exited: pruning failed, txnpruner stopping: failed to retrieve final txns count: read tcp 127.0.0.1:34474->127.0.0.1:37017: i/o timeout
2018-05-02 08:02:42 INFO juju.worker runner.go:467 restarting "txnpruner" in 3s
2018-05-02 08:02:45 INFO juju.worker runner.go:477 start "txnpruner"
2018-05-02 08:47:58 INFO juju.worker.singular singular.go:104 standby "txnpruner"
2018-05-02 08:47:58 INFO juju.worker runner.go:477 start "txnpruner"
2018-05-03 18:22:44 INFO juju.worker runner.go:483 stopped "txnpruner", err: EOF
2018-05-03 18:22:52 ERROR juju.worker runner.go:392 exited "txnpruner": EOF
2018-05-17 07:44:23 INFO juju.worker.singular singular.go:101 starting "txnpruner"
2018-05-17 07:44:23 INFO juju.worker runner.go:477 start "txnpruner"
2018-05-17 08:45:52 INFO juju.worker runner.go:483 stopped "txnpruner", err: worker "txnpruner" exited: pruning failed, txnpruner stopping: failed to retrieve final txns count: read
tcp 127.0.0.1:47484->127.0.0.1:37017: i/o timeout
2018-05-17 08:45:52 ERROR juju.worker runner.go:392 exited "txnpruner": worker "txnpruner" exited: pruning failed, txnpruner stopping: failed to retrieve final txns count: read tcp 127.0.0.1:47484->127.0.0.1:37017: i/o timeout
2018-05-17 08:45:52 INFO juju.worker runner.go:467 restarting "txnpruner" in 3s
2018-05-17 08:45:55 INFO juju.worker runner.go:477 start "txnpruner"
2018-05-17 09:44:04 INFO juju.worker runner.go:483 stopped "txnpruner", err: <nil>
2018-05-17 10:01:17 INFO juju.worker.singular singular.go:101 starting "txnpruner"
2018-05-17 10:01:17 INFO juju.worker runner.go:477 start "txnpruner"

Paul Collins (pjdc) wrote :

It seems txnpruner started up again later, although it's not clear to me how to tell if it's active:

(machine 0)

2018-05-17 10:01:40 INFO juju.worker.singular singular.go:104 standby "txnpruner"
2018-05-17 10:01:40 INFO juju.worker runner.go:477 start "txnpruner"

(machine 1)

2018-05-18 03:28:09 INFO juju.worker.singular singular.go:104 standby "txnpruner"
2018-05-18 03:28:09 INFO juju.worker runner.go:477 start "txnpruner"

(machine 2)

2018-05-17 10:01:17 INFO juju.worker.singular singular.go:101 starting "txnpruner"
2018-05-17 10:01:17 INFO juju.worker runner.go:477 start "txnpruner"

Paul Collins (pjdc) wrote :
Download full text (5.3 KiB)

Okay, finally figure out what to grep for:

2018-05-17 08:44:52 INFO juju.txn prune.go:128 txns after last prune: 251800, txns now: 357278, pruning: too many new transactions
2018-05-17 11:01:17 INFO juju.txn prune.go:128 txns after last prune: 251800, txns now: 778357, pruning: too many new transactions
2018-05-17 11:03:09 INFO juju.txn prune.go:153 txn pruning complete after 1m52.040008885s. txns now: 274117, inspected 63 collections, 27332 docs (2077 cleaned)
2018-05-17 12:03:09 INFO juju.txn prune.go:128 txns after last prune: 274117, txns now: 531412, pruning: too many new transactions
2018-05-17 12:04:30 INFO juju.txn prune.go:153 txn pruning complete after 1m20.513413195s. txns now: 263121, inspected 63 collections, 20481 docs (6564 cleaned)
2018-05-17 13:04:30 INFO juju.txn prune.go:128 txns after last prune: 263121, txns now: 520497, pruning: too many new transactions
2018-05-17 13:05:36 INFO juju.txn prune.go:153 txn pruning complete after 1m5.471639408s. txns now: 262156, inspected 63 collections, 14467 docs (434 cleaned)
2018-05-17 14:05:36 INFO juju.txn prune.go:128 txns after last prune: 262156, txns now: 519253, pruning: too many new transactions
2018-05-17 14:06:31 INFO juju.txn prune.go:153 txn pruning complete after 55.140586434s. txns now: 261032, inspected 63 collections, 14479 docs (446 cleaned)
2018-05-17 15:06:31 INFO juju.txn prune.go:128 txns after last prune: 261032, txns now: 519604, pruning: too many new transactions
2018-05-17 15:07:35 INFO juju.txn prune.go:153 txn pruning complete after 1m3.926678727s. txns now: 263312, inspected 63 collections, 19079 docs (451 cleaned)
2018-05-17 16:07:35 INFO juju.txn prune.go:128 txns after last prune: 263312, txns now: 520881, pruning: too many new transactions
2018-05-17 16:08:28 INFO juju.txn prune.go:153 txn pruning complete after 52.362132882s. txns now: 261125, inspected 63 collections, 16821 docs (1175 cleaned)
2018-05-17 17:08:28 INFO juju.txn prune.go:128 txns after last prune: 261125, txns now: 518994, pruning: too many new transactions
2018-05-17 17:09:23 INFO juju.txn prune.go:153 txn pruning complete after 55.783720751s. txns now: 261729, inspected 63 collections, 14771 docs (539 cleaned)
2018-05-17 18:09:24 INFO juju.txn prune.go:128 txns after last prune: 261729, txns now: 519350, pruning: too many new transactions
2018-05-17 18:10:26 INFO juju.txn prune.go:153 txn pruning complete after 1m2.719207323s. txns now: 262231, inspected 63 collections, 14660 docs (464 cleaned)
2018-05-17 19:10:28 INFO juju.txn prune.go:128 txns after last prune: 262231, txns now: 520676, pruning: too many new transactions
2018-05-17 19:11:26 INFO juju.txn prune.go:153 txn pruning complete after 58.66493666s. txns now: 262614, inspected 63 collections, 15923 docs (492 cleaned)
2018-05-17 20:11:27 INFO juju.txn prune.go:128 txns after last prune: 262614, txns now: 520286, pruning: too many new transactions
2018-05-17 20:12:25 INFO juju.txn prune.go:153 txn pruning complete after 58.163812471s. txns now: 261906, inspected 63 collections, 15726 docs (706 cleaned)
2018-05-17 21:12:25 INFO juju.txn prune.go:128 txns after last prune: 261906, txns now: 519117, pruning: too many new transact...

Read more...

Paul Collins (pjdc) wrote :

And I also ran a couple of external resume,prune mgopurges at:

2018-05-18 03:36:15 DEBUG pruning completed: removed 75055 txns
2018-05-18 04:06:52 DEBUG pruning completed: removed 130722 txns

John A Meinel (jameinel) wrote :

So you seem to have stabilized at about 260k transactions per hour. Our default settings don't remove the last 1hr worth of transactions, so you get 260k in hour 1, then 260k in hour 2, (total 520k) and then we prune back down to 260k.

All that seems to be working correctly.

The concern is what caused it to stop keeping you at 260k, and allowed it to grow to 50M.

Paul Collins (pjdc) wrote :

However, the logging above indicates normal pruning behaviour, although it seems pretty conservative? Hence a few hundred K transactions hanging around at all times.

Paul Collins (pjdc) wrote :

I'll mark this incomplete since internal pruning is apparently working correctly at present. I'll reopen if it stops working again. However, based on checking the controller agent log files, it seems that previously it stopped working without logging anything.

Changed in juju:
status: New → Incomplete

Lines like "failed to retrieve final txn count" means that we failed to
issue a db.txns.count() command.
Which I suppose if there was a lot of content being added to the database
and at the same time there were 50M records, the time it takes to get the
count + the churn meant that we were failing?

2018-04-22 12:01:52 INFO juju.worker runner.go:477 start "txnpruner"
2018-04-23 19:41:52 INFO juju.worker runner.go:483 stopped "txnpruner",
err: worker "txnpruner" exited: pruning failed, txnpruner stopping: failed
to retrieve final txns count: read
tcp 10.25.2.110:57904->10.25.2.110:37017: i/o timeout

i/o timeout generally means we failed to talk to mongo.

What surprises me is that you se ethe "start ... pruner" and then only see
"failed to get count" after 7.5 hours. Was it running the rest of the time?
Most of our logging is done at DEBUG level, so I'm not too surprised that
you don't see a lot by default, though 7.5hrs to prune certainly is way
longer than it should be (you saw it was only about 1-2min when things are
running correctly.)

We do have 2 relevant INFO level messages:
    if !required {
        logger.Infof("txns after last prune: %d, txns now: %d, not pruning:
%s",
            lastTxnsCount, txnsCount, rationale)
        return nil
    }
    logger.Infof("txns after last prune: %d, txns now: %d, pruning: %s",
        lastTxnsCount, txnsCount, rationale)

However, I think default logging level for controllers is WARNING, so those
also wouldn't end up in the logs.

You could set something like

  juju model-config -m controller
"logging-config=<root>=WARNING,juju.txn=INFO"

(We'd like to move to having INFO be the default universally, but we're not
sure if we're actually ready to switch that right now.)

John
=:->

On Fri, May 18, 2018 at 8:47 AM, Paul Collins <email address hidden>
wrote:

> I'll mark this incomplete since internal pruning is apparently working
> correctly at present. I'll reopen if it stops working again. However,
> based on checking the controller agent log files, it seems that
> previously it stopped working without logging anything.
>
> ** Changed in: juju
> Status: New => Incomplete
>
> --
> You received this bug notification because you are subscribed to juju.
> Matching subscriptions: juju bugs
> https://bugs.launchpad.net/bugs/1771906
>
> Title:
> txnpruner failing to prune on large controller
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju/+bug/1771906/+subscriptions
>

Junien Fridrick (axino) wrote :

Hi,

This is happening on a 2.3.7 controller with ~30M docs in juju.txns.

juju logs the following :

$ sudo egrep 'txnpruner|prune.go|juju.txn' /var/log/juju/machine-?.log|tail
2018-06-27 09:45:34 ERROR juju.worker.dependency engine.go:551 "transaction-pruner" manifold worker returned unexpected error: pruning failed, txnpruner stopping: failed to retrieve final txns count: read tcp A.B.C.4:33651->A.B.C.2:37017: i/o timeout
2018-06-27 10:45:37 INFO juju.txn prune.go:127 txns after last prune: 80758, txns now: 30356480, pruning: too many new transactions
2018-06-27 10:46:37 WARNING juju.txn oracle.go:220 cleanup of "txns.prunetemp" failed: read tcp A.B.C.4:36240->A.B.C.2:37017: i/o timeout
2018-06-27 10:46:37 ERROR juju.worker.dependency engine.go:551 "transaction-pruner" manifold worker returned unexpected error: pruning failed, txnpruner stopping: failed to retrieve final txns count: read tcp A.B.C.4:36240->A.B.C.2:37017: i/o timeout
2018-06-27 11:46:40 INFO juju.txn prune.go:127 txns after last prune: 80758, txns now: 30438897, pruning: too many new transactions
2018-06-27 11:47:40 WARNING juju.txn oracle.go:220 cleanup of "txns.prunetemp" failed: read tcp A.B.C.4:41384->A.B.C.2:37017: i/o timeout
2018-06-27 11:47:40 ERROR juju.worker.dependency engine.go:551 "transaction-pruner" manifold worker returned unexpected error: pruning failed, txnpruner stopping: failed to retrieve final txns count: read tcp A.B.C.4:41384->A.B.C.2:37017: i/o timeout
2018-06-27 12:47:43 INFO juju.txn prune.go:127 txns after last prune: 80758, txns now: 30521252, pruning: too many new transactions
2018-06-27 12:48:43 WARNING juju.txn oracle.go:220 cleanup of "txns.prunetemp" failed: read tcp A.B.C.4:38816->A.B.C.2:37017: i/o timeout
2018-06-27 12:48:43 ERROR juju.worker.dependency engine.go:551 "transaction-pruner" manifold worker returned unexpected error: pruning failed, txnpruner stopping: failed to retrieve final txns count: read tcp A.B.C.4:38816->A.B.C.2:37017: i/o timeout

It would appear that juju is timing out after 60s, but the aggregation takes 488s :

Jun 27 12:55:51 juju-foo-0 mongod.37017[2055]: [conn144066] command juju.txns command: aggregate { aggregate: "txns", pipeline: [ { $match: { _id: { $lt: ObjectId('5b33795f0000000000000000') }, s: { $gte: 5 } } }, { $project: { _id: 1 } }, { $out: "txns.prunetemp" } ], cursor: { batchSize: 1616 }, allowDiskUse: true } keyUpdates:0 writeConflicts:0 numYields:241900 reslen:115 locks:{ Global: { acquireCount: { r: 961941, w: 475690, W: 1 } }, Database: { acquireCount: { r: 243124, w: 475688, R: 1, W: 2 } }, Collection: { acquireCount: { r: 243124, w: 40 } }, Metadata: { acquireCount: { w: 30440288 } }, oplog: { acquireCount: { w: 475648 } } } protocol:op_query 488082ms

Changed in juju:
status: Incomplete → New
Junien Fridrick (axino) wrote :

Note that 99.999% of these 30M txns were in state 6. mgopurge -stages prune was able to get rid of them and get the controller to an OK state.

Tim Penhey (thumper) wrote :

We discussed changing the code so it doesn't try to materialize a list of all the transactions to prune, but instead gather about a million at a time max.

Changed in juju:
status: New → Triaged
importance: Undecided → High
tags: added: mongodb pruning
Changed in juju:
assignee: nobody → John A Meinel (jameinel)
milestone: none → 2.4.2
John A Meinel (jameinel) wrote :
Changed in juju:
status: Triaged → In Progress
John A Meinel (jameinel) wrote :

if we want to incorporate this into Juju code, we'll also need a small patch. CleanAndPrune doesn't default to a per-pass batch size and auto retry. We *could* put that into juju/txn, but that kind of policy is something that I prefer to pull out into calling code.

John A Meinel (jameinel) on 2018-07-10
Changed in juju:
milestone: 2.4.2 → 2.5-beta1
Paul Collins (pjdc) wrote :

We just had another occurrence of this on our largest internal production controller.

We received alerts for high load on a unit that turned out to be the mongodb primary. We ran "mgopurge -stages resume,prune", which eventually removed approximately 27.8 million completed txns.

However, mgopurge crashed twice with "ERROR failed stage prune: Cursor not found, cursor id: ${bignum}" before a final run completed cleanly.

John A Meinel (jameinel) wrote :

It seems cursors default to having a 10 minute lifetime, though it seems to
be driver dependent. Both Java and Pymongo expose some sort of
"no_cursor_timeout".
Or:
https://docs.mongodb.com/manual/reference/method/cursor.noCursorTimeout/#cursor.noCursorTimeout

I wouldn't think we have long periods of inactivity, but maybe it is
possible that after creating the temp table, we are busy thinking about
other things, and don't go back to hit one of our other queries very often?

Batching into smaller groups to run at a time probably works around this as
well.

On Mon, Jul 23, 2018 at 8:19 AM, Paul Collins <email address hidden>
wrote:

> We just had another occurrence of this on our largest internal
> production controller.
>
> We received alerts for high load on a unit that turned out to be the
> mongodb primary. We ran "mgopurge -stages resume,prune", which
> eventually removed approximately 27.8 million completed txns.
>
> However, mgopurge crashed twice with "ERROR failed stage prune: Cursor
> not found, cursor id: ${bignum}" before a final run completed cleanly.
>
> --
> You received this bug notification because you are a bug assignee.
> Matching subscriptions: juju bugs
> https://bugs.launchpad.net/bugs/1771906
>
> Title:
> txnpruner failing to prune on large controller
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju/+bug/1771906/+subscriptions
>

John A Meinel (jameinel) wrote :

https://github.com/juju/juju/pull/9083 is targeting 2.3 which

1) Defaults to a maximum-per-pass of 1M transactions, and 100 passes.
2) Allows 'juju controller-config' in case we find those defaults are too generous or too small.

John A Meinel (jameinel) on 2018-08-23
description: updated
John A Meinel (jameinel) on 2018-11-13
Changed in juju:
status: In Progress → Fix Committed
Junien Fridrick (axino) wrote :

Hi,

For what it's worth we're still seeing problems on a 2.4.5 controller :

$ sudo grep juju.txn /var/log/juju/machine-0.log
2018-11-19 13:35:27 INFO juju.txn prune.go:131 txns after last prune: 204355, txns now: 9511023, pruning: too many new transactions
2018-11-19 13:36:27 WARNING juju.txn oracle.go:228 cleanup of "txns.prunetemp" failed: read tcp 1.2.3.109:45032->1.2.3.111:37017: i/o timeout
2018-11-19 14:36:30 INFO juju.txn prune.go:131 txns after last prune: 204355, txns now: 9718097, pruning: too many new transactions
2018-11-19 14:37:30 WARNING juju.txn oracle.go:228 cleanup of "txns.prunetemp" failed: read tcp 1.2.3.109:54136->1.2.3.111:37017: i/o timeout
2018-11-19 15:37:33 INFO juju.txn prune.go:131 txns after last prune: 204355, txns now: 9925300, pruning: too many new transactions
2018-11-19 15:38:33 WARNING juju.txn oracle.go:228 cleanup of "txns.prunetemp" failed: read tcp 1.2.3.109:45780->1.2.3.111:37017: i/o timeout

John A Meinel (jameinel) wrote :

New code for how we prune transactions to start removing txns much sooner is landing here:
https://github.com/juju/juju/pull/9594

Changed in juju:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers