Bug #1771906 “txnpruner failing to prune on large controller” : Bugs : Canonical Juju

Revision history for this message

Paul Collins (pjdc) wrote on 2018-05-18:

#1

It seems txnpruner started up again later, although it's not clear to me how to tell if it's active:

(machine 0)

2018-05-17 10:01:40 INFO juju.worker.singular singular.go:104 standby "txnpruner"
2018-05-17 10:01:40 INFO juju.worker runner.go:477 start "txnpruner"

(machine 1)

2018-05-18 03:28:09 INFO juju.worker.singular singular.go:104 standby "txnpruner"
2018-05-18 03:28:09 INFO juju.worker runner.go:477 start "txnpruner"

(machine 2)

2018-05-17 10:01:17 INFO juju.worker.singular singular.go:101 starting "txnpruner"
2018-05-17 10:01:17 INFO juju.worker runner.go:477 start "txnpruner"

Revision history for this message

Paul Collins (pjdc) wrote on 2018-05-18:

#3

Download full text (5.3 KiB)

Okay, finally figure out what to grep for:

2018-05-17 08:44:52 INFO juju.txn prune.go:128 txns after last prune: 251800, txns now: 357278, pruning: too many new transactions
2018-05-17 11:01:17 INFO juju.txn prune.go:128 txns after last prune: 251800, txns now: 778357, pruning: too many new transactions
2018-05-17 11:03:09 INFO juju.txn prune.go:153 txn pruning complete after 1m52.040008885s. txns now: 274117, inspected 63 collections, 27332 docs (2077 cleaned)
2018-05-17 12:03:09 INFO juju.txn prune.go:128 txns after last prune: 274117, txns now: 531412, pruning: too many new transactions
2018-05-17 12:04:30 INFO juju.txn prune.go:153 txn pruning complete after 1m20.513413195s. txns now: 263121, inspected 63 collections, 20481 docs (6564 cleaned)
2018-05-17 13:04:30 INFO juju.txn prune.go:128 txns after last prune: 263121, txns now: 520497, pruning: too many new transactions
2018-05-17 13:05:36 INFO juju.txn prune.go:153 txn pruning complete after 1m5.471639408s. txns now: 262156, inspected 63 collections, 14467 docs (434 cleaned)
2018-05-17 14:05:36 INFO juju.txn prune.go:128 txns after last prune: 262156, txns now: 519253, pruning: too many new transactions
2018-05-17 14:06:31 INFO juju.txn prune.go:153 txn pruning complete after 55.140586434s. txns now: 261032, inspected 63 collections, 14479 docs (446 cleaned)
2018-05-17 15:06:31 INFO juju.txn prune.go:128 txns after last prune: 261032, txns now: 519604, pruning: too many new transactions
2018-05-17 15:07:35 INFO juju.txn prune.go:153 txn pruning complete after 1m3.926678727s. txns now: 263312, inspected 63 collections, 19079 docs (451 cleaned)
2018-05-17 16:07:35 INFO juju.txn prune.go:128 txns after last prune: 263312, txns now: 520881, pruning: too many new transactions
2018-05-17 16:08:28 INFO juju.txn prune.go:153 txn pruning complete after 52.362132882s. txns now: 261125, inspected 63 collections, 16821 docs (1175 cleaned)
2018-05-17 17:08:28 INFO juju.txn prune.go:128 txns after last prune: 261125, txns now: 518994, pruning: too many new transactions
2018-05-17 17:09:23 INFO juju.txn prune.go:153 txn pruning complete after 55.783720751s. txns now: 261729, inspected 63 collections, 14771 docs (539 cleaned)
2018-05-17 18:09:24 INFO juju.txn prune.go:128 txns after last prune: 261729, txns now: 519350, pruning: too many new transactions
2018-05-17 18:10:26 INFO juju.txn prune.go:153 txn pruning complete after 1m2.719207323s. txns now: 262231, inspected 63 collections, 14660 docs (464 cleaned)
2018-05-17 19:10:28 INFO juju.txn prune.go:128 txns after last prune: 262231, txns now: 520676, pruning: too many new transactions
2018-05-17 19:11:26 INFO juju.txn prune.go:153 txn pruning complete after 58.66493666s. txns now: 262614, inspected 63 collections, 15923 docs (492 cleaned)
2018-05-17 20:11:27 INFO juju.txn prune.go:128 txns after last prune: 262614, txns now: 520286, pruning: too many new transactions
2018-05-17 20:12:25 INFO juju.txn prune.go:153 txn pruning complete after 58.163812471s. txns now: 261906, inspected 63 collections, 15726 docs (706 cleaned)
2018-05-17 21:12:25 INFO juju.txn prune.go:128 txns after last prune: 261906, txns now: 519117, pruning: too many new transact...

Okay, finally figure out what to grep for:

2018-05-17 08:44:52 INFO juju.txn prune.go:128 txns after last prune: 251800, txns now: 357278, pruning: too many new transactions
2018-05-17 11:01:17 INFO juju.txn prune.go:128 txns after last prune: 251800, txns now: 778357, pruning: too many new transactions
2018-05-17 11:03:09 INFO juju.txn prune.go:153 txn pruning complete after 1m52.040008885s. txns now: 274117, inspected 63 collections, 27332 docs (2077 cleaned)
2018-05-17 12:03:09 INFO juju.txn prune.go:128 txns after last prune: 274117, txns now: 531412, pruning: too many new transactions
2018-05-17 12:04:30 INFO juju.txn prune.go:153 txn pruning complete after 1m20.513413195s. txns now: 263121, inspected 63 collections, 20481 docs (6564 cleaned)
2018-05-17 13:04:30 INFO juju.txn prune.go:128 txns after last prune: 263121, txns now: 520497, pruning: too many new transactions
2018-05-17 13:05:36 INFO juju.txn prune.go:153 txn pruning complete after 1m5.471639408s. txns now: 262156, inspected 63 collections, 14467 docs (434 cleaned)
2018-05-17 14:05:36 INFO juju.txn prune.go:128 txns after last prune: 262156, txns now: 519253, pruning: too many new transactions
2018-05-17 14:06:31 INFO juju.txn prune.go:153 txn pruning complete after 55.140586434s. txns now: 261032, inspected 63 collections, 14479 docs (446 cleaned)
2018-05-17 15:06:31 INFO juju.txn prune.go:128 txns after last prune: 261032, txns now: 519604, pruning: too many new transactions
2018-05-17 15:07:35 INFO juju.txn prune.go:153 txn pruning complete after 1m3.926678727s. txns now: 263312, inspected 63 collections, 19079 docs (451 cleaned)
2018-05-17 16:07:35 INFO juju.txn prune.go:128 txns after last prune: 263312, txns now: 520881, pruning: too many new transactions
2018-05-17 16:08:28 INFO juju.txn prune.go:153 txn pruning complete after 52.362132882s. txns now: 261125, inspected 63 collections, 16821 docs (1175 cleaned)
2018-05-17 17:08:28 INFO juju.txn prune.go:128 txns after last prune: 261125, txns now: 518994, pruning: too many new transactions
2018-05-17 17:09:23 INFO juju.txn prune.go:153 txn pruning complete after 55.783720751s. txns now: 261729, inspected 63 collections, 14771 docs (539 cleaned)
2018-05-17 18:09:24 INFO juju.txn prune.go:128 txns after last prune: 261729, txns now: 519350, pruning: too many new transactions
2018-05-17 18:10:26 INFO juju.txn prune.go:153 txn pruning complete after 1m2.719207323s. txns now: 262231, inspected 63 collections, 14660 docs (464 cleaned)
2018-05-17 19:10:28 INFO juju.txn prune.go:128 txns after last prune: 262231, txns now: 520676, pruning: too many new transactions
2018-05-17 19:11:26 INFO juju.txn prune.go:153 txn pruning complete after 58.66493666s. txns now: 262614, inspected 63 collections, 15923 docs (492 cleaned)
2018-05-17 20:11:27 INFO juju.txn prune.go:128 txns after last prune: 262614, txns now: 520286, pruning: too many new transactions
2018-05-17 20:12:25 INFO juju.txn prune.go:153 txn pruning complete after 58.163812471s. txns now: 261906, inspected 63 collections, 15726 docs (706 cleaned)
2018-05-17 21:12:25 INFO juju.txn prune.go:128 txns after last prune: 261906, txns now: 519117, pruning: too many new transactions
2018-05-17 21:13:26 INFO juju.txn prune.go:153 txn pruning complete after 1m0.888025344s. txns now: 261581, inspected 63 collections, 14761 docs (557 cleaned)
2018-05-17 22:13:26 INFO juju.txn prune.go:128 txns after last prune: 261581, txns now: 519084, pruning: too many new transactions
2018-05-17 22:14:25 INFO juju.txn prune.go:153 txn pruning complete after 59.219452113s. txns now: 261947, inspected 63 collections, 14576 docs (435 cleaned)
2018-05-17 23:14:25 INFO juju.txn prune.go:128 txns after last prune: 261947, txns now: 519450, pruning: too many new transactions
2018-05-17 23:15:24 INFO juju.txn prune.go:153 txn pruning complete after 59.303205079s. txns now: 261820, inspected 63 collections, 14511 docs (457 cleaned)
2018-05-18 00:15:24 INFO juju.txn prune.go:128 txns after last prune: 261820, txns now: 519367, pruning: too many new transactions
2018-05-18 00:16:18 INFO juju.txn prune.go:153 txn pruning complete after 53.51410596s. txns now: 261466, inspected 63 collections, 14509 docs (431 cleaned)
2018-05-18 01:16:18 INFO juju.txn prune.go:128 txns after last prune: 261466, txns now: 518683, pruning: too many new transactions
2018-05-18 01:17:16 INFO juju.txn prune.go:153 txn pruning complete after 57.198467444s. txns now: 261113, inspected 63 collections, 14540 docs (454 cleaned)
2018-05-18 02:17:16 INFO juju.txn prune.go:128 txns after last prune: 261113, txns now: 518789, pruning: too many new transactions
2018-05-18 02:18:21 INFO juju.txn prune.go:153 txn pruning complete after 1m4.740267702s. txns now: 262378, inspected 63 collections, 14475 docs (442 cleaned)
2018-05-18 03:18:21 INFO juju.txn prune.go:128 txns after last prune: 262378, txns now: 519477, pruning: too many new transactions
2018-05-18 03:19:29 INFO juju.txn prune.go:153 txn pruning complete after 1m7.800545051s. txns now: 262114, inspected 63 collections, 14498 docs (424 cleaned)
2018-05-18 04:19:29 INFO juju.txn prune.go:128 txns after last prune: 262114, txns now: 313918, pruning: transactions have grown significantly
2018-05-18 04:19:44 INFO juju.txn prune.go:153 txn pruning complete after 15.410154434s. txns now: 258670, inspected 63 collections, 13915 docs (124 cleaned)

Revision history for this message

Paul Collins (pjdc) wrote on 2018-05-18:

#4

And I also ran a couple of external resume,prune mgopurges at:

2018-05-18 03:36:15 DEBUG pruning completed: removed 75055 txns
2018-05-18 04:06:52 DEBUG pruning completed: removed 130722 txns

Revision history for this message

John A Meinel (jameinel) wrote on 2018-05-18:

#5

So you seem to have stabilized at about 260k transactions per hour. Our default settings don't remove the last 1hr worth of transactions, so you get 260k in hour 1, then 260k in hour 2, (total 520k) and then we prune back down to 260k.

All that seems to be working correctly.

The concern is what caused it to stop keeping you at 260k, and allowed it to grow to 50M.

Revision history for this message

Paul Collins (pjdc) wrote on 2018-05-18:

#6

However, the logging above indicates normal pruning behaviour, although it seems pretty conservative? Hence a few hundred K transactions hanging around at all times.

Revision history for this message

Paul Collins (pjdc) wrote on 2018-05-18:

#7

I'll mark this incomplete since internal pruning is apparently working correctly at present. I'll reopen if it stops working again. However, based on checking the controller agent log files, it seems that previously it stopped working without logging anything.

Changed in juju:
status:	New → Incomplete

Revision history for this message

John A Meinel (jameinel) wrote on 2018-05-18: Re: [Bug 1771906] Re: txnpruner failing to prune on large controller

#8

Lines like "failed to retrieve final txn count" means that we failed to
issue a db.txns.count() command.
Which I suppose if there was a lot of content being added to the database
and at the same time there were 50M records, the time it takes to get the
count + the churn meant that we were failing?

2018-04-22 12:01:52 INFO juju.worker runner.go:477 start "txnpruner"
2018-04-23 19:41:52 INFO juju.worker runner.go:483 stopped "txnpruner",
err: worker "txnpruner" exited: pruning failed, txnpruner stopping: failed
to retrieve final txns count: read
tcp 10.25.2.110:57904->10.25.2.110:37017: i/o timeout

i/o timeout generally means we failed to talk to mongo.

What surprises me is that you se ethe "start ... pruner" and then only see
"failed to get count" after 7.5 hours. Was it running the rest of the time?
Most of our logging is done at DEBUG level, so I'm not too surprised that
you don't see a lot by default, though 7.5hrs to prune certainly is way
longer than it should be (you saw it was only about 1-2min when things are
running correctly.)

We do have 2 relevant INFO level messages:
    if !required {
        logger.Infof("txns after last prune: %d, txns now: %d, not pruning:
%s",
            lastTxnsCount, txnsCount, rationale)
        return nil
    }
    logger.Infof("txns after last prune: %d, txns now: %d, pruning: %s",
        lastTxnsCount, txnsCount, rationale)

However, I think default logging level for controllers is WARNING, so those
also wouldn't end up in the logs.

You could set something like

juju model-config -m controller
"logging-config=<root>=WARNING,juju.txn=INFO"

(We'd like to move to having INFO be the default universally, but we're not
sure if we're actually ready to switch that right now.)

John
=:->

On Fri, May 18, 2018 at 8:47 AM, Paul Collins <email address hidden>
wrote:

> I'll mark this incomplete since internal pruning is apparently working
> correctly at present. I'll reopen if it stops working again. However,
> based on checking the controller agent log files, it seems that
> previously it stopped working without logging anything.
>
> ** Changed in: juju
> Status: New => Incomplete
>
> --
> You received this bug notification because you are subscribed to juju.
> Matching subscriptions: juju bugs
> https://bugs.launchpad.net/bugs/1771906
>
> Title:
> txnpruner failing to prune on large controller
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju/+bug/1771906/+subscriptions
>

Lines like "failed to retrieve final txn count" means that we failed to
issue a db.txns.count() command.
Which I suppose if there was a lot of content being added to the database
and at the same time there were 50M records, the time it takes to get the
count + the churn meant that we were failing?

2018-04-22 12:01:52 INFO juju.worker runner.go:477 start "txnpruner"
2018-04-23 19:41:52 INFO juju.worker runner.go:483 stopped "txnpruner",
err: worker "txnpruner" exited: pruning failed, txnpruner stopping: failed
to retrieve final txns count: read
tcp 10.25.2.110:57904->10.25.2.110:37017: i/o timeout

i/o timeout generally means we failed to talk to mongo.

What surprises me is that you se ethe "start ... pruner" and then only see
"failed to get count" after 7.5 hours. Was it running the rest of the time?
Most of our logging is done at DEBUG level, so I'm not too surprised that
you don't see a lot by default, though 7.5hrs to prune certainly is way
longer than it should be (you saw it was only about 1-2min when things are
running correctly.)

We do have 2 relevant INFO level messages:
    if !required {
        logger.Infof("txns after last prune: %d, txns now: %d, not pruning:
%s",
            lastTxnsCount, txnsCount, rationale)
        return nil
    }
    logger.Infof("txns after last prune: %d, txns now: %d, pruning: %s",
        lastTxnsCount, txnsCount, rationale)

However, I think default logging level for controllers is WARNING, so those
also wouldn't end up in the logs.

You could set something like

juju model-config -m controller
"logging-config=<root>=WARNING,juju.txn=INFO"

(We'd like to move to having INFO be the default universally, but we're not
sure if we're actually ready to switch that right now.)

John
=:->

On Fri, May 18, 2018 at 8:47 AM, Paul Collins <paul.collins@canonical.com>
wrote:

> I'll mark this incomplete since internal pruning is apparently working
> correctly at present. I'll reopen if it stops working again.  However,
> based on checking the controller agent log files, it seems that
> previously it stopped working without logging anything.
>
> ** Changed in: juju
>        Status: New => Incomplete
>
> --
> You received this bug notification because you are subscribed to juju.
> Matching subscriptions: juju bugs
> https://bugs.launchpad.net/bugs/1771906
>
> Title:
>   txnpruner failing to prune on large controller
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju/+bug/1771906/+subscriptions
>

Revision history for this message

Junien F (axino) wrote on 2018-06-27:

#9

Hi,

This is happening on a 2.3.7 controller with ~30M docs in juju.txns.

juju logs the following :

$ sudo egrep 'txnpruner|prune.go|juju.txn' /var/log/juju/machine-?.log|tail
2018-06-27 09:45:34 ERROR juju.worker.dependency engine.go:551 "transaction-pruner" manifold worker returned unexpected error: pruning failed, txnpruner stopping: failed to retrieve final txns count: read tcp A.B.C.4:33651->A.B.C.2:37017: i/o timeout
2018-06-27 10:45:37 INFO juju.txn prune.go:127 txns after last prune: 80758, txns now: 30356480, pruning: too many new transactions
2018-06-27 10:46:37 WARNING juju.txn oracle.go:220 cleanup of "txns.prunetemp" failed: read tcp A.B.C.4:36240->A.B.C.2:37017: i/o timeout
2018-06-27 10:46:37 ERROR juju.worker.dependency engine.go:551 "transaction-pruner" manifold worker returned unexpected error: pruning failed, txnpruner stopping: failed to retrieve final txns count: read tcp A.B.C.4:36240->A.B.C.2:37017: i/o timeout
2018-06-27 11:46:40 INFO juju.txn prune.go:127 txns after last prune: 80758, txns now: 30438897, pruning: too many new transactions
2018-06-27 11:47:40 WARNING juju.txn oracle.go:220 cleanup of "txns.prunetemp" failed: read tcp A.B.C.4:41384->A.B.C.2:37017: i/o timeout
2018-06-27 11:47:40 ERROR juju.worker.dependency engine.go:551 "transaction-pruner" manifold worker returned unexpected error: pruning failed, txnpruner stopping: failed to retrieve final txns count: read tcp A.B.C.4:41384->A.B.C.2:37017: i/o timeout
2018-06-27 12:47:43 INFO juju.txn prune.go:127 txns after last prune: 80758, txns now: 30521252, pruning: too many new transactions
2018-06-27 12:48:43 WARNING juju.txn oracle.go:220 cleanup of "txns.prunetemp" failed: read tcp A.B.C.4:38816->A.B.C.2:37017: i/o timeout
2018-06-27 12:48:43 ERROR juju.worker.dependency engine.go:551 "transaction-pruner" manifold worker returned unexpected error: pruning failed, txnpruner stopping: failed to retrieve final txns count: read tcp A.B.C.4:38816->A.B.C.2:37017: i/o timeout

It would appear that juju is timing out after 60s, but the aggregation takes 488s :

Jun 27 12:55:51 juju-foo-0 mongod.37017[2055]: [conn144066] command juju.txns command: aggregate { aggregate: "txns", pipeline: [ { $match: { _id: { $lt: ObjectId('5b33795f0000000000000000') }, s: { $gte: 5 } } }, { $project: { _id: 1 } }, { $out: "txns.prunetemp" } ], cursor: { batchSize: 1616 }, allowDiskUse: true } keyUpdates:0 writeConflicts:0 numYields:241900 reslen:115 locks:{ Global: { acquireCount: { r: 961941, w: 475690, W: 1 } }, Database: { acquireCount: { r: 243124, w: 475688, R: 1, W: 2 } }, Collection: { acquireCount: { r: 243124, w: 40 } }, Metadata: { acquireCount: { w: 30440288 } }, oplog: { acquireCount: { w: 475648 } } } protocol:op_query 488082ms

Hi,

This is happening on a 2.3.7 controller with ~30M docs in juju.txns.

juju logs the following :

$ sudo egrep 'txnpruner|prune.go|juju.txn' /var/log/juju/machine-?.log|tail
2018-06-27 09:45:34 ERROR juju.worker.dependency engine.go:551 "transaction-pruner" manifold worker returned unexpected error: pruning failed, txnpruner stopping: failed to retrieve final txns count: read tcp A.B.C.4:33651->A.B.C.2:37017: i/o timeout
2018-06-27 10:45:37 INFO juju.txn prune.go:127 txns after last prune: 80758, txns now: 30356480, pruning: too many new transactions
2018-06-27 10:46:37 WARNING juju.txn oracle.go:220 cleanup of "txns.prunetemp" failed: read tcp A.B.C.4:36240->A.B.C.2:37017: i/o timeout
2018-06-27 10:46:37 ERROR juju.worker.dependency engine.go:551 "transaction-pruner" manifold worker returned unexpected error: pruning failed, txnpruner stopping: failed to retrieve final txns count: read tcp A.B.C.4:36240->A.B.C.2:37017: i/o timeout
2018-06-27 11:46:40 INFO juju.txn prune.go:127 txns after last prune: 80758, txns now: 30438897, pruning: too many new transactions
2018-06-27 11:47:40 WARNING juju.txn oracle.go:220 cleanup of "txns.prunetemp" failed: read tcp A.B.C.4:41384->A.B.C.2:37017: i/o timeout
2018-06-27 11:47:40 ERROR juju.worker.dependency engine.go:551 "transaction-pruner" manifold worker returned unexpected error: pruning failed, txnpruner stopping: failed to retrieve final txns count: read tcp A.B.C.4:41384->A.B.C.2:37017: i/o timeout
2018-06-27 12:47:43 INFO juju.txn prune.go:127 txns after last prune: 80758, txns now: 30521252, pruning: too many new transactions
2018-06-27 12:48:43 WARNING juju.txn oracle.go:220 cleanup of "txns.prunetemp" failed: read tcp A.B.C.4:38816->A.B.C.2:37017: i/o timeout
2018-06-27 12:48:43 ERROR juju.worker.dependency engine.go:551 "transaction-pruner" manifold worker returned unexpected error: pruning failed, txnpruner stopping: failed to retrieve final txns count: read tcp A.B.C.4:38816->A.B.C.2:37017: i/o timeout

It would appear that juju is timing out after 60s, but the aggregation takes 488s :

Jun 27 12:55:51 juju-foo-0 mongod.37017[2055]: [conn144066] command juju.txns command: aggregate { aggregate: "txns", pipeline: [ { $match: { _id: { $lt: ObjectId('5b33795f0000000000000000') }, s: { $gte: 5 } } }, { $project: { _id: 1 } }, { $out: "txns.prunetemp" } ], cursor: { batchSize: 1616 }, allowDiskUse: true } keyUpdates:0 writeConflicts:0 numYields:241900 reslen:115 locks:{ Global: { acquireCount: { r: 961941, w: 475690, W: 1 } }, Database: { acquireCount: { r: 243124, w: 475688, R: 1, W: 2 } }, Collection: { acquireCount: { r: 243124, w: 40 } }, Metadata: { acquireCount: { w: 30440288 } }, oplog: { acquireCount: { w: 475648 } } } protocol:op_query 488082ms

Changed in juju:
status:	Incomplete → New

Revision history for this message

Junien F (axino) wrote on 2018-06-27:

#10

Note that 99.999% of these 30M txns were in state 6. mgopurge -stages prune was able to get rid of them and get the controller to an OK state.

Revision history for this message

Tim Penhey (thumper) wrote on 2018-07-10:

#11

We discussed changing the code so it doesn't try to materialize a list of all the transactions to prune, but instead gather about a million at a time max.

Changed in juju:
status:	New → Triaged
importance:	Undecided → High
tags:	added: mongodb pruning
Changed in juju:
assignee:	nobody → John A Meinel (jameinel)
milestone:	none → 2.4.2

Revision history for this message

John A Meinel (jameinel) wrote on 2018-07-10:

#12

https://github.com/juju/txn/pull/42

Revision history for this message

John A Meinel (jameinel) wrote on 2018-07-10:

#13

https://github.com/juju/mgopurge/pull/25

Changed in juju:
status:	Triaged → In Progress

Revision history for this message

John A Meinel (jameinel) wrote on 2018-07-10:

#14

if we want to incorporate this into Juju code, we'll also need a small patch. CleanAndPrune doesn't default to a per-pass batch size and auto retry. We *could* put that into juju/txn, but that kind of policy is something that I prefer to pull out into calling code.

John A Meinel (jameinel) on 2018-07-10

Changed in juju:
milestone:	2.4.2 → 2.5-beta1

Revision history for this message

Paul Collins (pjdc) wrote on 2018-07-23:

#15

We just had another occurrence of this on our largest internal production controller.

We received alerts for high load on a unit that turned out to be the mongodb primary. We ran "mgopurge -stages resume,prune", which eventually removed approximately 27.8 million completed txns.

However, mgopurge crashed twice with "ERROR failed stage prune: Cursor not found, cursor id: ${bignum}" before a final run completed cleanly.

Revision history for this message

John A Meinel (jameinel) wrote on 2018-08-13:

#16

It seems cursors default to having a 10 minute lifetime, though it seems to
be driver dependent. Both Java and Pymongo expose some sort of
"no_cursor_timeout".
Or:
https://docs.mongodb.com/manual/reference/method/cursor.noCursorTimeout/#cursor.noCursorTimeout

I wouldn't think we have long periods of inactivity, but maybe it is
possible that after creating the temp table, we are busy thinking about
other things, and don't go back to hit one of our other queries very often?

Batching into smaller groups to run at a time probably works around this as
well.

On Mon, Jul 23, 2018 at 8:19 AM, Paul Collins <email address hidden>
wrote:

> We just had another occurrence of this on our largest internal
> production controller.
>
> We received alerts for high load on a unit that turned out to be the
> mongodb primary. We ran "mgopurge -stages resume,prune", which
> eventually removed approximately 27.8 million completed txns.
>
> However, mgopurge crashed twice with "ERROR failed stage prune: Cursor
> not found, cursor id: ${bignum}" before a final run completed cleanly.
>
> --
> You received this bug notification because you are a bug assignee.
> Matching subscriptions: juju bugs
> https://bugs.launchpad.net/bugs/1771906
>
> Title:
> txnpruner failing to prune on large controller
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju/+bug/1771906/+subscriptions
>

Revision history for this message

John A Meinel (jameinel) wrote on 2018-08-20:

#17

https://github.com/juju/juju/pull/9083 is targeting 2.3 which

1) Defaults to a maximum-per-pass of 1M transactions, and 100 passes.
2) Allows 'juju controller-config' in case we find those defaults are too generous or too small.

John A Meinel (jameinel) on 2018-08-23

description:

updated

John A Meinel (jameinel) on 2018-11-13

Changed in juju:
status:	In Progress → Fix Committed

Revision history for this message

Junien F (axino) wrote on 2018-11-19:

#18

Hi,

For what it's worth we're still seeing problems on a 2.4.5 controller :

$ sudo grep juju.txn /var/log/juju/machine-0.log
2018-11-19 13:35:27 INFO juju.txn prune.go:131 txns after last prune: 204355, txns now: 9511023, pruning: too many new transactions
2018-11-19 13:36:27 WARNING juju.txn oracle.go:228 cleanup of "txns.prunetemp" failed: read tcp 1.2.3.109:45032->1.2.3.111:37017: i/o timeout
2018-11-19 14:36:30 INFO juju.txn prune.go:131 txns after last prune: 204355, txns now: 9718097, pruning: too many new transactions
2018-11-19 14:37:30 WARNING juju.txn oracle.go:228 cleanup of "txns.prunetemp" failed: read tcp 1.2.3.109:54136->1.2.3.111:37017: i/o timeout
2018-11-19 15:37:33 INFO juju.txn prune.go:131 txns after last prune: 204355, txns now: 9925300, pruning: too many new transactions
2018-11-19 15:38:33 WARNING juju.txn oracle.go:228 cleanup of "txns.prunetemp" failed: read tcp 1.2.3.109:45780->1.2.3.111:37017: i/o timeout

Revision history for this message

John A Meinel (jameinel) wrote on 2018-12-20:

#19

New code for how we prune transactions to start removing txns much sooner is landing here:
https://github.com/juju/juju/pull/9594

Anastasia (anastasia-macmood) on 2019-03-22

Changed in juju:
status:	Fix Committed → Fix Released

Canonical Juju

txnpruner failing to prune on large controller

Bug Description

Duplicates of this bug

Other bug subscribers

Remote bug watches

	Status	Importance	Assigned to	Milestone
Canonical Juju	Fix Released	High	John A Meinel	Canonical Juju 2.5-beta1
2.3	Fix Released	High	John A Meinel	Canonical Juju 2.3.9
2.4	Fix Released	High	John A Meinel	Canonical Juju 2.4.2