txnpruner failing to prune on large controller
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Canonical Juju |
Fix Released
|
High
|
John A Meinel | ||
2.3 |
Fix Released
|
High
|
John A Meinel | ||
2.4 |
Fix Released
|
High
|
John A Meinel |
Bug Description
In Juju 2.3.9 and 2.4.2 we introduced two new 'juju controller-config' options:
max-prune-
max-prune-passes: 100
We default to pruning 1M transactions at a time, and at most 100 passes per attempt. That gives us up to 100M transactions that we can prune per hour. That shouldn't prevent us from cleaning up any sort of backlog, but it should allow us to make more progress with each attempt. It is expected to take longer if we aren't pruning everything in one pass, but the normal load is about 200k transactions per hour, so 1M should operate the same under normal load. The config values will allow us to change the limits on a running controller, in case these are poor defaults.
Setting the max-prune-txn-size to 0 will remove the limit. A setting of max-prune-passes: 0 is just treated as a value of 1 pass.
We've observed that txns are building up on our production Juju 2 controller.
Yesterday I completed an mgopurge of 56M txns, and rebuilt all of the mongodb replics to compact the database from 38G to 24G, which took about five hours in all.
This controller is running 2.2.9, although since there are no significant pruning changes in 2.3.x (or so I'm told) I'm filing this bug now.
After about 13 hours of uptime, since approx 2018-05-17 07:44, we were up to 380,000 txns. I checked the controller agents logs and the txnpruner logged as follows.
I take the "failed to retrieve final txns count" message to mean that even after one hour of uptime this controller is generating more transactions than the pruner can cope with.
juju:PRIMARY> db.txns.count()
388057
juju:PRIMARY>
bye
ubuntu@
Thu May 17 20:41:58 UTC 2018
ubuntu@
Thu May 17 07:44:03 2018 /var/lib/
ubuntu@
[...]
2018-05-17 07:44:24 INFO juju.worker.
2018-05-17 07:44:24 INFO juju.worker runner.go:477 start "txnpruner"
2018-05-17 09:44:03 INFO juju.worker runner.go:483 stopped "txnpruner", err: EOF
2018-05-17 09:44:03 ERROR juju.worker runner.go:381 fatal "txnpruner": EOF
2018-05-17 10:01:40 INFO juju.worker.
2018-05-17 10:01:40 INFO juju.worker runner.go:477 start "txnpruner"
ubuntu@
ubuntu@
Thu May 17 07:44:02 2018 /var/lib/
ubuntu@
[...]
2018-05-17 07:44:23 INFO juju.worker.
2018-05-17 07:44:23 INFO juju.worker runner.go:477 start "txnpruner"
2018-05-17 09:44:03 INFO juju.worker runner.go:483 stopped "txnpruner", err: <nil>
ubuntu@
ubuntu@
ubuntu@
ubuntu@
2018-04-22 12:01:52 INFO juju.worker.
2018-04-22 12:01:52 INFO juju.worker runner.go:477 start "txnpruner"
2018-04-23 19:41:52 INFO juju.worker runner.go:483 stopped "txnpruner", err: worker "txnpruner" exited: pruning failed, txnpruner stopping: failed to retrieve final txns count: read
tcp 10.25.2.
2018-04-23 19:41:53 ERROR juju.worker runner.go:392 exited "txnpruner": worker "txnpruner" exited: pruning failed, txnpruner stopping: failed to retrieve final txns count: read tcp 10.25.2.
2018-04-23 19:41:53 INFO juju.worker runner.go:467 restarting "txnpruner" in 3s
2018-04-23 19:41:56 INFO juju.worker runner.go:477 start "txnpruner"
2018-04-23 20:23:13 INFO juju.worker runner.go:483 stopped "txnpruner", err: <nil>
2018-04-23 20:36:47 INFO juju.worker.
2018-04-23 20:36:47 INFO juju.worker runner.go:477 start "txnpruner"
2018-05-02 08:02:40 INFO juju.worker runner.go:483 stopped "txnpruner", err: worker "txnpruner" exited: pruning failed, txnpruner stopping: failed to retrieve final txns count: read
tcp 127.0.0.
2018-05-02 08:02:42 ERROR juju.worker runner.go:392 exited "txnpruner": worker "txnpruner" exited: pruning failed, txnpruner stopping: failed to retrieve final txns count: read tcp 127.0.0.
2018-05-02 08:02:42 INFO juju.worker runner.go:467 restarting "txnpruner" in 3s
2018-05-02 08:02:45 INFO juju.worker runner.go:477 start "txnpruner"
2018-05-02 08:47:58 INFO juju.worker.
2018-05-02 08:47:58 INFO juju.worker runner.go:477 start "txnpruner"
2018-05-03 18:22:44 INFO juju.worker runner.go:483 stopped "txnpruner", err: EOF
2018-05-03 18:22:52 ERROR juju.worker runner.go:392 exited "txnpruner": EOF
2018-05-17 07:44:23 INFO juju.worker.
2018-05-17 07:44:23 INFO juju.worker runner.go:477 start "txnpruner"
2018-05-17 08:45:52 INFO juju.worker runner.go:483 stopped "txnpruner", err: worker "txnpruner" exited: pruning failed, txnpruner stopping: failed to retrieve final txns count: read
tcp 127.0.0.
2018-05-17 08:45:52 ERROR juju.worker runner.go:392 exited "txnpruner": worker "txnpruner" exited: pruning failed, txnpruner stopping: failed to retrieve final txns count: read tcp 127.0.0.
2018-05-17 08:45:52 INFO juju.worker runner.go:467 restarting "txnpruner" in 3s
2018-05-17 08:45:55 INFO juju.worker runner.go:477 start "txnpruner"
2018-05-17 09:44:04 INFO juju.worker runner.go:483 stopped "txnpruner", err: <nil>
2018-05-17 10:01:17 INFO juju.worker.
2018-05-17 10:01:17 INFO juju.worker runner.go:477 start "txnpruner"
Changed in juju: | |
milestone: | 2.4.2 → 2.5-beta1 |
description: | updated |
Changed in juju: | |
status: | In Progress → Fix Committed |
Changed in juju: | |
status: | Fix Committed → Fix Released |
It seems txnpruner started up again later, although it's not clear to me how to tell if it's active:
(machine 0)
2018-05-17 10:01:40 INFO juju.worker. singular singular.go:104 standby "txnpruner"
2018-05-17 10:01:40 INFO juju.worker runner.go:477 start "txnpruner"
(machine 1)
2018-05-18 03:28:09 INFO juju.worker. singular singular.go:104 standby "txnpruner"
2018-05-18 03:28:09 INFO juju.worker runner.go:477 start "txnpruner"
(machine 2)
2018-05-17 10:01:17 INFO juju.worker. singular singular.go:101 starting "txnpruner"
2018-05-17 10:01:17 INFO juju.worker runner.go:477 start "txnpruner"