2018-08-23 11:38:11 |
John A Meinel |
description |
We've observed that txns are building up on our production Juju 2 controller.
Yesterday I completed an mgopurge of 56M txns, and rebuilt all of the mongodb replics to compact the database from 38G to 24G, which took about five hours in all.
This controller is running 2.2.9, although since there are no significant pruning changes in 2.3.x (or so I'm told) I'm filing this bug now.
After about 13 hours of uptime, since approx 2018-05-17 07:44, we were up to 380,000 txns. I checked the controller agents logs and the txnpruner logged as follows.
I take the "failed to retrieve final txns count" message to mean that even after one hour of uptime this controller is generating more transactions than the pruner can cope with.
juju:PRIMARY> db.txns.count()
388057
juju:PRIMARY>
bye
ubuntu@juju-4da59b22-9710-4e69-840a-be49ee864a97-machine-2:~$ date -u
Thu May 17 20:41:58 UTC 2018
ubuntu@juju-4da59b22-9710-4e69-840a-be49ee864a97-machine-0:~$ ps -o lstart,cmd $(pgrep -f -- --machine-id)
STARTED CMD
Thu May 17 07:44:03 2018 /var/lib/juju/tools/machine-0/jujud machine --data-dir /var/lib/juju --machine-id 0 --debug
ubuntu@juju-4da59b22-9710-4e69-840a-be49ee864a97-machine-0:~$ sudo grep txnprune /var/log/juju/machine-?.log | tail -n 50
[...]
2018-05-17 07:44:24 INFO juju.worker.singular singular.go:104 standby "txnpruner"
2018-05-17 07:44:24 INFO juju.worker runner.go:477 start "txnpruner"
2018-05-17 09:44:03 INFO juju.worker runner.go:483 stopped "txnpruner", err: EOF
2018-05-17 09:44:03 ERROR juju.worker runner.go:381 fatal "txnpruner": EOF
2018-05-17 10:01:40 INFO juju.worker.singular singular.go:104 standby "txnpruner"
2018-05-17 10:01:40 INFO juju.worker runner.go:477 start "txnpruner"
ubuntu@juju-4da59b22-9710-4e69-840a-be49ee864a97-machine-0:~$
ubuntu@juju-4da59b22-9710-4e69-840a-be49ee864a97-machine-1:~$ ps -o lstart,cmd $(pgrep -f -- --machine-id)
STARTED CMD
Thu May 17 07:44:02 2018 /var/lib/juju/tools/machine-1/jujud machine --data-dir /var/lib/juju --machine-id 1 --debug
ubuntu@juju-4da59b22-9710-4e69-840a-be49ee864a97-machine-1:~$ sudo grep txnprune /var/log/juju/machine-?.log | tail -n 50
[...]
2018-05-17 07:44:23 INFO juju.worker.singular singular.go:104 standby "txnpruner"
2018-05-17 07:44:23 INFO juju.worker runner.go:477 start "txnpruner"
2018-05-17 09:44:03 INFO juju.worker runner.go:483 stopped "txnpruner", err: <nil>
ubuntu@juju-4da59b22-9710-4e69-840a-be49ee864a97-machine-1:~$
ubuntu@juju-4da59b22-9710-4e69-840a-be49ee864a97-machine-2:~$ ps -o lstart,cmd $(pgrep -f -- --machine-id)
STARTED CMD
ubuntu@juju-4da59b22-9710-4e69-840a-be49ee864a97-machine-2:~$ sudo grep txnprune /var/log/juju/machine-?.log | tail -n 50
ubuntu@juju-4da59b22-9710-4e69-840a-be49ee864a97-machine-2:~$ sudo zgrep txnprune /var/log/juju/machine-2-2018-05-17T17-15-21.020.log.gz | tail -n 50
2018-04-22 12:01:52 INFO juju.worker.singular singular.go:101 starting "txnpruner"
2018-04-22 12:01:52 INFO juju.worker runner.go:477 start "txnpruner"
2018-04-23 19:41:52 INFO juju.worker runner.go:483 stopped "txnpruner", err: worker "txnpruner" exited: pruning failed, txnpruner stopping: failed to retrieve final txns count: read
tcp 10.25.2.110:57904->10.25.2.110:37017: i/o timeout
2018-04-23 19:41:53 ERROR juju.worker runner.go:392 exited "txnpruner": worker "txnpruner" exited: pruning failed, txnpruner stopping: failed to retrieve final txns count: read tcp 10.25.2.110:57904->10.25.2.110:37017: i/o timeout
2018-04-23 19:41:53 INFO juju.worker runner.go:467 restarting "txnpruner" in 3s
2018-04-23 19:41:56 INFO juju.worker runner.go:477 start "txnpruner"
2018-04-23 20:23:13 INFO juju.worker runner.go:483 stopped "txnpruner", err: <nil>
2018-04-23 20:36:47 INFO juju.worker.singular singular.go:101 starting "txnpruner"
2018-04-23 20:36:47 INFO juju.worker runner.go:477 start "txnpruner"
2018-05-02 08:02:40 INFO juju.worker runner.go:483 stopped "txnpruner", err: worker "txnpruner" exited: pruning failed, txnpruner stopping: failed to retrieve final txns count: read
tcp 127.0.0.1:34474->127.0.0.1:37017: i/o timeout
2018-05-02 08:02:42 ERROR juju.worker runner.go:392 exited "txnpruner": worker "txnpruner" exited: pruning failed, txnpruner stopping: failed to retrieve final txns count: read tcp 127.0.0.1:34474->127.0.0.1:37017: i/o timeout
2018-05-02 08:02:42 INFO juju.worker runner.go:467 restarting "txnpruner" in 3s
2018-05-02 08:02:45 INFO juju.worker runner.go:477 start "txnpruner"
2018-05-02 08:47:58 INFO juju.worker.singular singular.go:104 standby "txnpruner"
2018-05-02 08:47:58 INFO juju.worker runner.go:477 start "txnpruner"
2018-05-03 18:22:44 INFO juju.worker runner.go:483 stopped "txnpruner", err: EOF
2018-05-03 18:22:52 ERROR juju.worker runner.go:392 exited "txnpruner": EOF
2018-05-17 07:44:23 INFO juju.worker.singular singular.go:101 starting "txnpruner"
2018-05-17 07:44:23 INFO juju.worker runner.go:477 start "txnpruner"
2018-05-17 08:45:52 INFO juju.worker runner.go:483 stopped "txnpruner", err: worker "txnpruner" exited: pruning failed, txnpruner stopping: failed to retrieve final txns count: read
tcp 127.0.0.1:47484->127.0.0.1:37017: i/o timeout
2018-05-17 08:45:52 ERROR juju.worker runner.go:392 exited "txnpruner": worker "txnpruner" exited: pruning failed, txnpruner stopping: failed to retrieve final txns count: read tcp 127.0.0.1:47484->127.0.0.1:37017: i/o timeout
2018-05-17 08:45:52 INFO juju.worker runner.go:467 restarting "txnpruner" in 3s
2018-05-17 08:45:55 INFO juju.worker runner.go:477 start "txnpruner"
2018-05-17 09:44:04 INFO juju.worker runner.go:483 stopped "txnpruner", err: <nil>
2018-05-17 10:01:17 INFO juju.worker.singular singular.go:101 starting "txnpruner"
2018-05-17 10:01:17 INFO juju.worker runner.go:477 start "txnpruner" |
In Juju 2.3.9 and 2.4.2 we introduced two new 'juju controller-config' options:
max-prune-txn-size: 1000000
max-prune-passes: 100
We default to pruning 1M transactions at a time, and at most 100 passes per attempt. That gives us up to 100M transactions that we can prune per hour. That shouldn't prevent us from cleaning up any sort of backlog, but it should allow us to make more progress with each attempt. It is expected to take longer if we aren't pruning everything in one pass, but the normal load is about 200k transactions per hour, so 1M should operate the same under normal load. The config values will allow us to change the limits on a running controller, in case these are poor defaults.
Setting the max-prune-txn-size to 0 will remove the limit. A setting of max-prune-passes: 0 is just treated as a value of 1 pass.
We've observed that txns are building up on our production Juju 2 controller.
Yesterday I completed an mgopurge of 56M txns, and rebuilt all of the mongodb replics to compact the database from 38G to 24G, which took about five hours in all.
This controller is running 2.2.9, although since there are no significant pruning changes in 2.3.x (or so I'm told) I'm filing this bug now.
After about 13 hours of uptime, since approx 2018-05-17 07:44, we were up to 380,000 txns. I checked the controller agents logs and the txnpruner logged as follows.
I take the "failed to retrieve final txns count" message to mean that even after one hour of uptime this controller is generating more transactions than the pruner can cope with.
juju:PRIMARY> db.txns.count()
388057
juju:PRIMARY>
bye
ubuntu@juju-4da59b22-9710-4e69-840a-be49ee864a97-machine-2:~$ date -u
Thu May 17 20:41:58 UTC 2018
ubuntu@juju-4da59b22-9710-4e69-840a-be49ee864a97-machine-0:~$ ps -o lstart,cmd $(pgrep -f -- --machine-id)
STARTED CMD
Thu May 17 07:44:03 2018 /var/lib/juju/tools/machine-0/jujud machine --data-dir /var/lib/juju --machine-id 0 --debug
ubuntu@juju-4da59b22-9710-4e69-840a-be49ee864a97-machine-0:~$ sudo grep txnprune /var/log/juju/machine-?.log | tail -n 50
[...]
2018-05-17 07:44:24 INFO juju.worker.singular singular.go:104 standby "txnpruner"
2018-05-17 07:44:24 INFO juju.worker runner.go:477 start "txnpruner"
2018-05-17 09:44:03 INFO juju.worker runner.go:483 stopped "txnpruner", err: EOF
2018-05-17 09:44:03 ERROR juju.worker runner.go:381 fatal "txnpruner": EOF
2018-05-17 10:01:40 INFO juju.worker.singular singular.go:104 standby "txnpruner"
2018-05-17 10:01:40 INFO juju.worker runner.go:477 start "txnpruner"
ubuntu@juju-4da59b22-9710-4e69-840a-be49ee864a97-machine-0:~$
ubuntu@juju-4da59b22-9710-4e69-840a-be49ee864a97-machine-1:~$ ps -o lstart,cmd $(pgrep -f -- --machine-id)
STARTED CMD
Thu May 17 07:44:02 2018 /var/lib/juju/tools/machine-1/jujud machine --data-dir /var/lib/juju --machine-id 1 --debug
ubuntu@juju-4da59b22-9710-4e69-840a-be49ee864a97-machine-1:~$ sudo grep txnprune /var/log/juju/machine-?.log | tail -n 50
[...]
2018-05-17 07:44:23 INFO juju.worker.singular singular.go:104 standby "txnpruner"
2018-05-17 07:44:23 INFO juju.worker runner.go:477 start "txnpruner"
2018-05-17 09:44:03 INFO juju.worker runner.go:483 stopped "txnpruner", err: <nil>
ubuntu@juju-4da59b22-9710-4e69-840a-be49ee864a97-machine-1:~$
ubuntu@juju-4da59b22-9710-4e69-840a-be49ee864a97-machine-2:~$ ps -o lstart,cmd $(pgrep -f -- --machine-id)
STARTED CMD
ubuntu@juju-4da59b22-9710-4e69-840a-be49ee864a97-machine-2:~$ sudo grep txnprune /var/log/juju/machine-?.log | tail -n 50
ubuntu@juju-4da59b22-9710-4e69-840a-be49ee864a97-machine-2:~$ sudo zgrep txnprune /var/log/juju/machine-2-2018-05-17T17-15-21.020.log.gz | tail -n 50
2018-04-22 12:01:52 INFO juju.worker.singular singular.go:101 starting "txnpruner"
2018-04-22 12:01:52 INFO juju.worker runner.go:477 start "txnpruner"
2018-04-23 19:41:52 INFO juju.worker runner.go:483 stopped "txnpruner", err: worker "txnpruner" exited: pruning failed, txnpruner stopping: failed to retrieve final txns count: read
tcp 10.25.2.110:57904->10.25.2.110:37017: i/o timeout
2018-04-23 19:41:53 ERROR juju.worker runner.go:392 exited "txnpruner": worker "txnpruner" exited: pruning failed, txnpruner stopping: failed to retrieve final txns count: read tcp 10.25.2.110:57904->10.25.2.110:37017: i/o timeout
2018-04-23 19:41:53 INFO juju.worker runner.go:467 restarting "txnpruner" in 3s
2018-04-23 19:41:56 INFO juju.worker runner.go:477 start "txnpruner"
2018-04-23 20:23:13 INFO juju.worker runner.go:483 stopped "txnpruner", err: <nil>
2018-04-23 20:36:47 INFO juju.worker.singular singular.go:101 starting "txnpruner"
2018-04-23 20:36:47 INFO juju.worker runner.go:477 start "txnpruner"
2018-05-02 08:02:40 INFO juju.worker runner.go:483 stopped "txnpruner", err: worker "txnpruner" exited: pruning failed, txnpruner stopping: failed to retrieve final txns count: read
tcp 127.0.0.1:34474->127.0.0.1:37017: i/o timeout
2018-05-02 08:02:42 ERROR juju.worker runner.go:392 exited "txnpruner": worker "txnpruner" exited: pruning failed, txnpruner stopping: failed to retrieve final txns count: read tcp 127.0.0.1:34474->127.0.0.1:37017: i/o timeout
2018-05-02 08:02:42 INFO juju.worker runner.go:467 restarting "txnpruner" in 3s
2018-05-02 08:02:45 INFO juju.worker runner.go:477 start "txnpruner"
2018-05-02 08:47:58 INFO juju.worker.singular singular.go:104 standby "txnpruner"
2018-05-02 08:47:58 INFO juju.worker runner.go:477 start "txnpruner"
2018-05-03 18:22:44 INFO juju.worker runner.go:483 stopped "txnpruner", err: EOF
2018-05-03 18:22:52 ERROR juju.worker runner.go:392 exited "txnpruner": EOF
2018-05-17 07:44:23 INFO juju.worker.singular singular.go:101 starting "txnpruner"
2018-05-17 07:44:23 INFO juju.worker runner.go:477 start "txnpruner"
2018-05-17 08:45:52 INFO juju.worker runner.go:483 stopped "txnpruner", err: worker "txnpruner" exited: pruning failed, txnpruner stopping: failed to retrieve final txns count: read
tcp 127.0.0.1:47484->127.0.0.1:37017: i/o timeout
2018-05-17 08:45:52 ERROR juju.worker runner.go:392 exited "txnpruner": worker "txnpruner" exited: pruning failed, txnpruner stopping: failed to retrieve final txns count: read tcp 127.0.0.1:47484->127.0.0.1:37017: i/o timeout
2018-05-17 08:45:52 INFO juju.worker runner.go:467 restarting "txnpruner" in 3s
2018-05-17 08:45:55 INFO juju.worker runner.go:477 start "txnpruner"
2018-05-17 09:44:04 INFO juju.worker runner.go:483 stopped "txnpruner", err: <nil>
2018-05-17 10:01:17 INFO juju.worker.singular singular.go:101 starting "txnpruner"
2018-05-17 10:01:17 INFO juju.worker runner.go:477 start "txnpruner" |
|