Activity log for bug #1771906

Date Who What changed Old value New value Message
2018-05-18 01:07:14 Paul Collins bug added bug
2018-05-18 01:08:12 Paul Collins bug added subscriber The Canonical Sysadmins
2018-05-18 04:47:24 Paul Collins juju: status New Incomplete
2018-06-27 13:27:24 Junien F juju: status Incomplete New
2018-07-10 03:48:28 Tim Penhey juju: status New Triaged
2018-07-10 03:48:31 Tim Penhey juju: importance Undecided High
2018-07-10 03:49:01 Tim Penhey tags mongodb pruning
2018-07-10 03:49:34 Tim Penhey juju: assignee John A Meinel (jameinel)
2018-07-10 03:50:17 Tim Penhey juju: milestone 2.4.2
2018-07-10 10:02:25 John A Meinel juju: status Triaged In Progress
2018-07-10 10:58:27 John A Meinel nominated for series juju/2.4
2018-07-10 10:58:27 John A Meinel bug task added juju/2.4
2018-07-10 10:58:32 John A Meinel juju/2.4: milestone 2.4.2
2018-07-10 10:58:34 John A Meinel juju/2.4: assignee John A Meinel (jameinel)
2018-07-10 10:58:35 John A Meinel juju/2.4: importance Undecided High
2018-07-10 10:58:37 John A Meinel juju: milestone 2.4.2 2.5-beta1
2018-07-10 10:58:40 John A Meinel juju/2.4: status New In Progress
2018-08-20 12:37:20 John A Meinel nominated for series juju/2.3
2018-08-20 12:37:20 John A Meinel bug task added juju/2.3
2018-08-20 12:37:26 John A Meinel juju/2.3: status New In Progress
2018-08-20 12:37:31 John A Meinel juju/2.3: importance Undecided High
2018-08-20 12:37:33 John A Meinel juju/2.3: assignee John A Meinel (jameinel)
2018-08-20 12:37:35 John A Meinel juju/2.3: milestone 2.3.9
2018-08-20 18:46:55 Richard Harding juju/2.4: milestone 2.4.2 2.4.3
2018-08-23 11:26:42 John A Meinel juju/2.3: status In Progress Fix Committed
2018-08-23 11:26:44 John A Meinel juju/2.4: milestone 2.4.3 2.4.2
2018-08-23 11:26:48 John A Meinel juju/2.4: status In Progress Fix Committed
2018-08-23 11:38:11 John A Meinel description We've observed that txns are building up on our production Juju 2 controller. Yesterday I completed an mgopurge of 56M txns, and rebuilt all of the mongodb replics to compact the database from 38G to 24G, which took about five hours in all. This controller is running 2.2.9, although since there are no significant pruning changes in 2.3.x (or so I'm told) I'm filing this bug now. After about 13 hours of uptime, since approx 2018-05-17 07:44, we were up to 380,000 txns. I checked the controller agents logs and the txnpruner logged as follows. I take the "failed to retrieve final txns count" message to mean that even after one hour of uptime this controller is generating more transactions than the pruner can cope with. juju:PRIMARY> db.txns.count() 388057 juju:PRIMARY> bye ubuntu@juju-4da59b22-9710-4e69-840a-be49ee864a97-machine-2:~$ date -u Thu May 17 20:41:58 UTC 2018 ubuntu@juju-4da59b22-9710-4e69-840a-be49ee864a97-machine-0:~$ ps -o lstart,cmd $(pgrep -f -- --machine-id) STARTED CMD Thu May 17 07:44:03 2018 /var/lib/juju/tools/machine-0/jujud machine --data-dir /var/lib/juju --machine-id 0 --debug ubuntu@juju-4da59b22-9710-4e69-840a-be49ee864a97-machine-0:~$ sudo grep txnprune /var/log/juju/machine-?.log | tail -n 50 [...] 2018-05-17 07:44:24 INFO juju.worker.singular singular.go:104 standby "txnpruner" 2018-05-17 07:44:24 INFO juju.worker runner.go:477 start "txnpruner" 2018-05-17 09:44:03 INFO juju.worker runner.go:483 stopped "txnpruner", err: EOF 2018-05-17 09:44:03 ERROR juju.worker runner.go:381 fatal "txnpruner": EOF 2018-05-17 10:01:40 INFO juju.worker.singular singular.go:104 standby "txnpruner" 2018-05-17 10:01:40 INFO juju.worker runner.go:477 start "txnpruner" ubuntu@juju-4da59b22-9710-4e69-840a-be49ee864a97-machine-0:~$ ubuntu@juju-4da59b22-9710-4e69-840a-be49ee864a97-machine-1:~$ ps -o lstart,cmd $(pgrep -f -- --machine-id) STARTED CMD Thu May 17 07:44:02 2018 /var/lib/juju/tools/machine-1/jujud machine --data-dir /var/lib/juju --machine-id 1 --debug ubuntu@juju-4da59b22-9710-4e69-840a-be49ee864a97-machine-1:~$ sudo grep txnprune /var/log/juju/machine-?.log | tail -n 50 [...] 2018-05-17 07:44:23 INFO juju.worker.singular singular.go:104 standby "txnpruner" 2018-05-17 07:44:23 INFO juju.worker runner.go:477 start "txnpruner" 2018-05-17 09:44:03 INFO juju.worker runner.go:483 stopped "txnpruner", err: <nil> ubuntu@juju-4da59b22-9710-4e69-840a-be49ee864a97-machine-1:~$ ubuntu@juju-4da59b22-9710-4e69-840a-be49ee864a97-machine-2:~$ ps -o lstart,cmd $(pgrep -f -- --machine-id) STARTED CMD ubuntu@juju-4da59b22-9710-4e69-840a-be49ee864a97-machine-2:~$ sudo grep txnprune /var/log/juju/machine-?.log | tail -n 50 ubuntu@juju-4da59b22-9710-4e69-840a-be49ee864a97-machine-2:~$ sudo zgrep txnprune /var/log/juju/machine-2-2018-05-17T17-15-21.020.log.gz | tail -n 50 2018-04-22 12:01:52 INFO juju.worker.singular singular.go:101 starting "txnpruner" 2018-04-22 12:01:52 INFO juju.worker runner.go:477 start "txnpruner" 2018-04-23 19:41:52 INFO juju.worker runner.go:483 stopped "txnpruner", err: worker "txnpruner" exited: pruning failed, txnpruner stopping: failed to retrieve final txns count: read tcp 10.25.2.110:57904->10.25.2.110:37017: i/o timeout 2018-04-23 19:41:53 ERROR juju.worker runner.go:392 exited "txnpruner": worker "txnpruner" exited: pruning failed, txnpruner stopping: failed to retrieve final txns count: read tcp 10.25.2.110:57904->10.25.2.110:37017: i/o timeout 2018-04-23 19:41:53 INFO juju.worker runner.go:467 restarting "txnpruner" in 3s 2018-04-23 19:41:56 INFO juju.worker runner.go:477 start "txnpruner" 2018-04-23 20:23:13 INFO juju.worker runner.go:483 stopped "txnpruner", err: <nil> 2018-04-23 20:36:47 INFO juju.worker.singular singular.go:101 starting "txnpruner" 2018-04-23 20:36:47 INFO juju.worker runner.go:477 start "txnpruner" 2018-05-02 08:02:40 INFO juju.worker runner.go:483 stopped "txnpruner", err: worker "txnpruner" exited: pruning failed, txnpruner stopping: failed to retrieve final txns count: read tcp 127.0.0.1:34474->127.0.0.1:37017: i/o timeout 2018-05-02 08:02:42 ERROR juju.worker runner.go:392 exited "txnpruner": worker "txnpruner" exited: pruning failed, txnpruner stopping: failed to retrieve final txns count: read tcp 127.0.0.1:34474->127.0.0.1:37017: i/o timeout 2018-05-02 08:02:42 INFO juju.worker runner.go:467 restarting "txnpruner" in 3s 2018-05-02 08:02:45 INFO juju.worker runner.go:477 start "txnpruner" 2018-05-02 08:47:58 INFO juju.worker.singular singular.go:104 standby "txnpruner" 2018-05-02 08:47:58 INFO juju.worker runner.go:477 start "txnpruner" 2018-05-03 18:22:44 INFO juju.worker runner.go:483 stopped "txnpruner", err: EOF 2018-05-03 18:22:52 ERROR juju.worker runner.go:392 exited "txnpruner": EOF 2018-05-17 07:44:23 INFO juju.worker.singular singular.go:101 starting "txnpruner" 2018-05-17 07:44:23 INFO juju.worker runner.go:477 start "txnpruner" 2018-05-17 08:45:52 INFO juju.worker runner.go:483 stopped "txnpruner", err: worker "txnpruner" exited: pruning failed, txnpruner stopping: failed to retrieve final txns count: read tcp 127.0.0.1:47484->127.0.0.1:37017: i/o timeout 2018-05-17 08:45:52 ERROR juju.worker runner.go:392 exited "txnpruner": worker "txnpruner" exited: pruning failed, txnpruner stopping: failed to retrieve final txns count: read tcp 127.0.0.1:47484->127.0.0.1:37017: i/o timeout 2018-05-17 08:45:52 INFO juju.worker runner.go:467 restarting "txnpruner" in 3s 2018-05-17 08:45:55 INFO juju.worker runner.go:477 start "txnpruner" 2018-05-17 09:44:04 INFO juju.worker runner.go:483 stopped "txnpruner", err: <nil> 2018-05-17 10:01:17 INFO juju.worker.singular singular.go:101 starting "txnpruner" 2018-05-17 10:01:17 INFO juju.worker runner.go:477 start "txnpruner" In Juju 2.3.9 and 2.4.2 we introduced two new 'juju controller-config' options: max-prune-txn-size: 1000000 max-prune-passes: 100 We default to pruning 1M transactions at a time, and at most 100 passes per attempt. That gives us up to 100M transactions that we can prune per hour. That shouldn't prevent us from cleaning up any sort of backlog, but it should allow us to make more progress with each attempt. It is expected to take longer if we aren't pruning everything in one pass, but the normal load is about 200k transactions per hour, so 1M should operate the same under normal load. The config values will allow us to change the limits on a running controller, in case these are poor defaults. Setting the max-prune-txn-size to 0 will remove the limit. A setting of max-prune-passes: 0 is just treated as a value of 1 pass. We've observed that txns are building up on our production Juju 2 controller. Yesterday I completed an mgopurge of 56M txns, and rebuilt all of the mongodb replics to compact the database from 38G to 24G, which took about five hours in all. This controller is running 2.2.9, although since there are no significant pruning changes in 2.3.x (or so I'm told) I'm filing this bug now. After about 13 hours of uptime, since approx 2018-05-17 07:44, we were up to 380,000 txns. I checked the controller agents logs and the txnpruner logged as follows. I take the "failed to retrieve final txns count" message to mean that even after one hour of uptime this controller is generating more transactions than the pruner can cope with. juju:PRIMARY> db.txns.count() 388057 juju:PRIMARY> bye ubuntu@juju-4da59b22-9710-4e69-840a-be49ee864a97-machine-2:~$ date -u Thu May 17 20:41:58 UTC 2018 ubuntu@juju-4da59b22-9710-4e69-840a-be49ee864a97-machine-0:~$ ps -o lstart,cmd $(pgrep -f -- --machine-id)                  STARTED CMD Thu May 17 07:44:03 2018 /var/lib/juju/tools/machine-0/jujud machine --data-dir /var/lib/juju --machine-id 0 --debug ubuntu@juju-4da59b22-9710-4e69-840a-be49ee864a97-machine-0:~$ sudo grep txnprune /var/log/juju/machine-?.log | tail -n 50 [...] 2018-05-17 07:44:24 INFO juju.worker.singular singular.go:104 standby "txnpruner" 2018-05-17 07:44:24 INFO juju.worker runner.go:477 start "txnpruner" 2018-05-17 09:44:03 INFO juju.worker runner.go:483 stopped "txnpruner", err: EOF 2018-05-17 09:44:03 ERROR juju.worker runner.go:381 fatal "txnpruner": EOF 2018-05-17 10:01:40 INFO juju.worker.singular singular.go:104 standby "txnpruner" 2018-05-17 10:01:40 INFO juju.worker runner.go:477 start "txnpruner" ubuntu@juju-4da59b22-9710-4e69-840a-be49ee864a97-machine-0:~$ ubuntu@juju-4da59b22-9710-4e69-840a-be49ee864a97-machine-1:~$ ps -o lstart,cmd $(pgrep -f -- --machine-id)                  STARTED CMD Thu May 17 07:44:02 2018 /var/lib/juju/tools/machine-1/jujud machine --data-dir /var/lib/juju --machine-id 1 --debug ubuntu@juju-4da59b22-9710-4e69-840a-be49ee864a97-machine-1:~$ sudo grep txnprune /var/log/juju/machine-?.log | tail -n 50 [...] 2018-05-17 07:44:23 INFO juju.worker.singular singular.go:104 standby "txnpruner" 2018-05-17 07:44:23 INFO juju.worker runner.go:477 start "txnpruner" 2018-05-17 09:44:03 INFO juju.worker runner.go:483 stopped "txnpruner", err: <nil> ubuntu@juju-4da59b22-9710-4e69-840a-be49ee864a97-machine-1:~$ ubuntu@juju-4da59b22-9710-4e69-840a-be49ee864a97-machine-2:~$ ps -o lstart,cmd $(pgrep -f -- --machine-id)                  STARTED CMD ubuntu@juju-4da59b22-9710-4e69-840a-be49ee864a97-machine-2:~$ sudo grep txnprune /var/log/juju/machine-?.log | tail -n 50 ubuntu@juju-4da59b22-9710-4e69-840a-be49ee864a97-machine-2:~$ sudo zgrep txnprune /var/log/juju/machine-2-2018-05-17T17-15-21.020.log.gz | tail -n 50 2018-04-22 12:01:52 INFO juju.worker.singular singular.go:101 starting "txnpruner" 2018-04-22 12:01:52 INFO juju.worker runner.go:477 start "txnpruner" 2018-04-23 19:41:52 INFO juju.worker runner.go:483 stopped "txnpruner", err: worker "txnpruner" exited: pruning failed, txnpruner stopping: failed to retrieve final txns count: read tcp 10.25.2.110:57904->10.25.2.110:37017: i/o timeout 2018-04-23 19:41:53 ERROR juju.worker runner.go:392 exited "txnpruner": worker "txnpruner" exited: pruning failed, txnpruner stopping: failed to retrieve final txns count: read tcp 10.25.2.110:57904->10.25.2.110:37017: i/o timeout 2018-04-23 19:41:53 INFO juju.worker runner.go:467 restarting "txnpruner" in 3s 2018-04-23 19:41:56 INFO juju.worker runner.go:477 start "txnpruner" 2018-04-23 20:23:13 INFO juju.worker runner.go:483 stopped "txnpruner", err: <nil> 2018-04-23 20:36:47 INFO juju.worker.singular singular.go:101 starting "txnpruner" 2018-04-23 20:36:47 INFO juju.worker runner.go:477 start "txnpruner" 2018-05-02 08:02:40 INFO juju.worker runner.go:483 stopped "txnpruner", err: worker "txnpruner" exited: pruning failed, txnpruner stopping: failed to retrieve final txns count: read tcp 127.0.0.1:34474->127.0.0.1:37017: i/o timeout 2018-05-02 08:02:42 ERROR juju.worker runner.go:392 exited "txnpruner": worker "txnpruner" exited: pruning failed, txnpruner stopping: failed to retrieve final txns count: read tcp 127.0.0.1:34474->127.0.0.1:37017: i/o timeout 2018-05-02 08:02:42 INFO juju.worker runner.go:467 restarting "txnpruner" in 3s 2018-05-02 08:02:45 INFO juju.worker runner.go:477 start "txnpruner" 2018-05-02 08:47:58 INFO juju.worker.singular singular.go:104 standby "txnpruner" 2018-05-02 08:47:58 INFO juju.worker runner.go:477 start "txnpruner" 2018-05-03 18:22:44 INFO juju.worker runner.go:483 stopped "txnpruner", err: EOF 2018-05-03 18:22:52 ERROR juju.worker runner.go:392 exited "txnpruner": EOF 2018-05-17 07:44:23 INFO juju.worker.singular singular.go:101 starting "txnpruner" 2018-05-17 07:44:23 INFO juju.worker runner.go:477 start "txnpruner" 2018-05-17 08:45:52 INFO juju.worker runner.go:483 stopped "txnpruner", err: worker "txnpruner" exited: pruning failed, txnpruner stopping: failed to retrieve final txns count: read tcp 127.0.0.1:47484->127.0.0.1:37017: i/o timeout 2018-05-17 08:45:52 ERROR juju.worker runner.go:392 exited "txnpruner": worker "txnpruner" exited: pruning failed, txnpruner stopping: failed to retrieve final txns count: read tcp 127.0.0.1:47484->127.0.0.1:37017: i/o timeout 2018-05-17 08:45:52 INFO juju.worker runner.go:467 restarting "txnpruner" in 3s 2018-05-17 08:45:55 INFO juju.worker runner.go:477 start "txnpruner" 2018-05-17 09:44:04 INFO juju.worker runner.go:483 stopped "txnpruner", err: <nil> 2018-05-17 10:01:17 INFO juju.worker.singular singular.go:101 starting "txnpruner" 2018-05-17 10:01:17 INFO juju.worker runner.go:477 start "txnpruner"
2018-08-27 16:11:43 Canonical Juju QA Bot juju/2.4: status Fix Committed Fix Released
2018-09-18 15:12:09 Canonical Juju QA Bot juju/2.3: status Fix Committed Fix Released
2018-11-13 13:36:35 John A Meinel juju: status In Progress Fix Committed
2018-11-19 16:22:16 Junien F juju/2.4: status Fix Released New
2019-03-22 01:27:16 Anastasia juju: status Fix Committed Fix Released
2019-03-22 01:28:27 Anastasia juju/2.4: status New Fix Released