Canonical Juju

txnpruner apparently not running on 2.4.3 controller

Bug #1794223 reported by Paul Collins on 2018-09-25

This bug report is a duplicate of: Bug #1771906: txnpruner failing to prune on large controller. Edit Remove

This bug affects 4 people

Affects		Status	Importance	Assigned to	Milestone
	Canonical Juju	Triaged	High	Unassigned

Bug Description

Our large production Juju 2 controller has apparently been failing to prune txns. I used mgopurge to deal with them:

2018-09-25 01:56:16 DEBUG pruning completed: removed 43466583 txns
2018-09-25 01:56:18 INFO clean and prune cleaned 32412 docs in 71 collections
removed 43466583 transactions and 11345 stash documents

And poked around to find that the txnpruner stopped working a few days ago, with one dubiously successful run:

2018-09-18 00:25:27 INFO juju.txn prune.go:131 txns after last prune: 196941, txns now: 12893141, pruning: too many new transactions
2018-09-18 00:26:27 WARNING juju.txn oracle.go:228 cleanup of "txns.prunetemp" failed: read tcp 10.25.2.111:46930->10.25.2.109:37017: i/o timeout
2018-09-18 00:26:27 ERROR juju.worker.dependency engine.go:587 "transaction-pruner" manifold worker returned unexpected error: pruning failed, txnpruner stopping: failed to retrieve final txns count: read tcp 10.25.2.111:46930->10.25.2.109:37017: i/o timeout
2018-09-18 01:26:30 INFO juju.txn prune.go:131 txns after last prune: 196941, txns now: 13087808, pruning: too many new transactions
2018-09-18 01:27:30 WARNING juju.txn oracle.go:228 cleanup of "txns.prunetemp" failed: read tcp 10.25.2.111:43552->10.25.2.109:37017: i/o timeout
2018-09-18 01:27:30 ERROR juju.worker.dependency engine.go:587 "transaction-pruner" manifold worker returned unexpected error: pruning failed, txnpruner stopping: failed to retrieve final txns count: read tcp 10.25.2.111:43552->10.25.2.109:37017: i/o timeout
2018-09-18 02:27:33 INFO juju.txn prune.go:131 txns after last prune: 196941, txns now: 13282793, pruning: too many new transactions
2018-09-18 02:28:33 WARNING juju.txn oracle.go:228 cleanup of "txns.prunetemp" failed: read tcp 10.25.2.111:57914->10.25.2.109:37017: i/o timeout
2018-09-18 02:28:33 ERROR juju.worker.dependency engine.go:587 "transaction-pruner" manifold worker returned unexpected error: pruning failed, txnpruner stopping: failed to retrieve final txns count: read tcp 10.25.2.111:57914->10.25.2.109:37017: i/o timeout
[...]
2018-09-18 23:22:23 INFO juju.txn prune.go:131 txns after last prune: 196941, txns now: 17237575, pruning: too many new transactions
2018-09-18 23:23:23 WARNING juju.txn oracle.go:228 cleanup of "txns.prunetemp" failed: read tcp 10.25.2.111:46500->10.25.2.109:37017: i/o timeout
2018-09-19 01:22:53 INFO juju.txn prune.go:131 txns after last prune: 196941, txns now: 17569913, pruning: too many new transactions
2018-09-19 01:23:53 WARNING juju.txn oracle.go:228 cleanup of "txns.prunetemp" failed: read tcp 10.25.2.111:34856->10.25.2.110:37017: i/o timeout
2018-09-19 02:23:56 INFO juju.txn prune.go:131 txns after last prune: 196941, txns now: 17766989, pruning: too many new transactions
2018-09-19 02:31:18 INFO juju.txn prune.go:176 txn batch pruned in 7m22.09937523s. txns now: 16790490, inspected 70 collections, 59333 docs (785 cleaned)
2018-09-19 02:32:18 WARNING juju.txn oracle.go:228 cleanup of "txns.prunetemp" failed: read tcp 10.25.2.111:35074->10.25.2.110:37017: i/o timeout
2018-09-19 03:32:22 INFO juju.txn prune.go:131 txns after last prune: 196941, txns now: 16987610, pruning: too many new transactions
2018-09-19 03:33:22 WARNING juju.txn oracle.go:228 cleanup of "txns.prunetemp" failed: read tcp 10.25.2.111:58356->10.25.2.110:37017: i/o timeout

However the txnpruner has not started running at all. I tried to check juju_engine_report. It hangs on machines 0 and 1, and on machine 2 it reports:

  transaction-pruner:
    error: '"is-primary-controller-flag" not set: dependency not available'
    inputs:
    - clock
    - state
    - is-primary-controller-flag
    - migration-fortress
    - migration-inactive-flag
    resource-log:
    - name: migration-inactive-flag
      type: '*engine.Flag'
    - name: migration-fortress
      type: '*fortress.Guest'
    - name: is-primary-controller-flag
      type: '*engine.Flag'
    state: stopped

I also checked juju_goroutines on machines 0 and 1 for mentions of pruning, and it seems that some pruners are running on machine 1, and no pruners on machine 0:

ubuntu@juju-4da59b22-9710-4e69-840a-be49ee864a97-machine-0:~$ juju_goroutines | grep -i prune
Querying @jujud-machine-0 introspection socket: /debug/pprof/goroutine?debug=1
ubuntu@juju-4da59b22-9710-4e69-840a-be49ee864a97-machine-0:~$

ubuntu@juju-4da59b22-9710-4e69-840a-be49ee864a97-machine-1:~$ juju_goroutines | grep -i prune
Querying @jujud-machine-1 introspection socket: /debug/pprof/goroutine?debug=1
# 0x1e326dc github.com/juju/juju/worker/pruner.(*PrunerWorker).Wait+0x2c /workspace/_build/src/github.com/juju/juju/worker/pruner/worker.go:40
# 0x1e326dc github.com/juju/juju/worker/pruner.(*PrunerWorker).Wait+0x2c /workspace/_build/src/github.com/juju/juju/worker/pruner/worker.go:40
# 0x1e32958 github.com/juju/juju/worker/pruner.(*PrunerWorker).Work+0x218 /workspace/_build/src/github.com/juju/juju/worker/pruner/worker.go:75
# 0x1e33478 github.com/juju/juju/worker/actionpruner.(*Worker).loop+0x38 /workspace/_build/src/github.com/juju/juju/worker/actionpruner/worker.go:29
# 0x1e33969 github.com/juju/juju/worker/actionpruner.(*Worker).(github.com/juju/juju/worker/actionpruner.loop)-fm+0x29 /workspace/_build/src/github.com/juju/juju/worker/actionpruner/worker.go:46
# 0x1e32958 github.com/juju/juju/worker/pruner.(*PrunerWorker).Work+0x218 /workspace/_build/src/github.com/juju/juju/worker/pruner/worker.go:75
# 0x1eba0c8 github.com/juju/juju/worker/statushistorypruner.(*Worker).loop+0x38 /workspace/_build/src/github.com/juju/juju/worker/statushistorypruner/worker.go:30
# 0x1eba5b9 github.com/juju/juju/worker/statushistorypruner.(*Worker).(github.com/juju/juju/worker/statushistorypruner.loop)-fm+0x29 /workspace/_build/src/github.com/juju/juju/worker/statushistorypruner/worker.go:47
# 0xc0b5bf github.com/juju/juju/state/presence.(*Pruner).findActiveSeqs+0x8f /workspace/_build/src/github.com/juju/juju/state/presence/pruner.go:158
# 0xc0ae5a github.com/juju/juju/state/presence.(*Pruner).removeUnusedBeings+0x6a /workspace/_build/src/github.com/juju/juju/state/presence/pruner.go:103
# 0xc0b975 github.com/juju/juju/state/presence.(*Pruner).Prune+0x55 /workspace/_build/src/github.com/juju/juju/state/presence/pruner.go:187
# 0xc02833 github.com/juju/juju/state/presence.(*Watcher).checkShouldPrune+0x203 /workspace/_build/src/github.com/juju/juju/state/presence/presence.go:378
ubuntu@juju-4da59b22-9710-4e69-840a-be49ee864a97-machine-1:~$

Machine agent uptimes:
root 26734 53.3 11.0 3142772 1811352 ? Sl Sep21 2508:13 /var/lib/juju/tools/machine-0/jujud machine --data-dir /var/lib/juju --machine-id 0 --debug
root 9935 170 70.2 13919392 11539156 ? Rl Sep18 15283:34 /var/lib/juju/tools/machine-1/jujud machine --data-dir /var/lib/juju --machine-id 1 --debug
root 17006 74.3 26.3 8675108 4322348 ? Sl Sep18 7375:51 /var/lib/juju/tools/machine-2/jujud machine --data-dir /var/lib/juju --machine-id 2 --debug

Revision history for this message

Richard Harding (rharding) wrote on 2018-09-27:

Thanks, the pruners would be controlled as part of the workers running on the controllers. Have you been able to restart the juju agent on the controller 0 and see if the pruners are fired up? I assume something went wrong to cause them to not be available.

Changed in juju:
status:	New → Incomplete

Revision history for this message

Paul Collins (pjdc) wrote on 2018-09-27:

The controllers have all been restarted in the meantime, and although transaction-pruner is running, it is failing to prune due to mongodb timeouts. But that's a separate bug. I don't think we can do much more with this.

Revision history for this message

Launchpad Janitor (janitor) wrote on 2018-11-27:

[Expired for juju because there has been no activity for 60 days.]

Changed in juju:
status:	Incomplete → Expired

John A Meinel (jameinel) on 2018-11-27

Changed in juju:
status:	Expired → Triaged
importance:	Undecided → High

Revision history for this message

John A Meinel (jameinel) wrote on 2018-11-27:

I do believe they are still being affected by this w 2.4.5. We still haven't worked out exactly why it is failing to keep up.

Report a bug

This report contains Public information

Everyone can see this information.

Duplicate of bug #1771906 Remove

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.