txnpruner apparently not running on 2.4.3 controller

Bug #1794223 reported by Paul Collins
22
This bug affects 4 people
Affects Status Importance Assigned to Milestone
Canonical Juju
Triaged
High
Unassigned

Bug Description

Our large production Juju 2 controller has apparently been failing to prune txns. I used mgopurge to deal with them:

2018-09-25 01:56:16 DEBUG pruning completed: removed 43466583 txns
2018-09-25 01:56:18 INFO clean and prune cleaned 32412 docs in 71 collections
  removed 43466583 transactions and 11345 stash documents

And poked around to find that the txnpruner stopped working a few days ago, with one dubiously successful run:

2018-09-18 00:25:27 INFO juju.txn prune.go:131 txns after last prune: 196941, txns now: 12893141, pruning: too many new transactions
2018-09-18 00:26:27 WARNING juju.txn oracle.go:228 cleanup of "txns.prunetemp" failed: read tcp 10.25.2.111:46930->10.25.2.109:37017: i/o timeout
2018-09-18 00:26:27 ERROR juju.worker.dependency engine.go:587 "transaction-pruner" manifold worker returned unexpected error: pruning failed, txnpruner stopping: failed to retrieve final txns count: read tcp 10.25.2.111:46930->10.25.2.109:37017: i/o timeout
2018-09-18 01:26:30 INFO juju.txn prune.go:131 txns after last prune: 196941, txns now: 13087808, pruning: too many new transactions
2018-09-18 01:27:30 WARNING juju.txn oracle.go:228 cleanup of "txns.prunetemp" failed: read tcp 10.25.2.111:43552->10.25.2.109:37017: i/o timeout
2018-09-18 01:27:30 ERROR juju.worker.dependency engine.go:587 "transaction-pruner" manifold worker returned unexpected error: pruning failed, txnpruner stopping: failed to retrieve final txns count: read tcp 10.25.2.111:43552->10.25.2.109:37017: i/o timeout
2018-09-18 02:27:33 INFO juju.txn prune.go:131 txns after last prune: 196941, txns now: 13282793, pruning: too many new transactions
2018-09-18 02:28:33 WARNING juju.txn oracle.go:228 cleanup of "txns.prunetemp" failed: read tcp 10.25.2.111:57914->10.25.2.109:37017: i/o timeout
2018-09-18 02:28:33 ERROR juju.worker.dependency engine.go:587 "transaction-pruner" manifold worker returned unexpected error: pruning failed, txnpruner stopping: failed to retrieve final txns count: read tcp 10.25.2.111:57914->10.25.2.109:37017: i/o timeout
[...]
2018-09-18 23:22:23 INFO juju.txn prune.go:131 txns after last prune: 196941, txns now: 17237575, pruning: too many new transactions
2018-09-18 23:23:23 WARNING juju.txn oracle.go:228 cleanup of "txns.prunetemp" failed: read tcp 10.25.2.111:46500->10.25.2.109:37017: i/o timeout
2018-09-19 01:22:53 INFO juju.txn prune.go:131 txns after last prune: 196941, txns now: 17569913, pruning: too many new transactions
2018-09-19 01:23:53 WARNING juju.txn oracle.go:228 cleanup of "txns.prunetemp" failed: read tcp 10.25.2.111:34856->10.25.2.110:37017: i/o timeout
2018-09-19 02:23:56 INFO juju.txn prune.go:131 txns after last prune: 196941, txns now: 17766989, pruning: too many new transactions
2018-09-19 02:31:18 INFO juju.txn prune.go:176 txn batch pruned in 7m22.09937523s. txns now: 16790490, inspected 70 collections, 59333 docs (785 cleaned)
2018-09-19 02:32:18 WARNING juju.txn oracle.go:228 cleanup of "txns.prunetemp" failed: read tcp 10.25.2.111:35074->10.25.2.110:37017: i/o timeout
2018-09-19 03:32:22 INFO juju.txn prune.go:131 txns after last prune: 196941, txns now: 16987610, pruning: too many new transactions
2018-09-19 03:33:22 WARNING juju.txn oracle.go:228 cleanup of "txns.prunetemp" failed: read tcp 10.25.2.111:58356->10.25.2.110:37017: i/o timeout

However the txnpruner has not started running at all. I tried to check juju_engine_report. It hangs on machines 0 and 1, and on machine 2 it reports:

  transaction-pruner:
    error: '"is-primary-controller-flag" not set: dependency not available'
    inputs:
    - clock
    - state
    - is-primary-controller-flag
    - migration-fortress
    - migration-inactive-flag
    resource-log:
    - name: migration-inactive-flag
      type: '*engine.Flag'
    - name: migration-fortress
      type: '*fortress.Guest'
    - name: is-primary-controller-flag
      type: '*engine.Flag'
    state: stopped

I also checked juju_goroutines on machines 0 and 1 for mentions of pruning, and it seems that some pruners are running on machine 1, and no pruners on machine 0:

ubuntu@juju-4da59b22-9710-4e69-840a-be49ee864a97-machine-0:~$ juju_goroutines | grep -i prune
Querying @jujud-machine-0 introspection socket: /debug/pprof/goroutine?debug=1
ubuntu@juju-4da59b22-9710-4e69-840a-be49ee864a97-machine-0:~$

ubuntu@juju-4da59b22-9710-4e69-840a-be49ee864a97-machine-1:~$ juju_goroutines | grep -i prune
Querying @jujud-machine-1 introspection socket: /debug/pprof/goroutine?debug=1
# 0x1e326dc github.com/juju/juju/worker/pruner.(*PrunerWorker).Wait+0x2c /workspace/_build/src/github.com/juju/juju/worker/pruner/worker.go:40
# 0x1e326dc github.com/juju/juju/worker/pruner.(*PrunerWorker).Wait+0x2c /workspace/_build/src/github.com/juju/juju/worker/pruner/worker.go:40
# 0x1e32958 github.com/juju/juju/worker/pruner.(*PrunerWorker).Work+0x218 /workspace/_build/src/github.com/juju/juju/worker/pruner/worker.go:75
# 0x1e33478 github.com/juju/juju/worker/actionpruner.(*Worker).loop+0x38 /workspace/_build/src/github.com/juju/juju/worker/actionpruner/worker.go:29
# 0x1e33969 github.com/juju/juju/worker/actionpruner.(*Worker).(github.com/juju/juju/worker/actionpruner.loop)-fm+0x29 /workspace/_build/src/github.com/juju/juju/worker/actionpruner/worker.go:46
# 0x1e32958 github.com/juju/juju/worker/pruner.(*PrunerWorker).Work+0x218 /workspace/_build/src/github.com/juju/juju/worker/pruner/worker.go:75
# 0x1eba0c8 github.com/juju/juju/worker/statushistorypruner.(*Worker).loop+0x38 /workspace/_build/src/github.com/juju/juju/worker/statushistorypruner/worker.go:30
# 0x1eba5b9 github.com/juju/juju/worker/statushistorypruner.(*Worker).(github.com/juju/juju/worker/statushistorypruner.loop)-fm+0x29 /workspace/_build/src/github.com/juju/juju/worker/statushistorypruner/worker.go:47
# 0xc0b5bf github.com/juju/juju/state/presence.(*Pruner).findActiveSeqs+0x8f /workspace/_build/src/github.com/juju/juju/state/presence/pruner.go:158
# 0xc0ae5a github.com/juju/juju/state/presence.(*Pruner).removeUnusedBeings+0x6a /workspace/_build/src/github.com/juju/juju/state/presence/pruner.go:103
# 0xc0b975 github.com/juju/juju/state/presence.(*Pruner).Prune+0x55 /workspace/_build/src/github.com/juju/juju/state/presence/pruner.go:187
# 0xc02833 github.com/juju/juju/state/presence.(*Watcher).checkShouldPrune+0x203 /workspace/_build/src/github.com/juju/juju/state/presence/presence.go:378
ubuntu@juju-4da59b22-9710-4e69-840a-be49ee864a97-machine-1:~$

Machine agent uptimes:
root 26734 53.3 11.0 3142772 1811352 ? Sl Sep21 2508:13 /var/lib/juju/tools/machine-0/jujud machine --data-dir /var/lib/juju --machine-id 0 --debug
root 9935 170 70.2 13919392 11539156 ? Rl Sep18 15283:34 /var/lib/juju/tools/machine-1/jujud machine --data-dir /var/lib/juju --machine-id 1 --debug
root 17006 74.3 26.3 8675108 4322348 ? Sl Sep18 7375:51 /var/lib/juju/tools/machine-2/jujud machine --data-dir /var/lib/juju --machine-id 2 --debug

Revision history for this message
Richard Harding (rharding) wrote :

Thanks, the pruners would be controlled as part of the workers running on the controllers. Have you been able to restart the juju agent on the controller 0 and see if the pruners are fired up? I assume something went wrong to cause them to not be available.

Changed in juju:
status: New → Incomplete
Revision history for this message
Paul Collins (pjdc) wrote :

The controllers have all been restarted in the meantime, and although transaction-pruner is running, it is failing to prune due to mongodb timeouts. But that's a separate bug. I don't think we can do much more with this.

Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for juju because there has been no activity for 60 days.]

Changed in juju:
status: Incomplete → Expired
John A Meinel (jameinel)
Changed in juju:
status: Expired → Triaged
importance: Undecided → High
Revision history for this message
John A Meinel (jameinel) wrote :

I do believe they are still being affected by this w 2.4.5. We still haven't worked out exactly why it is failing to keep up.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.