Canonical Juju

status-history-pruner fails under load

Bug #1696509 reported by John A Meinel on 2017-06-07

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Canonical Juju	Fix Released	High	Christian Muirhead	Canonical Juju 2.2.2

Bug Description

2017-06-07 17:23:42 ERROR juju.worker.dependency engine.go:547 "status-history-pruner" manifold worker returned unexpected error: read tcp 127.0.0.1:52104->127.0.0.1:37017: i/o timeout

This seemed to be accompanied with a mongotop entry of:
ns total read write 2017-06-07T17:41:36Z
juju.statuseshistory 10697138ms 0ms 10697138ms

Note also, because of bugs like https://bugs.launchpad.net/juju/+bug/1696491 we end up with extra data in statuseshistory. (unit.Destroy() fails to remove statuseshistory because it is looking up docs to delete by the wrong key name.)

We need to make sure that our various pruning tasks aren't trying to take too large of a bite at a time, and can make forward progress when its most important.

Tags:

Revision history for this message

John A Meinel (jameinel) wrote on 2017-06-07:

It may be that we're trying to delete too much stuff at one time.
IIRC from txn pruning, it can take ~1min to delete 100,000 (200k?) txn records from juju.txns. On the current database, they are dealing with 18M transactions.
I tried to run a query to delete 17M of them.
At 100k/min that would be 2.8hrs to complete.
Maybe we just need to break it up into smaller chunks at a time?
In the txn pruner we went with something like 1000 records at a time, and then dump how far we've gotten every 15s.

Christian Muirhead (2-xtian) on 2017-06-08

Changed in juju:
assignee:	nobody → Christian Muirhead (2-xtian)
status:	Triaged → In Progress

Revision history for this message

John A Meinel (jameinel) wrote on 2017-06-08: Re: [Bug 1696509] Re: status-history-pruner fails under load

In txnpruner we have to track all the IDs that match. i don't know if there
is some way to batch items to remove without having to specify all of them.

Maybe doing a rough count and then comparing oldest updated, vs the target
updated vs the total count and then breaking it up from there?

eg, threshold = 2 days ago, oldest = 5 days ago, newest = now, total count
=1M.
1M/5days = 200k/day
threshold-oldest = 3days
100k = 0.5days
issue a delete for everything older than
oldest +0.5day ... log progress
oldest +1.0day ... log progress
oldest +1.5day ... log progress
...
threshold ... log progress

etc

John
=:->

On Jun 8, 2017 09:41, "Christian Muirhead" <email address hidden>
wrote:

> ** Changed in: juju
> Assignee: (unassigned) => Christian Muirhead (2-xtian)
>
> ** Changed in: juju
> Status: Triaged => In Progress
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1696509
>
> Title:
> status-history-pruner fails under load
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju/+bug/1696509/+subscriptions
>

Christian Reis (kiko) on 2017-06-13

tags:

added: adrastea

Tim Penhey (thumper) on 2017-06-22

Changed in juju:
milestone:	2.2.1 → 2.2.2

Revision history for this message

Christian Muirhead (2-xtian) wrote on 2017-07-07:

PR here: https://github.com/juju/juju/pull/7501

Ian Booth (wallyworld) on 2017-07-10

Changed in juju:
status:	In Progress → Fix Committed

Canonical Juju QA Bot (juju-qa-bot) on 2017-07-13

Changed in juju:
status:	Fix Committed → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.