status-history-pruner fails under load

Bug #1696509 reported by John A Meinel
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Canonical Juju
Fix Released
High
Christian Muirhead

Bug Description

2017-06-07 17:23:42 ERROR juju.worker.dependency engine.go:547 "status-history-pruner" manifold worker returned unexpected error: read tcp 127.0.0.1:52104->127.0.0.1:37017: i/o timeout

This seemed to be accompanied with a mongotop entry of:
                     ns total read write 2017-06-07T17:41:36Z
   juju.statuseshistory 10697138ms 0ms 10697138ms

Note also, because of bugs like https://bugs.launchpad.net/juju/+bug/1696491 we end up with extra data in statuseshistory. (unit.Destroy() fails to remove statuseshistory because it is looking up docs to delete by the wrong key name.)

We need to make sure that our various pruning tasks aren't trying to take too large of a bite at a time, and can make forward progress when its most important.

Revision history for this message
John A Meinel (jameinel) wrote :

It may be that we're trying to delete too much stuff at one time.
IIRC from txn pruning, it can take ~1min to delete 100,000 (200k?) txn records from juju.txns. On the current database, they are dealing with 18M transactions.
I tried to run a query to delete 17M of them.
At 100k/min that would be 2.8hrs to complete.
Maybe we just need to break it up into smaller chunks at a time?
In the txn pruner we went with something like 1000 records at a time, and then dump how far we've gotten every 15s.

Changed in juju:
assignee: nobody → Christian Muirhead (2-xtian)
status: Triaged → In Progress
Revision history for this message
John A Meinel (jameinel) wrote : Re: [Bug 1696509] Re: status-history-pruner fails under load

In txnpruner we have to track all the IDs that match. i don't know if there
is some way to batch items to remove without having to specify all of them.

Maybe doing a rough count and then comparing oldest updated, vs the target
updated vs the total count and then breaking it up from there?

eg, threshold = 2 days ago, oldest = 5 days ago, newest = now, total count
=1M.
1M/5days = 200k/day
threshold-oldest = 3days
100k = 0.5days
issue a delete for everything older than
oldest +0.5day ... log progress
oldest +1.0day ... log progress
oldest +1.5day ... log progress
...
threshold ... log progress

etc

John
=:->

On Jun 8, 2017 09:41, "Christian Muirhead" <email address hidden>
wrote:

> ** Changed in: juju
> Assignee: (unassigned) => Christian Muirhead (2-xtian)
>
> ** Changed in: juju
> Status: Triaged => In Progress
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1696509
>
> Title:
> status-history-pruner fails under load
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju/+bug/1696509/+subscriptions
>

Christian Reis (kiko)
tags: added: adrastea
Tim Penhey (thumper)
Changed in juju:
milestone: 2.2.1 → 2.2.2
Revision history for this message
Christian Muirhead (2-xtian) wrote :
Ian Booth (wallyworld)
Changed in juju:
status: In Progress → Fix Committed
Changed in juju:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.