please record concurrent query count on servers

Bug #651766 reported by Robert Collins
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Launchpad itself
Triaged
High
Unassigned

Bug Description

We routinely have OOPS where a query just appears to take 1000 times longer than normal.

This (for read queries) may be caused by a few things:
1 - mvcc data structure contention
2 - queuing for a timeslice on the DB
3 - queuing for a timeslice on the appserver being misallocated.
4 - network glitches.

We're working on 3 by testing single threaded appservers.

1 seems to be particularly hard to debug

for 2 a reasonable data source would be samples of the concurrent queries being processed by the db server, at 2*t frequency where t is the smallest blowout we're interested in. 0.5 times a seconds seems a reasonable t, so if we can sample the concurrent query count once a second (and record the time the result arrives because it itself could get queued), we should be able to detect if we're running into queuing storms from time to time.

This will need to run on all DB servers.

I think this is a fairly crucial bit of data gathering; would be great to have some key stats from it in the dbr (per server) too:
 - max, avg, stddev, 99th percentile.

Gary Poster (gary)
tags: added: oops-infrastructure
Changed in launchpad-foundations:
status: New → Triaged
importance: Undecided → High
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.