Launchpad itself

please record concurrent query count on servers

Bug #651766 reported by Robert Collins on 2010-09-30

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Launchpad itself	Triaged	High	Unassigned

Bug Description

We routinely have OOPS where a query just appears to take 1000 times longer than normal.

This (for read queries) may be caused by a few things:
1 - mvcc data structure contention
2 - queuing for a timeslice on the DB
3 - queuing for a timeslice on the appserver being misallocated.
4 - network glitches.

We're working on 3 by testing single threaded appservers.

1 seems to be particularly hard to debug

for 2 a reasonable data source would be samples of the concurrent queries being processed by the db server, at 2*t frequency where t is the smallest blowout we're interested in. 0.5 times a seconds seems a reasonable t, so if we can sample the concurrent query count once a second (and record the time the result arrives because it itself could get queued), we should be able to detect if we're running into queuing storms from time to time.

This will need to run on all DB servers.

I think this is a fairly crucial bit of data gathering; would be great to have some key stats from it in the dbr (per server) too:
- max, avg, stddev, 99th percentile.

Tags:

Gary Poster (gary) on 2010-10-05

tags:	added: oops-infrastructure
Changed in launchpad-foundations:
status:	New → Triaged
importance:	Undecided → High

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.