please record concurrent query count on servers
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Launchpad itself |
Triaged
|
High
|
Unassigned |
Bug Description
We routinely have OOPS where a query just appears to take 1000 times longer than normal.
This (for read queries) may be caused by a few things:
1 - mvcc data structure contention
2 - queuing for a timeslice on the DB
3 - queuing for a timeslice on the appserver being misallocated.
4 - network glitches.
We're working on 3 by testing single threaded appservers.
1 seems to be particularly hard to debug
for 2 a reasonable data source would be samples of the concurrent queries being processed by the db server, at 2*t frequency where t is the smallest blowout we're interested in. 0.5 times a seconds seems a reasonable t, so if we can sample the concurrent query count once a second (and record the time the result arrives because it itself could get queued), we should be able to detect if we're running into queuing storms from time to time.
This will need to run on all DB servers.
I think this is a fairly crucial bit of data gathering; would be great to have some key stats from it in the dbr (per server) too:
- max, avg, stddev, 99th percentile.
tags: | added: oops-infrastructure |
Changed in launchpad-foundations: | |
status: | New → Triaged |
importance: | Undecided → High |