branch path mapping has no safeguard against exceeding the critical threshold
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Launchpad itself |
Fix Released
|
Critical
|
Robert Collins |
Bug Description
Problem statment
================
We have a nagios check as follows:
/usr/lib/
This means throw a critical error if the lookup takes longer that 0.0125 seconds. We have seen intermittent failures of this check today for the last 60 minutes or so (although it appears to be okay now). Sometimes we were seeing lookups > 0.0125, and sometimes we were seeing outright timeouts (in other words, taking longer than 10 seconds).
The reason that 0.0125 is critical is that beyond that we will see a backlog queue as apache serialises all requests [based on 2009 traffic volumes].
Suggested fix
=============
Change the mapper to have a 500ms DB query timeout so that we never block for extended periods. If we had regular higher-than-norm responses, change nagios to measure 99th percentile not instantaneous duration.
Test url that should rewrite - http://
Related branches
- Gavin Panella (community): Approve
-
Diff: 63 lines (+27/-17)1 file modifiedlib/lp/codehosting/rewrite.py (+27/-17)
tags: | added: canonical-losa-lp |
tags: | added: pg83 |
Changed in launchpad: | |
importance: | High → Medium |
importance: | Medium → Critical |
summary: |
- Branch lookups can take too long + Branch path mapping exceeding threshold regularly |
tags: | removed: pg83 |
summary: |
- Branch path mapping exceeding threshold regularly + branch path mapping has no safeguard against exceeding the critical + threshold |
description: | updated |
Changed in launchpad: | |
assignee: | nobody → Robert Collins (lifeless) |
description: | updated |
tags: |
added: qa-ok removed: qa-needstesting |
Changed in launchpad: | |
status: | Fix Committed → Fix Released |
I did some log analysis, and found that about 600 of 600000 requests took longer than the magic 0.0125 value, the longest of which took 0.3s. Just concentrating on requests that hit the database, filters down to 587 of about 50000. There doesn't seem to be significant clumping in time of the slow requests.
What this information doesn't record is how long the request waited before the rewrite script got to it. I don't think there's really a a way we can get at this information, but one thing we can record is how long the script waits between requests -- if it's basically 0 time waiting between requests, there is a good chance there's a significant backlog.