branch path mapping has no safeguard against exceeding the critical threshold

Bug #433888 reported by Tom Haddon
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Launchpad itself
Fix Released
Critical
Robert Collins

Bug Description

Problem statment
================
We have a nagios check as follows:

/usr/lib/nagios/plugins/check_http -H bazaar.launchpad.net -u /~launchpad-pqm/launchpad/devel/.bzr/repository/pack-names --critical=0.0125 -e " 200 OK"

This means throw a critical error if the lookup takes longer that 0.0125 seconds. We have seen intermittent failures of this check today for the last 60 minutes or so (although it appears to be okay now). Sometimes we were seeing lookups > 0.0125, and sometimes we were seeing outright timeouts (in other words, taking longer than 10 seconds).

The reason that 0.0125 is critical is that beyond that we will see a backlog queue as apache serialises all requests [based on 2009 traffic volumes].

Suggested fix
=============

Change the mapper to have a 500ms DB query timeout so that we never block for extended periods. If we had regular higher-than-norm responses, change nagios to measure 99th percentile not instantaneous duration.

Test url that should rewrite - http://bazaar.qastaging.launchpad.net/~launchpad-pqm/launchpad/devel/.bzr/repository/pack-names

Related branches

Revision history for this message
Tom Haddon (mthaddon) wrote :
Revision history for this message
Michael Hudson-Doyle (mwhudson) wrote :

I did some log analysis, and found that about 600 of 600000 requests took longer than the magic 0.0125 value, the longest of which took 0.3s. Just concentrating on requests that hit the database, filters down to 587 of about 50000. There doesn't seem to be significant clumping in time of the slow requests.

What this information doesn't record is how long the request waited before the rewrite script got to it. I don't think there's really a a way we can get at this information, but one thing we can record is how long the script waits between requests -- if it's basically 0 time waiting between requests, there is a good chance there's a significant backlog.

Revision history for this message
Tim Penhey (thumper) wrote :

Michael,

Do we have a plan to fix this? Or is it in the weird/too hard basket for now?

Changed in launchpad-code:
status: New → Incomplete
Revision history for this message
Michael Hudson-Doyle (mwhudson) wrote :

I think it's in the weird/hard basket :/

Revision history for this message
Tim Penhey (thumper) wrote :

Marking as low as it seems to be only very occasionally affecting us. Tom if it becomes more of a problem please let us know.

Changed in launchpad-code:
importance: Undecided → Low
status: Incomplete → Triaged
Revision history for this message
Tom Haddon (mthaddon) wrote :

I'd really rather prefer we didn't mark this as "Low". If we do that implies we don't really care about it. We do, it's just that it's hard/weird and occasional.

Revision history for this message
Tim Penhey (thumper) wrote : Re: [Bug 433888] Re: Branch lookups can take too long

On Thu, 22 Oct 2009 08:37:05 Tom Haddon wrote:
> I'd really rather prefer we didn't mark this as "Low". If we do that
> implies we don't really care about it. We do, it's just that it's
> hard/weird and occasional.
  importance medium

There you go :)

Changed in launchpad-code:
importance: Low → Medium
Revision history for this message
Michael Hudson-Doyle (mwhudson) wrote : Re: Branch lookups can take too long

I think a realistic fix is probably to move away from the :prg: rewrite map interface to one that doesn't imply serialization of the translations. The current code has a particularly horrible failure mode:

Suppose the cache is nicely full, and we have a queue of 50 requests coming in: the first is not in the cache but all the others are. For whatever reason, looking up the first branch takes 10 seconds, so the cache is now entirely stale and all the requests that could have been serviced from the cache are not. If we could continue to service requests while waiting for the slow query to complete, we would be able to go on using the cache.

I don't know what would be a good tech to do this with. Twisted springs to mind, we could either proxy onto apache to serve the branch data (like we do today) or we could just serve the branch data with Twisted (we'd need a very new Twisted for this). There would be obvious performance concerns with doing this, but maybe we could experiment by directing some fraction of the requests to a new Twisted frontend and monitoring closely.

Tom Haddon (mthaddon)
tags: added: canonical-losa-lp
Revision history for this message
Steve McInerney (spm) wrote :

To update this one a bit. We've all been seeing this sporadically for quite some time now. Not just losas, but the gsas as well.
It's never consistent on day/time; May get a couple a day; then nothing for several days.

Given the somewhat critical nature of this functionality; we're pretty reluctant to adjust the nagios check to higher levels. But the number of "false" alarms is a pain.

It's unclear as to what impact this has no end users - we don't get (public) complaints that I'm aware of; but...

Revision history for this message
Robert Collins (lifeless) wrote :

This is causing timeouts from time to time, so falls under the timeout policy, updated accordingly. From my perspective, a 10 second lookup edge case is it self a problem: one way to mitigate this would be a 500ms timeout on the DB request.

tags: added: timeout
Changed in launchpad-code:
importance: Medium → High
tags: added: pg83
Changed in launchpad:
importance: High → Medium
importance: Medium → Critical
summary: - Branch lookups can take too long
+ Branch path mapping exceeding threshold regularly
tags: removed: pg83
Revision history for this message
Robert Collins (lifeless) wrote : Re: Branch path mapping exceeding threshold regularly

I agree with Michaels assessment; the serialised nature is a problem. That said, we can address the 10 second thing by using a 500ms timeout for branch mapping task. We should be able to make that do something sensible - a url to a 500 page or something.

summary: - Branch path mapping exceeding threshold regularly
+ branch path mapping has no safeguard against exceeding the critical
+ threshold
description: updated
Changed in launchpad:
assignee: nobody → Robert Collins (lifeless)
Revision history for this message
Launchpad QA Bot (lpqabot) wrote :
Changed in launchpad:
milestone: none → 11.05
tags: added: qa-needstesting
Changed in launchpad:
status: Triaged → Fix Committed
description: updated
William Grant (wgrant)
tags: added: qa-ok
removed: qa-needstesting
Changed in launchpad:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.