Launchpad itself

branch path mapping has no safeguard against exceeding the critical threshold

Bug #433888 reported by Tom Haddon on 2009-09-21

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Launchpad itself	Fix Released	Critical	Robert Collins	Launchpad itself 11.05

Bug Description

Problem statment
================
We have a nagios check as follows:

/usr/lib/nagios/plugins/check_http -H bazaar.launchpad.net -u /~launchpad-pqm/launchpad/devel/.bzr/repository/pack-names --critical=0.0125 -e " 200 OK"

This means throw a critical error if the lookup takes longer that 0.0125 seconds. We have seen intermittent failures of this check today for the last 60 minutes or so (although it appears to be okay now). Sometimes we were seeing lookups > 0.0125, and sometimes we were seeing outright timeouts (in other words, taking longer than 10 seconds).

The reason that 0.0125 is critical is that beyond that we will see a backlog queue as apache serialises all requests [based on 2009 traffic volumes].

Suggested fix
=============

Change the mapper to have a 500ms DB query timeout so that we never block for extended periods. If we had regular higher-than-norm responses, change nagios to measure 99th percentile not instantaneous duration.

Test url that should rewrite - http://bazaar.qastaging.launchpad.net/~launchpad-pqm/launchpad/devel/.bzr/repository/pack-names

See original description

Tags:

Related branches

lp:~lifeless/launchpad/bug-433888

Merged into lp:launchpad at revision 12839

Gavin Panella (community): Approve on 2011-04-14

Revision history for this message

Tom Haddon (mthaddon) wrote on 2009-09-21:

Nagios alert logs for branch lookup performance check Edit (192.6 KiB, application/pdf)

Revision history for this message

Michael Hudson-Doyle (mwhudson) wrote on 2009-09-21:

I did some log analysis, and found that about 600 of 600000 requests took longer than the magic 0.0125 value, the longest of which took 0.3s. Just concentrating on requests that hit the database, filters down to 587 of about 50000. There doesn't seem to be significant clumping in time of the slow requests.

What this information doesn't record is how long the request waited before the rewrite script got to it. I don't think there's really a a way we can get at this information, but one thing we can record is how long the script waits between requests -- if it's basically 0 time waiting between requests, there is a good chance there's a significant backlog.

Revision history for this message

Tim Penhey (thumper) wrote on 2009-10-14:

Michael,

Do we have a plan to fix this? Or is it in the weird/too hard basket for now?

Changed in launchpad-code:
status:	New → Incomplete

Revision history for this message

Michael Hudson-Doyle (mwhudson) wrote on 2009-10-14:

I think it's in the weird/hard basket :/

Revision history for this message

Tim Penhey (thumper) wrote on 2009-10-14:

Marking as low as it seems to be only very occasionally affecting us. Tom if it becomes more of a problem please let us know.

Changed in launchpad-code:
importance:	Undecided → Low
status:	Incomplete → Triaged

Revision history for this message

Tom Haddon (mthaddon) wrote on 2009-10-21:

I'd really rather prefer we didn't mark this as "Low". If we do that implies we don't really care about it. We do, it's just that it's hard/weird and occasional.

Revision history for this message

Tim Penhey (thumper) wrote on 2009-10-21: Re: [Bug 433888] Re: Branch lookups can take too long

On Thu, 22 Oct 2009 08:37:05 Tom Haddon wrote:
> I'd really rather prefer we didn't mark this as "Low". If we do that
> implies we don't really care about it. We do, it's just that it's
> hard/weird and occasional.
importance medium

There you go :)

Changed in launchpad-code:
importance:	Low → Medium

Revision history for this message

Michael Hudson-Doyle (mwhudson) wrote on 2009-10-21: Re: Branch lookups can take too long

I think a realistic fix is probably to move away from the :prg: rewrite map interface to one that doesn't imply serialization of the translations. The current code has a particularly horrible failure mode:

Suppose the cache is nicely full, and we have a queue of 50 requests coming in: the first is not in the cache but all the others are. For whatever reason, looking up the first branch takes 10 seconds, so the cache is now entirely stale and all the requests that could have been serviced from the cache are not. If we could continue to service requests while waiting for the slow query to complete, we would be able to go on using the cache.

I don't know what would be a good tech to do this with. Twisted springs to mind, we could either proxy onto apache to serve the branch data (like we do today) or we could just serve the branch data with Twisted (we'd need a very new Twisted for this). There would be obvious performance concerns with doing this, but maybe we could experiment by directing some fraction of the requests to a new Twisted frontend and monitoring closely.

Tom Haddon (mthaddon) on 2010-05-28

tags:

added: canonical-losa-lp

Revision history for this message

Steve McInerney (spm) wrote on 2010-08-24:

To update this one a bit. We've all been seeing this sporadically for quite some time now. Not just losas, but the gsas as well.
It's never consistent on day/time; May get a couple a day; then nothing for several days.

Given the somewhat critical nature of this functionality; we're pretty reluctant to adjust the nagios check to higher levels. But the number of "false" alarms is a pain.

It's unclear as to what impact this has no end users - we don't get (public) complaints that I'm aware of; but...

Revision history for this message

Robert Collins (lifeless) wrote on 2010-08-24:

#10

This is causing timeouts from time to time, so falls under the timeout policy, updated accordingly. From my perspective, a 10 second lookup edge case is it self a problem: one way to mitigate this would be a 500ms timeout on the DB request.

tags:	added: timeout
Changed in launchpad-code:
importance:	Medium → High

Robert Collins (lifeless) on 2010-10-14

tags:

added: pg83

Robert Collins (lifeless) on 2011-01-12

Changed in launchpad:
importance:	High → Medium
importance:	Medium → Critical

Robert Collins (lifeless) on 2011-01-31

summary:	- Branch lookups can take too long + Branch path mapping exceeding threshold regularly
tags:	removed: pg83

Revision history for this message

Robert Collins (lifeless) wrote on 2011-01-31: Re: Branch path mapping exceeding threshold regularly

#11

I agree with Michaels assessment; the serialised nature is a problem. That said, we can address the 10 second thing by using a 500ms timeout for branch mapping task. We should be able to make that do something sensible - a url to a 500 page or something.

Robert Collins (lifeless) on 2011-04-08

summary:

- Branch path mapping exceeding threshold regularly
+ branch path mapping has no safeguard against exceeding the critical
+ threshold

Robert Collins (lifeless) on 2011-04-08

description:

updated

Robert Collins (lifeless) on 2011-04-14

Changed in launchpad:
assignee:	nobody → Robert Collins (lifeless)

Revision history for this message

Launchpad QA Bot (lpqabot) wrote on 2011-04-15:

#12

Fixed in stable r12839 <http://bazaar.launchpad.net/~launchpad-pqm/launchpad/stable/revision/12839>.

Changed in launchpad:
milestone:	none → 11.05
tags:	added: qa-needstesting
Changed in launchpad:
status:	Triaged → Fix Committed

Robert Collins (lifeless) on 2011-04-18

description:

updated

William Grant (wgrant) on 2011-04-18

tags:

added: qa-ok
removed: qa-needstesting

Robert Collins (lifeless) on 2011-04-19

Changed in launchpad:
status:	Fix Committed → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

Nagios alert logs for branch lookup performance check Edit

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.