Codebrowse appears to hang because it uses too much RAM

Bug #518689 reported by Steve McInerney
32
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Launchpad itself
Triaged
High
Unassigned
loggerhead
Triaged
High
Unassigned
loggerhead-breezy
Triaged
High
Unassigned

Bug Description

Logging as a sec vuln as this is a rather easy way to currently DOS codebrowse.

We had an unusual one today. A user hitting this URL:
bazaar.launchpad.net/
~ubuntu-branches/ubuntu/lucid/libgweather/lucid/annotate/head:/data/Locations.xml.in
(DON"T CLICK IT!!! It's bad!)
Was able to lock codebrowse from further responses. It wouldn't happen immediately, but gradually over a few seconds.

eg. Running -debug against codebrowse post an above event:
https://pastebin.canonical.com/27525/
ie ~ 2 minutes to return. Ouch.

During one such event I grabbed the following sequence from the codebrowse debug log:

INF [20100208-04:42:34.053] [47374725336304] paste.httpserver.ThreadPool: kill_hung_threads status: 30 threads (30 working, 0 idle, 0 starting) ave time 146.80sec, max time 361.49sec, killed 1 workers
INF [20100208-04:45:28.592] [47374725336304] paste.httpserver.ThreadPool: kill_hung_threads status: 35 threads (34 working, 1 idle, 0 starting) ave time 158.53sec, max time 437.38sec, killed 9 workers
INF [20100208-04:46:50.923] [47374725336304] paste.httpserver.ThreadPool: kill_hung_threads status: 40 threads (37 working, 3 idle, 0 starting) ave time 147.08sec, max time 509.52sec, killed 5 workers

Which looks, bad.
During these events we did observe that the codebrowse process on guava would hit 100% CPU and stay there

Revision history for this message
Tim Penhey (thumper) wrote :

Perhaps we can get our contractor to look at this.

affects: launchpad-code → loggerhead
Changed in launchpad-code:
status: New → Triaged
importance: Undecided → High
Revision history for this message
Robert Collins (lifeless) wrote :

Marking public as using loggerhead is an easy way to DOS it too :P - discussed with Tim first, in case you're wondering.

Max, this bug probably just shows the impact of slow threads on loggerhead in production:
the theory we have is that the paste thread pool logic is
a) failing to kill 'hung' threads
b) spawning more threads
c) they then hang
d) don't get killed
e) boom

I think it would be good to do two things:
*) determine why hung threads aren't being killed (it could be they hold a mutex or something - in general we cannot expect killing a random thread to be a very reliable way to recover from it hanging).
*) Fix paste to be able to have a hard upper limit on threads, so we don't get this unlimited increase we saw here. I'm opening a paste upstream bug on this.

security vulnerability: yes → no
visibility: private → public
Revision history for this message
Robert Collins (lifeless) wrote :

http://trac.pythonpaste.org/pythonpaste/ticket/416 seems to suggest that threads are not being killed. Even though we have very high 'time to kill' limits, that would explain the thundering herd we saw.

Revision history for this message
Robert Collins (lifeless) wrote :

I've created http://trac.pythonpaste.org/pythonpaste/ticket/423 proposing an upper limit to the thread pool; it should be a pretty small patch - Max, perhaps you'd be up for that?

Changed in loggerhead:
status: New → Confirmed
importance: Undecided → High
Revision history for this message
Max Kanat-Alexander (mkanat) wrote :

Hey lifeless. Codebrowse overrides the thread-killing logic in paste, so I don't think you can rely on this being a Paste problem. There are always a lot of threads during loggerhead hang problems, but upon analysis, the problem is usually that one thread is holding up the others, not simply that there are many threads.

In order to debug, I need to see a full log from loggerhead up to 30 minutes before the hang, and up until loggerhead is restarted to clear the hang. Ideally, I also need a python/gdb traceback of all threads during the time that the process is hung.

Revision history for this message
Max Kanat-Alexander (mkanat) wrote :

Also, an upper limit for the thread pool would not solve the problem--tasks would simply queue up, and the queued tasks would appear to "hang" just like the overloaded tasks now.

Revision history for this message
Robert Collins (lifeless) wrote : Re: [Bug 518689] Re: thread spiral of doom on codebrowse

So, I don't really care where the issue is :). I do think that an
unlimited concurrency factor is a defect, because we're seeing the
machine driven deep into swap, which exacerbates what is occurring by
adding vm thrashing to the issue. Solving that is no guarantee that
loggerhead won't hang, but it would mean than when its in crisis,
we're not *also* fighting a machine trying to do 200 render actions at
once: core dumps will be smaller, thread dumps will be more readable.

Separately from things going wrong, what about when things go right
and we get spidered by something not honoring robots.txt? We shouldn't
become as arbitrarily concurrent as the robot, rather we *should* be
queueing requests and not accepting them [as in, leaving them in the
accept backlog] until we are roughly ready to deal with it.

So, an upper thread limit won't *stop hangs* but it will *mitigate
hangs* and reduce issues in overload situations.

Revision history for this message
Max Kanat-Alexander (mkanat) wrote : Re: thread spiral of doom on codebrowse

Well, if you have evidence that the machine is actually swapping during these hangs, then yes, adding a concurrency limit is a good idea. But I'm still going to investigate that URL a bit more and see what loggerhead does while rendering it.

Revision history for this message
Michael Hudson-Doyle (mwhudson) wrote :

If we're getting more requests than we can handle, we have a serious problem, but there are much better things to do than try to process them all in parallel -- leaving them in the accept backlog or cheaply telling them to go away (http 503) would both be better.

Note that I'm not completely sure that the description of the bug is totally valid any more -- it's not really about a specific URL, it's about a general failure mode.

Revision history for this message
Robert Collins (lifeless) wrote : Re: [Bug 518689] Re: thread spiral of doom on codebrowse

It may be about both a general failure mode and a specific bug. So
lets dig a little more then split it into two.

Revision history for this message
Max Kanat-Alexander (mkanat) wrote :

Looking into this, the URL described in the description causes loggerhead to leak about 100MB of RAM per hit.

As far as limiting concurrency, if that's still necessary after this, we should file a separate bug for it.

summary: - thread spiral of doom on codebrowse
+ Codebrowse appears to hang because it uses too much RAM
Revision history for this message
Max Kanat-Alexander (mkanat) wrote :

I want to use meliae to debug the problem, but meliae fails (either because of my system or something about loggerhead), so I've filed bug 586122 and will have to wait on that to be resolved before continuing to debug.

Revision history for this message
Robert Collins (lifeless) wrote :

Ok. I'll file a new bug (bug 586141) about thread caps; Thumper, mwhudson and I agree that being uncapped is a high risk setup for a web service - its close enough to unanimous for me :)

Tom Haddon (mthaddon)
tags: added: canonical-losa-lp
Changed in loggerhead:
assignee: nobody → Max Kanat-Alexander (mkanat)
status: Confirmed → In Progress
Revision history for this message
Max Kanat-Alexander (mkanat) wrote :

Okay, I have meliae working, although I have a few questions for jam the next time we're both on IRC.

Right now it looks like there are some util.Container instances that possibly should not be hanging around. I'm going to be investigating further.

The leak doesn't seem to be as bad on trunk as it was when this bug was first opened--I suspect the bzr-history-db work handled a lot of the problem.

Changed in loggerhead:
assignee: Max Kanat-Alexander (mkanat) → nobody
status: In Progress → Triaged
William Grant (wgrant)
no longer affects: ubuntu
Jelmer Vernooij (jelmer)
Changed in loggerhead-breezy:
status: New → Triaged
importance: Undecided → High
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.