loggerhead (codebrowse) slows down to unresponsiveness under heavy load on one branch

Bug #118625 reported by Michael Hudson-Doyle
34
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Launchpad itself
High
Unassigned
loggerhead
High
Max Kanat-Alexander

Bug Description

This happens fairly frequently, say a few times a week.

I'd like everyone who restarts loggerhead to comment on this bug when they do so.

Related branches

Revision history for this message
Michael Hudson-Doyle (mwhudson) wrote :

I've just restarted codebrowse, here were the notes I took while investigating. I've seen at least one hang that was basically identical.

Hang appears to have been at, or after 03:00:46 2007-06-04 UTC.

gdb revealed three threads:

thread 1: in time.sleep()
thread 2: also in time.sleep()
thread 3: in threading._RLock.wait()

[Switching to thread 1 (Thread 1075341632 (LWP 4779))]#0 0x401195b7 in select () from /lib/tls/libc.so.6
(gdb) pystack
/srv/codebrowse.launchpad.net/turbogears/lib/python2.4/site-packages/CherryPy-2.2.1-py2.4.egg/cherrypy/_cpengine.py (111): block
/srv/codebrowse.launchpad.net/turbogears/lib/python2.4/site-packages/CherryPy-2.2.1-py2.4.egg/cherrypy/_cpserver.py (111): _start
/srv/codebrowse.launchpad.net/turbogears/lib/python2.4/site-packages/CherryPy-2.2.1-py2.4.egg/cherrypy/_cpengine.py (74): start
/srv/codebrowse.launchpad.net/turbogears/lib/python2.4/site-packages/CherryPy-2.2.1-py2.4.egg/cherrypy/_cpserver.py (33): start
/srv/codebrowse.launchpad.net/turbogears/lib/python2.4/site-packages/TurboGears-1.0.1-py2.4.egg/turbogears/startup.py (295): start_server
./start-loggerhead.py (295): ?

[Switching to thread 2 (Thread 1090423728 (LWP 4781))]#0 0x401195b7 in select () from /lib/tls/libc.so.6
(gdb) pystack
/srv/codebrowse.launchpad.net/turbogears/lib/python2.4/site-packages/TurboGears-1.0.1-py2.4.egg/turbogears/scheduler.py (181): _run
/usr/lib/python2.4/threading.py (420): run
/usr/lib/python2.4/threading.py (420): __bootstrap

[Switching to thread 3 (Thread 1088326576 (LWP 4780))]#0 0x400277eb in sem_wait@GLIBC_2.0 () from /lib/tls/libpthread.so.0
(gdb) pystack
/usr/lib/python2.4/threading.py (195): wait
/usr/lib/python2.4/Queue.py (101): get
/srv/codebrowse.launchpad.net/turbogears/lib/python2.4/site-packages/CherryPy-2.2.1-py2.4.egg/cherrypy/filters/cachefilter.py (46): expireCache
/usr/lib/python2.4/threading.py (420): run
/usr/lib/python2.4/threading.py (420): __bootstrap

Revision history for this message
Michael Hudson-Doyle (mwhudson) wrote :

Happened again this morning, at or around 20070606-09:17:55.209.

Same symptoms.

Changed in launchpad-bazaar:
status: Unconfirmed → Confirmed
Revision history for this message
Tom Haddon (mthaddon) wrote :

Happened again 20070607 23:40 UTC - high load average and needed to kill and restart the loggerhead process. Looked like archmirror stuff was using resources again, although it may be that codebrowse was too.

Revision history for this message
Tom Haddon (mthaddon) wrote :

Happened again 2007-06-29 22:40 UTC. Restarted.

Revision history for this message
Tom Haddon (mthaddon) wrote :

And again 2007-07-05 17:20 UTC. Restarted.

Revision history for this message
Tom Haddon (mthaddon) wrote :

Happened again 2007-07-07 13:45 UTC (noticed then, at least). Restarted.

Changed in launchpad-bazaar:
importance: Undecided → Medium
Revision history for this message
Tom Haddon (mthaddon) wrote :

This is becoming unworkable. Have had to restart codebrowse 4 times today as it's been using > 80% system memory, as well as not responding.

Changed in launchpad-bazaar:
importance: Medium → Critical
Revision history for this message
Tom Haddon (mthaddon) wrote :

We need to sort out the permissions and get the logfiles being written to a sane directory. We then need to roll these changes out to production (including the new code layout with a full RF tree) and then setup a webstats instance for codebrowse - we can then see if there's a correlation between traffic and spikes in memory usage/hangs.

Revision history for this message
Michael Hudson-Doyle (mwhudson) wrote :

So an update: we cherrypicked a revision that fixed the main source of extreme memory usage, but memory usage is still high, if not so ridiculous as before.

Unfortunately, my cardboard and sticky tape monitoring system seems to have stopped working, so it's hard to quantify how bad the situation is now. I'll put it back together, but I think it's safe to downgrade this from Critical to High (Tom is free to disagree on this point).

Changed in launchpad-bazaar:
importance: Critical → High
Revision history for this message
Paul Hummer (rockstar) wrote :

We started work on finding some of the issues with Loggerhead during the Epic, I think we have a good idea what needs to be fixed.

Changed in launchpad-bazaar:
assignee: nobody → rockstar
status: Confirmed → Triaged
Paul Hummer (rockstar)
Changed in launchpad-bazaar:
assignee: rockstar → nobody
Revision history for this message
Tom Haddon (mthaddon) wrote :

This is becoming more and more of a problem again. There was something of a respite when it was less of a problem, but we're now back to three or four restarts a day:

https://wiki.canonical.com/InformationInfrastructure/OSA/LPIncidentLog

It's not really possible to post to this bug every time there's an outage, but hopefully this gives an idea of how serious this is.

Revision history for this message
Michael Hudson-Doyle (mwhudson) wrote :

Sadly, I don't have time this month (at least) to work on this. In the extremely unlikely event that someone else finds the time, here are some rambles:

My main hypothesis is that the problem is still around generating the revision graph cache. In particular, it's possible to end up with many threads trying to generate the revision graph cache for the same branch, which is an obvious waste and can push us in the direction of death spirals of doom. So:

1) Try to confirm this by looking at the logs (there should be enough information to tell at any given time how many threads are building revision graph caches for which branches)

2) Arrange things so that if a thread needs the revision cache for a particular branch and there is already another thread producing the cache for that branch, the first-mentioned thread just waits for the other one rather than redoing it's work. This probably isn't super hard, but will involve some not-fun thread programming I expect.

3) Change the cache-building code to update an out of date cache rather than build it from scratch. I think John Arbash Meinel has some code that does this, or at least would be a start.

Changed in loggerhead:
status: New → Triaged
importance: Undecided → High
Revision history for this message
Michael Hudson-Doyle (mwhudson) wrote :

Some more thoughts: it's probably easy enough to scale loggerhead horizontally by running it on several machines behind a load balancer. But then they will not share the revision graph caches they build. Swapping out the sqlite cache for something that can talk over the network -- couchdb springs to mind as being entirely suitable, although any key-value store will do -- would make sense for this, and probably isn't very hard.

For idea 2) in comment #12, a trick that would probably work is to insert a special value into the cache (like None) when any thread starts to compute a graph; a thread that finds this value would take it as an indication that another thread is computing the value and wait, rather than computing the graph itself. It would probably have to poll, unless we can do something fancy with triggers.

Revision history for this message
Karl Fogel (kfogel) wrote :

Paul, do you have any documents on the analysis from Epic? Or is it just basically what Michael Hudson said in comment #12?

Revision history for this message
Diogo Matsubara (matsubara) wrote :

Any news about this? This continues to be an outstanding item on LP production meetings and LOSAs IncidentLog. A status update would be nice to let the LOSAs know that this is being worked on or some work is planned for the near future.

Revision history for this message
Jonathan Lange (jml) wrote : Re: [Bug 118625] Re: codebrowse sometimes hangs

On Wed, Nov 11, 2009 at 5:16 AM, Diogo Matsubara
<email address hidden> wrote:
> Any news about this? This continues to be an outstanding item on LP
> production meetings and LOSAs IncidentLog. A status update would be nice
> to let the LOSAs know that this is being worked on or some work is
> planned for the near future.
>

Yes, we're hiring a contractor to deal with it. Ask Francis for more details.

jml

Changed in loggerhead:
assignee: nobody → Michael Hudson (mwhudson)
assignee: Michael Hudson (mwhudson) → Max Kanat-Alexander (mkanat)
Revision history for this message
Robert Collins (lifeless) wrote :

On Thu, 2009-09-17 at 05:07 +0000, Michael Hudson wrote:
> Some more thoughts: it's probably easy enough to scale loggerhead
> horizontally by running it on several machines behind a load balancer.

Sure they will; load balancers can group by url or url prefix. If they
are decent (and squid is, and haproxy should be from what I hear).

-Rob

Revision history for this message
Max Kanat-Alexander (mkanat) wrote : Re: codebrowse sometimes hangs

Okay, I'm looking at a log that mwhudson sent me, and here's what I see so far:

About two minutes before the hang, loggerhead starts being really slow. In the log I have, the first incident is:

INF [20091118-05:40:15.776] [1118071120] loggerhead.~wgrant/launchpad/fix-soyuz-build-and-queue-titles: Getting information for RevisionUI: 137.22760105133057 secs

That thread renders and then does nothing else. That's not true for every slow thread, though. Some of them render, then continue to function. So that's not a clue.

The slowness continues. It gets a little faster at:

INF [20091118-05:42:04.440] [1084594512] loggerhead.~munin-custom-plugins/munin/munin-plugins-tomcat: Getting information for ChangeLogUI: 3.7736661434173584 secs

That is about one minute before the hang. However, the weird part is that I then see that thread make several "Getting information" requests without rendering the previous request.

Revision history for this message
Max Kanat-Alexander (mkanat) wrote :

To assist with debugging this problem, I've created a launchpad-loggerhead log parser that shows what threads were doing before loggerhead was restarted. The branch is at:

bzr://bzr.everythingsolved.com/loggerparser/trunk

at the moment.

Revision history for this message
Max Kanat-Alexander (mkanat) wrote :

In the log that I have, the first slowdowns are always "built revision graph cache", which tends to indicate that mwhudson is probably on the right track. And indeed, here are the first slowdowns:

1118071120 Start: 05:21:14 End: 05:40:19 Duration: 1144 seconds
    Getting information for RevisionUI (137 secs)
    built revision graph cache (56 secs) Started: 05:37:57

1111767376 Start: 05:21:14 End: 05:40:37 Duration: 1162 seconds
    Getting information for RevisionUI (147 secs)
    built revision graph cache (130 secs) Started: 05:37:58

1120172368 Start: 05:21:14 End: 05:40:47 Duration: 1172 seconds
    Getting information for RevisionUI (169 secs)
    built revision graph cache (111 secs) Started: 05:37:58

1107564880 Start: 05:21:14 End: 05:42:36 Duration: 1281 seconds
    Getting information for AtomUI (25 secs)
    Getting information for AtomUI (31 secs)
    Getting information for RevisionUI (179 secs)
    built revision graph cache (147 secs) Started: 05:37:57

1126476112 Start: 05:21:14 End: 05:42:36 Duration: 1281 seconds
    Getting information for ChangeLogUI (4 secs)
    Getting information for ChangeLogUI (7 secs)
    Getting information for RevisionUI (201 secs)
    built revision graph cache (155 secs) Started: 05:37:58

You can see that they all start roughly simultaneously to build their revision caches. That's a somewhat random time, by the way, since the server started at 05:21:13 in this case.

And after that, no threads ever render, because things are going so slowly. So at the very least, building revision graphs is a performance problem if nothing else. :-)

Revision history for this message
Max Kanat-Alexander (mkanat) wrote :

Hmm, strike "no threads ever render", my log analyzer had a bug. Also, it appears that the last actions of each thread are almost always something like "branch_url: http://bazaar.launchpad.net/00/04/32/d3" and then nothing.

Revision history for this message
Max Kanat-Alexander (mkanat) wrote :

Okay, I think that comment 20 might indeed be the secret here, specifically the slowndown issue. I can produce some pretty extreme slowdowns on my local machine (an AnnotateUI that usually takes 2 seconds takes 60) by making every single request build a revision graph and then hitting loggerhead with 15 simultaneous users.

If I don't force loggerhead to rebuild the revision graph on every request, it stays relatively fast even when being hit with 15 simultaneous users, though the initial "rebuild the revision graph" process takes 15 seconds because it's being done by so many threads at once.

So this may in fact not be a hang, but simply an extreme slowdown by having many threads all attempt to build a revision graph at once for very large branches, and then due to the way Paste works, no new threads are spawned (they would be so slow anyway, who cares) and you get a task backup and a "hang".

Revision history for this message
Max Kanat-Alexander (mkanat) wrote :

The "first slowdown" from the above comment #20 is a place where loggerhead attempts to build the revision cache four times simultaneously for the same branch (namely, a branch of launchpad itself). I wouldn't be surprised if you could reproduce this simply by opening four or five revisions in a tab for a branch that didn't have a revision graph, on codebrowse.

I can confirm this behavior as being constant every time there's a hang, looking at various logs that I was sent by spm: Simultaneously make several requests that will cause one large branch's revision graph to be built (mysql, launchpad itself, bzr itself) in several threads simultaneously, and loggerhead will slow to a crawl. It doesn't actually seem to *hang*, it just gets so slow that it seems like it's hung.

Revision history for this message
Stuart Bishop (stub) wrote :

Given there are a finite number of cores on the servers (and issues with the Python GIL), there seems little point in allowing several revision caches to build simultaneously. Perhaps all that is needed is a few semaphores to block threads when things get busy (with timeouts returning a nice 503 error of course if a lock can't be granted in a sane timeframe).

Revision history for this message
Max Kanat-Alexander (mkanat) wrote :

@stub: That was pretty much my thought as well. At first I'm just going to make sure that loggerhead only builds one revision cache at a time for any given branch, but if that doesn't fix it, then I'll make it only build one revision cache at a time globally (but I think that could seriously degrade performance on a major installation like codebrowse, so it's not the ideal solution).

And yeah, due to the way Python threads work, at least on my local machine I only saw loggerhead use one core of my CPU no matter how loaded it was. (I've also proposed elsewhere that there be some rearchitecture to improve loggerhead's scalability, we'll see how that proposal goes.)

Revision history for this message
Robert Collins (lifeless) wrote : Re: [Bug 118625] Re: codebrowse sometimes hangs

On Thu, 2009-11-19 at 21:59 +0000, Max Kanat-Alexander wrote:
> @stub: That was pretty much my thought as well. At first I'm just going
> to make sure that loggerhead only builds one revision cache at a time
> for any given branch, but if that doesn't fix it, then I'll make it only
> build one revision cache at a time globally (but I think that could
> seriously degrade performance on a major installation like codebrowse,
> so it's not the ideal solution).
>
> And yeah, due to the way Python threads work, at least on my local
> machine I only saw loggerhead use one core of my CPU no matter how
> loaded it was. (I've also proposed elsewhere that there be some
> rearchitecture to improve loggerhead's scalability, we'll see how that
> proposal goes.)

So, bzr's core is starting to release the GIL at various places -
building multiple caches will start to get concurrency with python
threads. (Maybe not awesome concurrency, but some).

-Rob

Revision history for this message
Stuart Bishop (stub) wrote :

On Fri, Nov 20, 2009 at 5:13 AM, Robert Collins <email address hidden> wrote:

>> And yeah, due to the way Python threads work, at least on my local
>> machine I only saw loggerhead use one core of my CPU no matter how
>> loaded it was. (I've also proposed elsewhere that there be some
>> rearchitecture to improve loggerhead's scalability, we'll see how that
>> proposal goes.)
>
> So, bzr's core is starting to release the GIL at various places -
> building multiple caches will start to get concurrency with python
> threads. (Maybe not awesome concurrency, but some).

But we would still want to limit the number of concurrently building revision caches based on the number of cores available and the amount of time blocked on IO. As long as this is configurable, the admins can tweak it to maximize our cpu utilization and throughput.

--
Stuart Bishop <email address hidden>
http://www.stuartbishop.net/

Revision history for this message
Max Kanat-Alexander (mkanat) wrote : Re: codebrowse sometimes hangs

Mmm, well, currently loggerhead never uses more than one core, so that wouldn't be something to be concerned about at first. (At least, I couldn't make it use more than one core locally no matter how much I loaded it.)The bug here seems to be caused specifically by building caches for the same branch simultaneously. If we solve that and still see an issue, then we can look at further limitations on cache building.

Changed in loggerhead:
status: Triaged → In Progress
Revision history for this message
Max Kanat-Alexander (mkanat) wrote :

Hmm, okay, so mwhudson reports that there are still hangs, though they may be less frequent. From the log that I was sent, it looks like things are generally performing better (I see at least one thread that processed its max of 100 requests and then ended, so that's good) but there's still significant slowdowns before the server was restarted.

Here's what I see:

The earliest slowdown is around 21:53, like this:

* Somebody simultaneously requests several revisions of http://bazaar.launchpad.net/%7Egary/launchpad/fix-db-devel-sourcecode/
* It takes 11 seconds to build the revision graph, but thanks to the patch only one of the threads builds it.
* All told, each thread thinks takes a total of about 15 seconds to get the information for the RevisionUI, which means that each thread spent 4 seconds processing the data after the cache was generated.

Overall, that's slow, but it's not hanging slow, and after that brief slowdown things are speedy again for a while.

In fact, the original problem that I saw here is indeed fixed, and there is now some other problem. What's really weird with this log is there appears to be a real hang, not just a slowdown. Loggerhead stops logging for eight minutes before it's restarted, so the log is basically little help at all, other than to see what it was doing before everything was flushed. Further investigation underway.

Revision history for this message
Max Kanat-Alexander (mkanat) wrote :

Okay, so the next time it hangs, I will need an actual GDB backtrace (preferably one that's already been run through the magic backtrace-to-python-functions translator) to see what the threads are doing, which should lead to a resolution, since this is indeed a real hang now and not just a slowdown.

Revision history for this message
Max Kanat-Alexander (mkanat) wrote :

(The bug turned out to be only in loggerhead itself, so I'm marking this Invalid in the Launchbad Bazaar Integration.)

Changed in launchpad-code:
status: Triaged → Invalid
Revision history for this message
Max Kanat-Alexander (mkanat) wrote :

Okay, so I talked to mwhudson today, and he said that codebrowse hasn't frozen since January 19, which is seven days ago. Also, from the log I analyzed, that hang appears to be very different from the ones that we were seeing in this bug, so I'm going to close this bug and open a new one for the new-type hangs.

Changed in loggerhead:
status: In Progress → Fix Committed
summary: - codebrowse sometimes hangs
+ loggerhead (codebrowse) slows down to unresponsiveness under heavy load
+ on one branch
Tom Haddon (mthaddon)
tags: added: canonical-losa-lp
Changed in loggerhead:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers