Launchpad itself

Timeout on +translate page

Bug #302798 reported by Ursula Junque on 2008-11-27

20

Affects		Status	Importance	Assigned to	Milestone
	Launchpad itself	Fix Released	High	Данило Шеган	Launchpad itself 2.2.2

Bug Description

+translate pages are timing out a lot lately. According to jtv, this may be related to DB server running off space, on staging.

OOPSes: OOPS-1060S297, OOPS-1060S299, OOPS-1060S301, OOPS-1060S304, OOPS-1060S186, OOPS-1060S187, OOPS-1060S190

Update: this timeout happened 27 times on edge on Friday (12-05-08), like OOPS-1070EC112, OOPS-1070EC113 and OOPS-1070ED94. I'm pointing this because the number of timeouts is larger than the average we use to have daily.

See original description

Tags:

Revision history for this message

Jeroen T. Vermeulen (jtv) wrote on 2008-11-27:

#1

I think two sentences got conflated there. I believe these timeouts mean that the server is now fetching things from disk again that in the past months were constantly cached in memory.

There are optimizations that we discarded to avoid complicating our schema optimization a year ago. We could reinstate some of those if the problem becomes serious before message sharing is completed.

Ursula Junque (ursinha) on 2008-12-08

description:	updated
description:	updated

Revision history for this message

Данило Шеган (danilo) wrote on 2008-12-10:

#2

So, the problems from staging are now appearing on edge because edge is using the slave store which sits on the same DB server as the DB for staging (thus, basically, for the same reasons). Main production database is not experiencing any problems yet, but probably will as soon as replication is enabled there.

There is not much we can do immediately (or maybe there is, but would require a lot of investigation which defeats the purpose of "immediately"), except concentrate on finishing up message-sharing implementation which should improve overall scalability of Launchpad translations.

Revision history for this message

Henning Eggers (henninge) wrote on 2008-12-12:

#3

This is a performance issue that can be solved by either tweaking replication or adding more hardware.

Changed in rosetta:
importance:	Undecided → Low
status:	New → Triaged

Revision history for this message

Stuart Bishop (stub) wrote on 2008-12-13: Re: [Bug 302798] Re: Timeout on +translate page

#4

On Thu, Nov 27, 2008 at 8:55 PM, Jeroen T. Vermeulen <email address hidden> wrote:
> I think two sentences got conflated there. I believe these timeouts
> mean that the server is now fetching things from disk again that in the
> past months were constantly cached in memory.
>
> There are optimizations that we discarded to avoid complicating our
> schema optimization a year ago. We could reinstate some of those if the
> problem becomes serious before message sharing is completed.

We still have enough RAM to cache all the ondisk files. It is a
different issue. I only have vague guesses at the moment on what could
cause the perceived behavior. Possibly the global shared area is too
large. Possibly we need more frequent checkpoints.

Revision history for this message

Jeroen T. Vermeulen (jtv) wrote on 2008-12-22:

#5

There's more going on than slow queries. About two-thirds of these timeouts spend less time on SQL than they do on other things—more often than not a lot less!

So this page may be acting as a canary—keeling over first when there's really a systemic problem.

There's not a huge amount of "business logic" to process for this page, so there's a good chance that the real problem is hidden somewhere on a low level: memory on the app servers, waiting for connections, missing C speedup routines to replace Python ones, plain old load..?

Revision history for this message

Jeroen T. Vermeulen (jtv) wrote on 2008-12-23:

#6

It turns out the oops reports may be wrong about the non-SQL time. The cases where most time is logged as non-SQL time don't actually log the query that the timeout error is for.

Some possibilities I'm looking at:
* Two-thirds of the oops reports end up not logging that final, long query for some reason and this is probably screwing up the SQL-time computation. Reported as bug 310818.
* The time is spent somewhere where we can't see it. Stuart suggested as one possibility: C/Pyrex code holding Python's Global Interpreter Lock for too long and effectively serializing the app server. Something to discuss with Gustavo or Gary.
* I'm assured that the databases are still running out of memory, in which case this is not a repetition of our old I/O timeouts. But could the app servers be running out of memory because of aggressive caching? Setting up the Jaunty translations must have increased the working set.

If working-set size is the root cause, there are some short-term things we can do:
* We're going to close the translation UIs for obsolete Ubuntu releases. A branch for this is in PQM right now. This won't do much in itself, but it also means we can delete the translation messages for these obsolete series. That accounts for about 30% of the data set.
* The main cost of rendering these pages is in fetching suggestions. We can set an age limit on those suggestions if that helps the query plan. It's mainly the test code that suffers.

Revision history for this message

Jeroen T. Vermeulen (jtv) wrote on 2008-12-23:

#7

Okay, I just realized that I used "running out of memory" with two opposite meanings. The master database is running without needing to fetch data from disk. I thought the app servers might be going into swap, but they aren't.

But things may be different on the slave database. The memory graph looks peaceful and constant, but most memory usage is filesystem cache. The daily staging updates may be washing slave data out of the FS cache.

Revision history for this message

Jeroen T. Vermeulen (jtv) wrote on 2008-12-24:

#8

The periods when the timeouts happen are definitely correlated to the staging updates, and they don't happen at all when the staging database is not updated. The updates do just enough I/O to flush all but the hottest data out of the filesystem cache.

To stop the timeouts from happening over the holidays, staging database updates will be halted. We plan to move old data out of the way after the holidays, which should also bring some short-term relief.

Revision history for this message

Diogo Matsubara (matsubara) wrote on 2009-02-03:

#9

Translations guys are tackling this one for the performance week, so high importance. Assigning to danilo so he can re-assign to whoever fixes the bug.

Changed in rosetta:
assignee:	nobody → danilo
importance:	Low → High
milestone:	none → 2.2.2
milestone:	2.2.2 → none

Revision history for this message

Данило Шеган (danilo) wrote on 2009-02-26:

#10

Marking this as fix released for 2.2.2: we should be out of the top-10 timeout list for a while, and there are more optimizations we can do.

Changed in rosetta:
milestone:	none → 2.2.2
status:	Triaged → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

Duplicates of this bug

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.