Timeout on +translate page

Bug #302798 reported by Ursula Junque
20
Affects Status Importance Assigned to Milestone
Launchpad itself
Fix Released
High
Данило Шеган

Bug Description

+translate pages are timing out a lot lately. According to jtv, this may be related to DB server running off space, on staging.

OOPSes: OOPS-1060S297, OOPS-1060S299, OOPS-1060S301, OOPS-1060S304, OOPS-1060S186, OOPS-1060S187, OOPS-1060S190

Update: this timeout happened 27 times on edge on Friday (12-05-08), like OOPS-1070EC112, OOPS-1070EC113 and OOPS-1070ED94. I'm pointing this because the number of timeouts is larger than the average we use to have daily.

Revision history for this message
Jeroen T. Vermeulen (jtv) wrote :

I think two sentences got conflated there. I believe these timeouts mean that the server is now fetching things from disk again that in the past months were constantly cached in memory.

There are optimizations that we discarded to avoid complicating our schema optimization a year ago. We could reinstate some of those if the problem becomes serious before message sharing is completed.

Ursula Junque (ursinha)
description: updated
description: updated
Revision history for this message
Данило Шеган (danilo) wrote :

So, the problems from staging are now appearing on edge because edge is using the slave store which sits on the same DB server as the DB for staging (thus, basically, for the same reasons). Main production database is not experiencing any problems yet, but probably will as soon as replication is enabled there.

There is not much we can do immediately (or maybe there is, but would require a lot of investigation which defeats the purpose of "immediately"), except concentrate on finishing up message-sharing implementation which should improve overall scalability of Launchpad translations.

Revision history for this message
Henning Eggers (henninge) wrote :

This is a performance issue that can be solved by either tweaking replication or adding more hardware.

Changed in rosetta:
importance: Undecided → Low
status: New → Triaged
Revision history for this message
Stuart Bishop (stub) wrote : Re: [Bug 302798] Re: Timeout on +translate page

On Thu, Nov 27, 2008 at 8:55 PM, Jeroen T. Vermeulen <email address hidden> wrote:
> I think two sentences got conflated there. I believe these timeouts
> mean that the server is now fetching things from disk again that in the
> past months were constantly cached in memory.
>
> There are optimizations that we discarded to avoid complicating our
> schema optimization a year ago. We could reinstate some of those if the
> problem becomes serious before message sharing is completed.

We still have enough RAM to cache all the ondisk files. It is a
different issue. I only have vague guesses at the moment on what could
cause the perceived behavior. Possibly the global shared area is too
large. Possibly we need more frequent checkpoints.

Revision history for this message
Jeroen T. Vermeulen (jtv) wrote :

There's more going on than slow queries. About two-thirds of these timeouts spend less time on SQL than they do on other things—more often than not a lot less!

So this page may be acting as a canary—keeling over first when there's really a systemic problem.

There's not a huge amount of "business logic" to process for this page, so there's a good chance that the real problem is hidden somewhere on a low level: memory on the app servers, waiting for connections, missing C speedup routines to replace Python ones, plain old load..?

Revision history for this message
Jeroen T. Vermeulen (jtv) wrote :

It turns out the oops reports may be wrong about the non-SQL time. The cases where most time is logged as non-SQL time don't actually log the query that the timeout error is for.

Some possibilities I'm looking at:
 * Two-thirds of the oops reports end up not logging that final, long query for some reason and this is probably screwing up the SQL-time computation. Reported as bug 310818.
 * The time is spent somewhere where we can't see it. Stuart suggested as one possibility: C/Pyrex code holding Python's Global Interpreter Lock for too long and effectively serializing the app server. Something to discuss with Gustavo or Gary.
 * I'm assured that the databases are still running out of memory, in which case this is not a repetition of our old I/O timeouts. But could the app servers be running out of memory because of aggressive caching? Setting up the Jaunty translations must have increased the working set.

If working-set size is the root cause, there are some short-term things we can do:
 * We're going to close the translation UIs for obsolete Ubuntu releases. A branch for this is in PQM right now. This won't do much in itself, but it also means we can delete the translation messages for these obsolete series. That accounts for about 30% of the data set.
 * The main cost of rendering these pages is in fetching suggestions. We can set an age limit on those suggestions if that helps the query plan. It's mainly the test code that suffers.

Revision history for this message
Jeroen T. Vermeulen (jtv) wrote :

Okay, I just realized that I used "running out of memory" with two opposite meanings. The master database is running without needing to fetch data from disk. I thought the app servers might be going into swap, but they aren't.

But things may be different on the slave database. The memory graph looks peaceful and constant, but most memory usage is filesystem cache. The daily staging updates may be washing slave data out of the FS cache.

Revision history for this message
Jeroen T. Vermeulen (jtv) wrote :

The periods when the timeouts happen are definitely correlated to the staging updates, and they don't happen at all when the staging database is not updated. The updates do just enough I/O to flush all but the hottest data out of the filesystem cache.

To stop the timeouts from happening over the holidays, staging database updates will be halted. We plan to move old data out of the way after the holidays, which should also bring some short-term relief.

Revision history for this message
Diogo Matsubara (matsubara) wrote :

Translations guys are tackling this one for the performance week, so high importance. Assigning to danilo so he can re-assign to whoever fixes the bug.

Changed in rosetta:
assignee: nobody → danilo
importance: Low → High
milestone: none → 2.2.2
milestone: 2.2.2 → none
Revision history for this message
Данило Шеган (danilo) wrote :

Marking this as fix released for 2.2.2: we should be out of the top-10 timeout list for a while, and there are more optimizations we can do.

Changed in rosetta:
milestone: none → 2.2.2
status: Triaged → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.