Recife migration script unusably slow

Bug #682933 reported by Jeroen T. Vermeulen on 2010-11-30
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Launchpad itself
Jeroen T. Vermeulen

Bug Description

The migrate-current-flag script on staging can't complete even its first update (83 TMs) in more than an hour of runtime. We urgently need a dramatic speedup.

The problem seems to be that all our indexes are partial. The ideal index for this script would probably be one on TranslationMessage(potmsgset, language, potemplate) where is_current_upstream is true, and we have indexes covering all of that, but they're split up into separate partial indexes where potemplate is null and where potemplate is not null. An index on (potemplate, potmsgset, language) would probably have done just as well as two partial indexes, but not left us with the current problem.

As it is, we'll have to try and speed up the script without index changes.

Related branches

Jeroen T. Vermeulen (jtv) wrote :

The real culprit may be Storm bug 682989.

Jeroen T. Vermeulen (jtv) wrote :

Got a test run prepared for Tom to execute in a few minutes.

tags: added: recife
Changed in rosetta:
status: New → In Progress
importance: Undecided → Critical
assignee: nobody → Jeroen T. Vermeulen (jtv)
milestone: none → 10.12
Changed in rosetta:
importance: Critical → High
tags: added: upstream-translations-sharing
removed: recife
Jeroen T. Vermeulen (jtv) wrote :

Working around the Storm bug did fix things, but the script is still not as fast as we'd like: It kicked off at a rate that suggested it would complete in a bit over 7 hours, but then fell asleep to accommodate (AIUI) replication lag.

I expected a lot of the time to go into finding current translationmessages that need to be deactivated to "make room" for the newly activated ones, but it looks to be only a fraction of the time spent. It's not clear to me where the time does go—unless it's the updating and replication itself, in which case there's not much more we can do.

It's ok if the first run of the script takes eg. the entire weekend. It will progressively have less data to process, and that's exactly what we need to aim for. Since it's DBLoopTuner based, we do need to make sure that no slaves are being rebuilt at the time because that will completely stall the script.

Do note that TranslationMessage constraints are slow to check, so that might be why updating is slow.

The first run can basically take up whatever time it takes before the rollout. If we start the script on Friday, it means 4 full days, and that should be enough. Then, along with the roll-out, we can do another much shorter run while LP is read-only (we should time it before the roll-out to make sure it runs in eg. less than 5 minutes, which I expect it will).

Jeroen T. Vermeulen (jtv) wrote :

Thanks for the explanation. I was mostly disappointed at performance (after the fix) because of your references to "a few minutes." Now I understand that that would be just an incremental "patch-up" run after a prior migration of the bulk of the data.

tags: added: qa-needstesting
Changed in rosetta:
status: In Progress → Fix Committed
tags: added: qa-ok
removed: qa-needstesting
Curtis Hovey (sinzui) on 2010-12-08
Changed in rosetta:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers