Comment 2 for bug 373269

Revision history for this message
Jeroen T. Vermeulen (jtv) wrote :

Some notes from some research I just did into this:

For decoupling the statistics updater from the translations/template updates, it's easier to pass on "POFiles that need updating" (where the statistics updater just updates those) than "POFiles or POTemplates that have been touched" (where the statistics updater first has to figure out which POFiles need updating).

But which POFiles need updating? Picking templates based on their names is sloppy: template names can change without affecting sharing. So I looked into the cost of finding sharing templates based on actual POTMsgSet sharing.

Here's a query that detects POFiles that need updating after an update to a translation of a given template. For templates, just leave out the language clause.

SELECT DISTINCT POFile.id
FROM POFile
JOIN TranslationTemplateItem AS tti1 ON tti1.potemplate = POFile.potemplate
JOIN TranslationTemplateItem AS tti2 ON tti2.potmsgset = tti1.potmsgset
WHERE
    POFile.language = %(language)s AND
    tti1.sequence > 0 AND
    tti2.potemplate = %(template)s

In the largest template I could find (64K TTIs for 64K potmsgsets, but not shared) this took 1 second for a translation and about 13 seconds for a template. Of the 13-second query, some 8 seconds went into sorting for the DISTINCT.

I looked for other hardest cases based on:
 * Most sharing templates. About the maximum is 9, with a center of gravity at 7. Query performance was not particularly sensitive to this.
 * Most potmsgsets in a pool of sharing templates.
 * Most TTIs in a pool of sharing templates.

The second-hardest case I could find (27K TTIs and 14 POTMsgSets across a pool of 4 templates) took 10 seconds for a full template update, and half a second for just its most well-translated language.

The third of the hard cases I found shares 10K TTIs and 2.3K POTMsgSets across 7 templates. This took about 3.4 seconds for a full template update, and 130—280 ms for its most well-translated language.

The sorting and DISTINCT consistently took roughly half of the time for the longer queries (the template changes). The same sort/unique was much more expensive when querying the full POFile objects, making the queries 3×—10× slower.

We could add a "statistics dirty" flag to POFile that we'd set when the POFile needs its statistics recomputed. Then we'd do:

EXPLAIN ANALYZE
UPDATE POFile
SET dirty = TRUE
FROM TranslationTemplateItem tti1, TranslationTemplateItem tti2
WHERE
    tti1.potemplate = POFile.potemplate AND
    tti2.potmsgset = tti1.potmsgset AND
    tti1.sequence > 0 AND
    tti2.potemplate = %(template)s AND Language=%(language)s AND POFile.dirty IS FALSE

The update costs about 20 seconds for the biggest template, which is acceptable during template import.

For only the biggest translation of the biggest template (as we'd do while translating) it took 1.2 seconds, or 0.9 seconds with an additional index on POFile(statistics_dirty, language). Who could resist creating a pofile__statistics_dirty__language__idx?

The nice thing about a flag is that it scales well: these updates get faster as they have fewer records to update. If the updater falls behind during intensive imports for a single template, that will save redundant statistics updates as well.