Message sharing and POFile statistics

Bug #373269 reported by Jeroen T. Vermeulen
34
This bug affects 5 people
Affects Status Importance Assigned to Milestone
Launchpad itself
Triaged
High
Unassigned
Ubuntu Translations
New
Undecided
Unassigned

Bug Description

Since message sharing has rolled out, the POFile statistics update script performs a lot more corrections. So the "live updates" of the statistics must be missing a case.

I would guess that this is because of the way sharing affects multiple POFiles for the same language but in sharing POTemplates. If you translate a previously untranslated message, and the new translation is shared, then all those POFiles will have their statistics affected. Other nasty little cases will pop up with diverging messages, suggestions, and so on.

This could be a pain to fix; it's potentially complex and we don't want it slowing things down. We may have to rely on a faster, more frequent offline mechanism.

Revision history for this message
Данило Шеган (danilo) wrote :

I believe we should decouple statistics update from "save" calls as well (when done from web UI; this bug happens with imports as well), as you suggest. That should be a good solution for overall web UI performance as well (where every save emits an updateStatistics call on a POFile).

Changed in rosetta:
importance: Undecided → High
status: New → Triaged
tags: added: message-sharing
Revision history for this message
Jeroen T. Vermeulen (jtv) wrote :
Download full text (3.2 KiB)

Some notes from some research I just did into this:

For decoupling the statistics updater from the translations/template updates, it's easier to pass on "POFiles that need updating" (where the statistics updater just updates those) than "POFiles or POTemplates that have been touched" (where the statistics updater first has to figure out which POFiles need updating).

But which POFiles need updating? Picking templates based on their names is sloppy: template names can change without affecting sharing. So I looked into the cost of finding sharing templates based on actual POTMsgSet sharing.

Here's a query that detects POFiles that need updating after an update to a translation of a given template. For templates, just leave out the language clause.

SELECT DISTINCT POFile.id
FROM POFile
JOIN TranslationTemplateItem AS tti1 ON tti1.potemplate = POFile.potemplate
JOIN TranslationTemplateItem AS tti2 ON tti2.potmsgset = tti1.potmsgset
WHERE
    POFile.language = %(language)s AND
    tti1.sequence > 0 AND
    tti2.potemplate = %(template)s

In the largest template I could find (64K TTIs for 64K potmsgsets, but not shared) this took 1 second for a translation and about 13 seconds for a template. Of the 13-second query, some 8 seconds went into sorting for the DISTINCT.

I looked for other hardest cases based on:
 * Most sharing templates. About the maximum is 9, with a center of gravity at 7. Query performance was not particularly sensitive to this.
 * Most potmsgsets in a pool of sharing templates.
 * Most TTIs in a pool of sharing templates.

The second-hardest case I could find (27K TTIs and 14 POTMsgSets across a pool of 4 templates) took 10 seconds for a full template update, and half a second for just its most well-translated language.

The third of the hard cases I found shares 10K TTIs and 2.3K POTMsgSets across 7 templates. This took about 3.4 seconds for a full template update, and 130—280 ms for its most well-translated language.

The sorting and DISTINCT consistently took roughly half of the time for the longer queries (the template changes). The same sort/unique was much more expensive when querying the full POFile objects, making the queries 3×—10× slower.

We could add a "statistics dirty" flag to POFile that we'd set when the POFile needs its statistics recomputed. Then we'd do:

EXPLAIN ANALYZE
UPDATE POFile
SET dirty = TRUE
FROM TranslationTemplateItem tti1, TranslationTemplateItem tti2
WHERE
    tti1.potemplate = POFile.potemplate AND
    tti2.potmsgset = tti1.potmsgset AND
    tti1.sequence > 0 AND
    tti2.potemplate = %(template)s AND Language=%(language)s AND POFile.dirty IS FALSE

The update costs about 20 seconds for the biggest template, which is acceptable during template import.

For only the biggest translation of the biggest template (as we'd do while translating) it took 1.2 seconds, or 0.9 seconds with an additional index on POFile(statistics_dirty, language). Who could resist creating a pofile__statistics_dirty__language__idx?

The nice thing about a flag is that it scales well: these updates get faster as they have fewer records to update. If the updater falls behind during intensive imports for a singl...

Read more...

Revision history for this message
Jeroen T. Vermeulen (jtv) wrote :

We're currently issuing separate queries for each of the statistics properties, and use subqueries. We should see if we get better results using outer joins and counting multiple properties at once, as in http://paste.ubuntu.com/453278/

Revision history for this message
Robert Collins (lifeless) wrote :

Heh, just found this - it is perhaps fixed now, with benji's recent work. Or perhaps not.

Revision history for this message
Chris Graham (chris-ocportal) wrote :

This is worse than I originally thought when I found this issue myself. It's a really important issue for this reason...

When exporting po files, ones incorrectly thought to have no translated strings are not included in the export. You need to manually do save ops for each file to force it to carry through.

Let's say we have 10 translations, and 80 language files - every time we do a new version we'd need to do 800 saves to repair export validity.

Changed in launchpad:
importance: High → Critical
Revision history for this message
Jeroen T. Vermeulen (jtv) wrote :

For the historical record: we do have offline scrubbing of the POFile statistics now. The problem is that it's slow. I'm not sure Benji's work addressed cases where it's not really clear that a POFile needs its statistics updated, and we've long kept the really thorough script runs disabled.

There's a lot of data churn in the way we refresh POFile statistics, so in addition to complex queries, disk seeks may be behind the poor performance. Message sharing has reduced the number of TranslationMessages, but may also have made access locality worse.

In the ScrubPOFileTranslator Garbo job I addressed this by iterating over POFiles in a different order. The outer loop orders POFiles by POTemplate.name (to cluster sharing templates and POTMsgSets together) and, for templates with the same name, by language (to cluster shared TranslationMessages together). This will provide much better cache reuse.

Revision history for this message
Chris Graham (chris-ocportal) wrote :

From a (possibly naive) outstander, it seems that the architecture is over-complex. I guess it is due to you having an enormous amount of data, and a need to regular heavy lifting whenever source code repositories are updated, all for a file format that is rather transaction and query unfriendly.

i.e. a combination of:
 - hosting a lot of projects
 - needing to regularly hook into SCMs
 - database-unfriendly file format

It seems to me (again, I could be naive - forgive me) that storing stuff as po's is a bad idea, and you should just put it in a regular relational database, and convert it to po on demand. That would allow easy querying, efficient indexing, etc, without having to try and keep all this stuff in sync. I would imagine po imports are relatively rare (computationally-speaking), so the conversion would not be too costly. Maybe pos do change more like that, or maybe there is a good reason to stick to a native po model, or maybe you're heavily invested in here, but I just wanted to give my outsider's perspective.

My other point of view is that, instead of a background process running over huge amounts of data, do a foreground process triggered by a when a logged in user views the index of po's? Not every time, and not necessarily in real-time, but maybe trigger it at that point, if it hasn't happened already since the last pot upload.
That way, you can replace the need to do routine heavy lifting across your entire architecture with the need to keep po listings reasonable fresh from (I guess) the minority of users actually using them (I am just guessing here that there are a lot of po's in your architecture that rarely get touched).

Last thing, I wonder if your background task(s) aren't so smart. Maybe you are reparsing whole po files but maybe you can do a checksum check first to see if they've actually changed? Maybe there are other ways you can avoid doing full reparses (file-mod times, file-sizes).

Just my 3.149 cents. Again, I know nothing about your code - I am just coming in as an outsider.

Revision history for this message
Jeroen T. Vermeulen (jtv) wrote :

We do not store translations data as files. It's all in the database, and fresh files get generated for every export. The msgids from a PO template are represented in a table called POTMsgSet, and a POTMsgSet can be shared by different versions of the same template. Thus individual string translations are, subject to various rules and restrictions, automatically shared between release series of a project, between Ubuntu releases, and where applicable between an upstream project and its Ubuntu packages.

However, the fact that we use a database does not automatically make things easy and efficient. We're dealing with a fair amount of data, with complex relationships. One particular complication for the statistics is that we allow (though we try not to encourage) different versions of one and the same template to use different translations for msgids that they share. Another source of complications is the fact that we support a range of formats that we convert on during export: not all formats have the same data structure where a translation maps English strings to translated strings.

Curtis Hovey (sinzui)
tags: added: regression
Airkm (airkm)
information type: Public → Private
William Grant (wgrant)
information type: Private → Public
Changed in launchpad:
importance: Critical → High
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.