The "Upload files for Template" page is written confusingly

Bug #264122 reported by Andrew Sayers
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Launchpad itself
Invalid
Undecided
Unassigned

Bug Description

This is the third in a series of three reports based on question #43756, suggesting ways to improve the translation interface.

The text on the "Upload files for Template" page is confusing to users. An example of this page is:

https://translations.launchpad.net/remote-help-assistant/trunk/+pots/remote-help-assistant/+upload

The text currently reads:

Here you can upload either a single PO template (.pot) or a tar file containing a PO template and a set of PO files (.tar, .tar.gz or .tar.bz2). The files you upload will be imported into Launchpad shortly.

This page doesn't make it clear that the file(s) will need an automatic review, how long "shortly" is, how this page differs from the main upload page, or what happens when strings in the old template don't appear in the new template. The first sentence is also needlessly different to that on the main upload page. A better message would be:

<p>Upload either a single file (.pot / .po) or a tar file containing a PO template and a set of PO files (.tar, .tar.gz, .tgz or .tar.bz2). The files you upload will be automatically reviewed before being imported into Launchpad. The automatic review should take a few hours. In the unlikely event that the automatic review fails, an admin will review your upload and import it in a few days.</p>

<p>Once your upload has been reviewed, Launchpad will update the translation pages for this template. Strings that don't appear in the new template will be removed from the translation pages, so if you have updated some strings - such as to correct spelling mistakes - then you should <a href="+export">download everything associated with your current template as a collection of PO files</a> before uploading your new template. You can open PO files in a text editor, and read through them to help write new translations. Alternatively, you can upload an old translation template to bring its associated translations.</p>

<p>If you would rather leave this template unchanged, and add another template to your project instead, <a href="../../+translations-upload>click here</a>.</p>

Note that the second paragraph addresses a problem I've encountered, where some translations were hidden when I rewrote strings while moving them into a Glade file. If Launchpad handles that sort of thing transparently in normal cases, replace the second paragraph above with this:

<p>Once your upload has been reviewed, Launchpad will update the translation pages for this template. Launchpad will usually spot updated strings and move the translations across, but can miss strings that have changed too much. If Launchpad fails to move some of your translations across, you should upload your old template to get translations back, <a href="+export">download everything associated with the old template</a> as a collection of PO files, then upload the new template again. You can open PO files in a text editor and read through them to help manually import translations.</p>

description: updated
Revision history for this message
Jeroen T. Vermeulen (jtv) wrote :

We can't throw too much text at the user, so the question is: does this cover the biggest problems we can address with a few words on this page? I'm not convinced that explaining the basics of PO templates is worth it—especially since any surprises come _after_ the user has skipped over this text.

We also can no longer afford to make these explanations completely specific to gettext.

Revision history for this message
Andrew Sayers (andrew-bugs-launchpad-net) wrote :

In my opinion, the most important thing is the suggestion that users download the old templates, as I've got into trouble before by not doing so. How about moving the text about automatic vs. manual review to the "translation status" page?

Incidentally, much of the text on this page is an attempt to document issues in LP rather than fixing them - the text could be much shorter if the issues were fixed. If there's resources available to fix the bugs alluded to, I can post reports discussing other solutions.

Revision history for this message
Jeroen T. Vermeulen (jtv) wrote :

There are some things that we can improve, given the time. But I don't see any actual bugs alluded to here, and the issues won't go away completely! For example, changing an English string in a template means removing one and adding another—but that's just basic gettext operation. There's no way for software to decide whether an old string and a new one are "the same one, just differently worded" or very different messages that may or may not happen to be in the same place. The best we can do is recognize some specific patterns and/or require more information from the user.

The same goes for auto-approval: manual review is needed when the application does not get enough information to decide what to do with a file. It's already possible for the uploader to provide more of that information (you can upload to a specific template or translation) but they usually don't, and the system is such that it's usually not necessary. We can do many things to improve the system's guess at what the uploader intends, but it's never going to be perfect for all cases.

Revision history for this message
Andrew Sayers (andrew-bugs-launchpad-net) wrote :

I guess I wasn't being clear about the underlying issue. Here's the tale of woe that lead me to this report:

I had uploaded a translation template, which was then translated into a couple of other languages by other people. Then I rearranged the program, changed a lot of text (sometimes correcting typos, sometimes rewriting completely) and uploaded a new template. When the new template went through, all the old translated text vanished unless there was an identical match in the new template. Given the amount of work we'd put in to finding the correct translations for technical terms, it was pretty demoralising for us to see our work yanked away - it was also quite embarrassing for me personally that I'd destroyed people's work. I hadn't kept a note of which revision of the program I'd used to generate the original template, so even when I realised that it might be possible to get texts back by uploading old templates, my only choice was to keep uploading older versions until I found one that had all the right translations in. Because of the amount of time reviews take, that meant stopping translation work for the best part of a week. When I finally got all the translations I needed, I was able to reinstate most of the translations myself (a semicolon is a semicolon in any language), and proper translators were able to trivially rewrite the remainder of the old text.

In my opinion, the biggest bug in that situation was that LP doesn't give you any way to view translations for old templates - if we could have just seen the old translations, we could have copy/pasted 90% of it in a few hours. I agree that there's no perfect way for a computer to detect whether two strings have similar meanings, but if two similar strings occur at the same point in the same file, it seems sensible to me that you'd use translations of the first as an initial suggestion, albeit marked for someone to review the translation.

My ideal solution would be for LP to maintain translations in a bazaar branch, so that people could use the standard tools to merge/revert/etc. old translations. That's way more work than I'm willing to ask of others though, which is why I've limited myself to requesting you document the problems well enough that users can work around them.

Revision history for this message
Jeroen T. Vermeulen (jtv) wrote :

I like the ideas of showing old strings somehow (besides them being included in the export) and trying to make suggestions look at possible changes in the original strings. The latter is something we couldn't do before because finding suggestions was already too performance-intensive, but that's improved a lot now.

We are considering various interactions with bzr for the more distant future, but it's not likely that that will be a bi-directional exchange since it would open up cans of worms like dealing with text conflicts and diffs that may not respect message boundaries. More likely are auto-import of templates from bzr, and auto-export of translations to bzr.

I've registered the idea of showing suggestions based on a guess that a string may have changed as a blueprint: https://blueprints.launchpad.net/rosetta/+spec/suggestions-for-changed-msgids

Revision history for this message
Andrew Sayers (andrew-bugs-launchpad-net) wrote :

I actually have quite a bit of experience with efficient searches in large bodies of text - would it be worth me sitting down with someone and talking about strategies for improving performance?

Revision history for this message
Jeroen T. Vermeulen (jtv) wrote :

It's not so much a matter of searching text, as it is of adding database queries. So I think the main thing is to make sure we don't resort to doing this in cases where it's not going to help.

Revision history for this message
Andrew Sayers (andrew-bugs-launchpad-net) wrote :

If you're saying that translations are done with a mysql database doing "select foreign.text from foreign natural join native where native.text = 'foo'", then I'd definitely like to talk to someone - you should be able to get better performance by using a specialised structure like a suffix tree, or better functionality by integerising each word and searching for strings of integers.

Revision history for this message
Jeroen T. Vermeulen (jtv) wrote :

Not MySQL in any case, but in any case let's not get ahead of ourselves. :-) We're talking about pretty large-scale stuff, with fti and md5 hashes etc., and in any case we may end up finding that the best solution does not rely on text search at all.

Revision history for this message
Andrew Sayers (andrew-bugs-launchpad-net) wrote :

I work in corpus linguistics, so I'm used to dealing with gigabytes of text, but I'm also used to much longer documents than it would be useful to search by MD5 hash :)

If you don't mind my asking, doesn't a full-text index leave you with gigantic index file and constant random reads from the disk? If so, do you have any data about whether the new solid state disks improve performance there?

I've had a quick look through the list of Rosetta blueprints, which makes me more curious about the possibilities for cross-pollination. Is there a standard list of use cases that I could look at? It would be useful to know what the "best" solution is in your domain, as distinct from my preconceptions.

Revision history for this message
Jeroen T. Vermeulen (jtv) wrote :

Right, these are lots of small strings, not large texts. I don't know much about the index sizes and such, nor the hardware we use, except "try to keep your indexes cached." :-) And yes, that means we're careful not to overuse fti. It also helps to restrict searches so that you don't get large numbers of rows to search in the first place.

Revision history for this message
Andrew Sayers (andrew-bugs-launchpad-net) wrote :

I don't suppose you could point me in the direction of someone that handles those nuts-and-bolts issues? As I say, it sounds like we could learn a lot from each other.

Revision history for this message
Jeroen T. Vermeulen (jtv) wrote :

We don't really have a single person for that. Several of us have experience dealing with these issues though.

The problem is... finding time! Time we spend discussing the details now is time we can't spend on more pressing issues. Also, we find that if we go into detail too early on, all too often, too much has changed when the time comes to implement. We forget things, the persons who were in the discussion are busy, circumstances change, or we come up with a shortcut that makes the whole thing unnecessary.

Revision history for this message
Andrew Sayers (andrew-bugs-launchpad-net) wrote :

I don't suppose there's a practical way to get hold of (a few gigabytes of) Ubuntu's translation database? I'm working for Birmingham University in the UK, at the Centre for Corpus Research, so a large parallel corpus under a permissive license would probably interest some of the linguists around here. It would also give me a chance to understand your particular issues without taking up too much of your time, and could lead to research that could turn up something useful to you guys.

Revision history for this message
Jeroen T. Vermeulen (jtv) wrote :

We're strayed pretty far from what should be in a bug ticket, so I'm closing this one and replacing it with bug 281165.

If you have an Ubuntu system, one way you could get the translations is by looking for the .mo files installed with language packs and "decompiling" them back to PO files.

Having the text won't help much with performance analysis, however. For that you'd need to know lots of things ranging from the hardware configuration to database setup to usage patterns, and all of that is not exactly something I have on a sheet of paper somewhere that I could hand you. But first off you'd have to know exactly what it is you're trying to do with the text. At the moment it sounds like you're trying to re-invent the message-matching logic already built into gettext.

Revision history for this message
Jeroen T. Vermeulen (jtv) wrote :

Replacing with more specific bug 281165

Changed in rosetta:
status: New → Invalid
Revision history for this message
Данило Шеган (danilo) wrote :

Actually, to get most of the data we have you can fetch our complete language pack source tarballs (get a "base" pack from https://translations.launchpad.net/ubuntu/intrepid/+language-packs). We also have a lot of data for previous Ubuntu releases, and a lot of data that is only considered as translation suggestions.

Since all of the translations we have now are either coming from upstream projects' translations or from Launchpad (which are in BSD), there is no legal problem with us sharing all the data, and we'd be happy to do so. The only problem at this time is actually doing the work to share it all (there are some legal issues about sharing our database model, but that is going to be resolved in the coming months when Launchpad is made free software).

What you are actually interested in is https://blueprints.launchpad.net/rosetta/+spec/rosetta-fuzzy-merge (that one is pretty incomplete, but we've already have some ideas on how to approach the problem). If you are interested in collaborating on the topic, we'd be very happy to talk about it.

Revision history for this message
Andrew Sayers (andrew-bugs-launchpad-net) wrote :

Sorry about letting this report wander, I'll try to wind it down. I'll look at the translation packs later in the week, which should tell me what I need to know in order to discuss things usefully - I'm more interested in the quantity of data and distributions of words, rather than the database model itself. If there's significant interest from linguists, I might ask you for old/orphaned text later on.

I'd be interested in collaborating on fuzzy merging, although as a user I'd be more interested in an approach that could suggest translations of specific technical terms, as well as possible matches for whole strings. That sort of thing shouldn't be too hard to do by analysing frequencies of different words. Where would be a better place to talk about this stuff?

tags: added: ui
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Related questions

Remote bug watches

Bug watches keep track of this bug in other bug trackers.