gtxml: error with tags in Mallard-based PO files

Bug #1052399 reported by Daniel Mustieles
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
PyG3T
In Progress
Undecided
Ask Hjorth Larsen

Bug Description

When analyzing some PO files based in Mallard format, gtxml doesn't detect errors in tags. For example, review this bug:

https://bugzilla.gnome.org/show_bug.cgi?id=683905

There were some errors in Spanish and French translations, which broke the compilation of the package, but usin gtxml to review the files, it found no errors.

Mallard is being used widely in GNOME, and we would really happy if we could check Mallard-based PO files with gtxml.

Could you please review it? If I can help you, please let me know.

Best regards

Tags: gtxml
Revision history for this message
Ask Hjorth Larsen (askhl) wrote :

Hi Daniel

Let's see. From the commit log somewhere in the other bug[1]

 msgstr ""
"Cette Licence couvre tout manuel ou tout autre travail écrit contenant une "
"notice de copyright autorisant la redistribution selon les termes de cette "
-"Licence. Le mot <:quote-1/> se réfère ci-après à un tel manuel ou travail. "
+"Licence. Le mot <_:quote-1/> se réfère ci-après à un tel manuel ou travail. "
"Toute personne en est par définition concessionnaire et est référencée ci-"
"après par le terme <_:quote-2/>."

So the problem is that "<:quote-1/>" is wrong and "<_:quote-1/>" is correct.

The second one is not illegal xml, but clearly we want to identify it as a mistake in this case. Right now we just check that the msgstr is legal xml which might mean there could still be mistakes. We have the following options:

 * We can check that the exact XML structure (all tags/nestings/etc.) is identical between msgid and msgstr. This is very aggressive as there may be legitimate reasons to, say, interchange words with markup in sentences. Thus it will lead to error reports for things that do not contain real errors.
 * We can check that no XML tags are used in the msgstr if they are not also used in the msgid. This might be a worthy compromise.
 * We can run the entire Mallard compilation which is the ultimate authority, but this is really a bit more than we would like to do
 * We could somehow know all legal tags that can be used in Mallard documents. This will probably be a bit more nasty.

Comments, ideas, etc.?

[1] http://git.gnome.org/browse/gtk-doc/commit/?id=8c0b13bbc32c81d20c92a25d4f8c3688291fcf8b

Revision history for this message
Daniel Mustieles (daniel-mustieles) wrote :

Well... all the options you propose are difficult and/or nasty for the user... in some cases, translators modify the order of the XML tags, to adapt the string to their native languages, which is correct. For this cases, checking the XML structure would be a wrong idea, since the tool would generate several false positives.

What about creating a «temporal dictionary» for each string, with the tags used in that string, and check if these strings are properly located in the translated string? For example, if you have an original string with a <_:quote-1/> tag, but the translated string has a <:quote-1/>, it might be an error, and user should be advised. Ok, maybe it is a bit agressive method, since it can report false positives, but maybe it will be better than checking the whole XML structure. Note that, at the end, the translated string must have the same tags than the original one, so if the script detects a missing or an incorrect string, should warn about it.

What do you think about this idea? Would it be very difficult to implement?

Revision history for this message
Shaun McCance (shaunm-gnome) wrote :

Note that anything starting with "_:" is an itstool placeholder element for content that's (possibly) in another msgid (or possibly untranslated). It's sort of like "%s" in a C format string, and it's not specific to Mallard. One thing you could probably do, if you know the PO file is from itstool, is make sure the msgstr has the same placeholders as the msgid.

Revision history for this message
Daniel Mustieles (daniel-mustieles) wrote :

Yes, that's exactly what I meant when said "temporal dictionary". Just checking if the translated string has the same placeholders than the original one (but not neccesary in the same position, since it may change depending of the translation) would be enough to be able to detect missing placeholders and/or mistakes (like, i.e. missing "_") in tags.

Many thanks Shaun for clarifying it (and nice to read you again!)

Changed in pyg3t:
status: New → In Progress
assignee: nobody → Ask Hjorth Larsen (askhl)
Revision history for this message
Ask Hjorth Larsen (askhl) wrote :
Download full text (3.3 KiB)

I've been trying a couple of possibilities. I think the changes to gtxml are ready-ish to be committed, but I'd like your opinions on whether these are real errors or not and perhaps modify the behaviour a bit.

Okay:

Either we check, for each tag in the msgstr, whether that tag is included in the corresponding msgid. This is somewhat strict and will complain if someone wants <application>Something</application> in cases where that does not enter in the English version.

Else we check each tag against a collection which could be extracted from all msgids.

The two checks result in vastly different size of output (as tested on the Spanish documentation). The strict one yields 2000 lines while the other one 100 lines. Here are the two files:

  http://www.student.dtu.dk/~ashj/opendir/gtxml-output.txt
  http://www.student.dtu.dk/~ashj/opendir/gtxml-output-nodatabase.txt

Some examples:

./es/ghex-help.master.es.po, line 1339: Unrecognized element "placeholder-1" found in msgstr
------------------------------------------------------------------------------
#: C/legal.xml:28(legalnotice/para)
msgid ""
"DOCUMENT AND MODIFIED VERSIONS OF THE DOCUMENT ARE PROVIDED UNDER THE TERMS "
"OF THE GNU FREE DOCUMENTATION LICENSE WITH THE FURTHER UNDERSTANDING THAT: "
"<_:orderedlist-1/>"
msgstr ""
"EL DOCUMENTO Y LAS VERSIONES MODIFICADAS DEL MISMO SE PROPORCIONAN CON "
"SUJECIÓN A LOS TÉRMINOS DE LA GFDL, QUEDANDO BIEN ENTENDIDO, ADEMÁS, "
"QUE: <placeholder-1/>"

The "placeholder-1" tag is incorrect. The "compare-to-msgid" method will discover this, but the "database"-method will only discover it if the tag "placeholder-1" is not in fact used anywhere at all.

Here is another one:

./es/filters~blur.master.es.po, line 729: Unrecognized element "citation" found in msgstr
------------------------------------------------------------------------------
#: src/filters/blur/introduction.xml:110(para)
msgid ""
"You can find a nice explanation of the Abraham Lincoln effect at <xref "
"linkend=\"bibliography-online-bach\"/>. You will see the Salvador Dali's "
"painting <quote>Gala Contemplating the Mediterranean Sea</quote> turning to "
"an Abraham Lincoln's portrait when looking at it from a distance."
msgstr ""
"Puede ver una interesante explicación, en inglés, del efecto Abraham "
"Lincoln en <citation>Bach04</citation>."

Is "citation" actually an illegal tag or has it been chosen for some good reason?

Here is a case of a correct-looking tag ("guiicon") not being found anywhere in any msgid. But it looks as if it has been used on purpose:

./es/gnote-help.master.es.po, line 620: Unrecognized element "guiicon" found in msgstr
------------------------------------------------------------------------------
#: C/gnote-addin-timestamp.page:22(page/p)
msgid ""
"The Tools button is represented by the <media type=\"image\" "
"src=\"figures/gnote-tools.png\" mime=\"image/png\" style=\"right\"> </media> "
"icon. When you click the Tools icon on the toolbar present on your note, a "
"menu will appear."
msgstr ""
"El botón <guibutton>Herramientas</guibutton> se representa con el icono "
"<media type=\"image\" src=\"figures/gnote-tools.png\" mime=\"image/png\" "
"st...

Read more...

Revision history for this message
Ask Hjorth Larsen (askhl) wrote :

We could make it so that the "_:" syntax pointed as out by Shaun is used in the strict way (must exist in msgid) while the remaining tags must either be in the msgid or in the database.

The database can be generated by running gtxml in a special mode like: gtxml --dump-tags *.po > tags.txt

Revision history for this message
Daniel Mustieles (daniel-mustieles) wrote :

I've seen the report's you've generated and is really impressive the number of incorrect tags in the PO files. I guess we should fix it, since the tags in the msgid should be the same as in the msgstr, am I right?

I would do the following check: if there is a tag in the msgid that doesn't appear in the msgstr, I would raise an error message; and the same for placeholder that begin with "_".

About the tag's database... I'm not sure about how it's working now. The database created with --dump-tags contains all the tags from the PO files, independently of the string they appear or there is any relation between both elements?

If the strict check checks and detects the placeholders, it's ok for me, considering that the tags in the mgsgid should be the same as the msgstr ones.

Shaun, what do you think about it?

Many thanks for taking care of this :)

Revision history for this message
Ask Hjorth Larsen (askhl) wrote :

Daniel, what is roughly the rate of false positives in the strict one? Are there many, a few, zero?

On some occasions I have written a translation which mentions the program name (with <app>...</app>) less than in the original string, as that's what comes most naturally in Danish. Thus a one-to-one correspondence of tags would probably be too aggressive.

I'm mostly surprised (also in other languages) about the number of cases where one legal tag is replaced by another apparently legal tag. I suppose this happens when the original string used the wrong one and was later changed, making the existing translation fuzzy but with a difficult-to-spot difference in tags.

Revision history for this message
Ask Hjorth Larsen (askhl) wrote :

I pushed the recent changes to trunk in case any readers are interested in trying.

Check tags strictly:
  gtxml --tags *.po

Dump all tags from msgids into file:
  gtxml --dump-tags *.po > database.txt

Check tags allowing everything from the database:
  gtxml --tags-from database.txt --tags *.po

(it's a bit sketchy that --tags is also required in the last one, I will at the very least change that before release)

Revision history for this message
Daniel Mustieles (daniel-mustieles) wrote :

But the question is... are those tags mandatory? If all the tags are mandatory, the check should be completely aggressive to verify that all the tags in the translated strings are the same than in the original one.

In the other hand, although having exactly the same tags in the msgstr and msgid was not mandatory, we should keep the same tags in both fields, since some of them may become deprecated causing .page files not been show properly.

So, asking your question, in the case of the report for the spanish documentation, all the errors detected are ok for me, and should be mandatory to fix them. If there is no need to have the same tags in both fields, well, we should fix it, but without haste ;)

Revision history for this message
Ask Hjorth Larsen (askhl) wrote :

(Disclaimer: I have no idea what I'm talking about)

Tags are not mandatory. After all this is just a documentation utility. What it does is to generate the document from whatever strings it loads, and the output will be fine whichever tags it happens to contain (quite independently of the original msgids). Unless of course some tags are invalid in which case the output is messed up.

So we have the choice: Shall we complain only about tags that we know to be wrong, or should we complain every time the translator has forgotten/chosen to use an unforeseen (by us) tag? The programme should probably provide both options, but someone who knows about Mallard could perhaps provide some "exact constraints" to help.

I'll send some (large) error reports to the i18n list and we'll probably get some more opinions.

Revision history for this message
Daniel Mustieles (daniel-mustieles) wrote :

Well, I also don't know hot it works, so my opinion may not be valid :-(

If tags are mandatory, of course we should check and fix them, but it they aren't, we might be missing some funcionality. But, apart from that question (in the worst case, all the documentation page will look like plain text, without any markup) I see another problem here... how do we explain it to translators? I mean... I always teach translator to keep an eye on tags when translating, because if they don't do it may be problems when compiling, etc... If they realize that some tags are not mandatory, they won't include those tags in the translated string an so.

Since we are not mallard-gurus and both have no idea how it works under the hood, maybe Shaun could give us some light about it.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.