Lithuanian text recognition: wrong recognition of "ų" as an "ę"

Bug #388926 reported by Donatas Glodenis
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Cuneiform for Linux
In Progress
Undecided
Unassigned
Baltix
New
Undecided
Unassigned

Bug Description

Using cuneiform 0.7 on Ubuntu 9.04

When ocr-ing a lithuanian text with the switch "-l lit" a large number of letters "ų" that usually go at the end of the word get recognized as "ę".

If someone pointed me to the source file I have to check, I am pretty certain that the solution is simple, as the mistake is very simple. However, I cannot find the file: the closest match - datafiles/*lit.dat are binary and I cannot edit those...

Revision history for this message
Donatas Glodenis (dgvirtual) wrote :
Revision history for this message
Jussi Pakkanen (jpakkane) wrote :

Unfortunately there is no documentation on how the code actually works. Source diving and debugger hunting are your only choices.

Revision history for this message
Donatas Glodenis (dgvirtual) wrote :

Well, the problem is, there seem to be no source files for languages - the files I mentioned in the source package are actually binary...

Revision history for this message
abrek (abrek) wrote :

These files are actually data files. There is a possibility that source code for them does not exist --- that is they are in their original form. Whether or not this is true --- we simply don't know. As Jussi told, not that much is known of how the code actually works.

Revision history for this message
Ben Jackson (ben.jackson) wrote :

Assuming there is a Lithuanian dictionary (or you create a 'user dictionary' which I am almost done adding support for) then I believe the key to making this work is to create a suitable datafiles/rec9lit.dat entry which tells spelart.c that "ų" and "ę" are sometimes confused for each other. This is the same list that knows that 'rn' looks like 'm' and 'vv' looks like 'w'. The rec9lit.dat is just a copy of the default English one.

Here are some notes as I investigate.

0) I don't understand the source image: It seems to be a screenshot of a web browser showing the *bad* output? It's not the thing to OCR, is it??

1) rec6lit.dat defines the Lithuanian alphabet (the char to BYTE mapping, essentially). (6 is alphabet files, lit is the abbreviation for Lithuanian)

2) based on the contents of rec6lit.dat and *no* knowledge of Lithuanian at all my conclusion is that the charset of that file is cp1257. (that's consistent with mentions of 1257 in the code) (this picture was useful: http://www.borgendale.com/codepage/cp1257.gif )

3) ...in fact, all of the internal string representations of BYTE seem to be cp1257

4) (there's a bug in InitializeAlphabet where it uses a global instead of the passed in arg, which was breaking my dictionary builder! does not need to be fixed directly for this problem, though)

Ok, I have successfully made a modified rec9lit.dat and attached it (to the bug). It tells the spelling code about your pair of letters. This will cause it to try both variations against the stock dictionary and any user dictionaries. I can see it is trying both even for your jpg (which has the wrong letter, if I understand correctly). I don't know if the dictionary that comes with cuneiform knows the words you are having trouble with. If not, you will need my user dictionary support as well. I'm still waiting for email about that to appear on the list :(

Revision history for this message
Yury V. Zaytsev (zyv) wrote :

Wow, Ben, this is just to say that I am incredibly impressed by your results. Too bad I am of no use here... :-(

Revision history for this message
Donatas Glodenis (dgvirtual) wrote :

Hey Ben, the patched file changed it all - I tried to ocr the page I have previously included in the attachment (yes, it was ocr'ed text in word processor with mistakes highlighted, not the page to ocr) and it corrected all the instances of the wrong recognition of "ų" as "ę".

What hex editor did you use to modify the binary file? I tried to use KDE Okteta, and I could *replace* symbols, but not *add* new ones... Anyway, there are quotation marks with each pair in the file rec9lit.dat; in some cases there is only one pair, and in other cases - couple pairs: mrn""rnm"nnrm""dcl""cld"ce"ec"li"
How do I know if single or double quotation marks apply?

Couple more questions: Are there sources anywhere for the Lithuanian dictionary? Or could someone convert it to a text format? I have negotiated a 300 000 word dictionary with one institution in Lithuania to be used with Tesseract OCR, and I think I could do the same for Cuneiform (that dictionary would be free for usage, but not open source, and distributed only in binary format). This dictionary would cover > 80% of all words occuring in Lithuanian texts... I could try to experiment with it on Cuneiform and report the results.

Another note: the cp1257 encoding (you guessed it correctly) is Microsoft default for Windows in Lithuanian but it is not even an iso standard. Coud we perhaps use utf8 encoding instead?

Thank you Ben for taking interest in this

Revision history for this message
Donatas Glodenis (dgvirtual) wrote :

The modified binary file rec9lit.dat, attached to this bug report, completely solves the problem.

Changed in cuneiform-linux:
status: New → In Progress
Revision history for this message
Ben Jackson (ben.jackson) wrote :

About hex editing: Luckily this file is a very simple format which was easy to reverse-engineer from reading the source. Also, there are several hardcoded entries in the table in spelart.c (plus some #ifdef'd out ones which are now in the rec9.dat file) so I could tell a lot about what was in the file. The quotes you see are not really quotes, they're just binary flags which happen to be ascii quotes.

To edit the file I used perl to read the first record (a special 14 byte header) and add 1 to the count of entries. Then I copied out the pre-existing entries and then printed a new one which I constructed by hand (perl can 'pack' to make binary output easily). I will try to add some comments to my script and add it to my bzr branch.

About sources to dat files: I have no idea if sources exist or if the project will ever have them. I am only working with what's in the repository (and only since last week!). They could probably be reverse-engineered so we could have editable versions in the repository which get "compiled" at build time.

About dictionaries in general: There are two kinds of dictionaries: the builtin ones (which must be pretty good for Lithuanian or otherwise my patch would not work) and "user dictionaries". There are clear functions to call to write new user dictionaries, and I am working on support for that. The format appears to be different than the "builtin" dictionaries. You could use several user dictionaries to do what you want (probably not one, there are size limits on user dictionaries which might not fit 300,000 words).

About character encoding: The source has several functions for doing *output* encoding, including UTF8. However, internally all strings are based on 8-bit BYTE arrays. So each language must be compressed into some 8-bit encoding like cp1257. Every language has an alphabet (you can easily decipher the rec6*.dat files) but unfortunately the charset is not specified (it may be cp1257 for all of them). Your bug report made me think about how to support that in my dictionary creation program. I didn't see a good way to reuse existing code to allow UTF8 input so I will probably require you to provide input in the charset already used by cuneiform.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.