Comment 7 for bug 388926

Revision history for this message
Donatas Glodenis (dgvirtual) wrote :

Hey Ben, the patched file changed it all - I tried to ocr the page I have previously included in the attachment (yes, it was ocr'ed text in word processor with mistakes highlighted, not the page to ocr) and it corrected all the instances of the wrong recognition of "ų" as "ę".

What hex editor did you use to modify the binary file? I tried to use KDE Okteta, and I could *replace* symbols, but not *add* new ones... Anyway, there are quotation marks with each pair in the file rec9lit.dat; in some cases there is only one pair, and in other cases - couple pairs: mrn""rnm"nnrm""dcl""cld"ce"ec"li"
How do I know if single or double quotation marks apply?

Couple more questions: Are there sources anywhere for the Lithuanian dictionary? Or could someone convert it to a text format? I have negotiated a 300 000 word dictionary with one institution in Lithuania to be used with Tesseract OCR, and I think I could do the same for Cuneiform (that dictionary would be free for usage, but not open source, and distributed only in binary format). This dictionary would cover > 80% of all words occuring in Lithuanian texts... I could try to experiment with it on Cuneiform and report the results.

Another note: the cp1257 encoding (you guessed it correctly) is Microsoft default for Windows in Lithuanian but it is not even an iso standard. Coud we perhaps use utf8 encoding instead?

Thank you Ben for taking interest in this