Cuneiform for Linux

Bug #388926
Comment #9

Comment 9 for bug 388926

Revision history for this message

Ben Jackson (ben.jackson) wrote on 2009-07-08:

About hex editing: Luckily this file is a very simple format which was easy to reverse-engineer from reading the source. Also, there are several hardcoded entries in the table in spelart.c (plus some #ifdef'd out ones which are now in the rec9.dat file) so I could tell a lot about what was in the file. The quotes you see are not really quotes, they're just binary flags which happen to be ascii quotes.

To edit the file I used perl to read the first record (a special 14 byte header) and add 1 to the count of entries. Then I copied out the pre-existing entries and then printed a new one which I constructed by hand (perl can 'pack' to make binary output easily). I will try to add some comments to my script and add it to my bzr branch.

About sources to dat files: I have no idea if sources exist or if the project will ever have them. I am only working with what's in the repository (and only since last week!). They could probably be reverse-engineered so we could have editable versions in the repository which get "compiled" at build time.

About dictionaries in general: There are two kinds of dictionaries: the builtin ones (which must be pretty good for Lithuanian or otherwise my patch would not work) and "user dictionaries". There are clear functions to call to write new user dictionaries, and I am working on support for that. The format appears to be different than the "builtin" dictionaries. You could use several user dictionaries to do what you want (probably not one, there are size limits on user dictionaries which might not fit 300,000 words).

About character encoding: The source has several functions for doing *output* encoding, including UTF8. However, internally all strings are based on 8-bit BYTE arrays. So each language must be compressed into some 8-bit encoding like cp1257. Every language has an alphabet (you can easily decipher the rec6*.dat files) but unfortunately the charset is not specified (it may be cp1257 for all of them). Your bug report made me think about how to support that in my dictionary creation program. I didn't see a good way to reuse existing code to allow UTF8 input so I will probably require you to provide input in the charset already used by cuneiform.

About hex editing:  Luckily this file is a very simple format which was easy to reverse-engineer from reading the source.  Also, there are several hardcoded entries in the table in spelart.c (plus some #ifdef'd out ones which are now in the rec9.dat file) so I could tell a lot about what was in the file.  The quotes you see are not really quotes, they're just binary flags which happen to be ascii quotes.

To edit the file I used perl to read the first record (a special 14 byte header) and add 1 to the count of entries.  Then I copied out the pre-existing entries and then printed a new one which I constructed by hand (perl can 'pack' to make binary output easily).  I will try to add some comments to my script and add it to my bzr branch.

About sources to dat files:  I have no idea if sources exist or if the project will ever have them.  I am only working with what's in the repository (and only since last week!).  They could probably be reverse-engineered so we could have editable versions in the repository which get "compiled" at build time.

About dictionaries in general:  There are two kinds of dictionaries:  the builtin ones (which must be pretty good for Lithuanian or otherwise my patch would not work) and "user dictionaries".  There are clear functions to call to write new user dictionaries, and I am working on support for that.  The format appears to be different than the "builtin" dictionaries.  You could use several user dictionaries to do what you want (probably not one, there are size limits on user dictionaries which might not fit 300,000 words).

About character encoding:  The source has several functions for doing *output* encoding, including UTF8.  However, internally all strings are based on 8-bit BYTE arrays.  So each language must be compressed into some 8-bit encoding like cp1257.  Every language has an alphabet (you can easily decipher the rec6*.dat files) but unfortunately the charset is not specified (it may be cp1257 for all of them).  Your bug report made me think about how to support that in my dictionary creation program.  I didn't see a good way to reuse existing code to allow UTF8 input so I will probably require you to provide input in the charset already used by cuneiform.