Add keyboard layout for Scottish Gaelic (gd)

Bug #1367210 reported by GunChleoc
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
ubuntu-keyboard
Fix Released
Medium
GunChleoc
ubuntu-keyboard (Ubuntu)
Fix Released
Undecided
Unassigned

Bug Description

I am working on creating a keyboard layout for Scottish Gaelic, and I have a question:

Looking at the other languages, the database for predictive texting seems to be filled from a sample text (e.g. Buddenbrooks for de, Les trois mousquetaires for fr). We actually have a lexical database at our disposal that we already used for predictive texting in the Adaptxt keyboard for Android. How do you recommend we proceed for Ubuntu? Should we turn the database data into a plain text file? How is the database for Ubuntu Keyboard then generated?

Related branches

GunChleoc (gunchleoc)
Changed in ubuntu-keyboard:
assignee: nobody → GunChleoc (gunchleoc)
Revision history for this message
Michael Sheldon (michael-sheldon) wrote :

Hi! It's awesome that you're developing a Scottish Gaelic keyboard for Ubuntu, thanks!

The predictive text data is stored in sqlite databases containing three tables of ngrams (specifically 1, 2 and 3-grams) in tables named "_1_gram", "_2_gram" and "_3_gram". Each table contains columns for the individual words in that ngram and a count of the times that ngram is encountered.

So the _1_gram table is of the structure:

word | count

_2_gram has the structure:

word_1 | word | count

and _3_gram has the structure:

word_2 | word_1 | word | count

The word columns are all text and the count column is an integer. A bit confusingly the highest numbered "word_" column is the first word in the ngram and "word" is always the last one. So an example from the _3_gram table would be:

seemed | to | him | 27

Meaning that in training it has seen the phrase "seemed to him" 27 times.

We use the text2ngram utility provided by the presage project (http://presage.sourceforge.net/) to generate these database from ebooks (which isn't ideal, since this doesn't fully represent more conversational writing styles), but you might find it easier to convert your database directly depending on how it's formatted.

Any further questions just let me know :)

Cheers,
Mike.

Revision history for this message
GunChleoc (gunchleoc) wrote :

I tried running text2ngram to convert the data to sqlite format. No luck though:

text2ngram -o testxx -f sqlite teacsa.txt
Parsing teacsa.txt...
0---10---20---30---40---50---60---70---80---90--100
###################################################
Writing out to sqlite format file testxx...
0---10---20---30---40---50---60---70---80---90--100
[DatabaseConnector] Error executing SQL: 'INSERT INTO _1_gram VALUES('', 1130);' on database: 'testxx' : unrecognized token: "'"
terminate called after throwing an instance of 'SqliteDatabaseConnector::SqliteDatabaseConnectorException'
  what(): unrecognized token: "'"
Aborted (core dumped)

I tried escaping all the ' as \' and deleted the blank line at the end of the file. The database only contains 1 empty table, so the problem is right at the start.

Revision history for this message
GunChleoc (gunchleoc) wrote :

P.S. the text file looks like this:

sa bhad ,
sa bhad ,
sa bhad ,
sa bhad ,
sa bhad ,
sa Bheurla ,
sa Bheurla ,
sa Bheurla ,
sa Bheurla ,
sa Bheurla ,
sa cheud ,
sa cheud ,
sa cheud ,
sa cheud ,
sa cheud ,

etc.

Bill Filler (bfiller)
Changed in ubuntu-keyboard:
status: New → Confirmed
importance: Undecided → Medium
Revision history for this message
Michael Sheldon (michael-sheldon) wrote :

Sorry for the delay on getting back to you on this, seems I forgot to subscribe to this bug. It looks like there's a bug in presage itself with respect to apostrophes. I've created a bug for this here: https://bugs.launchpad.net/ubuntu/+source/presage/+bug/1384800 and a patch to our presage package which will fix this: https://code.launchpad.net/~michael-sheldon/ubuntu/utopic/presage/fix-apostrophes

 Could you add your testxx.txt file as an attachment to this bug so I can verify it'll fix your issue?

Revision history for this message
GunChleoc (gunchleoc) wrote :

Here it is - thanks for looking into this!

This is the full text file we would use to create the ngrams

Revision history for this message
Michael Sheldon (michael-sheldon) wrote :

Found the problem, there's a mix of unix and DOS line endings in the file, if you run it through dos2unix first it should work fine. You'll probably want to remove the \s as the current version of presage will split on apostrophes regardless of whether they have a backslash or not, however once this lands: https://code.launchpad.net/~michael-sheldon/ubuntu/vivid/presage/fix-apostrophes/+merge/240565 apostrophes will be handled correctly without you needing to do any escaping.

Revision history for this message
GunChleoc (gunchleoc) wrote :

Thanks for your help. I managed to get some 1-grams now.

I also failed to get compound nouns (which include hyphens) or the a-, h- and n- prefixes (hyphens again), but it's a start. I guess I will have to wait for upstream to add the custom tokens or learn how to feed data into SQLite myself.

Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package ubuntu-keyboard - 0.99.trunk.phablet2+15.10.20150529-0ubuntu1

---------------
ubuntu-keyboard (0.99.trunk.phablet2+15.10.20150529-0ubuntu1) wily; urgency=medium

  [ Bernard Banko ]
  * Add new keyboard layouts for Romanian, Scottish Gaelic, Greek,
    Norwegian, Ukrainian, Slovenian and Icelandic and fix incorrect
    layouts in Swedish and Croatian. (LP: #1367210, #1395402, #1436045,
    #1452719, #1363376, #1440722, #1452723, #1454206, #1440959)

  [ CI Train Bot ]
  * New rebuild forced.

  [ GunChleoc ]
  * Add new keyboard layouts for Romanian, Scottish Gaelic, Greek,
    Norwegian, Ukrainian, Slovenian and Icelandic and fix incorrect
    layouts in Swedish and Croatian. (LP: #1367210, #1395402, #1436045,
    #1452719, #1363376, #1440722, #1452723, #1454206, #1440959)

  [ Michael Sheldon ]
  * Add new keyboard layouts for Romanian, Scottish Gaelic, Greek,
    Norwegian, Ukrainian, Slovenian and Icelandic and fix incorrect
    layouts in Swedish and Croatian. (LP: #1367210, #1395402, #1436045,
    #1452719, #1363376, #1440722, #1452723, #1454206, #1440959)
  * Add test for the keyboard remaining dismissed when scrolling in
    Oxide.
  * Allow the keyboard's height reporting to be disabled when the system
    is in windowed mode. (LP: #1457116)

ubuntu-keyboard (0.99.trunk.phablet2+15.04.20150514-0ubuntu1) vivid; urgency=medium

  [ CI Train Bot ]
  * Resync trunk. added: po/ro.po

  [ Leo Arias ]
  * Fixed the static errors reported by flake8. Added the check to the
    debian build tests. Added python3-flake8 as a build dependency. (LP:
    #1444170)
  * Use the base class from the toolkit in autopilot tests.

 -- CI Train Bot <email address hidden> Fri, 29 May 2015 12:24:02 +0000

Changed in ubuntu-keyboard (Ubuntu):
status: New → Fix Released
Changed in ubuntu-keyboard:
status: Confirmed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.