Find more modern text for Italian word prediction

Bug #1591149 reported by Alberto Mardegan
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
ubuntu-keyboard (Ubuntu)
Confirmed
Undecided
Unassigned

Bug Description

The text used for word prediction in Italian [1] is IMHO not very suitable for the goal of predicting words typed into computers nowadays (especially on phones), fo a few reasons:

- it's very old -- from 1868; particles like "cotesto", "pel", "pei" are not used anymore
- it's mostly written in the "passato remoto" past tense, which is not that common in modern speech
- it talks about history, with ample use of long and rarely used words.
- dialogues are entirely missing

I think it should be changed with a modern text, not without dialogues. Does this have to be a single book, or can we assemble a few different texts together (i'm thinking about adding some pieces from newspapers, blogs and short novels, mostly)?

[1] http://bazaar.launchpad.net/~phablet-team/ubuntu-keyboard/trunk/view/head:/plugins/it/src/la_francia_dal_primo_impero.txt

Revision history for this message
Andrea Bernabei (faenil) wrote :

another italian here o/

wow I had no idea the predictions were coming from a .txt file, I always thought that they were coming from Nuance's XT9 engine and its vocabularies...but it seems like we're not using XT9, indeed...

I only use text predictions to avoid press-holding keys to get to accented letters, so I can't really provide my feedback here. I think we all agree that using a document from 150 years ago may lead to the prediction engine offering "obsolete" words, though :)

PS: about accented letters, I find it really annoying that when you press "e" you don't get "è" in the suggested words...but I should probably report a separate bug for that :)

Revision history for this message
Alberto Mardegan (mardy) wrote :

Looks like it's not easy to find contemporary texts accompanied by a permissive license. The site liberliber.it has many free contemporary books, by the are all distributed under the CC-BY-NC-SA, and the NC (Non Commercial) bit is a bit problematic here.
To play safe, better use books from project Gutenberg; I would go for romances, because they are likely to contain both dialogues and narrative text. Here's a list:
http://www.gutenberg.org/wiki/IT_Romanzi_%28Biblioteca%29

They are all a bit oldish, but we can pick something from the beginning of last century, at least.

A few options:

- Il perduto amore, 1921
  http://www.gutenberg.org/ebooks/41281

- I sogni dell'anarchico, 1922
  http://www.gutenberg.org/cache/epub/25175/pg25175.txt

- I divoratori, 1922
  http://www.gutenberg.org/ebooks/34983.txt.utf-8

In other categories, other suitable books:

- Fuochi di bivacco, 1913
  http://www.gutenberg.org/files/49223/49223-0.txt
  (I like this because it's mostly in the present tense)

- La favorita del Mahdi, 1911
  http://www.gutenberg.org/cache/epub/25180/pg25180.txt
  (by Emilio Salgari!)

Any opinions on which one we should pick?

Whatever the choice, I think I'll give the chosen book a pass with sed and replace s/egli/lui/, s/ella/lei/, s/de'/dei/, and similar ones.

Revision history for this message
Stefano Verzegnassi (verzegnassi-stefano) wrote :

Yet another italian! :)

One thing I saw when typing on the keyboard with an Italian dictionary is that it does not recognise loanwords from English (and at the current days we use them a lot in our everyday Italian).

(random) e.g. "Mi è arrivato dello SPAM in posta" -> "Mi è arrivato dello SPAMPANAVA in posta"

I think that any old novel couldn't help with this.

Could some resource from Wikipedia be integrated with any of the novels above?
As far as I can see, Wikipedia/Wikizionario/* content is CC BY-SA licensed.

As alternative, I found a dictionary from the Boot2Gecko project, which seems to be complete enough (and includes English loanwords).
https://github.com/mozilla-b2g/gaia/blob/master/apps/keyboard/js/imes/latin/dictionaries/it_wordlist.xml
A good thing is that dirty words are already marked, so they can be easily filtered out.

Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in ubuntu-keyboard (Ubuntu):
status: New → Confirmed
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.