stemming language setting problematic

Bug #157183 reported by era
22
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Tracker
Expired
Wishlist
tracker (Ubuntu)
Confirmed
Undecided
Unassigned

Bug Description

Binary package hint: tracker

System - Preferences - Indexing preferences - General tab - Stemming has a short list of languages ... but you have to pick one. This is not a realistic scenario in many locales; for example, I routinely handle stuff in three languages (English; my home language, Swedish; and the majority language of the country where I live, Finnish) and so does everyone in my family, including soon enough my daughter, who just started school.

Not only is the absence of stemming for, say, Finnish, problematic, but being forced to choose English or Swedish stemming for Finnish documents is likely to produce a large amount of false-positive stems, making searches for Finnish words return what seem like completely haphazard matches in many cases -- enough to make it useless at least in some scenarios.

What happens if later, you change this setting? Does it throw away or redo all the stemming it has done so far?

What happens if your primary locale preferences indicate a language which is not on the list; would that be a workaround for disabling stemming?

I do realize that coming up with a good fix for this is hard. At a minimum, indexing without any stemming should be possible. Further out in wishlist territory, it would be nice if at some point the indexer could try to establish the language of each document (ignoring for now the can of worms that is multilingual documents -- don't let any philologists hear about this) and use an appropriate stemmer only if the language can be established with reasonable certainty. (Debian has a package "mguesser" for stand-alone language identification, which is also available as a library which ships with the mnogosearch search engine; google for TextCat for some more suggestions. Or ask me again and be prepared for a veritable flood of bookmarks on the topic.)

Changed in tracker:
status: Unknown → In Progress
Revision history for this message
era (era) wrote :

Thanks for the pointer to the upstream bug; I added some comments there.

Revision history for this message
Simone Tolotti (simontol) wrote :

Subscribed.
quote:"making searches for Finnish words return what seem like completely haphazard matches in many cases -- enough to make it useless at least in some scenarios."
Same here for Italian language. (I think that italian stemming is not well done...).
I also have a lot of english docs in my home 70-80%, so it is not very useful to use italian stemming for them.

Changed in tracker:
status: New → Confirmed
Changed in tracker:
importance: Unknown → Wishlist
status: In Progress → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.