Stemming for French language.

Bug #314281 reported by musically_ut
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
MDic
Fix Committed
Low
Mehrdad Momeny

Bug Description

Some words from French language are not found in the dictionary because they are not reduced to their root words (eg. dansez, tombons, etc.). Hence, a rudimentary stemming strategy for French language is proposed, in line with the strategy being utilized for English language. The Fix is not complete or exhaustive, but should get something started.

Patch attached.

Thanks!

~
musically_ut

Revision history for this message
musically_ut (musically-ut) wrote :
Revision history for this message
Mehrdad Momeny (mehrdad-momeny) wrote :

Thanks

Changed in mdic:
assignee: nobody → mehrdad-momeny
importance: Undecided → Low
status: New → Fix Committed
Revision history for this message
musically_ut (musically-ut) wrote :

The suggested fix is, of course, not perfect, and it will eventually result in more bugs. So here are a few short comings:

1. Mis-spelled words in English might end up having a meaning: e.g. cancez -> cancer: Hence, the set of changes are language specific now. I don't know how languages of dictionaries are being managed.

2. The "e-accent aigu" is NOT matched perfectly. I am not very familiar with Unicode string matching, so I did not implement it the 'proper' way.

If you work around any of these, do tell me. On the first sight, they look fairly straightforward fixes.

Thanks.

~
musically_ut

Revision history for this message
Majid Ramezanpour (thinkgnu) wrote :

Actually we are working on a new series of MDic to be more compatible with KDE so we will use KDELibs in it instead of Qt.

In this series we need a better designing of classes and also make it more Plugin based.

One of our plan for stemming strategy is to make it Plugin based, so it will be easier for users to make Plugin for stemming strategies for their own languages.

So if you have any idea about designing it in a better way we highly appreciate it.

Revision history for this message
musically_ut (musically-ut) wrote :

I can see a simple way of dealing with simple morphological rules, simple text files as plugins. For example, if one allows perl like regular expression matching (though the current rules can be implemented using only prefix/suffix matching), I would write out the replacement rules as the follows:

english.rep:
---------------------------------------------------------
/ies$/ -> 'y'
/n't$/ -> ''
/ness$/ -> ''
/'s$/ -> ''
/ied$/ -> 'y'
/^un/ -> ''
/ment$/ -> ''
/ing$/ -> '', 'e'
/ning$/ -> 'n'
/s$/ -> ''
/ly$/ -> ''
/ed$/ -> '', 'e'
/es$/ -> '', 'e'
/er$/ -> ''
/ily$/ -> 'y'
---------------------------------------------------------

french.rep
---------------------------------------------------------
/ons$/ -> 'er', 're', 'ir'
/ez$/ -> 'er', 're', 'ir'
/ent$/ -> 'er', 're', 'ir'
/es$/ -> 'er'
/s$/ -> '', 're'
/e$/ -> 'er'
/u$/ -> 're'
/(e-accent ague)$/ -> 'er'
/i$/ -> 'ir'
// -> 're'
---------------------------------------------------------

How does that sound?

~
musically_ut

PS: And, while writing that, I stumbled across another bug, so the corrected diff is attached.
PPS: Unfortunately, as KDE4 had scanty proxy support, I have shifted almost completely to Gnome. :)

Revision history for this message
Mehrdad Momeny (mehrdad-momeny) wrote :

It seems fine,
thanks, we need more research on this topic. ;)

Sorry, I was disconnected for 3 days.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.