MDic

Stemming for French language.

Bug #314281 reported by musically_ut on 2009-01-06

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	MDic	Fix Committed	Low	Mehrdad Momeny

Bug Description

Some words from French language are not found in the dictionary because they are not reduced to their root words (eg. dansez, tombons, etc.). Hence, a rudimentary stemming strategy for French language is proposed, in line with the strategy being utilized for English language. The Fix is not complete or exhaustive, but should get something started.

Patch attached.

Thanks!

~
musically_ut

Revision history for this message

musically_ut (musically-ut) wrote on 2009-01-06:

Changes: minor fix in README, additions to dbman.cpp Edit (6.6 KiB, text/plain)

Revision history for this message

Mehrdad Momeny (mehrdad-momeny) wrote on 2009-01-06:

Thanks

Changed in mdic:
assignee:	nobody → mehrdad-momeny
importance:	Undecided → Low
status:	New → Fix Committed

Revision history for this message

musically_ut (musically-ut) wrote on 2009-01-07:

The suggested fix is, of course, not perfect, and it will eventually result in more bugs. So here are a few short comings:

1. Mis-spelled words in English might end up having a meaning: e.g. cancez -> cancer: Hence, the set of changes are language specific now. I don't know how languages of dictionaries are being managed.

2. The "e-accent aigu" is NOT matched perfectly. I am not very familiar with Unicode string matching, so I did not implement it the 'proper' way.

If you work around any of these, do tell me. On the first sight, they look fairly straightforward fixes.

Thanks.

~
musically_ut

Revision history for this message

Majid Ramezanpour (thinkgnu) wrote on 2009-01-07:

Actually we are working on a new series of MDic to be more compatible with KDE so we will use KDELibs in it instead of Qt.

In this series we need a better designing of classes and also make it more Plugin based.

One of our plan for stemming strategy is to make it Plugin based, so it will be easier for users to make Plugin for stemming strategies for their own languages.

So if you have any idea about designing it in a better way we highly appreciate it.

Revision history for this message

musically_ut (musically-ut) wrote on 2009-01-07:

Corrected diff. Edit (6.6 KiB, text/plain)

I can see a simple way of dealing with simple morphological rules, simple text files as plugins. For example, if one allows perl like regular expression matching (though the current rules can be implemented using only prefix/suffix matching), I would write out the replacement rules as the follows:

english.rep:
---------------------------------------------------------
/ies$/ -> 'y'
/n't$/ -> ''
/ness$/ -> ''
/'s$/ -> ''
/ied$/ -> 'y'
/^un/ -> ''
/ment$/ -> ''
/ing$/ -> '', 'e'
/ning$/ -> 'n'
/s$/ -> ''
/ly$/ -> ''
/ed$/ -> '', 'e'
/es$/ -> '', 'e'
/er$/ -> ''
/ily$/ -> 'y'
---------------------------------------------------------

french.rep
---------------------------------------------------------
/ons$/ -> 'er', 're', 'ir'
/ez$/ -> 'er', 're', 'ir'
/ent$/ -> 'er', 're', 'ir'
/es$/ -> 'er'
/s$/ -> '', 're'
/e$/ -> 'er'
/u$/ -> 're'
/(e-accent ague)$/ -> 'er'
/i$/ -> 'ir'
// -> 're'
---------------------------------------------------------

How does that sound?

~
musically_ut

PS: And, while writing that, I stumbled across another bug, so the corrected diff is attached.
PPS: Unfortunately, as KDE4 had scanty proxy support, I have shifted almost completely to Gnome. :)

Revision history for this message

Mehrdad Momeny (mehrdad-momeny) wrote on 2009-01-10:

It seems fine,
thanks, we need more research on this topic. ;)

Sorry, I was disconnected for 3 days.

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Patches

Add patch

Remote bug watches

Bug watches keep track of this bug in other bug trackers.