Comment 8 for bug 1187433

Revision history for this message
Mike Rylander (mrylander) wrote :

I'me very confident that the phrase difference you note in the last paragraph is not related to initialization, but just a 2.4+ issue. That is, I'm pretty certain, a subclass of ye olde apostrophes-considered-harmful. That, in turn, stems at least partially from a change attempting to address apostrophe-joined leading articles (French "l'", say). The older NACO normalizer would simply remove the apostrophy, collapsing the "l" to the front of the next word, which is obviously non-good in the case of French articles. This was changed with the new search_normalize procedure, who's major behavioral difference from naco_normalize was to replace apostrophes with spaces, splitting the "l" from the following word.

I wonder if, now that we have multilingual stemming support and sites could add (or we could supply by default) a D-weighted, always-used french stemmer, we should switch back to naco_normalize, at least for the C-weight slot. The test for efficacy would be to use the french stemmer against an appropriate French word with a leading article and see if the article and word are tokenized to separate lexems. As a test:

=# SELECT to_tsvector('french','J'aime la glace');

   to_tsvector
------------------
 'aim':2 'glac':4
(1 row)

Looking good so far. Now, this would require that we pass the original (or, at least, not apostrophe-molested) string to the D-weight slot (conventionally reserved for alternate language indexing) where we've configured the French stemmer to live. That would mean that rows on config.metabib_field_index_norm_map would need to be associated with specific weight classes (in this case specifically, at least not with D), which would be a new thing, but would round the configuration space pretty well in terms of granularity, and would allow us to fiddle with how, or more correctly WHEN, we deal with apostrophes (in this case).

Thoughts?