Comment 10 for bug 1187433

Revision history for this message
Dan Scott (denials) wrote :

It may be "just" a 2.4+ issue, but it's still a significant issue, and quite the regression from what just worked in 2.3.

Re: SELECT to_tsvector('french','J'aime la glace');

Yes, of course those are the results we would get if we used PostgreSQL's French stemming text search configuration with stopwords against the unmolested phrase (in this case, unmolested by either naco_normalize or search_normalize).

Here's what happens if we pass the text through search_normalize() before passing it on to the stopword French stemming text search configuration:

SELECT to_tsvector('french', search_normalize('J''aime la glace'));
   to_tsvector
------------------
 'aim':2 'glac':4
(1 row)

No surprise that the outcome is still what we want; it will match if people search for "aime glace" or "j'aime glace" alike, because the "j" was separated out and the french dictionary recognized it as a stopword and gobbled it up.

Here's what happens if we pass the text through naco_normalize() before passing it on to the stopword French stemming algorithm:

SELECT to_tsvector('french', naco_normalize('J''aime la glace'));
    to_tsvector
-------------------
 'glac':3 'jaim':1
(1 row)

Oh, there's that pesky squashed "j", which would break searches for "aime glace". I can't say that I see how naco_normalize() actually helps us in any way here.

As far as I can tell, though, QueryParser currently has no good way of telling whether to apply English or French stemming algorithms against the incoming search query. If I understand you correctly, you're suggesting that we enable sites to configure a secondary language (such as French, or Spanish), and then index & match _all_ incoming queries as both a normalized primary language (for weight C) and a secondary language (for weight D) -- and enable stopwords for D, and maybe (or maybe not) stopwords for C, but not for A, as well?

In which case, given that the severed lexemes are more likely to be stopworded away ('j' and 'l' in French), why go back to naco_normalize()? There doesn't seem to be any advantage to jamming the affixes onto the root words in English, other that being able to blame NACO for search design decisions.

The suggestion to make use of weight D for a secondary language _would_ enable Evergreen to offer some level of support for a secondary language, I guess, but it's a hard cap at a second language. Unless, of course, we throw a third language into weight B...

You said "conventionally reserved for alternate language indexing"--I'm really interested in learning more about conventions with respect to PostgreSQL full text search. Do you have a reference for that?

Anyway, we've wandered pretty far away from the core problem, which is that searches containing apostrophes have suffered a pretty severe regression in 2.4. Your patch resolves the regular query issue, and I'll happily apply that, but I don't think any of what we're talking about for phrase search here is appropriate for 2.4; it's more along the line of another significant design change requiring another reingest of bibs that would be more appropriate for 2.5.

If we are talking 2.5 reingest-y moves, another less-complex option (in terms of just using what PostgreSQL gives us) would be to supply a minimal English stopwords list that just contained, say, "i", "a", "d", "j", "l", and "s" to deal with the most common suffixes and prefixes that get applied in English, French, and Spanish, and apply that to weight C. Phrases can still be matched against weight A, and we would improve support for the most common Romance languages (and arguably English, too).

Here's the difference between 2.3 and 2.4 when it comes to the indexing of the apostrophe'd title:

-- 2.3:
SELECT * FROM metabib.title_field_entry WHERE source = 729818 AND field = 4;
   id | source | field | value | index_vector
---------+--------+-------+--------------+--------------------------
 7479122 | 729818 | 4 | Men's health | 'health':3 'men':1 's':2

-- 2.4:
 SELECT * FROM metabib.title_field_entry WHERE source = 729818 AND field = 4;
   id | source | field | value | index_vector
---------+--------+-------+--------------+--------------------------------------
 7479122 | 729818 | 4 | men s health | 'health':3A,6C 'men':1A,4C 's':2A,5C

So in 2.4, we've already normalized the "value" field. IIRC, we used to do a case-insensitive regex search against the value field for phrase searches in 2.3, presumably we've started doing something more complex for 2.4.

But then we can see in the WHERE clause in 2.4 phrase search, we do this:

x911b370_title.id IS NOT NULL AND x911b370_title.value ~* $_24133$[[:<:]]Men\'s\ health[[:>:]]$_24133$

That is... we're still doing a case-insensitive regex search against the value field for phrase searches in 2.4, but we're still matching the completely unnormalized phrase as typed in by the user against the now-normalized "men s health". Which won't work. And won't work whether we use naco_normalize or search_normalize to normalize the value field.

Looks like this problem is turning up because in 2.4 we decided to start normalizing the content in the value field, instead of letting the index_vector take care of the normalizing, and this is part of the fall out. I assume that weight A was meant to be used for phrase searches, but maybe the late revert of some of that code left weight A as a vestigial result, and now that phrase searches have gone back to matching against the (now normalized) value field, we're kind of screwed for 2.4?