apostrophe search issues in 2.4

Bug #1187433 reported by Ben Shum on 2013-06-04
22
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Evergreen
High
Unassigned
2.4
Undecided
Unassigned
2.5
Undecided
Unassigned

Bug Description

Evergreen master / 2.4

When doing a search like "mens health" (without the apostrophe for "men's"), the search will fail to retrieve records that include the apostrophe term. What we would expect is that it retrieves these hits.

dbs in IRC noted that it should apply stemming and change the search from "mens health" to "men health" (stemmed men from mens) which should have retrieved some results. But it's possible that with 2.4 and new QP that behavior has changed.

Confirmed as misbehaving in Bibliomation (master), MVLC (master), and SCLENDS (2.4.0) catalogs.

Jason Stephenson (jstephenson) wrote :

I can confirm the behavior in master.

Changed in evergreen:
status: New → Confirmed
importance: Undecided → Medium
Dan Scott (denials) wrote :

Confirmed on our 2.4 test server, too; we're still running 2.3 in production and can confirm that this was _not_ a problem in 2.3.

I suspect there's a problem with the introduction of weighting in the index vector. All of the following is from our test server:

Here's the metabib.title_field_entry row for "Mens creatix: an essay":

   id | source | field | value | index_vector
---------+--------+-------+------------------------+--------------------------------------------------------------
 3255651 | 83837 | 6 | mens creatrix an essay | 'an':3A,7C 'creatrix':2A,6C 'essay':4A,8C 'men':5C 'mens':1A
(1 row)

Note that "mens" gets stemmed to "men" in the weighting "C" but appears only as "mens" in weighting A. For this example, searches for both "men creatrix" and "mens creatrix" work, but "men's creatrix" returns no results.

The text search configuration, given "Men's creatrix", stems "Men's" down to "men" (and adds a separate word "s"):

SELECT ts_debug('english_nostop', 'Men''s creatrix');
                                     ts_debug
-----------------------------------------------------------------------------------
 (asciiword,"Word, all ASCII",Men,{english_nostop},english_nostop,{men})
 (blank,"Space symbols",',{},,)
 (asciiword,"Word, all ASCII",s,{english_nostop},english_nostop,{s})
 (blank,"Space symbols"," ",{},,)
 (asciiword,"Word, all ASCII",creatrix,{english_nostop},english_nostop,{creatrix})
(5 rows)

Weighting was new in 2.4, and with the rollback of part of the search changes, I'm wondering whether that affected the approach.

Alternately, perhaps QP is mangling the query such that the english_nostop text search configuration is not getting run against the incoming query?

Dan Scott (denials) wrote :

Here's the core query that QP is passing to search.query_parser_fts() for a title search for "mens health"

WITH x105058a8_title_xq AS (SELECT
      to_tsquery('simple', COALESCE(NULLIF( '(' || btrim(regexp_replace(search_normalize(split_date_range($_9405$mens$_9405$)),E'(?:\\s+|:)','&','g'),'&|') || ')', '()'), ''))&&
      to_tsquery('simple', COALESCE(NULLIF( '(' || btrim(regexp_replace(search_normalize(split_date_range($_9405$health$_9405$)),E'(?:\\s+|:)','&','g'),'&|') || ')', '()'), '')) AS tsq,
      to_tsquery('simple', COALESCE(NULLIF( '(' || btrim(regexp_replace(search_normalize(split_date_range($_9405$mens$_9405$)),E'(?:\\s+|:)','&','g'),'&|') || ')', '()'), '')) ||
      to_tsquery('simple', COALESCE(NULLIF( '(' || btrim(regexp_replace(search_normalize(split_date_range($_9405$health$_9405$)),E'(?:\\s+|:)','&','g'),'&|') || ')', '()'), '')) AS tsq_rank )
SELECT m.source AS id,
        ARRAY[m.source] AS records,
        1.0/((AVG(
          (1)
          + (COALESCE(ts_rank_cd('{0.1, 0.2, 0.4, 1}', x105058a8_title.index_vector, x105058a8_title.tsq_rank, 14) * x105058a8_title.weight, 0.0))
        )+1 * COALESCE( NULLIF( FIRST(mrd.attrs @> hstore('item_lang', $_9405$eng$_9405$)), FALSE )::INT * 5, 1)))::NUMERIC AS rel,
        1.0/((AVG(
          (1)
          + (COALESCE(ts_rank_cd('{0.1, 0.2, 0.4, 1}', x105058a8_title.index_vector, x105058a8_title.tsq_rank, 14) * x105058a8_title.weight, 0.0))
        )+1 * COALESCE( NULLIF( FIRST(mrd.attrs @> hstore('item_lang', $_9405$eng$_9405$)), FALSE )::INT * 5, 1)))::NUMERIC AS rank,
        FIRST(mrd.attrs->'date1') AS tie_break
  FROM metabib.metarecord_source_map m

        LEFT JOIN (
          SELECT fe.*, fe_weight.weight, x105058a8_title_xq.tsq, x105058a8_title_xq.tsq_rank /* search */
            FROM metabib.title_field_entry AS fe
              JOIN config.metabib_field AS fe_weight ON (fe_weight.id = fe.field)
            JOIN x105058a8_title_xq ON (fe.index_vector @@ x105058a8_title_xq.tsq)
        ) AS x105058a8_title ON (m.source = x105058a8_title.source)
        INNER JOIN metabib.record_attr mrd ON m.source = mrd.id

  WHERE 1=1
        AND (
          TRUE
          AND (x105058a8_title.id IS NOT NULL)
        )
  GROUP BY 1

Dan Scott (denials) wrote :

Replace "simple" with "english_nostop" in the to_tsquery() calls to get stemming support, and search starts returning expected results for "mens health". Stemming FTW.

Dan Scott (denials) wrote :

And then back in Driver/Pg/QueryParser.pm, around line 1249:

    # Assume we want exact if none otherwise provided.
    # Because we can reasonably expect this to exist
    $ts_configs = ['simple'] unless (scalar @$ts_configs);

Guess what happens if we swap in 'english_nostop' instead of 'simple'? A search for "mens health" returns hits against "men's health" titles. Ah yes, that reasonable expectation appears to have been violated!

So, the real bug appears to be that we're not getting $ts_configs where we expect to be. Or we're testing for its existence incorrectly, or something like that.

Mike Rylander (mrylander) wrote :

In IRC I claimed, boldly:

15:39:18 <eeevil> found it...
15:39:27 <dbs> yay
15:39:32 <eeevil> we're not loading the class->ts mapping
15:39:47 <eeevil> the calling code in open-ils.search calls the fts wrapper
15:40:04 <eeevil> the wrapper has an initialize block instead of using _initialize($parser)
15:40:49 <eeevil> once /any/ initialization has occurred, we never reinitialize the parser singleton (well, effective singleton)
15:41:10 <eeevil> and the wrapper's block does not do the class and field to ts config mapping
15:42:16 <eeevil> dbs: I'll toss a patch on the bug for you to test, if you want

So, here's the mentioned patch. Eyeballs and testing, anyone?

Dan Scott (denials) wrote :

Looks like it works for my limited tests so far. Now I get 45 results for a search of "men's health", and 83 results for a search of "mens health" (probably because the apostrophe version includes "s" as one of the lexemes, whereas the "mens" version just gets stemmed down to "men").

Interestingly, a phrase search for "men's health" returns zero hits, despite there being titles that should match that phrase. But I would need to go back and test our system without this patch to determine whether this is introduced by the patch, or is previously existing behaviour. I can confirm that our 2.3 system with the same data returns hits for this phrase search.

Mike Rylander (mrylander) wrote :

I'me very confident that the phrase difference you note in the last paragraph is not related to initialization, but just a 2.4+ issue. That is, I'm pretty certain, a subclass of ye olde apostrophes-considered-harmful. That, in turn, stems at least partially from a change attempting to address apostrophe-joined leading articles (French "l'", say). The older NACO normalizer would simply remove the apostrophy, collapsing the "l" to the front of the next word, which is obviously non-good in the case of French articles. This was changed with the new search_normalize procedure, who's major behavioral difference from naco_normalize was to replace apostrophes with spaces, splitting the "l" from the following word.

I wonder if, now that we have multilingual stemming support and sites could add (or we could supply by default) a D-weighted, always-used french stemmer, we should switch back to naco_normalize, at least for the C-weight slot. The test for efficacy would be to use the french stemmer against an appropriate French word with a leading article and see if the article and word are tokenized to separate lexems. As a test:

=# SELECT to_tsvector('french','J'aime la glace');

   to_tsvector
------------------
 'aim':2 'glac':4
(1 row)

Looking good so far. Now, this would require that we pass the original (or, at least, not apostrophe-molested) string to the D-weight slot (conventionally reserved for alternate language indexing) where we've configured the French stemmer to live. That would mean that rows on config.metabib_field_index_norm_map would need to be associated with specific weight classes (in this case specifically, at least not with D), which would be a new thing, but would round the configuration space pretty well in terms of granularity, and would allow us to fiddle with how, or more correctly WHEN, we deal with apostrophes (in this case).

Thoughts?

Mike Rylander (mrylander) wrote :
tags: added: pullrequest
Dan Scott (denials) wrote :
Download full text (5.5 KiB)

It may be "just" a 2.4+ issue, but it's still a significant issue, and quite the regression from what just worked in 2.3.

Re: SELECT to_tsvector('french','J'aime la glace');

Yes, of course those are the results we would get if we used PostgreSQL's French stemming text search configuration with stopwords against the unmolested phrase (in this case, unmolested by either naco_normalize or search_normalize).

Here's what happens if we pass the text through search_normalize() before passing it on to the stopword French stemming text search configuration:

SELECT to_tsvector('french', search_normalize('J''aime la glace'));
   to_tsvector
------------------
 'aim':2 'glac':4
(1 row)

No surprise that the outcome is still what we want; it will match if people search for "aime glace" or "j'aime glace" alike, because the "j" was separated out and the french dictionary recognized it as a stopword and gobbled it up.

Here's what happens if we pass the text through naco_normalize() before passing it on to the stopword French stemming algorithm:

SELECT to_tsvector('french', naco_normalize('J''aime la glace'));
    to_tsvector
-------------------
 'glac':3 'jaim':1
(1 row)

Oh, there's that pesky squashed "j", which would break searches for "aime glace". I can't say that I see how naco_normalize() actually helps us in any way here.

As far as I can tell, though, QueryParser currently has no good way of telling whether to apply English or French stemming algorithms against the incoming search query. If I understand you correctly, you're suggesting that we enable sites to configure a secondary language (such as French, or Spanish), and then index & match _all_ incoming queries as both a normalized primary language (for weight C) and a secondary language (for weight D) -- and enable stopwords for D, and maybe (or maybe not) stopwords for C, but not for A, as well?

In which case, given that the severed lexemes are more likely to be stopworded away ('j' and 'l' in French), why go back to naco_normalize()? There doesn't seem to be any advantage to jamming the affixes onto the root words in English, other that being able to blame NACO for search design decisions.

The suggestion to make use of weight D for a secondary language _would_ enable Evergreen to offer some level of support for a secondary language, I guess, but it's a hard cap at a second language. Unless, of course, we throw a third language into weight B...

You said "conventionally reserved for alternate language indexing"--I'm really interested in learning more about conventions with respect to PostgreSQL full text search. Do you have a reference for that?

Anyway, we've wandered pretty far away from the core problem, which is that searches containing apostrophes have suffered a pretty severe regression in 2.4. Your patch resolves the regular query issue, and I'll happily apply that, but I don't think any of what we're talking about for phrase search here is appropriate for 2.4; it's more along the line of another significant design change requiring another reingest of bibs that would be more appropriate for 2.5.

If we are talking 2.5 reingest-y moves, another less-complex option (in ter...

Read more...

Mike Rylander (mrylander) wrote :
Download full text (5.3 KiB)

Some out of order copy/paste for context below ... I'll try to make it readable, but this web UI...

> It may be "just" a 2.4+ issue, but it's still a significant issue, and quite the regression from what just worked in 2.3.

My "just" was used to highlight the exclusion of the patch I posted as being involved in this change, not as a relative value or priority judgement of the issue.

> Looks like this problem is turning up because in 2.4 we decided to start normalizing the content in the value field, instead of letting the index_vector take care of the normalizing, and this is part of the fall out. I assume that weight A was meant to be used for phrase searches, but maybe the late revert of some of that code left weight A as a vestigial result, and now that phrase searches have gone back to matching against the (now normalized) value field, we're kind of screwed for 2.4?

Your assumption is incorrect. And while we're far from screwed, we are in a bind for those on 2.4.0. I'll try to explain as best I can.

The revert you mention had nothing to do with that part of the code. There was discussion regarding the ability to use the A-weight slot in the future for relevance bump calculation as part of the baseline rank, but phrase matching (that is, downcased exact string matching) has not been strongly considered as a use for that slot -- though it might be possible.

Regarding the issue at hand, the code that calculates the value and index_vector columns of *field_entry tables was indeed changed in a significant way by Thomas' work, and it threw away an essential feature of indexing that existed before.

Specifically, the documented[1] behavior of index normalizers was such that those with a "pos" value (used to order the application of normalizers) of less than 0 would be applied in such a way that their effects would show up in the value column, while the effects of those with a "pos" value greater than or equal to 0 would show up only in the index_vector tsvector column. Unless Thomas can explain why this is, I contend that this behavioral change must be reverted, and I will be posting a branch to do just that, soon. Further, facet values may currently be adversely effected by this behavioral change. It's unfortunate that this behavioral change slipped through -- perhaps Thomas didn't know about the purpose of the previous code layout, or didn't think it was important -- but I do not recall this change being documented in any way in his proposal, the discussions that lead to the eventual code, or the commit messages that accompanied the code. If I've simply missed that, I'd be happy to see a pointer to that.

That does mean a re-ingest, but with the patch already provided we've improved the situation. Given that, I think a rolling re-ingest is reasonable once the ingest regression is addressed. Something as simple as this seems appropriate:

~$ psql evergreen -tc 'select $$update biblio.record_entry set id = id where id = $$ || id || $$; select pg_sleep(1);$$ from biblio.record_entry where not deleted' | psql evergreen

... with ingest.reingest.force_on_same_marc enabled, of course.

> As far as I can tell, though, QueryParser cur...

Read more...

Mike Rylander (mrylander) wrote :

I've pushed a branch to the working repo to address the above-mentioned loss of functionality: http://git.evergreen-ils.org/?p=working/Evergreen.git;a=shortlog;h=refs/heads/collab/miker/split-brain-normalization

Changed in evergreen:
importance: Medium → High
Dan Scott (denials) wrote :

In the concerto test data set, a title search for "Alexander's feast" should be a good test of phrase search where apostrophes are included in the phrase.

Dan Scott (denials) wrote :

Tested; searches work well, but we're back to the old problem of metabib.title_field_entry rows being duplicated:

SELECT * FROM metabib.title_field_entry WHERE length(value) < 20 AND field = 6 ORDER BY source;

 id | source | field | value | index_vector
-----+--------+-------+---------------------+---------------------------------------------------------------
   3 | 2 | 6 | Le concerto / | 'concerto':2A,4C 'le':1A,3C
   4 | 2 | 6 | Le concerto / | 'concerto':2A,4C 'le':1A,3C
  64 | 22 | 6 | The concerto / | 'concerto':2A,4C 'the':1A,3C
  65 | 22 | 6 | The concerto / | 'concerto':2A,4C 'the':1A,3C
 207 | 51 | 6 | Cello concerto. | 'cello':1A,3C 'concerto':2A,4C
 208 | 51 | 6 | Cello concerto. | 'cello':1A,3C 'concerto':2A,4C
 219 | 53 | 6 | 6 Double concertos. | '6':1A,4C 'concerto':6C 'concertos':3A 'doubl':5C 'double':2A
 220 | 53 | 6 | 6 Double concertos. | '6':1A,4C 'concerto':6C 'concertos':3A 'doubl':5C 'double':2A

Oh, wait - that's bug 1032208 which never ended up getting resolved (but you would think that trimming out a large percentage of rows should have some effect on search performance). So I guess this is back to normal, then. Excellent!

Dan Scott (denials) wrote :

I've pushed collab/miker/unify-QP-initialization to rel_2_4 and master, but as collab/miker/split-brain-normalization requires a database upgrade script, I haven't really touched that yet.

However, I have pushed http://git.evergreen-ils.org/?p=working/Evergreen.git;a=shortlog;h=refs/heads/user/dbs/another_metabib_ingest_fix which includes Mike's split-brain-normalization commit as well as another commit to fix the doubled entries in metabib.title_field_entry. Still no database upgrade script, but I'm feeling a lot better about how this is looking.

Mike Rylander (mrylander) wrote :

Awesome! Thanks, Dan.

One question I have is regarding the change in your top commit. Previous to the change, we only cared about adding rows to metabib.{whatever}_field_entry if search_field was true, but now we would add if either that or browse_field were true. We already have a (uniquified) metabib.browse_entry table for browsing and autosuggest, so I'm not understanding that part of the change you made. I'm sure I'm missing something, but could you help me out?

Bill Erickson (berick) wrote :

It's easy to get lost in this ticket..

1. I see 3 commits in working/user/dbs/another_metabib_ingest_fix. Is that everything that needs merging?

2. We appear to still be missing the upgrade script. Is that true?

Dan Scott (denials) wrote :

Bill:

1. Yes, all 3 commits need merging.

2. Yes, the upgrade script is missing. Replacing the guts of the metabib ingest methods is straightforward, the problem is deciding what to do about the reingest.

Mike's reingest approach was working for us, but at a glacial pace (it was going to take at least 13 days to reingest everything and our WAL files were growing at a crazy rate because of the churn of metabib.full_rec & control numbers & authority linking etc; fortunately we still had non-munged metabib.*_field_entry.value columns, so switching to metabib.reingest_metabib_field_entries(id) is about 4 to 5 times faster).

Sites that went live on 2.4.0, or upgraded to 2.4.0 and have already reingested their records, will need to recreate the metabib.*_field_entry rows entirely, because the value columns will have been munged; sites that upgrade from 2.3 or earlier should defer the 2.4.0 reingest and go directly to a metabib.reingest_metabib_field_entries() approach.

Perhaps just echoing that advice at the end of the upgrade script would be enough to get 2.4.1 out the door? Although I would heartily recommend doing something at the 2.4.1 time frame to adjust the 2.3-2.4.0 upgrade script to also prevent people from going through a needless & lengthy reingest at the 2.4.0 portion of the upgrade. Maybe update the metabib function bodies accordingly in the 2.3-2.4.0 script?

Bill Erickson (berick) wrote :

Thanks, Dan. Well, I'm not familiar w/ the code in question, so I'm just a cheerleader here, but I think your suggestions sound sane. Certainly, avoiding a double-re-ingest is justification for futzing with the 2.3-2.4.0 script.

Mike Rylander (mrylander) wrote :

I've pushed a branch to http://git.evergreen-ils.org/?p=working/Evergreen.git;a=shortlog;h=refs/heads/collab/miker/another_metabib_ingest_fix-with-upgrade that includes the two modified functions and a modified "now run this" note. I've included a suggestion to use Dan's metabib.reingest_metabib_field_entries-only reingest method.

Dan Scott (denials) wrote :

Bizarre.

The \qecho statements work perfectly for text, but as soon as they run into a \ command, they actually run the command. So instead of showing users what they should run, it ran the commands immediately. At least for me (on my Fedora 19 system).

I quoted the arguments for \qecho and escaped the backslashed commands, and that appears to have done the trick. Therefore, I signed off on the commits and pushed them (along with the upgrade script) into rel_2_4 and master.

Ben Shum (bshum) on 2013-07-23
Changed in evergreen:
status: Confirmed → Fix Committed
Ben Shum (bshum) on 2013-11-11
Changed in evergreen:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers