Evergreen

Bug #1187433
Comment #11

Comment 11 for bug 1187433

Revision history for this message

Mike Rylander (mrylander) wrote on 2013-06-12:

#11

Some out of order copy/paste for context below ... I'll try to make it readable, but this web UI...

> It may be "just" a 2.4+ issue, but it's still a significant issue, and quite the regression from what just worked in 2.3.

My "just" was used to highlight the exclusion of the patch I posted as being involved in this change, not as a relative value or priority judgement of the issue.

> Looks like this problem is turning up because in 2.4 we decided to start normalizing the content in the value field, instead of letting the index_vector take care of the normalizing, and this is part of the fall out. I assume that weight A was meant to be used for phrase searches, but maybe the late revert of some of that code left weight A as a vestigial result, and now that phrase searches have gone back to matching against the (now normalized) value field, we're kind of screwed for 2.4?

Your assumption is incorrect. And while we're far from screwed, we are in a bind for those on 2.4.0. I'll try to explain as best I can.

The revert you mention had nothing to do with that part of the code. There was discussion regarding the ability to use the A-weight slot in the future for relevance bump calculation as part of the baseline rank, but phrase matching (that is, downcased exact string matching) has not been strongly considered as a use for that slot -- though it might be possible.

Regarding the issue at hand, the code that calculates the value and index_vector columns of *field_entry tables was indeed changed in a significant way by Thomas' work, and it threw away an essential feature of indexing that existed before.

Specifically, the documented[1] behavior of index normalizers was such that those with a "pos" value (used to order the application of normalizers) of less than 0 would be applied in such a way that their effects would show up in the value column, while the effects of those with a "pos" value greater than or equal to 0 would show up only in the index_vector tsvector column. Unless Thomas can explain why this is, I contend that this behavioral change must be reverted, and I will be posting a branch to do just that, soon. Further, facet values may currently be adversely effected by this behavioral change. It's unfortunate that this behavioral change slipped through -- perhaps Thomas didn't know about the purpose of the previous code layout, or didn't think it was important -- but I do not recall this change being documented in any way in his proposal, the discussions that lead to the eventual code, or the commit messages that accompanied the code. If I've simply missed that, I'd be happy to see a pointer to that.

That does mean a re-ingest, but with the patch already provided we've improved the situation. Given that, I think a rolling re-ingest is reasonable once the ingest regression is addressed. Something as simple as this seems appropriate:

~$ psql evergreen -tc 'select $$update biblio.record_entry set id = id where id = $$ || id || $$; select pg_sleep(1);$$ from biblio.record_entry where not deleted' | psql evergreen

... with ingest.reingest.force_on_same_marc enabled, of course.

> As far as I can tell, though, QueryParser currently has no good way of telling whether to apply English or French stemming algorithms against the incoming search query.

Sure it does. The OpenSRF sessions' local is translated to a "preferred_language" modifier, which is then combined with all of the "always-use" ts config mappings (if present for the requested class or field). That original local is derived from user choice, or failing that, an Apache setting.

> If I understand you correctly, you're suggesting that we enable sites to configure a secondary language (such as French, or Spanish), and then index & match _all_ incoming queries as both a normalized primary language (for weight C) and a secondary language (for weight D) -- and enable stopwords for D, and maybe (or maybe not) stopwords for C, but not for A, as well?

You're not, or not completely. I'll try at some time in the near future to explain more clearly, but this bug isn't the place. Especially since I believe I have identified the cause of the phrase-related regression that was of primary concern in your follow-up. Making the future better can wait until we fix present mistakes (I'm certain you agree).

> The suggestion to make use of weight D for a secondary language _would_ enable Evergreen to offer some level of support for a secondary language, I guess, but it's a hard cap at a second language. Unless, of course, we throw a third language into weight B...

No, it could contain all language-specific parsings of the original string. Simply add the languages of interest to the D-weight slot, and the index vector would end up containing multiple, differently parsed sub-tsvectors derived from the original.

> You said "conventionally reserved for alternate language indexing"--I'm really interested in learning more about conventions with respect to PostgreSQL full text search. Do you have a reference for that?

By our convention, as alluded to in the design documents Thomas created[2], and previous to that discussed at some length in IRC this past January.

[1] http://evergreen-ils.org/dokuwiki/doku.php?id=documentation:indexing#field_normalization_settings
[2] http://evergreen-ils.org/dokuwiki/doku.php?id=dev:search_changes

Some out of order copy/paste for context below ... I'll try to make it readable, but this web UI...

> It may be "just" a 2.4+ issue, but it's still a significant issue, and quite the regression from what just worked in 2.3.

My "just" was used to highlight the exclusion of the patch I posted as being involved in this change, not as a relative value or priority judgement of the issue.

Your assumption is incorrect.  And while we're far from screwed, we are in a bind for those on 2.4.0.  I'll try to explain as best I can.

The revert you mention had nothing to do with that part of the code.  There was discussion regarding the ability to use the A-weight slot in the future for relevance bump calculation as part of the baseline rank, but phrase matching (that is, downcased exact string matching) has not been strongly considered as a use for that slot -- though it might be possible.

Specifically, the documented[1] behavior of index normalizers was such that those with a "pos" value (used to order the application of normalizers) of less than 0 would be applied in such a way that their effects would show up in the value column, while the effects of those with a "pos" value greater than or equal to 0 would show up only in the index_vector tsvector column.  Unless Thomas can explain why this is, I contend that this behavioral change must be reverted, and I will be posting a branch to do just that, soon.  Further, facet values may currently be adversely effected by this behavioral change.  It's unfortunate that this behavioral change slipped through -- perhaps Thomas didn't know about the purpose of the previous code layout, or didn't think it was important -- but I do not recall this change being documented in any way in his proposal, the discussions that lead to the eventual code, or the commit messages that accompanied the code.  If I've simply missed that, I'd be happy to see a pointer to that.

That does mean a re-ingest, but with the patch already provided we've improved the situation.  Given that, I think a rolling re-ingest is reasonable once the ingest regression is addressed.  Something as simple as this seems appropriate:

~$ psql evergreen -tc 'select $$update biblio.record_entry set id = id where id = $$ || id || $$; select pg_sleep(1);$$ from biblio.record_entry where not deleted' | psql evergreen

... with ingest.reingest.force_on_same_marc enabled, of course.

> As far as I can tell, though, QueryParser currently has no good way of telling whether to apply English or French stemming algorithms against the incoming search query.

Sure it does.  The OpenSRF sessions' local is translated to a "preferred_language" modifier, which is then combined with all of the "always-use" ts config mappings (if present for the requested class or field).  That original local is derived from user choice, or failing that, an Apache setting.

You're not, or not completely.  I'll try at some time in the near future to explain more clearly, but this bug isn't the place.  Especially since I believe I have identified the cause of the phrase-related regression that was of primary concern in your follow-up.  Making the future better can wait until we fix present mistakes (I'm certain you agree).

No, it could contain all language-specific parsings of the original string.  Simply add the languages of interest to the D-weight slot, and the index vector would end up containing multiple, differently parsed sub-tsvectors derived from the original.

> You said "conventionally reserved for alternate language indexing"--I'm really interested in learning more about conventions with respect to  PostgreSQL full text search. Do you have a reference for that?

By our convention, as alluded to in the design documents Thomas created[2], and previous to that discussed at some length in IRC this past January.

[1] http://evergreen-ils.org/dokuwiki/doku.php?id=documentation:indexing#field_normalization_settings
[2] http://evergreen-ils.org/dokuwiki/doku.php?id=dev:search_changes