Evergreen

Did You Mean optimization fails for some data sets

Bug #1931162 reported by Mike Rylander on 2021-06-07

This bug affects 8 people

Affects		Status	Importance	Assigned to	Milestone
	Evergreen	Fix Released	High	Unassigned	Evergreen 3.7.2

Bug Description

Evergreen Versions: 3.7+
PostgreSQL Versions: all supported
OpenSRF versions: n/a

For some data sets and some queries the Did You Mean search suggestion logic can be much too slow. This is mainly due to cases where a "misspelled" word of sufficient length greater than the symspell prefix length is checked against many short prefixes that have many long suggestions attached to them.

A branch with a drop-in update to the search.symspell_lookup() function that optimizes against this situation is forthcoming.

Tags:

Revision history for this message

Mike Rylander (mrylander) wrote on 2021-06-07:

Branch is available here for testing:

https://git.evergreen-ils.org/?p=working/Evergreen.git;a=shortlog;h=refs/heads/user/miker/lp-1931162-DYM-optimization / user/miker/lp-1931162-DYM-optimization

From the commit:

For some data sets and some queries the Did You Mean search suggestion logic can be much too slow. This is mainly in cases where a "misspelled" word of sufficient length greater than the symspell prefix length is checked against many short prefixes that have many long suggestions attached to them.

This commit optimizes for that case in particular by testing the length of suggestions and prefix keys against the user input to avoid unnecessary tests. Futher, it captures the edit distance of suggestions that pass that test in-line, avoiding expensive retesting, and caches the short-cutoff edit distance when in low-verbosity mode to avoid future different-but-not-too-different suggestions coming from the same prefix key.

It additionally provides a general optimization by batching the capture of suggest counts to avoid per-suggestion secondary lookups, and a micro-optimization of ordering suggestions by length at distance cache time.

tags:

added: pullrequest

Mike Rylander (mrylander) on 2021-06-07

Changed in evergreen:
assignee:	Mike Rylander (mrylander) → nobody
importance:	Undecided → High

Mike Rylander (mrylander) on 2021-06-23

Changed in evergreen:
status:	New → Confirmed

Evergreen Bug Maintenance (bugmaster) on 2021-06-25

Changed in evergreen:
milestone:	3.7.1 → 3.7.2

Revision history for this message

Shula Link (slink-g) wrote on 2021-08-12:

What would be a suggested mode to test this on the Concerto dataset?

Revision history for this message

Mike Rylander (mrylander) wrote on 2021-08-12:

Hi Shula,

Unfortunately for testing purposes, Concerto is not big enough to trigger the optimization issues that this commit addresses. We have tested it locally on a large consortial data set, and it does what it says on the tin for us, without any additional code or configuration tuning.

If you can independently confirm that it does not /break/ anything, that would be a big help.

Thanks!

Revision history for this message

Shula Link (slink-g) wrote on 2021-08-12:

Tested multiple searches across Concerto and nothing broke.

I sign off on this patch with my name, Shula Link, and my email, <email address hidden>.

https://git.evergreen-ils.org/?p=working/Evergreen.git;a=commit;h=96f05bd08b1ce77dc748b2d20e8340682bf89087

tags:

added: signedoff

Revision history for this message

Jason Boyer (jboyer) wrote on 2021-08-27:

Works well for me also, pushed to master and rel_3_7. Thanks Mike and Shula!

Changed in evergreen:
status:	Confirmed → Fix Committed

Evergreen Bug Maintenance (bugmaster) on 2021-10-29

Changed in evergreen:
status:	Fix Committed → Fix Released

Jeff Davis (jdavis-sitka) on 2023-01-23

tags:

added: didyoumean

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.