Did You Mean optimization fails for some data sets
Bug #1931162 reported by
Mike Rylander
This bug affects 8 people
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Evergreen |
Fix Released
|
High
|
Unassigned |
Bug Description
Evergreen Versions: 3.7+
PostgreSQL Versions: all supported
OpenSRF versions: n/a
For some data sets and some queries the Did You Mean search suggestion logic can be much too slow. This is mainly due to cases where a "misspelled" word of sufficient length greater than the symspell prefix length is checked against many short prefixes that have many long suggestions attached to them.
A branch with a drop-in update to the search.
Changed in evergreen: | |
assignee: | Mike Rylander (mrylander) → nobody |
importance: | Undecided → High |
Changed in evergreen: | |
status: | New → Confirmed |
Changed in evergreen: | |
milestone: | 3.7.1 → 3.7.2 |
Changed in evergreen: | |
status: | Fix Committed → Fix Released |
tags: | added: didyoumean |
To post a comment you must log in.
Branch is available here for testing:
https:/ /git.evergreen- ils.org/ ?p=working/ Evergreen. git;a=shortlog; h=refs/ heads/user/ miker/lp- 1931162- DYM-optimizatio n / user/miker/ lp-1931162- DYM-optimizatio n
From the commit:
For some data sets and some queries the Did You Mean search suggestion logic can be much too slow. This is mainly in cases where a "misspelled" word of sufficient length greater than the symspell prefix length is checked against many short prefixes that have many long suggestions attached to them.
This commit optimizes for that case in particular by testing the length of suggestions and prefix keys against the user input to avoid unnecessary tests. Futher, it captures the edit distance of suggestions that pass that test in-line, avoiding expensive retesting, and caches the short-cutoff edit distance when in low-verbosity mode to avoid future different- but-not- too-different suggestions coming from the same prefix key.
It additionally provides a general optimization by batching the capture of suggest counts to avoid per-suggestion secondary lookups, and a micro-optimization of ordering suggestions by length at distance cache time.