Did You Mean Symspell dictionary updates can significant slow record ingest
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Evergreen |
Fix Released
|
High
|
Unassigned | ||
3.7 |
Fix Released
|
High
|
Unassigned |
Bug Description
In sufficiently large databases, the process of updating the Symspell dictionary during record ingest can become rather slow.
In particular, we observed that the step of merging text arrays containing tens or hundreds of thousands of suggestions for a given prefix key during an update can take times measured in half-seconds or longer, adds up for any given bib record.
Cases where prefix keys had far too many suggestions to efficiently process included ISBNs, where prefix keys of '978' and '979' would have their suggestion list contain effectively every ISBN value in the system.
Patches are forthcoming that will:
- Omit suggestions whose length is longer than the prefix key length when the prefix key length is less than or equal to the maximum prefix key length minus the maximum edit distance
- Omit words that contain a run of 5 or more digits. This will drop most identifiers from the dictionary while still allowing suggestions to happen for year values
- Omit empty keys from the dictionary
- Add a small speedup to evergreen.
Besides improving reingest speed, the patches will also make the search.
tags: | added: search |
description: | updated |
tags: | added: performance |
Changed in evergreen: | |
status: | New → Confirmed |
assignee: | nobody → Galen Charlton (gmc) |
Changed in evergreen: | |
status: | Fix Committed → Fix Released |
tags: | added: didyoumean |
The branch implementing the fixes described above can be found at:
https:/ /git.evergreen- ils.org/ ?p=working/ Evergreen. git;a=shortlog; h=refs/ heads/user/ miker/lp- 1947173- symspell- ingest- speed