Did You Mean Symspell dictionary updates can significant slow record ingest

Bug #1947173 reported by Galen Charlton
16
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Evergreen
Fix Released
High
Unassigned
3.7
Fix Released
High
Unassigned

Bug Description

In sufficiently large databases, the process of updating the Symspell dictionary during record ingest can become rather slow.

In particular, we observed that the step of merging text arrays containing tens or hundreds of thousands of suggestions for a given prefix key during an update can take times measured in half-seconds or longer, adds up for any given bib record.

Cases where prefix keys had far too many suggestions to efficiently process included ISBNs, where prefix keys of '978' and '979' would have their suggestion list contain effectively every ISBN value in the system.

Patches are forthcoming that will:

- Omit suggestions whose length is longer than the prefix key length when the prefix key length is less than or equal to the maximum prefix key length minus the maximum edit distance
- Omit words that contain a run of 5 or more digits. This will drop most identifiers from the dictionary while still allowing suggestions to happen for year values
- Omit empty keys from the dictionary
- Add a small speedup to evergreen.text_array_merge_unique() by making it assume that arrays passed to it do not have null values

Besides improving reingest speed, the patches will also make the search.symspell_dictionary table significantly smaller.

Galen Charlton (gmc)
tags: added: search
description: updated
tags: added: performance
Revision history for this message
Mike Rylander (mrylander) wrote :
tags: added: pullrequest
Changed in evergreen:
milestone: none → 3.8-rc
importance: Undecided → High
Revision history for this message
Garry Collum (gcollum) wrote :

This was applied to our production Evergreen (3.7.1) on 10-14. The time it took to import marc records after the patch was applied vastly improved.

Galen Charlton (gmc)
Changed in evergreen:
status: New → Confirmed
assignee: nobody → Galen Charlton (gmc)
Revision history for this message
Galen Charlton (gmc) wrote :

Pushed all the way down to rel_3_7. Thanks, Mike and Garry!

Changed in evergreen:
status: Confirmed → Fix Committed
assignee: Galen Charlton (gmc) → nobody
Revision history for this message
Blake GH (bmagic) wrote :

Linking a related bug for more surface area

bug 1931737

Galen Charlton (gmc)
Changed in evergreen:
status: Fix Committed → Fix Released
tags: added: didyoumean
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.