Wishlist: Did you mean? Single word, single class substitutions

Bug #1893997 reported by Andrea Neiman
30
This bug affects 5 people
Affects Status Importance Assigned to Milestone
Evergreen
Fix Released
Wishlist
Unassigned

Bug Description

This project is sponsored by the Evergreen Community Development Initiative, with development work performed by Equinox.

As part of a larger search-suggestions project, Equinox will be creating a suggestion engine for Evergreen intended to provide specific, relevant, and bibliographic-based search suggestions for single-word single-class searches.

This will include:
* Database-level implementation of the SymSpell algorithm
* Options for administrators to apply suggestion mechanisms, including Levenshtein distance, Soundex similarity, Trigram similarity, QWERTY distance, and global term frequency as derived from the bibliographic data of a local Evergreen instance
* Changes to the core search API to support these suggestions
* Various flags & YAOUSen to control suggestion options

Full specifications can be seen here: https://yeti.equinoxinitiative.org/dev/public/techspecs/dym_stage2_20200131.pdf

Revision history for this message
Mike Rylander (mrylander) wrote :
tags: added: pullrequest
Revision history for this message
Andrea Neiman (aneiman) wrote :
Andrea Neiman (aneiman)
Changed in evergreen:
milestone: none → 3.7-beta
Revision history for this message
Ruth Frasur Davis (redavis) wrote :

This work has been tested by the Evergreen Community Development Initiative and works as described and expected.

Revision history for this message
Elizabeth Thomsen (et-8) wrote :

This development is only supposed to suggest words found in the database, so when testing this on a server with the concerto database, it's not going do well on a lot of common words: a search on alfabet suggested planet. But test this using misspellings of music related words, and you'll see it working as expected. I searched voilin, basoon, rachmaniov, tumpet, strumpet, peano, Kachaturian and similar words and it did great!

Revision history for this message
Mike Rylander (mrylander) wrote :

I missed one line needed for the bootstrap opac addition. There's a new commit at the top of the existing branch with that change. Future committer, please feel free to just squash that (or request a new branch).

Revision history for this message
Gina Monti (gmonti90) wrote :

Feedback Fest 2/11 note: only works if the search query has one word in the string and not multiple as far as I'm testing.

Revision history for this message
Gina Monti (gmonti90) wrote :

Sorry, I just saw in the notes for Feedback Fest that this is a single word query feature.

I, Gina Monti, sign off on this bug fix. Single-word queries with incorrect spelling do feature suggestions.

Andrea Neiman (aneiman)
tags: added: signedoff
Revision history for this message
Chris Sharp (chrissharp123) wrote :

The upgrade script on a copy of PINES production is taking an unduly long time - I began it Tuesday morning, March 2 and it's still running right now, I'm guessing about halfway through (on "identifier"), followed by a reingest. Mike Rylander has confirmed that the script should take a "long time", but this may create a problem for admins applying this feature to an upgrade.

Revision history for this message
Mike Rylander (mrylander) wrote :

I've added a script to be used at upgrade time that generates the dictionary data based on an export of certain table data. For ref, the current branch is at:

https://git.evergreen-ils.org/?p=working/Evergreen.git;a=shortlog;h=refs/heads/user/miker/lp-1893997-did-you-mean

Revision history for this message
Mike Rylander (mrylander) wrote :

I've pushed a rebased version of this branch, along with a followup commit that improves the RAM and time performance of the side-loader and updates the instructions for its use:

https://git.evergreen-ils.org/?p=working/Evergreen.git;a=shortlog;h=refs/heads/user/miker/lp-1893997-did-you-mean-rebase-2021-03-11

Revision history for this message
Jason Stephenson (jstephenson) wrote :

Mike, with the latest incarnation of the side load procedure, I was able to generate all of the load files. It still took about 8 hours in parallel, mainly because it took about 8 hours to generate the keyword SQL file.

It still consumed a lot of memory. When I stopped watching around 5:00 PM yesterday, two of the processes were still reading lines and consuming over 20GB of RAM each.

This morning, when I rand a sideload.sql made by copying and pasting the code from the db upgrade script comments, I got the following output:

time psql -U evergreen -h localhost -p 5434 -f sideload.sql
ALTER TABLE
TRUNCATE TABLE
psql:identifier.sql:5: ERROR: syntax error at or near "FROM"
LINE 5: ) FROM STDIN;
          ^
psql:identifier.sql:7: ERROR: cross-database references are not implemented: "search.search.symspell_dictionary_partial_identifier"
psql:identifier.sql:6400686: error: out of memory

real 1m14.796s
user 0m27.587s
sys 0m4.825s

I also have a concern about setting the table to unlogged as this will cause the table to not be replicated to any replication servers. This suggests that we need a release note explaining this side load process and its many implications.

Revision history for this message
Jason Stephenson (jstephenson) wrote :

Here's my sideload.sql for the sake of completeness

Revision history for this message
Mike Rylander (mrylander) wrote :

Thanks, Jason.

I'll obviously fix the typo in the generated SQL. Thanks for testing, I hope you were able to remove the obvious issue and try it.

As for the table being unlogged, as soon as it is switched to logged, it is replicated. So, I don't see the problem there. It loads much faster that way because it doesn't generate the WAL of the upserts, just a representation of the table when we switch it back to logged.

Thanks again.

Revision history for this message
Mike Rylander (mrylander) wrote :

I've pushed a commit (the last 4 of which on the branch can be squashed into the main one by me, upon request, or by a committer) to address the copy/paste issue Jason found.

Just to be explicit, there is no replication-related issue due to how the script uses ALTER TABLE...SET UNLOGGED. The table is only unlogged while it's being written (and then rewritten for speed by CLUSTER, and then reindexed for index compaction), and then is replicated once it's finalized via ALTER TABLE...SET LOGGED. There's no purpose in replicating the dictionary during build, other than to test your disk and network I/O.

Revision history for this message
Jason Stephenson (jstephenson) wrote :

Mike, I have corrected the two typos in each of the SQL files, and I'm running the script again.

I wasn't sure if the changes to an unlogged table would be replicated after it was set to logged. The manual only says that unlogged tables are not replicated, and I didn't see anything about what happens after an unlogged table is set to logged. I didn't look any further than the create table documentation, so if it is somewhere else in the documentation, then I apologize.

I didn't jump on this right away because I moved on to something else in the meantime.

Revision history for this message
Jason Stephenson (jstephenson) wrote :

After fixing the previous typos, I get this:

CREATE TABLE
COPY 314233
psql:series.sql:314247: ERROR: syntax error at or near "CONFILICT"
LINE 3: ON CONFILICT (prefix_key) DO UPDATE

I'll see if I can fix those, or if you have addressed it in the latest commit, I can start over.

Revision history for this message
Mike Rylander (mrylander) wrote :

I'll make sure that's fixed as well, but I don't see any reason to spend 8 more hours if a couple shell commands can rewrite the typo.

Thanks again.

Revision history for this message
Jason Stephenson (jstephenson) wrote :

After much wrangling with data, I have tested this feature and pushed a signoff branch to the working repository:

https://git.evergreen-ils.org/?p=working/Evergreen.git;a=shortlog;h=refs/heads/user/dyrcona/lp-1893997-did-you-mean-squash-signoff

I think a release note that explains the data preparation process should be added, but that can come later.

Since the beta cutoff has come and gone, I'm not pushing this to master. I will leave it up to the release managers to work out whether or not this goes in for 3.7.

Revision history for this message
Mike Rylander (mrylander) wrote :

And here's a branch with release notes at the top, to be improved. But I'm out of brain for the moment^Wday^Wweek.

Thanks, Jason, for chasing this with me for the last couple days!

https://git.evergreen-ils.org/?p=working/Evergreen.git;a=shortlog;h=refs/heads/user/miker/lp-1893997-did-you-mean-signoff-release-notes

Galen Charlton (gmc)
Changed in evergreen:
assignee: nobody → Galen Charlton (gmc)
Revision history for this message
Galen Charlton (gmc) wrote :

Pushed to master for inclusion in 3.7. Thanks, Mike, Gina, Jason, and Chris!

Changed in evergreen:
assignee: Galen Charlton (gmc) → nobody
status: New → Fix Committed
Changed in evergreen:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.