Did You Mean's search.symspell_dictionary can get significantly bloated

Bug #1998355 reported by Galen Charlton
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Evergreen
Status tracked in Main
3.10
Fix Released
Medium
Unassigned
3.9
Fix Released
Medium
Unassigned
Main
Fix Released
Medium
Unassigned

Bug Description

We observed that search.symspell_dictionary can get significantly bloated (by GB) under certain scenarios. One in particular was:

- loading ~50K bib records in a large database
- doing bib-to-authority linking using authority_control_fields.pl

The combination resulted in a couple hundred GB of disk space getting used unexpectedly.

Focusing on authority_control_fields.pl in particular, that job has the effect of adding $0 to the bib records. However, since $0 is not generally included in the keyword indexes, the changes that job makes would typically not affect the DYM dictionary.

However, in practice such changes result in the following:

- bib gets reingested (because it's actually different with the $0's)
- all metabib field entries get deleted and recreated. Because the field entry rows don't get updated, just recreated, the attempt by search.symspell_maintain_entries() to not add a null change to the queue of updates in search.symspell_dictionary_updates gets ignored
- when search.symspell_dictionary_updates gets processed, either during the main reingest transaction or later, all of the relevant rows in search.symspell_dictionary get updated. Those updates do not actually change the rows, but they do create row versions.

The kicker: empirically, autovacuum simply cannot keep up to keep a lid on the size of search.symspell_dictionary. More aggressive autovac settings for that table may help, but we also observed that lock contention prevented manual vacuums from marking rows as dead.

A patch is forthcoming.

Galen Charlton (gmc)
Changed in evergreen:
importance: Undecided → Medium
Revision history for this message
Galen Charlton (gmc) wrote :
Galen Charlton (gmc)
tags: added: database performance pullrequest search
Revision history for this message
Jason Stephenson (jstephenson) wrote :

The branch needs a database upgrade script.

tags: added: didyoumean
Revision history for this message
Galen Charlton (gmc) wrote :

Oops. I've force-pushed to the working branch a version that is rebased against master and contains the upgrade script.

Revision history for this message
Jason Boyer (jboyer) wrote :

I'm going to be looking at this more soon but I wanted to drop this here, which we probably have use for in a variety of places: https://www.postgresql.org/docs/14/functions-trigger.html The suppress_redundant_updates_trigger function is available in 8.4+.

Jason Boyer (jboyer)
Changed in evergreen:
status: New → Confirmed
Revision history for this message
Jason Boyer (jboyer) wrote :

Does what it says and has been doing it for some time. Pushed through rel_3_9, thanks Galen!

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.