Did you mean breaks parallel reingest and causes deadlocks when loading/overlaying bib records in the client

Bug #1931737 reported by Blake GH
44
This bug affects 8 people
Affects Status Importance Assigned to Milestone
Evergreen
Fix Released
High
Unassigned
3.8
Fix Committed
High
Unassigned

Bug Description

PG 9.5
EG 3.7.0

We've found that if we reingest the Evergreen database (using pingest.pl) more than one record at a time, PostgreSQL will trhow errors.

Example:

./pingest.pl --user evergreen --host 127.0.0.1 --db evergreen --password password --port 5432 --max-child 4

PostgreSQL logs results with:

DBD::Pg::st execute failed: ERROR: deadlock detected
DETAIL: Process 48054 waits for ShareLock on transaction 11063061; blocked by process 48055.
Process 48055 waits for ShareLock on transaction 11063133; blocked by process 48054.
HINT: See server log for query details.
CONTEXT: while rechecking updated tuple (98465,12) in relation "symspell_dictionary"
PL/pgSQL function search.symspell_build_and_merge_entries(text,text,text,boolean) line 34 at RETURN QUERY
SQL statement "SELECT * FROM search.symspell_build_and_merge_entries(new_value, search_class, old_value)"
PL/pgSQL function search.symspell_maintain_entries() line 17 at PERFORM
SQL statement "DELETE FROM metabib.identifier_field_entry WHERE source = 10210"
PL/pgSQL function metabib.reingest_metabib_field_entries(bigint,boolean,boolean,boolean,boolean,integer[]) line 35 at EXECUTE at ./pingest.pl line 291.
metabib.reingest_metabib_field_entries failed for record 10210 at ./pingest.pl line 294.

It goes away when you do this:

ALTER TABLE metabib.title_field_entry DISABLE TRIGGER maintain_symspell_entries_tgr;
ALTER TABLE metabib.author_field_entry DISABLE TRIGGER maintain_symspell_entries_tgr;
ALTER TABLE metabib.subject_field_entry DISABLE TRIGGER maintain_symspell_entries_tgr;
ALTER TABLE metabib.series_field_entry DISABLE TRIGGER maintain_symspell_entries_tgr;
ALTER TABLE metabib.keyword_field_entry DISABLE TRIGGER maintain_symspell_entries_tgr;
ALTER TABLE metabib.identifier_field_entry DISABLE TRIGGER maintain_symspell_entries_tgr;

Blake GH (bmagic)
summary: - Reingest in parrallel broken in 3.7
+ Reingest in parallel broken in 3.7
Revision history for this message
Jason Stephenson (jstephenson) wrote : Re: Did you mean breaks parallel reingest

I have confirmed this behavior with PostgreSQL 12 and Evergreen 3.5.3 with the did you mean/symspell code backported.

I have also verified that the behavior continues after applying the patch from bug 1931162.

summary: - Reingest in parallel broken in 3.7
+ Did you mean breaks parallel reingest
Changed in evergreen:
status: New → Confirmed
tags: added: database
Revision history for this message
Lynn Floyd (lfloyd) wrote :

Running 9.6 with 3.7.1 still does not work .

Revision history for this message
Jane Sandberg (sandbergja) wrote :

Just adding a note for anybody wanting to reproduce this using the concerto data set: I was able to get this error on the concerto data set with Postgres 9.6 by running `./pingest.pl --batch-size 100` (or 50, 200... the batch size didn't really matter, they all caused this error).

Revision history for this message
Michele Morgan (mmorgan) wrote :
Download full text (3.7 KiB)

We have seen the same errors on our production system while loading and overlaying bib records. Our production system is running:

Evergreen 3.7.2
Postgres 9.6
Patch for bug 1947173 applied

Prior to 3.7.2, no deadlocks were logged. Since loading 3.7.2 with the additional patch, we are seeing deadlocks fairly regularly which are NOT related to running pingest.pl. Deadlocks are happening during normal record loading, adding and overlaying via the staff client.

Here's a snippet from the logs:

process 65396 detected deadlock while waiting for ShareLock on transaction 1126196004 after 1000.133 ms
DETAIL: Process holding the lock: 67365. Wait queue: .
CONTEXT: while updating tuple (355971,7) in relation "symspell_dictionary"
PL/pgSQL function search.symspell_build_and_merge_entries(text,text,text,boolean) line 34 at RETURN QUERY
SQL statement "SELECT * FROM search.symspell_build_and_merge_entries(new_value, search_class, old_value)"
PL/pgSQL function search.symspell_maintain_entries() line 17 at PERFORM
SQL statement "DELETE FROM metabib.identifier_field_entry WHERE source = 3798324"
[]) line 35 at EXECUTE
SQL statement "SELECT metabib.reingest_metabib_field_entries(NEW.id)"
PL/pgSQL function biblio.indexing_ingest_or_delete() line 55 at PERFORM
STATEMENT: UPDATE biblio.record_entry
SET marc = '<?xml version="1.0"?>
<record xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://www.loc.gov/MARC21/slim" xsi:schemaLocation="http://www.loc.gov/MARC21/slim http://www.loc.gov/standards/marcxml/schema/MARC21slim.xsd"><leader>05691cam a2200625Ki 4500</leader><controlfield tag="001">3798324</controlfield><controlfield tag="003">NOBLE</controlfield><controlfield tag="005">20180905042051.7</controlfield><controlfield tag="006">m o d </controlfield><controlfield tag="007">cr cnu---unuuu</controlfield><controlfield tag="008">140630r19861975enk ob 000 0 eng d</controlfield><datafield tag="040" ind1=" " ind2=" "><subfield code="a">N$T</subfield><subfield code="b">eng</subfield><subfield code="e">rda</subfield><subfield code="e">pn</subfield><subfield code="c">N$T</subfield><subfield code="d">VLB</subfield><subfield code="d">YDXCP</subfield><subfield
code="d">EBLCP</subfield><subfield code="d">ICA</subfield><subfield code="d">AGLDB</subfield><subfield code="d">OCLCQ</subfield><subfield code="d">VTS</subfield></datafield><datafield tag="019" ind1=" " ind2=" "><subfield code="a">888549609</subfield><subfield code="a">907152912</subfield><subfield code="a">908511865</subfield><subfield code="a">923184802</subfield><subfield code="a">963324551</subfield></datafield><datafield tag="020" ind1=" " ind2=" "><subfield code="z">9780830897971</subfield><subfield code="q">(electronic bk.)</subfield></datafield><datafield tag="020" ind1=" " ind2=" "><subfield code="z">0830897976</subfield><subfield code="q">(electronic bk.)</subfield></datafield><datafield tag="020" ind1=" " ind2=" "><subfield code="z">0877842930</subfield></datafield><datafield tag="020" ind1=" " ind2=" "><subfield code="z">9780877842934</subfield></datafield><datafield
tag="020" ind1=" " ind2=" "><subfield code="z">0851107397</subfield><subfield code="q">(pbk. UK)</subf...

Read more...

tags: added: performance search
Changed in evergreen:
importance: Undecided → High
summary: - Did you mean breaks parallel reingest
+ Did you mean breaks parallel reingest and causes deadlocks when
+ loading/overlaying bib records in the client
Revision history for this message
Mike Rylander (mrylander) wrote :

Hi Michele (and anyone else experiencing this issue), can you provide the output of the queries below? And, did you upgrade straight to 3.7.2, or were you on 3.7.1 or 3.7.0 before that? Versions prior to 3.7.2 had an issue that would bloat certain data and lead to the possibility for this symptom, and the fix for that is to use the newer symspell side loader to reload the dictionary after upgrading to 3.7.2.

Thanks!

-- Queries of interest
select count(*) from search.symspell_dictionary;
select pg_size_pretty(pg_table_size('search.symspell_dictionary'));
select pg_size_pretty(pg_relation_size('search.symspell_dictionary'));
select pg_size_pretty(pg_relation_size('search.symspell_dictionary_pkey'));

Revision history for this message
Michele Morgan (mmorgan) wrote :

Hi Mike,

We upgraded from 3.6.4 to 3.7.2 and, as previously mentioned, applied the patch for bug 1947173.

Here are the results of the queries of interest:

4843731 - select count(*) from search.symspell_dictionary;
3478 MB - select pg_size_pretty(pg_table_size('search.symspell_dictionary'));
3036 MB - select pg_size_pretty(pg_relation_size('search.symspell_dictionary'));
144 MB - select pg_size_pretty(pg_relation_size('search.symspell_dictionary_pkey'));

Revision history for this message
Mike Rylander (mrylander) wrote :

Thanks, Michele.

You're table data is definitely much larger than I'd expect given the prefix key count (by a factor of about 3), but the index size is in the right range. That suggests the initial DYM upgrade step was done with the pre-patched side loader.

Did you apply those patches after the upgrade, or did you install from a git branch made of 3.7.2 plus those patches, or some other way? I ask because if you loaded the symspell dictionary using the base 3.7.2 code, you'll definitely want to reload it with the patched script and functions in place. That's safe assuming a quiescent database for the rewriting part. You can use the same "export -> process -> unlogged -> truncate -> SQL import -> logged" that was requested at upgrade time, just with the new code, and follow up with an ANALYZE on the dictionary table.

You can also try using CLUSTER to shrink the table data size, if it's not suffering from the pre-patch issues.

Revision history for this message
Martha Driscoll (mjdriscoll) wrote :

Mike,
The new patched symspell-sideload.pl script was used during our upgrade. The process was:

1. export from author, title, subject etc.
2. use the patched symspell-sideload.pl to create the sql scripts for loading.
2. install Evergreen 3.7.2
3. run version upgrade scripts
4. run xxxx.schema.symspell-speed-ingest.sql from lp#1947173 to update the database functions
5. import the suggestions created earlier (set unlogged, truncate, import, reindex, set logged, vacuum analyze)

Revision history for this message
Jason Stephenson (jstephenson) wrote :

I was going to add that I've seen the pingest failures on a database where the latest sideload code was used but Martha beat me to it.

Hoever, I also wanted to test on an upgraded database before doing the symspell load to see if that makes a difference. I don't think the deadlocks have anything to do with the number of rows in the tables. If the number of rows matters, then doing a parallel ingest with empty symspell tables should work.

I plan to test this next week after a test database refresh and upgrade from 3.5.3 to 3.7.

I was going to wait for the results of those tests before adding more noise to this bug, but figured I'd add a "Me, too" after seeing Martha's comment.

Revision history for this message
Jason Stephenson (jstephenson) wrote :

Running parallel ingest (pingest.pl) on a database with no symsqell entries, I get many messages like the following:

DBD::Pg::st execute failed: ERROR: deadlock detected
DETAIL: Process 114527 waits for ShareLock on transaction 280452035; blocked by
 process 114526.
Process 114526 waits for ShareLock on transaction 280452086; blocked by process
114527.
HINT: See server log for query details.
CONTEXT: while updating tuple (28509,21) in relation "symspell_dictionary"
PL/pgSQL function search.symspell_build_and_merge_entries(text,text,text,boolean
) line 34 at RETURN QUERY
SQL statement "SELECT * FROM search.symspell_build_and_merge_entries(new_value,
search_class, old_value)"
PL/pgSQL function search.symspell_maintain_entries() line 17 at PERFORM
SQL statement "DELETE FROM metabib.keyword_field_entry WHERE source = 42494"
PL/pgSQL function metabib.reingest_metabib_field_entries(bigint,boolean,boolean,
boolean,boolean,integer[]) line 35 at EXECUTE at /openils/bin/pingest.pl line 30
6.
metabib.reingest_metabib_field_entries failed for record 42494 at /openils/bin/p
ingest.pl line 309.

It looks like the symspell triggers need to be disabled for parallel ingest to work.

Revision history for this message
Mike Rylander (mrylander) wrote :

Hi all,

I've pushed a branch to https://git.evergreen-ils.org/?p=working/Evergreen.git;a=shortlog;h=refs/heads/user/miker/lp-1931737-dym-deadlock-fix that addresses the deadlocks by using INSERT ... ON CONFLICT to guarantee that things happen in a predictable order, which side-steps the deadlock issue.

I'm happy to help backport for 3.8, and 3.7 if applicable based on timing, once some eyes are on it.

tags: added: pullrequest
Revision history for this message
Jane Sandberg (sandbergja) wrote :

This works well for me, Mike! Thank you! Signoff pushed to user/sandbergja/lp-1931737-dym-deadlock-fix-signoff

tags: added: signedoff
Changed in evergreen:
milestone: none → 3.9-rc
assignee: nobody → Jason Stephenson (jstephenson)
Revision history for this message
Jason Stephenson (jstephenson) wrote :

I get the following error from pingest on a database upgraded from 3.5.3 to the latest rel_3_7 with this branch applied:

LINE 1: SELECT ind_data.browse_nocase
               ^
QUERY: SELECT ind_data.browse_nocase
CONTEXT: PL/pgSQL function metabib.reingest_metabib_field_entries(bigint,boolean,boolean,boolean,boolean,integer[]) line 80 at IF at /openils/bin/pingest.pl line 259.
Browse ingest failed for record 435 at /openils/bin/pingest.pl line 262.

I have not thoroughly investigated, yet, but it could be that a required upgrade is missing from rel_3_7.

Revision history for this message
Jason Stephenson (jstephenson) wrote :

Ooops. I missed this important part of the paste:

Browse ingest failed for record 433 at /openils/bin/pingest.pl line 262.
DBD::Pg::st execute failed: ERROR: record "ind_data" has no field "browse_nocase"

It could also be something that I did "wrong" in generating the DB upgrade, but I did pretty much exactly what the make_release script does.

Revision history for this message
Jason Stephenson (jstephenson) wrote :

The db upgrade for 1298 is missing from rel_3_7. The 1298 upgrade added case insensitivity in browse search. Assuming that is considered a new feature in 3.8, this bug needs a separate branch for rel_3_7.

I will see if I can work that out.

Revision history for this message
Mike Rylander (mrylander) wrote :

Sorry if I wasn't clear in comment 11 about the backporting need, Jason.

I suspect it would be safest to create a version-specific upgrade script for both 3.8 and 3.7, with a new number for each, and I'm happy to assist.

The change to an existing function is just a 1-line addition, but it's a function that has changed from 3.7 to 3.8, so it needs to be specialized. A 3.8 upgrade for that function /also/ has to be run when going from 3.7 to 3.8, though, because of unrelated changes. so we may need to add different numbered script to rel_3_7 and rel_3_8, the latter basically being a copy of just the metabib.reingest_metabib_field_entries() function from 1298.

Revision history for this message
Jason Stephenson (jstephenson) wrote :

I'm testing this and so far it is working for me, I pushed a signoff branch to https://git.evergreen-ils.org/?p=working/Evergreen.git;a=shortlog;h=refs/heads/collab/dyrcona/lp1931737-dym-deadlock-fix-signoff

And a branch with an additional commit to make it work on rel_3_7 to https://git.evergreen-ils.org/?p=working/Evergreen.git;a=shortlog;h=refs/heads/collab/dyrcona/lp1931737-dym-deadlock-fix-rel_3_7

I have not pushed these to the main repository because my parallel ingest is still running. Also, this can probably wait until early next week to go in.

Revision history for this message
Jason Stephenson (jstephenson) wrote :

Mike, I missed your comment because Lp doesn't refresh in the browser, and I didn't check my bug email in the mean time.

The db upgrade for 3.7 and the db upgrade for 3.8/master will need different numbers. Otherwise, the fix will get lost when someone upgrades from 3.7 to 3.8 or later.

I think this is basically what you are saying, just in different words.

Also, when a bug has more than a few comments, I often skim over them, so I think I missed your previous remarks about backporting.

Changed in evergreen:
assignee: Jason Stephenson (jstephenson) → nobody
Revision history for this message
Jason Stephenson (jstephenson) wrote :

This probably warrants its own bug, but I want to mention that while parallel ingest works now, it seems to be taking much longer with symspell enabled than with it disabled. I have a parallel ingest running for almost 3 days now, and it has completed 5 out of 228 batches. Normally by this point it would be over halfway done.

Checking the running batches of 10,000 records each, they've been running for 1 day and 5 hours.

I am only running 5 batches simultaneously, and I normally run 8. When running 8 on this setup, the batches of 10,000 usually finish in 4 to 6 hours.

Revision history for this message
Mike Rylander (mrylander) wrote :

Since the current branch fixes the deadlock issue, I think we should commit it as is. While making the pingest tool less parallel is not ideal, I think fixing the potential for spurious failures is more important in the immediate term, and I have a plan (but not the time, before 3.9) to provide a solution to the pingest serialization issue. From IRC:

[22-04-15 12:28:21] <miker> Dyrcona: I won't be able to finish it for you before 3.9.0, I don't think, but so you know that I know (that you know that...) the "deadlock fix" does introduce some serialization of ingest (WITH DELETE ... RETURNING * INSERT ... ON CONFLICT guarantees deadlock-free ACID semantics, but by forcing some locking), but(!) it also introduces a place for us to hook into that we can use avoid that serialization in the main DYM part of the ingest code. if we invent a way of skipping the call to search.symspell_dictionary_reify() in the trigger (global flag, a new "Skip this part" bool for the function, etc) then we can just run a reification function (slightly different than the current reify function, because it should not care about which transaction id generated the change(s)) after all are done, and push DYM changes into the dictionary in one go at the end.

Changed in evergreen:
milestone: 3.9-rc → 3.9.1
Revision history for this message
Mike Rylander (mrylander) wrote :

I've rebased the master/rel_3_9 targeted branch against current master, and added a commit to address the serialization issue for pingest.pl. The new branch is available at

https://git.evergreen-ils.org/?p=working/Evergreen.git;a=shortlog;h=refs/heads/user/miker/lp1931737-dym-deadlock-plus-pingest

Revision history for this message
Jason Stephenson (jstephenson) wrote :

Removed the signedoff tag because a new branch was posted.

tags: removed: signedoff
Changed in evergreen:
milestone: 3.9.1 → none
Revision history for this message
Mike Rylander (mrylander) wrote :

Hi all,

I've rebased this again, and am re-up'ing for eyeballs.

The two commits together provide protection (garuanteed by Postgres) against the deadlocks that concurrent symspell work can cause, and provide a way to delay symspell's lock-heavy work so that parallelization is effective. This should substantially remove the visible ingest-time performance effects of the symspell.

https://git.evergreen-ils.org/?p=working/Evergreen.git;a=shortlog;h=refs/heads/user/miker/lp1931737-dym-deadlock-plus-pingest-rebase

Changed in evergreen:
assignee: nobody → Jason Stephenson (jstephenson)
Revision history for this message
Jason Stephenson (jstephenson) wrote :

I've been testing this for the past few days, and while the new code works and seems to resolve the performance issues on 3.7, I'm setting this to "Won't Fix" for 3.7. My reasoning is that 3.7 is supposed to be security-only fixes at this point, and the backport for 3.7 will conflict with the fix for 3.8/3.9. This means that we'd have to have different upgrade numbers for 3.7 viz the other branches.

Revision history for this message
Jason Stephenson (jstephenson) wrote :

I have pushed a signoff branch to the working repository:

user/dyrcona/lp1931737-dym-deadlock-plus-pingest-rebase-signoff

https://git.evergreen-ils.org/?p=working/Evergreen.git;a=shortlog;h=refs/heads/user/dyrcona/lp1931737-dym-deadlock-plus-pingest-rebase-signoff

If anyone else wants to have a look, feel free. I'm satisfied with the patch and have tested it on 3.7 and 3.9 with production data. I'll push it in a few days unless someone else beats me to it.

Changed in evergreen:
assignee: Jason Stephenson (jstephenson) → nobody
milestone: none → 3.9.1
tags: added: signedoff
Changed in evergreen:
assignee: nobody → Jason Stephenson (jstephenson)
Revision history for this message
Jason Stephenson (jstephenson) wrote (last edit ):

I am rescinding my signoff and changing the pullrequest tag to needwork. The database server filled up while running pingest while using the --delay-symspell option:

DBD::Pg::db do failed: ERROR: could not write to file "base/pgsql_tmp/pgsql_tmp232233.1084": No space left on device
CONTEXT: SQL function "symspell_dictionary_full_reify" statement 1 at /openils/bin/pingest.pl line 269.

The above happened on a database server with 1.1TB of free space in a database that is normally 400GB in size.

I am reconfiguring the server over the weekend and will delete several unnecessary databases to free up some space. However, this would have crashed on our production database server as it has a 1.4TB SSD for the same database.

tags: added: needswork
removed: pullrequest signedoff
Revision history for this message
Blake GH (bmagic) wrote :

All,

I tested this branch on a copy of a production database. On Postgres 10. The database was pg_dumped from pg9.6, then restored onto a pg10 instance. Then upgraded EG3.7 -> EG3.9 with this patch. I performed the sideloader and then pingest. Before I started the ingest, I noted the size of the database on disk: 16GB. The first time I used pingest.pl, I did not pass the --delay-symspell switch. The DB grew 2GB, to 18GB. Funny enough, there were no deadlocks without the switch. Then I ran it again with the switch. When it finished, it grew to 20GB. In conclusion, it seems that the database grew by 25% of it's original size on disk. (4 / 16) * 100 = 25%. That's not great, however, this database had previously not had symspell enabled. This test database does not have an external script to deal with the WAL logs, so they are getting rolled off immediately. Not soaking up the disk. The pg_wal folder is 529MB.

FWIW, I've given my sign-off, I did find a missing semicolon which I committed here as well:

https://git.evergreen-ils.org/?p=working/Evergreen.git;a=shortlog;h=refs/heads/user/blake/lp1931737-dym-deadlock-plus-pingest-rebase-signoff

Changed in evergreen:
assignee: Jason Stephenson (jstephenson) → nobody
Changed in evergreen:
assignee: nobody → Mike Rylander (mrylander)
Revision history for this message
Mike Rylander (mrylander) wrote :

After further testing and not-on-LP discussion, I've merged this, with Blake's fix, to master, rel_3_9 and rel_3_8.

Thanks, folks!

Changed in evergreen:
assignee: Mike Rylander (mrylander) → nobody
status: Confirmed → Fix Committed
no longer affects: evergreen/3.7
Changed in evergreen:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers