Queued Ingest functionality

Bug #1979071 reported by Mike Rylander
82
This bug affects 21 people
Affects Status Importance Assigned to Milestone
Evergreen
Fix Released
Wishlist
Unassigned

Bug Description

Bib and authority record processing in Evergreen is handled by database triggers that fire whenever a record is inserted, updated, or deleted. This makes sure that processes such as catalog search and record display always have correct and up to date data. However, there are situations when the amount of processing time required to process, in particular, a large set of records is detrimental not only to staff, but can cause an action to time out entirely and fail processing.

Queued Ingest is meant to address this by allowing the system to remove time-intensive record processing from the core business logic workflow, and have that work be done in the background (nearly immediately, and in parallel where possible) without blocking staff work.

The in-database ingest logic has been refactored to add checks that decide when and how much processing should happen in-line, and alternatively, when to record the request for particular record ingest actions for later processing. Upon successful processing request queuing, the triggers will signal via Postgres' NOTIFY feature that there are now queued requests.

These decisions are made based on the state of a set of Global Flags, as well as a real-time, live overriding mechanism that can be used to impact specific ingest events, so that the system, or an administrator, can disable Queued Ingest in specific circumstances without having to change the state of the Global Flags for that one need.

The queued requests are handled by an external script that connects to the database, checks for overdue outstanding requests to process, and then waits to receive a notification, via Postgres' LISTEN feature, that one or more processing requests have been enqueued. Upon receiving notification, this script will gather outstanding requests and process them, optionally in parallel.

This script also provides a command line interface for creating named and staff-owned queues, adding processing requests (either to named/owned queues, for tracking purposes, or to the main anonymous request queue), and investigating the current and historical processing status of queued requests.

Revision history for this message
Mike Rylander (mrylander) wrote :

A branch implementing Queued ingest is available at:

https://git.evergreen-ils.org/?p=working/Evergreen.git;a=shortlog;h=refs/heads/user/miker/lp-1979071-queued-ingest

NOTE: The baseline schema has intentionally /not/ been reified with the changes from the DB upgrade script yet. To test, you will still need to run the DB upgrade (Open-ILS/src/sql/Pg/upgrade/XXXX.schema.queued_ingest.sql) after a fresh installation. Likewise, you can test this in a existing (relatively recent) installation by just running that upgrade script. Reification is left out at the moment because there are at least two other commits that may make it into master before Queued Ingest, and delaying baseline update will remove one potential source of bugs that could sneak in during merge cleanup.

tags: added: pullrequest
Changed in evergreen:
assignee: Mike Rylander (mrylander) → nobody
Revision history for this message
Andrea Neiman (aneiman) wrote :

This work is sponsored by King County Public Library.

Revision history for this message
Mike Rylander (mrylander) wrote :

NOTE: Queued Ingest addresses bug #1193490, authority propagation timeout.

Revision history for this message
Jane Sandberg (sandbergja) wrote (last edit ):

A few notes I took today in the group code review session:

1. We should merge bug 1931737 before merging this. Without this, you will run into deadlocks and failed action.ingest_queue_entry entries.
2. To stop the coordinator, you have to manually kill the process and remove the lockfile. It would be helpful to have a way to gracefully stop the coordinator all at once.

Revision history for this message
Jane Sandberg (sandbergja) wrote :

Ooh, one more thought. Any chance we could get some pgtap coverage on this feature? To my eye, it would be helpful to have some tests that exercise the first part of the new evergreen.indexing_ingest_or_delete where it consults all those flags and determines whether to enqueue or process inline (lines 511-546).

Revision history for this message
Mike Rylander (mrylander) wrote :

I've rebased the main commit, added a --stop suboption to the --coordinator option, and picked the pingest-on-queued-ingest commit for consideration. If we like the shape of the --stop option I'd suggest we squash it into the main commit. Rebased branch (with additions) is here:

https://git.evergreen-ils.org/?p=working/Evergreen.git;a=shortlog;h=refs/heads/user/miker/lp-1979071-queued-ingest-rebase-with-stop-option

Regarding Jane's testing suggestion, I'll give some thought to how we might set up pgtap tests, but if anyone has time and it's obvious to them how to do so, please don't wait on me for that!

Changed in evergreen:
status: New → Confirmed
Revision history for this message
Jason Stephenson (jstephenson) wrote :

I plan to take a look at this very soon with production data. I've made a test branch with Mike's latest rebase and the first commit on Blake's signoff branch for bug 1931737 and that is the only commit necessary to really test this.

Right off the bat, I think it would be good to have ingest_ctl installed in /openils/bin as part of make install.

I can't guarantee that I'll have enough time to review this in the immediate future, so I'm not assigning the bug to myself just yet. If anyone else wants to look at this, too, feel free.

Revision history for this message
Jane Sandberg (sandbergja) wrote :

In case they are helpful to anyone, here are my notes from today's ensemble code review: https://gist.githubusercontent.com/sandbergja/47bdc5fa46e767aa21ebe6ce6abdd30f/raw/f2c60fad62d032b803809681c71347ea0642d231/notes

Revision history for this message
Jason Stephenson (jstephenson) wrote :

While testing this, I ran a script to remove located URIs from the Ebrary vendor for one of our member libraries.

I had ingest.queued.all set to enabled and ingest_ctl running in --coordinator mode. I started the ingest coordinator after a 1,000 or records had been processed, and it picked up right away and things appeared to be working well. I checked the progress this morning and here is the stats output:

Record processing stats
============================

* Scheduled processing date: 2022-10-17
  Record Type: biblio; Action: update
    Pending: 147525
         ... 0 with a named queue
         ... 147525 without a named queue
  Processed: 4102
 Overridden: 0
   Complete: 4102
     Failed: 0
      Total: 151627

--------------------------------------------------

  Record Type: biblio; Action: update
  Totals
  ------
    Pending: 147525
         ... 0 with a named queue
         ... 147525 without a named queue
  Processed: 4102
 Overridden: 0
   Complete: 4102
     Failed: 0
      Total: 151627

--------------------------------------------------

I have attached the log from /tmp/queued-ingest-coordinator-LOG

Revision history for this message
Jason Stephenson (jstephenson) wrote (last edit ):

After restarting the Ingest Coordinator, I accidentally ran ingest_ctl with --status as an option. While the program reported that --status is an invalid option, it started printing errors on the console about updating some records. I had to interrupt the process with Ctrl-c.

Here's the warning message for the first queued record. The others are all the same, just the number changes:

WARNING: Ingest action of update on biblio.record_entry 1156086 for queue entry 4211 failed

If an invalid CLI option is supplied the program should probably exit immediately rather than go on with some default behavior.

Revision history for this message
Mike Rylander (mrylander) wrote :

I've pushed a small commit to the current branch that stops ingest_ctl when it gets an option it doesn't understand.

Jason, I don't see the log you mentioned, but did that seem to point you in any useful direction?

Thanks!

Revision history for this message
Jason Stephenson (jstephenson) wrote :

Here's the missing log.

Revision history for this message
Jason Stephenson (jstephenson) wrote :

While loading a batch of 314 new electronic resource records as a test this moring, 8 of them got a fail_time set. After consulting with Mike Rylander in IRC, I did the following:

1. Stopped the ingest_ctl coordinator.
2. Dumped the 8 rows from action.ingest_queue_entry with a fail_time to a file.
3. Updated the rows to set the fail_time to null.
4. Ran the action.process_ingest_queue_enty function on each row.

I was hoping to see a database error that would have caused each record to fail. Instead, they all processesed without failure this time.

When I check the PostgreSQL logs, I see a warning message about each of the records failing the first time, but there's no corresponding error in the logs that I can find.

Two take-aways:

1. We have mysterious failures on our hands that could be related to my environment or some other factors.

2. It would be useful to add a column to the action.ingest_queue_entry table to capture the database error (if any) when a record fails to ingest. If that could also be exposed with a --failures or --errors option to ingest_ctl, that would be a nice bonus.

I will keep looking at this branch.

Revision history for this message
Jason Stephenson (jstephenson) wrote :

I got the same myserious failure with a batch of 28,860 records this afternoon, that I got from Monday's test. ingest_ctl stopped running with the same database error.

This batch indicated some record issues duing the load, so I ran the 3 bad records manually, to see if that produced any useful output, but no. They ingested just fine.

I think someone else should test this with large-ish batch loads and updates to make sure this isn't just a problem with my test environment.

Galen Charlton (gmc)
Changed in evergreen:
assignee: nobody → Galen Charlton (gmc)
Revision history for this message
Jason Stephenson (jstephenson) wrote :

ingest_ctl died while running in coordinator mode again today while another big batch of updates was running:

DBD::Pg::db selectall_arrayref failed: server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request. at /openils/bin/ingest_ctl line 828.
Can't use an undefined value as an ARRAY reference at /openils/bin/ingest_ctl line 708.

I have applied all of Mike's latest changes.

Revision history for this message
Mike Rylander (mrylander) wrote :

Thanks, Jason. I still can't replicate this issue, but I've pushed a small change that should keep the coordinator from falling over at line 708. Instead, it will end up refreshing the db again, and waiting at most one more 10s cycle to process a batch.

Still using the same branch at:

https://git.evergreen-ils.org/?p=working/Evergreen.git;a=shortlog;h=refs/heads/user/miker/lp-1979071-queued-ingest-rebase-with-stop-option

Revision history for this message
Mike Rylander (mrylander) wrote :

I've squashed the fix-up commits the previous branch and rebased on top of master, which now includes the early symspell deadlock fix branch. I've also updated the baseline schema. This new branch includes everything in the previously linked branch, along with one fix for a cut-and-paste-o error related to restricting record attribute reingest to a subset via the QI infrastructure.

This should now be installable from a checkout without running the upgrade scripts. Or, you can upgrade an existing master database that includes 1340 and 1341 with the provisional upgrade SQL.

https://git.evergreen-ils.org/?p=working/Evergreen.git;a=shortlog;h=refs/heads/user/miker/lp-1979071-queued-ingest-rebase-and-baseline-2022-10-26

Changed in evergreen:
milestone: 3.10-beta → 3.next
Andrea Neiman (aneiman)
Changed in evergreen:
milestone: 3.next → 3.11-beta
tags: added: didyoumean
tags: removed: didyoumean
Galen Charlton (gmc)
Changed in evergreen:
assignee: Galen Charlton (gmc) → nobody
Revision history for this message
Bill Erickson (berick) wrote :

Fix to resolve this error:

---
psql:assets_concerto.sql:58: WARNING: \nPL/pgSQL function action.enqueue_ingest_entry(bigint,text,timestamp with time zone,integer,text,text) line 11 at SQL statement
SQL statement "SELECT action.enqueue_ingest_entry( NEW.id, TG_TABLE_SCHEMA, NOW(), ingest_queue, new_action, old_state_data)"
PL/pgSQL function indexing_ingest_or_delete() line 72 at SQL statement
psql:assets_concerto.sql:58: WARNING: Enqueuing of biblio.record_entry 10 for ingest failed, attempting direct ingest
psql:assets_concerto.sql:58: WARNING: It could refer to either a PL/pgSQL variable or a table column.\nPL/pgSQL function metabib.indexing_delete(biblio.record_entry,text) line 19 at SQL statement
SQL statement "SELECT metabib.indexing_delete(NEW.*, old_state_data)"
PL/pgSQL function indexing_ingest_or_delete() line 83 at SQL statement
psql:assets_concerto.sql:58: WARNING: Ingest of biblio.record_entry 10 failed
---

https://git.evergreen-ils.org/?p=working/Evergreen.git;a=shortlog;h=refs/heads/user/berick/lp1979071-queued-ingest-rebase-and-baseline-2022-10-26-wee-fix

Revision history for this message
Bill Erickson (berick) wrote :

Just a note I have run the basic functionality on the working branch and offer my sign-off for merging: Bill Erickson <email address hidden>

Galen Charlton (gmc)
Changed in evergreen:
assignee: nobody → Galen Charlton (gmc)
assignee: Galen Charlton (gmc) → nobody
Galen Charlton (gmc)
Changed in evergreen:
assignee: nobody → Galen Charlton (gmc)
Revision history for this message
Galen Charlton (gmc) wrote :

I've also run it through its paces and have merged it for inclusion to 3.11 with the following follow-ups:

* do various updates to the schema and DB updates to account for code drift and resolve some glitches
* ensure that ingest_ctl is installed in the bin dir by default
* add release notes

Thanks, Mike and Bill (and Jane and Jason)!

Changed in evergreen:
status: Confirmed → Fix Committed
assignee: Galen Charlton (gmc) → nobody
Changed in evergreen:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.