Loading records with located URIs should not delete and recreate call_numbers

Bug #1482757 reported by Michele Morgan
28
This bug affects 5 people
Affects Status Importance Assigned to Milestone
Evergreen
Fix Released
Low
Unassigned
3.7
Fix Released
Low
Unassigned

Bug Description

When loading MARC records with 856 fields, matching records with 856 fields may already exist in the database. The resulting record will include all the previously existing, updated or added 856 fields.

In creating the located URIs, the function

biblio.extract_located_uris("bib_id" bigint, "marcxml" text, "editor_id" integer)

first deletes any URI call numbers already associated with the record id, then recreates new rows based on the 856 links in the resulting MARC record.

This deletion and recreation of call numbers is problematic in a few ways:

- It causes inflation of the asset.call_number and auditor.asset_call_number_history tables

- It makes counting monthly statistics based on the number of URI entries added to and deleted from the asset.call_number table impossible.

As a use case, we continuously load files of records for collections of electronic resources purchased by our libraries. Some of these files contain very large numbers of records.

Each record in the file may already exist in our database with multiple 856 fields, one for each library that already owns the resource.

Say five of our libraries already own the collection. When the file is loaded for the new library, five existing URI call numbers are deleted and six new URI call numbers are created for each record. So loading a collection of 100,000 records would cause the deletion of 500,000 call number records and the addition of 600,000.

Instead of always deleting and readding call numbers, biblio.extract_located_uris() should only delete URI call numbers when there is no associated 856 in the MARC record, and only update existing URI call numbers when a change is required based on the associated 856 field. Other existing URI call numbers should not be affected.

In the use case above, this change to the function would produce the more appropriate result of zero call numbers deleted and 100,000 added. The call number create date would then represent the date the resource was procured by the library, and could be used for statistical purposes.

Revision history for this message
Blake GH (bmagic) wrote :

Confirmed in 2.8.2

Changed in evergreen:
status: New → Confirmed
Michele Morgan (mmorgan)
Changed in evergreen:
assignee: nobody → Michele Morgan (mmorgan)
Revision history for this message
Kathy Lussier (klussier) wrote :

Adding a link to an open-ils-dev email with a suggested approach for this bug:

http://markmail.org/message/n6tqndljslrrlgnz

Revision history for this message
Michele Morgan (mmorgan) wrote :

I added a text file with the diff from the email thread Kathy links to above for clarity.

While this suggested approach does address the issue of deleting and re-adding all entries in the call_number table for each record, it doesn't handle the URI mappings properly when more than one 856 link for a given library is involved.

This can happen if the same title exists in two different purchased electronic collections, which is fairly common.

If a library already owns a title from one electronic collection, and a new record is loaded from another electronic collection, the new URI link will not be added.

Revision history for this message
Michele Morgan (mmorgan) wrote :

I used Mike's suggested fix for the call number issue so that call numbers are no longer deleted and re-added when a record is saved. I also reworked the biblio.extract_located_uris function to better handle the uris.

Previously, entries were never removed from asset.uri, only asset.call_number and asset.uri_call_number_map entries were deleted. This left unused entries in an ever-growing asset.uri table. Additionally, URIs extracted from the 856 fields were compared via text string comparison to existing entries in asset.uri in order to map them.

The reworked function preserves call numbers, and removes the mapped uris as well as the mappings before re-adding new uris and new mapping entries.

The hope with these additional changes was to speed up loading of records with electronic resources. Initial tests (not in a production environment) have not shown much of an improvement, but additional testing is most welcome.

The changes to this function presume that two call numbers are never mapped to the same uri, so it may generate errors with existing data. Also, for existing data, unmapped entries in asset.uri should be deleted.

Here's a link to the branch:

http://git.evergreen-ils.org/?p=working/Evergreen.git;a=shortlog;h=refs/heads/user/mmorgan/LP1482757_reworked_extract_located_uri

tags: added: pullrequest
Kathy Lussier (klussier)
Changed in evergreen:
milestone: none → 2.next
Michele Morgan (mmorgan)
Changed in evergreen:
assignee: Michele Morgan (mmorgan) → nobody
Revision history for this message
Mike Rylander (mrylander) wrote :

Michele,

Unfortunately, one of the properties of asset.uri rows is that they should be usable by multiple records. I believe your goal with the change from my suggestion was to avoid "orphaned" URIs, which is good, I think. To that end, how about this:

http://ils.org/?p=working/Evergreen.git;a=shortlog;h=refs/heads/user/miker/lp1482757-cleaner-uri-extraction

Thanks!

Revision history for this message
Michele Morgan (mmorgan) wrote :

Mike,

Thanks for looking at this!

Yes, my goals did evolve from the original call number issue. The rewrite was intended to address these concerns:

- eliminate call number churn
- avoid orphaned uris
- avoid string comparisons in hopes of speed improvements
- properly handle multiple uris per call number on an ongoing basis

Since there's no uniqueness constraint on asset.uri, it seemed cleaner to avoid using string comparisons to match to existing uris and just remove the old mappings and uris and regenerate them. I guess I'm not clear on the importance of allowing multiple call numbers to map to the same uri.

I took a look at your latest branch at:

http://git.evergreen-ils.org/?p=working/Evergreen.git;a=shortlog;h=refs/heads/user/miker/lp1482757-cleaner-uri-extraction

and while it does solve the problem of orphaned uris as well as the call number issue, it doesn't appear to handle multiple uris per call number properly when a second 856 for a library is removed.

Here's a use case we see often:

A single title is available through multiple electronic collections. Here's an example:

http://evergreen.noblenet.org/eg/opac/record/3736349

Should the above record change over time, here's the list of links, showing two for Phillips Academy:

Access for Bunker Hill Community College via ebrary
Access for Endicott College via EBSCOhost
Access for Gordon College via EBSCOhost
Access for Merrimack College via ebrary
Access for Middlesex Community College via EBSCOhost
Access for North Shore Community College via ebrary
Access for Northern Essex Community College via ebrary
Access for Phillips Academy via EBSCOhost
Access for Phillips Academy via ebrary
Access for Salem State University via EBSCOhost

The title exists in both EBSCOhost and ebrary collections. Phillips decides to discontinue the ebrary collection that contains this title so the corresponding 856 link will be removed from the MARC record. Since a call number is not being added or deleted, the function doesn't remove the ebrary uri.

Can you clarify the importance of multiple records using the same uri?

Thanks!

Galen Charlton (gmc)
Changed in evergreen:
assignee: nobody → Mike Rylander (mrylander)
Revision history for this message
Mike Rylander (mrylander) wrote :

Michele (and all interested parties),

I've updated the stored procedure to track mappings between URIs and call numbers, which allows us to remove disused URIs.

As for using URIs across records, it's about keeping URIs unique (think: linked data) rather a record-oriented view of where they're used today. By keeping them canonical and authoritative we can (potentially) use them for other things in the future. Put another way (using relational terms), novel uses for data are simpler when you have a set (unique items) rather than a bag (potential duplicates). Does that help explain?

Thanks!

Changed in evergreen:
assignee: Mike Rylander (mrylander) → nobody
Revision history for this message
Kathy Lussier (klussier) wrote :
Revision history for this message
Jason Stephenson (jstephenson) wrote :

I have batches of URI deletes where I can test this. I also have one batch that leaves an orphaned URI behind with the current code. This would make a decent test case.

In the end, we'll probably want some pgtap tests and a release note, but I'll give this a whirl for now and report back my thoughts/findings.

Revision history for this message
Jason Stephenson (jstephenson) wrote :

Maybe I don't understand what this branch is ultimately trying to do, but it seems to make things worse for me.

Let me explain what I am doing. We have records with multiple located URIs for different libraries and sometimes even different vendors. To make things easier in locating these records for specific vendors and libraries, I added a table of vendor information and created a view with asset.ur and friends to link vendors with bib records and call numbers (the latter to get the owning library). I have used this view to find and remove URIs from records when the library sends a MARC file of records to delete or when they completely drop a vendor.

Let me explain why this branch makes things "worse" from my point of view. I have recently removed several batches from URIs using the above view. This is done by locating the record using the view, pulling the marc into Perl, locating and deleting the specific 856 tag(s) using MARC::Record, and then updating the biblio.record_entry with the new MARC in the database.

I have checked with two libraries where I removed all of their URLs for specific vendors. When I did this in production without this branch installed, for one library, there were no asset.uri entries left behind, and hence nothing in my view. For the other library, 21 asset.uri entries were left behind. I checked the MARC on these records and the 856 tags were gone. However, a couple of the other libraries that had URIs on these records, had 2 subfield 9s. This caused two asset.uri entries to be created. (So, it looks like the old code only removes 1 asset.uri entry when more than 1 exist.)

When I did this in a test database with the same data and these function changes applied, more asset.uris were left behind even though there were none in the MARC. In the case of the library that had 0 asset.uris left behind in production, there are 714 asset.uris orphaned in my test database. The library that had 21 left behind in production has 592 left behind in the test database. I have not checked all of these, but a spot check of 5 to 10 records from each library shows that the 856 tags were removed from the records.

It appears that the second commit "to remove all orphaned URI" is not doing what it says on the tin. I have not looked any deeper than this, and these scripts are taking so long to run on my test environment that I will likely not be able to test again until next week.

Revision history for this message
Mike Rylander (mrylander) wrote :

Thanks for testing, Jason. I obviously got something very wrong... I'll try to look at this again soon.

Revision history for this message
Dan Wells (dbw2) wrote :

I'd like to see this get into 3.1. Mike, are you still working on this?

I looked over the code, and it generally seems sound, and it isn't obvious what's causing Jason's issues. Here are two quick thoughts:

1) One possible "crack" I see for data to fall through is the dead map selector. The "AND cn.label = '##URI##' AND NOT cn.deleted" seems overly restrictive. Now, I realize we *shouldn't* have mappings to deleted or non-##URI## call numbers, but it isn't impossible, and I don't think we'd have harm in leaving those restrictions out. If a map isn't used, let's get rid of it regardless, I think.

2) (side note) orphaned_uri_list doesn't seems to be used any more, so should be removed.

Any branch that reduces data churn is good in my book. Thanks, all!

Changed in evergreen:
milestone: 3.next → 3.1-beta
Changed in evergreen:
milestone: 3.1-beta → 3.1-rc
Dan Wells (dbw2)
Changed in evergreen:
importance: Undecided → Low
Changed in evergreen:
milestone: 3.1-rc → 3.1.1
Changed in evergreen:
milestone: 3.1.1 → 3.1.2
Changed in evergreen:
milestone: 3.1.2 → 3.1.3
Changed in evergreen:
milestone: 3.1.3 → 3.1.4
Changed in evergreen:
milestone: 3.1.4 → 3.1.5
Changed in evergreen:
milestone: 3.1.5 → 3.1.6
Kathy Lussier (klussier)
tags: added: needsrepatch
removed: pullrequest
Changed in evergreen:
milestone: 3.1.6 → 3.2.1
Changed in evergreen:
milestone: 3.2.1 → 3.2.2
Changed in evergreen:
milestone: 3.2.2 → 3.2.3
Changed in evergreen:
milestone: 3.2.3 → 3.3-beta1
status: Confirmed → New
Changed in evergreen:
milestone: 3.3-beta1 → 3.3-rc
Changed in evergreen:
milestone: 3.3-rc → 3.3.1
Changed in evergreen:
milestone: 3.3.1 → 3.3.2
Changed in evergreen:
milestone: 3.3.2 → 3.3.3
Changed in evergreen:
assignee: nobody → Rogan Hamby (rogan-hamby)
Changed in evergreen:
milestone: 3.3.3 → 3.3.4
Changed in evergreen:
milestone: 3.3.4 → 3.3.5
Changed in evergreen:
milestone: 3.3.5 → none
Revision history for this message
Michele Morgan (mmorgan) wrote :

We have done extensive testing on the reworked extract_located_uri function on a test system with production data and have been pleased with the results. We loaded several files of 50 and 1000 records with located URIs. These consisted of new records and updated records which both added and removed URIs.

In all cases, rows in the call_number table and and call_number_uri_map were updated or created only as necessary.

Prior to testing, we deleted 1.2 million rows from asset.uri which were orphaned from record loading activity in the past. Our testing generated no orphaned URIs.

I've pushed a branch here:

https://git.evergreen-ils.org/?p=working/Evergreen.git;a=shortlog;h=refs/heads/user/mmorgan/lp1482757_tweaks_and_signoff

This branch includes my signoff and two additional commits.

The first makes the minor tweaks Dan Wells suggested in comment #12.

The second adds a query to the upgrade script to delete any pre-existing orphaned URIs from the database.

Here's hoping this can move forward!

Adding back the pullrequest tag. Not sure if adding the signedoff tag is appropriate, but it has mine.

tags: added: pullrequest
removed: needsrepatch
Michele Morgan (mmorgan)
Changed in evergreen:
milestone: none → 3.5.0
status: New → Confirmed
Michele Morgan (mmorgan)
tags: added: signedoff
Changed in evergreen:
assignee: Rogan Hamby (rogan-hamby) → nobody
Revision history for this message
Michele Morgan (mmorgan) wrote :

We have been using this function in our production system and that has revealed a problem:

If a marc record is updated such that all 856 links are removed, the uris, mappings, and call numbers from the now removed 856 fields do not get deleted. Consequently, the record cannot be deleted, which was the goal when removing the 856 fields in the first place.

The function needs to handle the case where all 856 fields are removed from the marc.

Removing the signedoff and pullrequest tags and adding needsrepatch for now

tags: added: needsrepatch
removed: pullrequest signedoff
Changed in evergreen:
assignee: nobody → Michele Morgan (mmorgan)
Changed in evergreen:
milestone: 3.5.0 → 3.5.1
Revision history for this message
Michele Morgan (mmorgan) wrote :

I've force pushed an updated branch to the same location:

https://git.evergreen-ils.org/?p=working/Evergreen.git;a=shortlog;h=refs/heads/user/mmorgan/lp1482757_tweaks_and_signoff

This branch makes sure that all URIs and call number are deleted when all 856 fields are removed from MARC record. Also provides for tracking the edit_date and editor in asset.call_number when the call number is deleted.

We are using this updated function in our production system.

tags: added: pullrequest
removed: needsrepatch
Changed in evergreen:
assignee: Michele Morgan (mmorgan) → nobody
Changed in evergreen:
milestone: 3.5.1 → 3.5.2
Revision history for this message
Jason Stephenson (jstephenson) wrote :

I'm testing this again with our production data at CW MARS. Part of my testing is to see if this makes any performance of my URI removal script on Pg 10, 11, and 12.

One thing that I have noticed as the delete portion of the upgrade script does not take the foreign key link between asset.uri and serial.item into account. We don't use serials, so I'm asking if this should be a concern or not?

Changed in evergreen:
assignee: nobody → Jason Stephenson (jstephenson)
Revision history for this message
Jason Stephenson (jstephenson) wrote :

I am still testing this for any possible effect on performance of bib record updates.

I have pushed a collab branch with my signoff on Mike's and Michele's work so far, plus an extra commit on top to speed up the deletion of orphaned URIs in the upgrade script:

https://git.evergreen-ils.org/?p=working/Evergreen.git;a=shortlog;h=refs/heads/collab/dyrcona/lp1482757

I may have more commits to add later if performance degrades.

I will not be done looking at this until next week.

Revision history for this message
Jason Stephenson (jstephenson) wrote :

I tested this branch with a large deletion of 51,655 URIs from 51,422 bib records over the past week. I ran the deletion twice on Pg 10 and twice on Pg 9.6. The first run was without this branch installed, and the second run with. Both databases use the default PostgreSQL configuration, except for the port.

For the purposes of this test, I added code using the Time::HiRes Perl module to the script that I normally run to delete located URIs by vendor and library in order to time the record updates. My script works by first finding the records that need to have URIs removed, extracting the MARC from the record, making a MARC::Record object, deleting the appropriate 856 tags, and then updating the biblio.record_entry with the modified MARC. The timing code is wrapped around the DBI call to do the database update, so it should be an accurate measure of the impact of this branch on the database.

Between test runs, I use pg_restore to load the same dump used in previous tests, and ran vacuumdb with the -f and -z options on the restored database.

It should be noted that this database server runs multiple instances of different versions of PostgreSQL. However, the other instances were not used during the test runs.

I got differing results from Pg 10 vs. Pg 9.6. (NOTE: I rounded the times below to the nearest hundredth of a second. Tim::HiRes gives much finer grained results.)

On Pg 10 WITHOUT the patch, the average update time for a bib record with located URIs was 1.85 seconds. WITH the patch on Pg 10, the average time to update a record increased to 2.25 seconds.

On Pg 9.6 the average time to update a record was 2.76 seconds WITHOUT the patch, and it decreased to 2.38 seconds WITH the patch.

In order to be scientifically useful, the results should be run thousands of times in each configuration and then averaged. Since it takes about 3 days to run a complete test including the database reload, etc., this is no feasible at this time.

Given what I've seen with this test, I have added my signoff and with an extra commit to speed up the deletion of orphaned asset.uri entries in the upgrade script:

https://git.evergreen-ils.org/?p=working/Evergreen.git;a=shortlog;h=refs/heads/collab/dyrcona/lp1482757

I think more testing may be warranted for this branch, particularly since it seems to degrade performance on PG 10. Given the cursory nature of my testing, I cannot rule out some other factor having an effect on this number.

tags: added: signedoff
Changed in evergreen:
assignee: Jason Stephenson (jstephenson) → nobody
Changed in evergreen:
milestone: 3.5.2 → 3.6.1
tags: added: performance
Changed in evergreen:
milestone: 3.6.1 → 3.6.2
Changed in evergreen:
milestone: 3.6.2 → 3.6.3
Changed in evergreen:
milestone: 3.6.3 → 3.6.4
Changed in evergreen:
milestone: 3.6.4 → 3.7.2
no longer affects: evergreen/3.1
no longer affects: evergreen/3.2
no longer affects: evergreen/3.3
no longer affects: evergreen/3.4
no longer affects: evergreen/3.5
Changed in evergreen:
milestone: 3.7.2 → 3.7.3
Revision history for this message
Jessica Woolford (jwoolford) wrote :

I have tested this code and consent to sign off with my name Jessica Woolford and email <email address hidden>

Michele Morgan (mmorgan)
Changed in evergreen:
assignee: nobody → Michele Morgan (mmorgan)
Revision history for this message
Michele Morgan (mmorgan) wrote :

Pushed to rel_3_7, rel_3_8, and master. Thanks Mike, Jason, Jessica, and everyone else who put eyes on it!

Changed in evergreen:
milestone: 3.7.3 → 3.8.1
no longer affects: evergreen/3.6
Changed in evergreen:
assignee: Michele Morgan (mmorgan) → nobody
status: Confirmed → Fix Committed
Changed in evergreen:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.