marc_export - performance issues with larger numbers of copies per bib

Bug #2041364 reported by Josh Stompro
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Evergreen
Fix Released
Medium
Unassigned
3.10
Won't Fix
Medium
Unassigned
3.11
Fix Released
Medium
Unassigned

Bug Description

As reported by Jason Stephenson, marc_export was taking a long time to export data in the case of including items (--items) and with bibs that had a large number of copies. (Hundreds to thousands of copies per bib).

Testing on EG 3.11.1
libmarc-record-perl 2.0.7-1

On a test system, I identified our top 100 or so bibs with the most copies and duplicated the copies on each bib 4 times, so went from 8K to 120K copies so I could trigger the issue. The bibs with the most number of copies had 16K, 6K, 4K and 3K copies each.

I used libdevel-profile-perl to profile each run to see what was being called the most.

The problem is that the code that adds each 852 tag uses $marc->insert_grouped_field, which searches through all the marc tags every time it is called to find the end of the marc tag grouping. Then it uses insert_fields_before, which searches through all the tags again before performing the insert. So for each newly inserted 852 field, the amount of work increases by 2 comparisons.

https://metacpan.org/release/GMCHARLT/MARC-Record-2.0.7/source/lib/MARC/Record.pm#L399

By re-working how the 852s get added, by using insert_grouped_field for the first item's insert, and then one call to insert_fields_after for the rest of the fields, the runtime went down 78% for my test setup.

I'm not sure how much the memory usage has gone up yet, I'll check that next.

The xml output was identical also, so I think I haven't changed the ordering at all.

I ran this with just the list of bibs with the lots of copies with the --pipe option.

Example command line runs with profiling.

## EG 3.11 version
$ cat ids-most-items.txt | perl -d:Profile /openils/bin/marc_export --pipe --items --format xml > test3.xml

time elapsed (wall): 820.5668
time running program: 609.7940 (74.31%)
time profiling (est.): 210.7729 (25.69%)
number of calls: 286352387
number of exceptions: 1

%Time Sec. #calls sec/call F name
46.07 280.9194 122512 0.002293 MARC::Record::insert_grouped_field
23.84 145.3557 234275686 0.000001 MARC::Field::tag
 7.46 45.5201 122512 0.000372 MARC::Record::insert_fields_before
 7.12 43.4227 105 0.413549 Marque::Biblio::acps_for_bre
 6.32 38.5100 38895909 0.000001 <anon>:/openils/bin/marc_export:643
 1.78 10.8453 123770 0.000088 Fieldmapper::properties
 1.25 7.6325 122722 0.000062 MARC::Record::fields
 1.22 7.4176 106 0.069977 Marque::Biblio::next

$ cat ids-most-items.txt | perl -d:Profile ./marc_export_testing2 --pipe --items --format xml > test-testing-fancyappend.xml

time elapsed (wall): 165.3634
time running program: 129.9096 (78.56%)
time profiling (est.): 35.4539 (21.44%)
number of calls: 51348206
number of exceptions: 1

%Time Sec. #calls sec/call F name
32.55 42.2917 105 0.402778 Marque::Biblio::acps_for_bre
31.70 41.1810 38895909 0.000001 <anon>:./marc_export_testing2:643
 8.66 11.2528 123770 0.000091 Fieldmapper::properties
 5.01 6.5075 106 0.061391 Marque::Biblio::next
 3.39 4.4023 105 0.041927 MARC::File::XML::record
 2.14 2.7745 2427108 0.000001 MARC::File::XML::escape
 2.09 2.7136 123770 0.000022 Fieldmapper::from_bare_hash
 2.02 2.6192 2089846 0.000001 <anon>:./marc_export_testing2:658

I'll put my full testing notes at
https://gitlab.com/LARL/evergreen-larl/-/snippets/3615366

And a working branch will be on the way.
Josh

Revision history for this message
Josh Stompro (u-launchpad-stompro-org) wrote :

The memory usage increase seems slight, less than 1%.

(kbytes): 227200
to
(kbytes): 229004

= 0.794014% increase

tags: added: cat-importexport
Revision history for this message
Jason Stephenson (jstephenson) wrote :

Josh,

Thanks for sharing all of that information and the snippets in particular.

For the sake of posterity, here's a patch of some changes I made to marc_export to log the time spent in each subroutine. I planned to use the output to see where all of the time was being spent and the do more granular logging at that point.

I wonder if this merits a bug on MARC::Record?

Revision history for this message
Jason Stephenson (jstephenson) wrote :

I deleted the previous patch to replace it with a more complete one. The previous patch lacked a change I made that was not committed to the local git repository to format the microseconds in the timestamps.

Changed in evergreen:
status: New → Confirmed
importance: Undecided → Medium
assignee: nobody → Jason Stephenson (jstephenson)
Revision history for this message
Jason Stephenson (jstephenson) wrote :

Here's a patch based on Josh Stompro's bug description and some discussion in IRC. I dumped records for 1 of our academic libraries, and it made no difference in performance. I am trying it with all of our records, an export that would take over 5 days without this patch. We'll see if that makes a difference. If it does, I'll start a collab branch. Josh has identified other areas that could possibly use some work.

Revision history for this message
Jason Stephenson (jstephenson) wrote :

My testing so far shows that this patch has minimal impact on performance for my case. I am running marc_export without feeding it a list of database ids, so it is retrieving all of the records in my database and looping through them one by one. My monitoring of the CPU and RAM usage suggests that this is the main bottleneck in my case. While this probably warrants another bug, I am going to piggy back on this one, since I do think the previous patch is an improvement.

I plan to add a --batch-size option to marc_export to break the main query up with LIMIT and OFFSET.

Currently, my test appears to be exporting about 20,000 records per hour. I have seen marc_export go faster when exporting 40,000 or so bibs with items for a single library.

I'll set up a collab branch with the above patch and one for --batch-size. If any other improvements turn up as a result of profiling, then they can be added there.

Revision history for this message
Josh Stompro (u-launchpad-stompro-org) wrote :

I updated the collab branch with a commit that switches a bunch of grep array lookups to hash lookups.

Also, setup the acp query to use a prepared statement.

collab/dyrcona/lp2041364-marc_export_improvements

Revision history for this message
Jason Stephenson (jstephenson) wrote :

Thanks for sharing the collab branch, Josh. I meant to add it here, but forgot.

I'll take a look at your most recent changes.

The improvements have so far not had a major impact on the performance of my exports with the --items option.

Changed in evergreen:
milestone: none → 3.12-beta
assignee: Jason Stephenson (jstephenson) → nobody
tags: added: pullrequest
Revision history for this message
Jason Stephenson (jstephenson) wrote (last edit ):

I have pushed a branch with my signoffs added to Josh Stompro's commits. (I also cleaned up one of Josh's commits when signing it: removed extra whitespace and fixed the commit message.)

There are also two (unrelated) commits in the branch: 1 fixes an error messages with a bad --type parameter is used and the other cleans up two instances of excess white space at the ends of lines. I think it's acceptable to include these here.

The new branch that I'm making the pull request for is at working/collab/dyrcona/lp2041364-marc_export_improvements-partial-signoff

https://git.evergreen-ils.org/?p=working/Evergreen.git;a=shortlog;h=refs/heads/collab/dyrcona/lp2041364-marc_export_improvements-partial-signoff

The most recent changes to the marc_export improve export times for my data with the --items flag by a factor of 70 or more. A test export did in 45 minutes with this branch what took 60 hours with the stock marc_export.

Changed in evergreen:
assignee: nobody → Jane Sandberg (sandbergja)
Changed in evergreen:
status: Confirmed → Fix Committed
tags: added: signedoff
Revision history for this message
Jane Sandberg (sandbergja) wrote :

Thanks for this great work, Josh and Jason! I merged this to main as part of bug 2015484.

There are merge conflicts when attempting to cherry-pick to rel_3_11, mainly since 3.11 doesn't include the new --852b flag. I did my best to resolve them here: https://git.evergreen-ils.org/?p=working/Evergreen.git;a=shortlog;h=refs/heads/collab/sandbergja/lp2041364-marc_export_improvements-rel_3_11 . This branch also signs off on all 7 commits involved in this work.

From my casual testing, collab/sandbergja/lp2041364-marc_export_improvements-rel_3_11 produces the same output (and same warnings on special characters) as rel_3_11, and is a bit faster. But I am not feeling 100% sure that I handled the merge conflicts in the best way, and I would appreciate a double check before it gets pushed to rel_3_11.

Changed in evergreen:
assignee: Jane Sandberg (sandbergja) → nobody
Changed in evergreen:
status: Fix Committed → Fix Released
Revision history for this message
Jason Stephenson (jstephenson) wrote :

Thanks, Jane! Your conflict resolution looks correct and works for me.

Pushed to rel_3_11 for inclusion in 3.11.3!

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.