BibMagic - Enhanced MARC importer script bib_magic_importer.pl

Bug #1947898 reported by Blake GH
44
This bug affects 8 people
Affects Status Importance Assigned to Milestone
Evergreen
Confirmed
Wishlist
Unassigned

Bug Description

I would like to submit my work to the Evergreen repository, making it easier to share. We've created a Swiss Army Knife Perl script which imports MARC records from several supported sources:

Local Folder
FTP (recursive or not)
CloudLibrary
MARCive https

Files are only downloaded when the filename matches the configured filename fragment. They are designated as "adds" or "deletes" or "authority control" also by configured filename fragment.

More than reaching out and getting these MARC records, this tool can be configured to edit the MARC records as they are imported. It performs work automatically that otherwise would be manual:

- Appending $9 to 856's (when configured)

- Editing/removing/replacing any MARC field(s) (when configured)

- Processing imported Authority Control records

- Setting up and continuing to import into a configured "Bib Source"

- Matching incoming records onto the same record that was previous imported by this tool. This is accomplished with a special $7 tag on the 856s

- Deletes will only remove a related 856, but will also remove the whole bib when zero 856's remain.

- Allows for matching records outside of the previously-imported-records-by-this-tool (usually configured for the first run)

- Sends an email when it begins and an email with the result summary upon completion

- The results are also logged to a CSV file

- The results are also in its database schema

- Another script to "sync" the $9s to all of the already-imported bibs if and when the participating libraries change.

This tool requires it's own schema in the database. The script creates the schema when it's used for the first time, but I'll be refactoring it to integrate that component into the Evergreen/OpenSRF/src/sql/Pg folder. Making it easier to upgrade from one version to another when changes to the schema might be required.

Branch incoming.

Blake GH (bmagic)
tags: added: supportscripts
Revision history for this message
Blake GH (bmagic) wrote :

Stubbing the files in the working repo. Merged all of the external Perl Module functionality into the single script.

https://git.evergreen-ils.org/?p=working/Evergreen.git;a=shortlog;h=refs/heads/user/blake/enhanced_marc_importer_lp1947898

Changed in evergreen:
importance: Undecided → Wishlist
Michele Morgan (mmorgan)
tags: added: pullrequest
Revision history for this message
Blake GH (bmagic) wrote :

Alright then! I've got the code to a place where I feel like it's time to invite others to review. I've confirmed that Evergreen can build and run with this patch. The only danger is fm_IDL.xml which seems to be sound. I've ran some reports in the interface against the new schema.

Changed in evergreen:
status: New → Confirmed
assignee: nobody → Jason Stephenson (jstephenson)
Revision history for this message
Blake GH (bmagic) wrote :

Updated this with a small patch for the 901c match not matching when running with argument --match_901c.

Revision history for this message
Blake GH (bmagic) wrote :

Added a bug fix for deep matching on 001 when the incoming 001 is too small (less than 6 characters) or non-existent. Also fixed a bug for marc editing for non-matching fields.

Changed in evergreen:
assignee: Jason Stephenson (jstephenson) → nobody
Changed in evergreen:
assignee: nobody → Jason Stephenson (jstephenson)
Revision history for this message
Jason Stephenson (jstephenson) wrote (last edit ):

These are my "first impression" comments from having looked at the code and the documentation, but not having run it, yet. These are intended as constructive/actionable suggestions to make it fit better with the existing Evergreen code base and way of doing things. These are my opinions, so take them for what they're worth. Nothing here is meant to be negative.

This code adds the following dependencies to Evergreen:

* DateTime::Format::Duration
* Digest::SHA1
* Email::Sender::Simple
* pQuery

Of the above, the functionality of Digest::SHA1 can be replaced by the sha1 functions of Digest::SHA, which this new code also uses. I suggest refactoring the code to use Digest::SHA and dropping the Digest::SHA1 dependency.

Evergreen currently uses Email::Send to deliver email. Since Email::Send is "deprecated" by the maintainer (see: https://metacpan.org/pod/Email::Send#WAIT!-ACHTUNG!), we could do well to replace Email::Send in other code with Email::Sender::Simple. That's a totally different bug, of course.

Whatever modules stay, they should be added to the prerequisites installation.

libdatetime-format-duration-perl is a deb on all of the Debian and Ubuntu releases currently supported by the Evergreen community. The others do not have packages and need to be installed via CPAN.

Why is the documentation under the development category? Shouldn't it be under cataloging or possibly admin?

Change git://git.esilibrary.com/migration-tools.git to https://github.com/EquinoxOpenLibraryInitiative/migration-tools.git. This is the new home of the tools.

The crontab examples should be changed to remove the sourcing of .bashrc and to not cd to /openils/bin. We should encourage people to set up proper crontab files with the environment set appropriately. This means that we should also fix our other example crontab files as well. (That's another bug that I've been meaning to open for some time.)

The instructions should tell the user how to install the program to /openils/bin, or it should be installed as part of make install.--I favor installing it automatically.

I also think that the script and configuration file should be moved out of the Open-ILS/src/support-scripts/bib_magic_importer subdirectory. The script should go in the parent directory (Open-ILS/src/support-scripts) and the configuration example should be moved to Open-ILS/examples. The example configuration should also be installed to SYSCONFDIR as part of the make install process.

To fit better with MARC/LoC nomenclature, change "marc_edit_standard" to "marc_edit_data" as the proper name for these is "Data Field." There is no "Standard Field" in the MARC documentation.

I'll follow up in a few days once I've had a chance to try the code out.

Revision history for this message
Jason Stephenson (jstephenson) wrote (last edit ):

The module below should only be used when needed, i.e. if the code that uses it is invoked.

Can't locate REST/Client.pm in @INC (you may need to install the REST::Client module) (@INC contains: /etc/perl /usr/local/lib/x86_64-linux-gnu/perl/5.34.0 /usr/local/share/perl/5.34.0 /usr/lib/x86_64-linux-gnu/perl5/5.34 /usr/share/perl5 /usr/lib/x86_64-linux-gnu/perl-base /usr/lib/x86_64-linux-gnu/perl/5.34 /usr/share/perl/5.34 /usr/local/lib/site_perl) at /openils/bin/bib_magic_importer line 42.
BEGIN failed--compilation aborted at /openils/bin/bib_magic_importer line 42.

I missed it in the documentation, but I suppose it needs to be added to the prerequisite installation as well.

Revision history for this message
Jason Stephenson (jstephenson) wrote :
Download full text (3.3 KiB)

Follow up comments after trying to run 3 batches of records through it:

Configuration example suggestions:

* The example configuration file should be renamed to bib_magic_importer.conf.example.

* Delete the comment about installing the prerequisite modules.

* Provide default directories that match sample Evergreen layout: /openils/var/log/, etc.

* Boolean options should accept Boolean values: 0, 1, true, false, yes, no. They shouldn't need to be commented out to be set to false.

* Can we dispense with the need for the _{N} on the field edits?

The documentation should warn the user that a recursive descent is done through the incomingmarcfolder, so any subdirectories are scanned as well. This was not obvious to me, and I only found during my first test run.

It should not try to send email if no email addresses are configured. I commented out the erroremaillist, successemaillist, and alwaysemail options and got the following error at the end of the run:

Use of uninitialized value for string at /usr/share/perl5/Email/MIME/Enc[7/1838]
ine 67.
no recipients

Trace begun at /usr/local/share/perl/5.34.0/Email/Sender/Simple.pm line 116
Email::Sender::Simple::send_email('Email::Sender::Simple', 'Email::Abstract=ARRA
Y(0x565547895338)', 'HASH(0x565547ad1f48)') called at /usr/local/share/perl/5.34
.0/Email/Sender/Role/CommonSending.pm line 45

Email::Sender::Role::CommonSending::try {...} at /usr/share/perl5/Try/Tiny.pm line 102 eval {...} at /usr/share/perl5/Try/Tiny.pm line 93 Try::Tiny::try('CODE(0x56554788f7a8)', 'Try::Tiny::Catch=REF(0x565547896f80)') called at /usr/local/share/perl/5.34.0/Email/Sender/Role/CommonSending.pm line 58

Email::Sender::Role::CommonSending::send('Email::Sender::Simple', 'Email::MIME=HASH(0x565547af3330)') called at /usr/share/perl5/Sub/Exporter/Util.pm line 57

Sub::Exporter::Util::__ANON__('Email::MIME=HASH(0x565547af3330)') called at /ope nils/bin/bib_magic_importer line 4183

main::email_send('HASH(0x5655478bd150)', 'Evergreen Electronic Import Summary - custom_electronic Import Report Job # 3 WINDING UP', 'Hello,^J^JJust letting you know that I have begun processing the provided files:^JKanopy_MARC_Records__add itions__springfieldlibrary.mrc^M^J^JThis software is configured to perform deep search matches against the database. This is slow but thorough.^JDepending on th e number of records, it could be days before you receive the finished message. F YI.^JI\'ll send a follow-up email when I\'m done.^J^JYours Truly,^JThe friendly server^J') called at /openils/bin/bib_magic_importer line 1103 main::sendWelcomeMessage('ARRAY(0x5655472a5c70)') called at /openils/bin/bib_mag ic_importer line 252

I am testing this on a test environment that is not set up to send email.

I think the above may have blown up my attempts to process records because nothing seemed to happen after that and no records were loaded.

Looking at the data in bib_magic.import_status, t...

Read more...

tags: added: needswork
removed: pullrequest
Changed in evergreen:
assignee: Jason Stephenson (jstephenson) → nobody
Revision history for this message
Jason Stephenson (jstephenson) wrote :

Another suggestion regarding the configuration file: Have you considering using YAML or XML or some other format? There are Perl modules for parsing that which would eliminate the need for the parsing code in the program itself. Other parts of Evergreen/OpenSRF use XML, but YAML or JSON are more compact alternatives.

Revision history for this message
Blake GH (bmagic) wrote :

Committed a change to the matching. Introducing 250a, 245n, 245p match points to prevent the matching/merging for bibs that are different on these data points.

Jason - This is great feedback! Sorry for the delayed response. Thank you for reviewing the code. I agree with everything. I'll get back to this as soon as I can!

Blake GH (bmagic)
summary: - Enhanced MARC importer script electronic_marc_import.pl
+ BibMagic - Enhanced MARC importer script bib_magic_importer.pl
Revision history for this message
Blake GH (bmagic) wrote :

Jason,

I've done the following:

1. Moved the perl script up a directory.
2. Moved the example config to the examples folder.
3. Referenced both files in the Makefile.am so that they are copied into the Evergreen installation folders.
4. Moved the documentation to the admin "module" and moved the navigation references to point to it.
5. Removed the perl module requirements in the config file and documentation.
6. Setup Makefile.[debian-bullseye, debian-buster, fedora, ubuntu-focal, ubuntu-jammy] to install the required perl modules, either from debian repository or CPAN. In Fedora's case, I found the Fedora-style perl package and referenced that. I realize that Fedora is basically not supported but the Evergreen reference file still exists.
7. Removed Digest::SHA1 in favor of plain old Digest::SHA, and updated the one line in the perl code that used it. The other perl code that interacts with that perl module are still compatible with Digest::SHA.
8. Changed the MARC edit references to marc_edit_standard to marc_edit_data.
9 Updated the crontab examples, dropping the .bashrc and change directory commands
10. All boolean type config switches now require "yes", "y", 1, "true" for the switch to be enabled. Anything else (including non-existent/commented out) will evaluate false.
11. An email will not be attempted unless there is at least one recipient.
12. Better documentation about source type: folder. Including note about the sub folders being traversed recursively.

You made reference to changes to the 856 that were not made. Did the record show your changes when viewed in the Evergreen opac/staff client. As opposed to looking at bib_magic.import_status? What configuration line did you have setup for the marc manipulation?

I tested it just now and it did what I told it to do.

I can confirm that when the email piece bombs, the rest of the import doesn't complete. The latest patch will skip the email attempt if the config doesn't have any recipients.

-Blake-

Blake GH (bmagic)
tags: added: pullrequest
removed: needswork
Revision history for this message
Jason Stephenson (jstephenson) wrote :

Thanks for the updates, Blake. I'll have another look.

Changed in evergreen:
assignee: nobody → Jason Stephenson (jstephenson)
Revision history for this message
Mike Rylander (mrylander) wrote :

Apologies in advance for any offense, but ...

Having looking closely at the code, I strongly believe this should be maintained in its own public repository and not pushed into the official Evergreen code base. Definitely not in its current state, and for many of the same reasons that the various migration tools are not merged with Evergreen (the need to be more agile than the Evergreen release cadence; significant dependence on various non-Evergreen code; coding and tooling style mismatches; assumptions about documentation and audience) perhaps not in the future.

Here are some (but not all) of the strong concerns I have about pushing this into the main Evergreen repo:

 * The basic Perl syntax used is confusing, and arguably broken. For instance, this works by accident and is dangerous:
     - if(@subfields[$subs] eq @shortnames[$t])
 * There are external dependencies that provide core functionality claimed by the script (authority records are loaded using other unofficial tools)
 * The script injects arbitrary, persistent schema elements into the database without using the Evergreen standard mechanisms
 * It ignores all existing Evergreen configuration, and introduces new, duplicate configuration
 * The coding style is completely different from the rest of the Evergreen code base, making it nearly unmaintainable except by the original author
 * The documentation is still out of line with the code, and does not explain what is actually happening based on admin input/configuration
 * It does not make use of transactions to guard against partial changes to the database
 * It violates the LoC MARC standard for 856$7: https://www.loc.gov/marc/bibliographic/bd856.html

Revision history for this message
Jason Stephenson (jstephenson) wrote :

I'm going to defer to Mike on this one.

Changed in evergreen:
assignee: Jason Stephenson (jstephenson) → nobody
Revision history for this message
Blake GH (bmagic) wrote :

Mike,

Well said. I pretty much agree. When I started converting the code to "fit", I was trying to keep it isolated from Evergreen as much as possible. But slowly, I added things like filling out the fm_IDL/(there is a XXXX upgrade script in there, so the perl could drop that portion), adding the perl modules into the Makefile, etc. And now we have sort of a half-integrated tool with all of the outstanding pieces that you've highlighted. It needs to go "the rest" of the way.

I can do these things:

* Matching the Evergreen perl style guide.
* Refactor the accidental-working and dangerous perl.
* DB transactions.
* Better documentation.
* DB connection config via ENV or referencing opensrf.xml.

But this one:

Authority load using a different program.

I'm not sure how to address.

Do you have a suggestion on addressing the authority record load portion? Rather than calling another* non-Evergreen tool? I suppose the alternative is to integrate the pieces we need into this program? Or maybe merge that program into support-scripts?

The use of the $7 is something I hadn't revisited since 2016 or so. Evergreen makes use of a non-LoC subfield ($9), so maybe this can use $1 instead of an already-claimed $7 subfield?

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.