Add clean_marc function to OpenILS::Utils::Normalize

Bug #888572 reported by Jason Stephenson
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Evergreen
Fix Released
Wishlist
Unassigned

Bug Description

The branch below adds a clean_marc function to OpenILS::Utils::Normalize.

It also replaces redundant code in OpenILS::Application::Vandelay and OpenILS::Application::Acq::Order with calls to the new function.

I have imported 1,000 records with Vandelay since applying the branch, so it doesn't seem to cause any breakage there. I don't have a way to test Acquisitions, so someone might want to test that before merging into master.

For sharing O-U-N-clean_marc via the remote repo working:

You need only add the remote once, regardless of how many branches you wish to look at.
To add this repo as working:
 Read-only (git protocol):
  git remote add working git://git.evergreen-ils.org/working/Evergreen
 Read-write (ssh protocol):
  git remote add working <email address hidden>:working/Evergreen

Once you have the remote added you can check out this branch:
git checkout -b O-U-N-clean_marc working/user/dyrcona/O-U-N-clean_marc

Tags: pullrequest
Changed in evergreen:
importance: Undecided → Wishlist
description: updated
Revision history for this message
Jason Stephenson (jstephenson) wrote :

I rebased and force-pushed the branch to resolve a conflict with Dan's code that adds the search_normalize function to the same file.

Revision history for this message
Dan Scott (denials) wrote :

I signed off on the commit but had to resolve the conflict again, and pushed it to user/dbs/LP888572_clean_marc_fix_corruption - along with another commit, below, to add some unit tests and resolve a potentially severe problem:

*IMPORTANT* In the course of adding some unit tests to ensure that we were generating what we expected, I found that clean_marc() (and very likely the functionality as it exists within Vandelay) introduces subtle but significant corruption to upper case diacritic characters. Check Open-ILS/src/perlmods/t/14-OpenILS-Utils.t to see an example; in the same commit that I added the test, I changed the order in which entityize() was called to resolve the corruption.

To run the Perl unit tests:
cd Open-ILS/src/perlmods
prove -l t

(or "prove -l t/<specific_file_of_tests>").

This may warrant backporting the same fix back to rel_2_1 and rel_2_0 if testing Vandelay directly bears out the same results.

Also note that I'm getting the warning: "UNIVERSAL->import is deprecated and will be removed in a future perl" (for the use of "isa" as a function). This was introduced in OpenILS/Utils/Normalize by your changes, but it's a problem throughout our code base and deserves a bug of its own to have someone address it by the use of Scalar::Util instead.

Revision history for this message
Dan Scott (denials) wrote :

Ah, one more commit with some happiness on the corruption front - it turned out that as the input from the test case did not have decode_utf8() invoked on it, we ran into the misery that we saw.

 So, my extra commit does the precautionary decode_utf8() call on the input because that's the sane thing to do, and ensures that the regexes know that they're dealing with a Unicode string instead of some random binary string and can behave accordingly. I've restored the order of the entityize() call in this commit as well.

Revision history for this message
Jason Stephenson (jstephenson) wrote :

Rebased, cherry-picked, signed off and pushed:

working/user/Dyrcona/lp888572

Revision history for this message
Dan Scott (denials) wrote :

Committed - thanks!

Changed in evergreen:
status: New → Fix Committed
milestone: none → 2.2.0alpha2
Changed in evergreen:
milestone: 2.2.0alpha2 → 2.2.0alpha3
Changed in evergreen:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.