marc_export: want to delete fields/subfields

Bug #1754455 reported by Dan Pearl
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Evergreen
Fix Released
Wishlist
Unassigned

Bug Description

Wishlist

LOC MARC specification for Bibliographic Records describes subcode 0 thusly:

"Subfield $0 contains the system control number of the related authority or classification record, or a standard identifier such as an International Standard Name Identifier (ISNI). These identifiers may be in the form of text or a Uniform Resource Identifier (URI). If the identifier is text, the control number or identifier is preceded by the appropriate MARC Organization code (for a related authority record) or the Standard Identifier source code (for a standard identifier scheme), enclosed in parentheses. When the identifier is given in the form of a Web retrieval protocol, e.g., HTTP URI, no preceding parenthetical is used.

Subfield $0 may contain a URI that identifies a name or label for an entity. When dereferenced, the URI points to information describing that name. A URI that directly identifies the entity itself is contained in subfield $1.

See MARC Code List for Organizations for a listing of organization codes and Standard Identifier Source Codes for code systems for standard identifiers. Subfield $0 is repeatable for different control numbers or identifiers."

CWMARS -- and presumably other Evergreen users -- use subfield 0 in ways that (I presume) may impair data interchange. It would be useful if marc_export could optionally suppress the output of these subfields.

Dan Pearl (dpearl)
Changed in evergreen:
assignee: nobody → Dan Pearl (dpearl)
Revision history for this message
Mike Rylander (mrylander) wrote :

I would argue that we are following the standard. The format of our $0 is:

$0(FOO) 123456

where "FOO" is configurable to be a proper MARC Organization code, and "123456" is our system control number for the relevant authority record. Since we specifically check for a preceding parenthetical when inspecting the $0, staff (or outside record sources) can safely use URIs, which don't have the preceding parenthetical.

That said, we could make the parenthetical check more robust, and require the value in the record to match the configured value for the Evergreen instance.

I also don't think it's a /bad/ thing to be able to strip $0s, but I imagine it might be good to consider this in a larger context of single "strip what I ask you to strip" feature rather than adding a new option for each field we may strip at some point. Located URI data comes to mind here...

Thoughts?

Revision history for this message
Rogan Hamby (rogan-hamby) wrote : Re: [Bug 1754455] Re: marc_export should (optionally) remove 0 subfields
Download full text (3.3 KiB)

"That said, we could make the parenthetical check more robust, and
require the value in the record to match the configured value for the
Evergreen instance.*"*

This is the issue I think is the more significant issue. I've had to help
libraries who copy cataloged from other Evergreen installations and had
issues with the imported authority links. In a perfect world an option to
strip them out of marc import would be nice too.

On Thu, Mar 8, 2018 at 3:41 PM, Mike Rylander <email address hidden> wrote:

> I would argue that we are following the standard. The format of our $0
> is:
>
> $0(FOO) 123456
>
> where "FOO" is configurable to be a proper MARC Organization code, and
> "123456" is our system control number for the relevant authority record.
> Since we specifically check for a preceding parenthetical when
> inspecting the $0, staff (or outside record sources) can safely use
> URIs, which don't have the preceding parenthetical.
>
> That said, we could make the parenthetical check more robust, and
> require the value in the record to match the configured value for the
> Evergreen instance.
>
> I also don't think it's a /bad/ thing to be able to strip $0s, but I
> imagine it might be good to consider this in a larger context of single
> "strip what I ask you to strip" feature rather than adding a new option
> for each field we may strip at some point. Located URI data comes to
> mind here...
>
> Thoughts?
>
> --
> You received this bug notification because you are subscribed to
> Evergreen.
> Matching subscriptions: evergreenbugs
> https://bugs.launchpad.net/bugs/1754455
>
> Title:
> marc_export should (optionally) remove 0 subfields
>
> Status in Evergreen:
> New
>
> Bug description:
> Wishlist
>
> LOC MARC specification for Bibliographic Records describes subcode 0
> thusly:
>
> "Subfield $0 contains the system control number of the related
> authority or classification record, or a standard identifier such as
> an International Standard Name Identifier (ISNI). These identifiers
> may be in the form of text or a Uniform Resource Identifier (URI). If
> the identifier is text, the control number or identifier is preceded
> by the appropriate MARC Organization code (for a related authority
> record) or the Standard Identifier source code (for a standard
> identifier scheme), enclosed in parentheses. When the identifier is
> given in the form of a Web retrieval protocol, e.g., HTTP URI, no
> preceding parenthetical is used.
>
> Subfield $0 may contain a URI that identifies a name or label for an
> entity. When dereferenced, the URI points to information describing
> that name. A URI that directly identifies the entity itself is
> contained in subfield $1.
>
> See MARC Code List for Organizations for a listing of organization
> codes and Standard Identifier Source Codes for code systems for
> standard identifiers. Subfield $0 is repeatable for different control
> numbers or identifiers."
>
> CWMARS -- and presumably other Evergreen users -- use subfield 0 in
> ways that (I presume) may impair data interchange. It would be useful
> if marc_export could optionally suppress the output of these...

Read more...

Revision history for this message
Elaine Hardy (ehardy) wrote : Re: marc_export should (optionally) remove 0 subfields

I think one should tread lightly removing all instances of this subfield from a bib record on import or export since it may be essential to future linked data. I think it would be better to address problems with specific $0s rather than strip all $0s, whether correct or not. However, having the option to strip them out or not would allow for local control. A perfect world would be a way to identify the problem subfield for resolution, while leaving the remaining ones in place.

Revision history for this message
Jason Stephenson (jstephenson) wrote :

We have been asked specifically if it is possible to remove the subfield 0 from all tags when doing exports. So, we started a branch that adds a command line option to marc_export that can be use to remove the subfield zero from all tags. We want to share that with the community, since others might have a similar need.

I see now that a more generic feature would be more useful. After reading the above comments, what I think is called for is a command line option that can be used to specify:

1. any tag(s) to be removed completely during export but we thought others might find it useful but we thought others might find it useful
2. any subfield(s) to be removed from all fields during export
3. any subfiled(s) to be remove from certain tag(s) during export
4. the ability to specify multiple combinations of the above

Perhaps Dan's remark about Evergreen not following the spec is a) erroneous and b) a red herring. The purpose of this branch is not to conform to any specification. It is to answer a perceived need at our institution. Someone has asked us to strip certain subfields when exporting records, so we thought that would be a useful feature.

Any discussion of improvements to Evergreen's use of subfield 0 should be taken to its own bug as should any discussion of stripping fields/subfields on import. Neither one of these is the goal of this bug or branch.

Revision history for this message
Jason Stephenson (jstephenson) wrote :

Number 1 in my previous comment got a little botched from my palm resting on my touchpad. I fixed most of the mess, but guess I missed that bit.

Revision history for this message
Dan Pearl (dpearl) wrote :

For safekeeping, I have created a branch that supports this basic functionality. No Pullrequest.

user/dpearl/LP1754455_marc_export_zero

Changed in evergreen:
assignee: Dan Pearl (dpearl) → nobody
Revision history for this message
Dan Pearl (dpearl) wrote :

Here is a branch that supports regular expressions. Thanks for the suggestion, Mike!

user/dpearl/LP1754455_marc_export_regex

Dan Pearl (dpearl)
summary: - marc_export should (optionally) remove 0 subfields
+ marc_export: want facility to delete fields/subfields
summary: - marc_export: want facility to delete fields/subfields
+ marc_export: want to delete fields/subfields
Changed in evergreen:
milestone: none → 3.next
Revision history for this message
Jason Stephenson (jstephenson) wrote :

I'm testing this with some export that I will need to do in May, and it occurred to me that an additional feature that would be nice to have, that this branch does not provide, would be to delete tags from the output based on subfield values. My main use case would be to delete 856 tags where the subfield 9 is not for one of the libraries whose records I am exporting.

I will add that feature and write ups some release notes, which this branch needs.

Changed in evergreen:
assignee: nobody → Jason Stephenson (jstephenson)
Revision history for this message
Dan Pearl (dpearl) wrote : Re: [Bug 1754455] Re: marc_export: want to delete fields/subfields

There are release notes:
docs/RELEASE_NOTES_NEXT/marc_export_strip.adoc

On Thu, Mar 29, 2018 at 10:47 AM, Jason Stephenson <
<email address hidden>> wrote:

> I'm testing this with some export that I will need to do in May, and it
> occurred to me that an additional feature that would be nice to have,
> that this branch does not provide, would be to delete tags from the
> output based on subfield values. My main use case would be to delete 856
> tags where the subfield 9 is not for one of the libraries whose records
> I am exporting.
>
> I will add that feature and write ups some release notes, which this
> branch needs.
>
> ** Changed in: evergreen
> Assignee: (unassigned) => Jason Stephenson (jstephenson)
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1754455
>
> Title:
> marc_export: want to delete fields/subfields
>
> Status in Evergreen:
> New
>
> Bug description:
> Wishlist
>
> LOC MARC specification for Bibliographic Records describes subcode 0
> thusly:
>
> "Subfield $0 contains the system control number of the related
> authority or classification record, or a standard identifier such as
> an International Standard Name Identifier (ISNI). These identifiers
> may be in the form of text or a Uniform Resource Identifier (URI). If
> the identifier is text, the control number or identifier is preceded
> by the appropriate MARC Organization code (for a related authority
> record) or the Standard Identifier source code (for a standard
> identifier scheme), enclosed in parentheses. When the identifier is
> given in the form of a Web retrieval protocol, e.g., HTTP URI, no
> preceding parenthetical is used.
>
> Subfield $0 may contain a URI that identifies a name or label for an
> entity. When dereferenced, the URI points to information describing
> that name. A URI that directly identifies the entity itself is
> contained in subfield $1.
>
> See MARC Code List for Organizations for a listing of organization
> codes and Standard Identifier Source Codes for code systems for
> standard identifiers. Subfield $0 is repeatable for different control
> numbers or identifiers."
>
> CWMARS -- and presumably other Evergreen users -- use subfield 0 in
> ways that (I presume) may impair data interchange. It would be useful
> if marc_export could optionally suppress the output of these
> subfields.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/evergreen/+bug/1754455/+subscriptions
>

Revision history for this message
Jason Stephenson (jstephenson) wrote :

OK. I expected to find them under Administration or Cataloging. I'll update them with my additions and put them in one of those two places. Guess I'll check the docs to see where marc_export goes in the regular documentation, if it appears there at all.

Revision history for this message
Jason Stephenson (jstephenson) wrote :

I'm not working on this at the moment. Most of the reason for our need has passed. I do intend to revisit this later, which might very well be never.

Changed in evergreen:
assignee: Jason Stephenson (jstephenson) → nobody
milestone: 3.next → none
tags: added: cat-importexport
Revision history for this message
Jason Stephenson (jstephenson) wrote :

I finally cleaned up Dan Pearl's code and pushed to a collab branch:

collab/dyrcona/LP1754455_marc_export_regex

https://git.evergreen-ils.org/?p=working/Evergreen.git;a=shortlog;h=refs/heads/collab/dyrcona/LP1754455_marc_export_regex

Other than fixing white space issues, cleaning up the release notes and commit message, I have not looked at this code in 4 years.

It rebased cleanly onto master, and I've added the pullrequest tag to maybe get some eyes on it at last.

tags: added: pullrequest
Revision history for this message
Jason Boyer (jboyer) wrote (last edit ):

I've tried it out and like it. Signoff is here: https://git.evergreen-ils.org/?p=working/Evergreen.git;a=shortlog;h=refs/heads/user/jboyer/lp1988472_marque_signoff / working/user/jboyer/lp1988472_marque_signoff

I also included a followup that ignores --strip / because outputting nothing but a leader is useless. It also auto-strips 852s when using --items because that implies that you want accurate, current data and there's no reason to require 2 params to get it.

Changed in evergreen:
milestone: none → 3.10.1
tags: added: signedoff
Galen Charlton (gmc)
Changed in evergreen:
status: New → Confirmed
importance: Undecided → Wishlist
milestone: 3.10.1 → 3.11-beta
Revision history for this message
Galen Charlton (gmc) wrote :

Pushed to master for inclusion in 3.11-beta. Thanks, Dan, Jason, and Jason!

Changed in evergreen:
status: Confirmed → Fix Committed
Changed in evergreen:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.