marc_export creating MARC data that yaz-marcdump dislikes

Bug #1671845 reported by Jeff Godin
14
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Evergreen
Confirmed
Medium
Unassigned

Bug Description

Observed with:
Evergreen 2.11.3
Debian Jessie:
YAZ version: 4.2.30 98864b44c654645bc16b2c54f822dc2e45a93031
Perl 5.20.2

Perl MARC modules installed from Debian Jessie packages:
libmarc-charset-perl 1.35-1
libmarc-record-perl 2.0.6-1
libmarc-xml-perl 1.0.3-1

Tested with some healthy-appearing records in a migration system, have not attempted (yet) to reproduce with concerto bibs.

Per marc_export's --help output, using marc_export without passing --format or --encoding should default to USMARC encoded as MARC8:

 --format or -f Output format (USMARC, UNIMARC, XML, BRE, ARE) [USMARC]
 --encoding or -e Output encoding (UTF-8, ISO-8859-?, MARC8) [MARC8]

# export bib ids 123 and 456
echo -e "123\n456" | marc_export > test.mrc

I would expect "yaz-marcdump test.mrc" to be able to output the two records without issue, other than possible display-time encoding quirks due to my terminal not supporting MARC8.

Also tried with:

yaz-marcdump -n -p test.mrc
yaz-marcdump -n test.mrc
yaz-marcdump -f MARC8 -t UTF8 test.mrc

The following are some examples of warnings generated by the above yaz-marcdump commands:
Separator but not at end of field length=48
Bad indicator data. Skipping 1 bytes
Separator but not at end of field length=65
Separator but not at end of field length=29
Separator but not at end of field length=62
Bad indicator data. Skipping 2 bytes
No separator at end of field length=6
Bad indicator data. Skipping 2 bytes

The warnings / errors suggest a problem with the directory in the records, perhaps the values being incorrect when a multi-byte character is changed to a single-byte character during the encoding change from UTF-8 in the database to MARC8.

When specifying --encoding UTF-8, the resulting MARC output does not have the above errors. As a workaround, you should be able to output UTF-8 records from marc_export and then convert them to MARC8 with yaz-marcdump or other tools.

It is quite possible that this is not a bug in marc_export or even Evergreen, but an issue upstream. Looking for reported (and possibly fixed!) bugs there may be a good next step.

Tags: cat-marc
Revision history for this message
Jeff Godin (jgodin) wrote :

A total of 17 error / warning messages are generated by yaz-marcdump when attempting to parse the two records in the above scenario, and the errors appear to begin AFTER the first occurrence of a character such as é or ©.

In some tests with a single record, I was unable to get yaz-marcdump to emit anything at all, other than an initial "<!-- Record 1 offset 0 (0x0) -->" when using -p

Changed in evergreen:
status: New → Confirmed
importance: Undecided → Medium
Revision history for this message
Jeff Godin (jgodin) wrote :

Confirmed same symptoms with a set of sample records loaded with --load-all-sample

$ marc_export --all > sample-all.mrc
Warning from bibliographic record 233: no mapping found at position 4 in Arii︠a︡ kni︠a︡zi︠a︡. at /usr/share/perl5/MARC/Charset.pm line 384.
Warning from bibliographic record 233: Use of uninitialized value in join or string at /usr/share/perl5/MARC/Field.pm line 696.

(The above string after "at position 4 in" may not appear correctly in launchpad.)

Counting lines of output with yaz-marcdump -n (should be zero lines):
$ yaz-marcdump -n sample-all.mrc | wc -l
1479

Counting '<!-- Record ' in yaz-marcdump -n -p output:
$ yaz-marcdump -n -p sample-all.mrc | grep -c '^<!-- Record '
149

Summarizing errors/warnings:
$ yaz-marcdump -n sample-all.mrc | sort | sed -e 's/ length=.*$/length=XX/' | sort | uniq -c
    221 Bad indicator data. Skipping 1 bytes
    226 Bad indicator data. Skipping 2 bytes
     59 No separator at end of fieldlength=XX
    973 Separator but not at end of fieldlength=XX

For purposes of testing, I confirmed that if I exclude bibliographic record 233 I receive no warnings from marc_export but the resulting output file still shows the same symptoms as elsewhere in this bug.

psql -c 'SELECT id FROM biblio.record_entry where id <> 233 and not deleted and id > 0;' -A -t > bibs-not-233.txt

$ cat bibs-not-233.txt | marc_export > sample-some.mrc
[no warnings/errors from marc_export, since we excluded record id 233]

$ yaz-marcdump -n -p sample-some.mrc | grep -c '^<!-- Record '
148

$ yaz-marcdump -n sample-some.mrc | sort | sed -e 's/ length=.*$/length=XX/' | sort | uniq -c
    221 Bad indicator data. Skipping 1 bytes
    226 Bad indicator data. Skipping 2 bytes
     59 No separator at end of fieldlength=XX
    973 Separator but not at end of fieldlength=XX

Using --encoding=UTF-8 with marc_export results in no apparent issues:

$ marc_export --all --encoding=UTF-8 > sample-all.utf8.mrc
$ yaz-marcdump -n -p sample-all.utf8.mrc | grep -c '^<!-- Record '
235
$ yaz-marcdump -n sample-all.utf8.mrc | sort | sed -e 's/ length=.*$/length=XX/' | sort | uniq -c

$ cat bibs-not-233.txt | marc_export --encoding=UTF-8 > sample-some.utf8.mrc
$ yaz-marcdump -n -p sample-some.utf8.mrc | grep -c '^<!-- Record '
234
$ yaz-marcdump -n sample-some.utf8.mrc | sort | sed -e 's/ length=.*$/length=XX/' | sort | uniq -c

The issue with the non-UTF-8 files confuses yaz-marcdump enough that it reports 86 fewer records than are expected to be present in the file.

File sizes do not seem to suggest that there are actually 86 fewer records in the non-UTF-8 files:

271839 sample-all.mrc
264189 sample-some.mrc

271438 sample-all.utf8.mrc
263755 sample-some.utf8.mrc

Revision history for this message
Jeff Godin (jgodin) wrote :

Attaching lp1671845.tar.gz containing sample marc_export output for comparison.

Revision history for this message
Josh Stompro (u-launchpad-stompro-org) wrote :

Hello, I just ran into this issue.

Here is a snippit from marcdump that shows that it is confused with the indicator, it is displaying the next records indicator as the last character of the last data subfield.

700 _aHeadland, Leslye,
       _efilm director.1
700 _aBrie, Alison,
       _eactor.1
700 _aSudeikis, Jason,
       _eactor.1
700 _aCarlos, Jordan,
       _eactor.1
700 _aLevieva, Margarita,
       _d1985-
       _eactor.

So would this be the directory header being off? Maybe the length is off by 1 or 2?

If there is no apparent fix for this, then maybe the defaults should be changed so that running it with no options gets valid results.

Josh

tags: added: marc
Elaine Hardy (ehardy)
tags: added: cat-marc
removed: marc
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.