Comment 2 for bug 1671845

Revision history for this message
Jeff Godin (jgodin) wrote :

Confirmed same symptoms with a set of sample records loaded with --load-all-sample

$ marc_export --all > sample-all.mrc
Warning from bibliographic record 233: no mapping found at position 4 in Arii︠a︡ kni︠a︡zi︠a︡. at /usr/share/perl5/MARC/Charset.pm line 384.
Warning from bibliographic record 233: Use of uninitialized value in join or string at /usr/share/perl5/MARC/Field.pm line 696.

(The above string after "at position 4 in" may not appear correctly in launchpad.)

Counting lines of output with yaz-marcdump -n (should be zero lines):
$ yaz-marcdump -n sample-all.mrc | wc -l
1479

Counting '<!-- Record ' in yaz-marcdump -n -p output:
$ yaz-marcdump -n -p sample-all.mrc | grep -c '^<!-- Record '
149

Summarizing errors/warnings:
$ yaz-marcdump -n sample-all.mrc | sort | sed -e 's/ length=.*$/length=XX/' | sort | uniq -c
    221 Bad indicator data. Skipping 1 bytes
    226 Bad indicator data. Skipping 2 bytes
     59 No separator at end of fieldlength=XX
    973 Separator but not at end of fieldlength=XX

For purposes of testing, I confirmed that if I exclude bibliographic record 233 I receive no warnings from marc_export but the resulting output file still shows the same symptoms as elsewhere in this bug.

psql -c 'SELECT id FROM biblio.record_entry where id <> 233 and not deleted and id > 0;' -A -t > bibs-not-233.txt

$ cat bibs-not-233.txt | marc_export > sample-some.mrc
[no warnings/errors from marc_export, since we excluded record id 233]

$ yaz-marcdump -n -p sample-some.mrc | grep -c '^<!-- Record '
148

$ yaz-marcdump -n sample-some.mrc | sort | sed -e 's/ length=.*$/length=XX/' | sort | uniq -c
    221 Bad indicator data. Skipping 1 bytes
    226 Bad indicator data. Skipping 2 bytes
     59 No separator at end of fieldlength=XX
    973 Separator but not at end of fieldlength=XX

Using --encoding=UTF-8 with marc_export results in no apparent issues:

$ marc_export --all --encoding=UTF-8 > sample-all.utf8.mrc
$ yaz-marcdump -n -p sample-all.utf8.mrc | grep -c '^<!-- Record '
235
$ yaz-marcdump -n sample-all.utf8.mrc | sort | sed -e 's/ length=.*$/length=XX/' | sort | uniq -c

$ cat bibs-not-233.txt | marc_export --encoding=UTF-8 > sample-some.utf8.mrc
$ yaz-marcdump -n -p sample-some.utf8.mrc | grep -c '^<!-- Record '
234
$ yaz-marcdump -n sample-some.utf8.mrc | sort | sed -e 's/ length=.*$/length=XX/' | sort | uniq -c

The issue with the non-UTF-8 files confuses yaz-marcdump enough that it reports 86 fewer records than are expected to be present in the file.

File sizes do not seem to suggest that there are actually 86 fewer records in the non-UTF-8 files:

271839 sample-all.mrc
264189 sample-some.mrc

271438 sample-all.utf8.mrc
263755 sample-some.utf8.mrc