Confirmed same symptoms with a set of sample records loaded with --load-all-sample
$ marc_export --all > sample-all.mrc
Warning from bibliographic record 233: no mapping found at position 4 in Arii︠a︡ kni︠a︡zi︠a︡. at /usr/share/perl5/MARC/Charset.pm line 384.
Warning from bibliographic record 233: Use of uninitialized value in join or string at /usr/share/perl5/MARC/Field.pm line 696.
(The above string after "at position 4 in" may not appear correctly in launchpad.)
Counting lines of output with yaz-marcdump -n (should be zero lines):
$ yaz-marcdump -n sample-all.mrc | wc -l
1479
Counting '<!-- Record ' in yaz-marcdump -n -p output:
$ yaz-marcdump -n -p sample-all.mrc | grep -c '^<!-- Record '
149
Summarizing errors/warnings:
$ yaz-marcdump -n sample-all.mrc | sort | sed -e 's/ length=.*$/length=XX/' | sort | uniq -c
221 Bad indicator data. Skipping 1 bytes
226 Bad indicator data. Skipping 2 bytes
59 No separator at end of fieldlength=XX
973 Separator but not at end of fieldlength=XX
For purposes of testing, I confirmed that if I exclude bibliographic record 233 I receive no warnings from marc_export but the resulting output file still shows the same symptoms as elsewhere in this bug.
psql -c 'SELECT id FROM biblio.record_entry where id <> 233 and not deleted and id > 0;' -A -t > bibs-not-233.txt
$ cat bibs-not-233.txt | marc_export > sample-some.mrc
[no warnings/errors from marc_export, since we excluded record id 233]
$ yaz-marcdump -n sample-some.mrc | sort | sed -e 's/ length=.*$/length=XX/' | sort | uniq -c
221 Bad indicator data. Skipping 1 bytes
226 Bad indicator data. Skipping 2 bytes
59 No separator at end of fieldlength=XX
973 Separator but not at end of fieldlength=XX
Using --encoding=UTF-8 with marc_export results in no apparent issues:
Confirmed same symptoms with a set of sample records loaded with --load-all-sample
$ marc_export --all > sample-all.mrc perl5/MARC/ Charset. pm line 384. perl5/MARC/ Field.pm line 696.
Warning from bibliographic record 233: no mapping found at position 4 in Arii︠a︡ kni︠a︡zi︠a︡. at /usr/share/
Warning from bibliographic record 233: Use of uninitialized value in join or string at /usr/share/
(The above string after "at position 4 in" may not appear correctly in launchpad.)
Counting lines of output with yaz-marcdump -n (should be zero lines):
$ yaz-marcdump -n sample-all.mrc | wc -l
1479
Counting '<!-- Record ' in yaz-marcdump -n -p output:
$ yaz-marcdump -n -p sample-all.mrc | grep -c '^<!-- Record '
149
Summarizing errors/warnings: .*$/length= XX/' | sort | uniq -c
$ yaz-marcdump -n sample-all.mrc | sort | sed -e 's/ length=
221 Bad indicator data. Skipping 1 bytes
226 Bad indicator data. Skipping 2 bytes
59 No separator at end of fieldlength=XX
973 Separator but not at end of fieldlength=XX
For purposes of testing, I confirmed that if I exclude bibliographic record 233 I receive no warnings from marc_export but the resulting output file still shows the same symptoms as elsewhere in this bug.
psql -c 'SELECT id FROM biblio.record_entry where id <> 233 and not deleted and id > 0;' -A -t > bibs-not-233.txt
$ cat bibs-not-233.txt | marc_export > sample-some.mrc
[no warnings/errors from marc_export, since we excluded record id 233]
$ yaz-marcdump -n -p sample-some.mrc | grep -c '^<!-- Record '
148
$ yaz-marcdump -n sample-some.mrc | sort | sed -e 's/ length= .*$/length= XX/' | sort | uniq -c
221 Bad indicator data. Skipping 1 bytes
226 Bad indicator data. Skipping 2 bytes
59 No separator at end of fieldlength=XX
973 Separator but not at end of fieldlength=XX
Using --encoding=UTF-8 with marc_export results in no apparent issues:
$ marc_export --all --encoding=UTF-8 > sample-all.utf8.mrc .*$/length= XX/' | sort | uniq -c
$ yaz-marcdump -n -p sample-all.utf8.mrc | grep -c '^<!-- Record '
235
$ yaz-marcdump -n sample-all.utf8.mrc | sort | sed -e 's/ length=
$ cat bibs-not-233.txt | marc_export --encoding=UTF-8 > sample- some.utf8. mrc some.utf8. mrc | grep -c '^<!-- Record ' some.utf8. mrc | sort | sed -e 's/ length= .*$/length= XX/' | sort | uniq -c
$ yaz-marcdump -n -p sample-
234
$ yaz-marcdump -n sample-
The issue with the non-UTF-8 files confuses yaz-marcdump enough that it reports 86 fewer records than are expected to be present in the file.
File sizes do not seem to suggest that there are actually 86 fewer records in the non-UTF-8 files:
271839 sample-all.mrc
264189 sample-some.mrc
271438 sample-all.utf8.mrc some.utf8. mrc
263755 sample-