Supercat encoding problems with MODS output (Zotero)

Bug #1442276 reported by Dan Scott
16
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Evergreen
Fix Released
High
Unassigned
2.10
Fix Released
High
Unassigned
2.11
Fix Released
High
Unassigned

Bug Description

Per http://libmail.georgialibraries.org/pipermail/open-ils-general/2015-April/011486.html current versions of Evergreen show some problems with encoding characters in Zotero.

Zotero uses unapi to retrieve MODS3 output. When comparing MODS to MODS3, MODS32, and MODS33 output, we can see that the MODS output contains correctly encoded characters, while the MODS3, MODS32, and MODS33 output has incorrectly encoded characters.

The very simple difference between MODS and MODS3* stylesheets boils down to this:

/openils/var/xsl/MARC21slim2MODS.xsl does not set an explicit output encoding:
<xsl:output indent="yes" method="xml"/>

whereas openils/var/xsl/MARC21slim2MODS32.xsl does:
<xsl:output indent="yes" method="xml" encoding="UTF-8"/>

My reading of XML::LibXSLT suggests that the deprecated output_string() method hints at the reason this explicit encoding output is causing the problem; http://search.cpan.org/~shlomif/XML-LibXSLT-1.94/LibXSLT.pm says:

output_string(result)
DEPRECATED: This method is something between output_as_bytes(result) and output_as_bytes(result): The scalar returned by this function appears to Perl as characters (UTF8 flag is on) if the output encoding specified in the XSLT stylesheet was UTF-8 and as bytes if no output encoding was specified or if the output encoding was other than UTF-8. Since the behavior of this function depends on the particular stylesheet, it is deprecated in favor of output_as_bytes(result) and output_as_chars(result).

(Aside: changing output_string() to output_as_bytes() in OpenILS::WWW::SuperCat::Feed fixes a similar-but-different encoding problem for RIS and MARCTXT formats, whereas output_as_chars()--which would be the expected desired value--perpetuates the encoding problem).

Revision history for this message
Eva Cerninakova (ece) wrote :

The bug stil relevant in 2.12 (version 0master.53150bb)

Revision history for this message
Dan Scott (denials) wrote :

I've confirmed that changing output_string() to output_as_chars() in the evergreen.oils_xslt_process() database function (as defined in 002.schema.config.sql) resolves the issue with MODS3, MODS32, and MODS33 output, without causing a regression for MODS output.

description: updated
Revision history for this message
Dan Scott (denials) wrote :

The top two commits on http://git.evergreen-ils.org/?p=working/Evergreen.git;a=shortlog;h=refs/heads/user/dbs/lp1442276_decorrupt_mods3_ris fix the MODS3 output and the MARCTXT/RIS output, respectively.

Changed in evergreen:
milestone: none → 2.12-rc
importance: Undecided → High
status: New → Confirmed
tags: added: pullrequest
Revision history for this message
Dan Scott (denials) wrote :

I have added a third commit which contains a pgTAP regression test for config.oils_xslt_process().

Revision history for this message
Dan Scott (denials) wrote :

I've determined that the problem seen in retrieving a SuperCat request like https://laurentian.concat.ca/opac/extras/supercat/retrieve/mods/record/2505029 also manifests in OpenILS::Application::SuperCat, in that the use of the $result->toString method generates a byte string instead of characters.

I have added one more commit to the referenced branch to address this source of character corruption, for a total of four commits.

Revision history for this message
Dan Scott (denials) wrote :

Manual test plan for the SuperCat output:

1. Install current master and use eg_db_config.pl to --load-all-sample
2. Check http://hostname/opac/extras/supercat/retrieve/mods/record/147 - you should see <title>Mystères de Montréal :</title> appears with the appropriate accents
3. Check http://hostname/opac/extras/supercat/retrieve/mods33/record/147 - you should see <title>Mystères de Montréal :</title> appears with corrupted characters instead of the accented characters.
4. Add the four commits from this branch. Reinstall to get the updated Perl modules in place, and use eg_db_config to --load-all-sample
5. Repeat the tests for 2 and 3; this time, both MODS and MODS33 output will show the appropriate accents.

Dan Scott (denials)
tags: added: i18n
Revision history for this message
Kathy Lussier (klussier) wrote :

Dan already mentioned this in IRC, but I want to add a comment here so that it doesn't get missed by whoever merges it. We will need an upgrade script before this code is merged.

Revision history for this message
Kathy Lussier (klussier) wrote :

It works for me. I've signed off on Dan's commits and added one more commit for the upgrade script that needs a signoff. New working branch at http://git.evergreen-ils.org/?p=working/Evergreen.git;a=shortlog;h=refs/heads/user/kmlussier/lp1442276_decorrupt_mods3_ris

I'm also assigning targets for 2.10 and 2.11 since this is a bug fix.

Thanks Dan!

Revision history for this message
Dan Scott (denials) wrote :

Thank you so much, Kathy. I've signed off on your upgrade script, stamped it with 1030, and applied the commits to master, 2.11, and 2.10.

Changed in evergreen:
status: Confirmed → Fix Committed
Changed in evergreen:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.