Added content encoding for Open Library is not correct

Bug #1610678 reported by Linda Jansova
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Evergreen
New
Undecided
Unassigned

Bug Description

We are using:
Evergreen 2.10.5
PostgreSQL 9.4.8
Debian 8 Jessie

Open Library added content (table of contents in particular) does not provide correctly encoded data in the online catalog.

A sample Open Library table of contents with Czech characters can be found here:

https://openlibrary.org/works/OL577950W/U%CC%81stavni%CC%81_pe%CC%81c%CC%8Ce

We have also checked how the same record looks like in our production system (Evergreen 2.8.3) and so now we can report that it is also corrupt as you can see at:

http://www.jabok.cuni.cz/eg/opac/record/13907?locg=102 (please click on Additional Content in the lower part of the page)

Revision history for this message
Jason Stephenson (jstephenson) wrote :

I'm guessing that you're being sent ISO 8859-2 characters and you need to convert them to UTF-8.

Another possibility is you're being sent some other Unicode format that is being interpreted as UTF-8.

Can you verify from the vendor what character set they are sending you?

Revision history for this message
Jason Stephenson (jstephenson) wrote :

Or you're double-encoding, i.e. converting the text that is already UTF-8 into UTF-8 again. Figuring out where that is happening can be tricky.

One final possibility: Perl uses some character set internally, and I'm not entirely certain what it is. You could also try using the Encode module and tell it to explicitly encode the output in UTF-8.

Welcome to the fun world of Unicode! ;)

Revision history for this message
Linda Jansova (skolkova-s) wrote :

Well, this is strange because in this case we are using Evergreen default added content provider... There is some developer documentation regarding Open Library (available at https://openlibrary.org/developers).

We hope that in case content from Open Library goes through Evergreen without losing the encoding along the way, we could similarly correct our own added content module...

No doubt it shall be fun ;-)...

Revision history for this message
Jakub Kotrla (0-jakub) wrote :

Hi, I am a developer creating AddedContent plugin loading data from server obalkyknih.cz that provides e.g. toc. I am working with Linda.

The encoding of toc shown in evergreen using our new plugin was wrong (letters with accents were replaced by strange symbols).

I've tried a lot of tricks and investigated a lot, I've found following:
- server obalkyknih.cz provides toc as utf8
- toc shown in evergreen using our plugin is double-encoded, when I tried to encode original toc being in utf8 from "ISO-8859-1" to "UTF-8", I got same results as what evergreen shown
- therefore I suspect somewhere in process is already utf8-encoded toc encoded again (in a way "ISO-8859-1" to "UTF-8")
- I've tried to use Encode and utf8 Perl module
- I've tried to log toc to logger and content of log file is correct, maybe because evergreen Logger calls binmode(SINK, ':utf8'); in sub _write_file
- I've tried to add line binmode(STDOUT, ":utf8"); to module AddedContent.pm with no success
- I've even tried to add encoding to content-type part of returned added content by using following line in our AddedContent handler:
return { content_type => 'text/html; charset=utf-8', content => $c };
- interestingly on URL in form http://evergreen-server/opac/extras/ac/toc/html/r/23225 I could see toc in correct encoding

What I do not understand is where is toc encoded wrongly. How is AddedContent.pm handler called? Who calls it? The only thing I've understood is that AddedContent.pm handler is called from some other part of evergreen via some kind of network call, because the handler in sub print_content writes first line looking like HTTP header (print "Content-type: $ct\n\n";) and than content itself.

I do not know how does evergreen and openSRF work internally but it seems to me, that AddedContent.pm module provides correct toc in correct encoding and some other part of evergreen mess with it and shows toc double-encoded.

Any ideas, help or explanation of who calls AddedContent.pm handler would be greatly appreciated.

Revision history for this message
Josh Stompro (u-launchpad-stompro-org) wrote :

EG 2.10

We are seeing a similar issue with Content Cafe, I'm going to just add to this ticket since this might be related.

When added content with an em dash is displayed via the added content, the em dash is displayed as '—'

Example:
http://egcatalog.larl.org/opac/extras/ac/reviews/html/r/242765

I see the content-type set to "text/html" in the headers, but not the '; encoding=utf8' that I think should be there.

Display from booklist.
https://www.booklistonline.com/The-Leavers-Lisa-Ko/pid=8559950

The xml view seems to be fine -
http://egcatalog.larl.org/opac/extras/ac/reviews/xml/r/242765

So i wonder if it comes down to the send_html sub in Open-ILS/src/perlmods/lib/OpenILS/WWW/AddedContent/ContentCafe.pm using "return { content_type => 'text/html', content => $HTML };"

I'll see if adding the '; encoding=utf8' to that call gets me the correct results.

http://git.evergreen-ils.org/?p=Evergreen.git;a=blob;f=Open-ILS/src/perlmods/lib/OpenILS/WWW/AddedContent/ContentCafe.pm#l351

Josh

Revision history for this message
Linda Jansova (skolkova-s) wrote :

Hi,

It definitely looks related to me - BTW, our developer has eventually used HTML entities to make sure the encoding is okay (please see https://www.mail-archive.com/search?<email address hidden>&q=subject:%22%5C%5BOPEN%5C-ILS%5C-GENERAL%5C%5D+Record+webpage+and+programming+documentation%5C%3F%22&o=newest&f=1 for more details)...

Linda

Revision history for this message
Josh Stompro (u-launchpad-stompro-org) wrote :

Thanks Linda, for the case of calling the added content link directly, setting the charset to utf-8 seems to have worked for me. The browser correctly shows the page in Unicode mode in that case.

return { content_type => 'text/html; charset=utf-8', content => $HTML };

That doesn't seem to effect the display of the review in the record detail page though, it still is corrupted.

I've been trying to understand how perl, template toolkit handle unicode and what the issue could be, https://www.lemoda.net/perl/template-encoding/index.html has been an interesting read about what can go wrong with TT and utf8.

Josh

Revision history for this message
Josh Stompro (u-launchpad-stompro-org) wrote :

Here is an example that displays incorrect in unicode format, but correctly in western encoding (Testing in Firefox, using view -> text encoding), which is the opposite of the em dash problem.

The text includes \u00e9 for "Miéville's" name.

http://egcatalog.larl.org/opac/extras/ac/reviews/html/r/212636

Which makes me think that the send_html charset is not the issue I'm looking for.

Josh

Revision history for this message
Linda Jansova (skolkova-s) wrote :

Josh, have you eventually managed to hit the nail on the head?

We have done some more testing (not using Open Library but using Obalkyknih) and reported our interim results to open-ils-general and open-ils-dev mailing lists: http://libmail.georgialibraries.org/pipermail/open-ils-general/2018-November/015488.html.

Is there anything that would help you identify the troublesome piece of code or setting?

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.