Added content encoding for Open Library is not correct

Bug #1610678 reported by Linda Jansova on 2016-08-07
This bug affects 2 people
Affects Status Importance Assigned to Milestone

Bug Description

We are using:
Evergreen 2.10.5
PostgreSQL 9.4.8
Debian 8 Jessie

Open Library added content (table of contents in particular) does not provide correctly encoded data in the online catalog.

A sample Open Library table of contents with Czech characters can be found here:

We have also checked how the same record looks like in our production system (Evergreen 2.8.3) and so now we can report that it is also corrupt as you can see at: (please click on Additional Content in the lower part of the page)

Jason Stephenson (jstephenson) wrote :

I'm guessing that you're being sent ISO 8859-2 characters and you need to convert them to UTF-8.

Another possibility is you're being sent some other Unicode format that is being interpreted as UTF-8.

Can you verify from the vendor what character set they are sending you?

Jason Stephenson (jstephenson) wrote :

Or you're double-encoding, i.e. converting the text that is already UTF-8 into UTF-8 again. Figuring out where that is happening can be tricky.

One final possibility: Perl uses some character set internally, and I'm not entirely certain what it is. You could also try using the Encode module and tell it to explicitly encode the output in UTF-8.

Welcome to the fun world of Unicode! ;)

Linda Jansova (skolkova-s) wrote :

Well, this is strange because in this case we are using Evergreen default added content provider... There is some developer documentation regarding Open Library (available at

We hope that in case content from Open Library goes through Evergreen without losing the encoding along the way, we could similarly correct our own added content module...

No doubt it shall be fun ;-)...

Jakub Kotrla (0-jakub) wrote :

Hi, I am a developer creating AddedContent plugin loading data from server that provides e.g. toc. I am working with Linda.

The encoding of toc shown in evergreen using our new plugin was wrong (letters with accents were replaced by strange symbols).

I've tried a lot of tricks and investigated a lot, I've found following:
- server provides toc as utf8
- toc shown in evergreen using our plugin is double-encoded, when I tried to encode original toc being in utf8 from "ISO-8859-1" to "UTF-8", I got same results as what evergreen shown
- therefore I suspect somewhere in process is already utf8-encoded toc encoded again (in a way "ISO-8859-1" to "UTF-8")
- I've tried to use Encode and utf8 Perl module
- I've tried to log toc to logger and content of log file is correct, maybe because evergreen Logger calls binmode(SINK, ':utf8'); in sub _write_file
- I've tried to add line binmode(STDOUT, ":utf8"); to module with no success
- I've even tried to add encoding to content-type part of returned added content by using following line in our AddedContent handler:
return { content_type => 'text/html; charset=utf-8', content => $c };
- interestingly on URL in form http://evergreen-server/opac/extras/ac/toc/html/r/23225 I could see toc in correct encoding

What I do not understand is where is toc encoded wrongly. How is handler called? Who calls it? The only thing I've understood is that handler is called from some other part of evergreen via some kind of network call, because the handler in sub print_content writes first line looking like HTTP header (print "Content-type: $ct\n\n";) and than content itself.

I do not know how does evergreen and openSRF work internally but it seems to me, that module provides correct toc in correct encoding and some other part of evergreen mess with it and shows toc double-encoded.

Any ideas, help or explanation of who calls handler would be greatly appreciated.

EG 2.10

We are seeing a similar issue with Content Cafe, I'm going to just add to this ticket since this might be related.

When added content with an em dash is displayed via the added content, the em dash is displayed as '—'


I see the content-type set to "text/html" in the headers, but not the '; encoding=utf8' that I think should be there.

Display from booklist.

The xml view seems to be fine -

So i wonder if it comes down to the send_html sub in Open-ILS/src/perlmods/lib/OpenILS/WWW/AddedContent/ using "return { content_type => 'text/html', content => $HTML };"

I'll see if adding the '; encoding=utf8' to that call gets me the correct results.;a=blob;f=Open-ILS/src/perlmods/lib/OpenILS/WWW/AddedContent/


Linda Jansova (skolkova-s) wrote :


It definitely looks related to me - BTW, our developer has eventually used HTML entities to make sure the encoding is okay (please see<email address hidden>&q=subject:%22%5C%5BOPEN%5C-ILS%5C-GENERAL%5C%5D+Record+webpage+and+programming+documentation%5C%3F%22&o=newest&f=1 for more details)...


Thanks Linda, for the case of calling the added content link directly, setting the charset to utf-8 seems to have worked for me. The browser correctly shows the page in Unicode mode in that case.

return { content_type => 'text/html; charset=utf-8', content => $HTML };

That doesn't seem to effect the display of the review in the record detail page though, it still is corrupted.

I've been trying to understand how perl, template toolkit handle unicode and what the issue could be, has been an interesting read about what can go wrong with TT and utf8.


Here is an example that displays incorrect in unicode format, but correctly in western encoding (Testing in Firefox, using view -> text encoding), which is the opposite of the em dash problem.

The text includes \u00e9 for "Miéville's" name.

Which makes me think that the send_html charset is not the issue I'm looking for.


Linda Jansova (skolkova-s) wrote :

Josh, have you eventually managed to hit the nail on the head?

We have done some more testing (not using Open Library but using Obalkyknih) and reported our interim results to open-ils-general and open-ils-dev mailing lists:

Is there anything that would help you identify the troublesome piece of code or setting?

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers