TPAC search doesn't handle UTF well

Bug #1104004 reported by Pasi Kallinen
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Evergreen
Incomplete
Undecided
Unassigned

Bug Description

NON-ASCII characters in OPAC searches may get messed up, see image.
Usually seems to happen when switching the language, or pressing enter in the search bar.

Tags: i18n opac
Revision history for this message
Pasi Kallinen (paxed) wrote :
Revision history for this message
Pasi Kallinen (paxed) wrote :

The issue looks like this: BERGENDAL, GÖRAN vs BERGENDAL, GÃRAN

Revision history for this message
Pasi Kallinen (paxed) wrote :

And this is the MARC record:

<?xml version="1.0"?>
<collection xmlns="http://www.loc.gov/MARC21/slim" xmlns:marc="http://www.loc.gov/MARC21/slim">
  <record xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.loc.gov/MARC21/slim http://www.loc.gov/standards/marcxml/schema/MARC21slim.xsd">
    <leader> cam a22 4a 4500</leader>
    <controlfield tag="001">119400</controlfield>
    <controlfield tag="003">JOKUNEN</controlfield>
    <controlfield tag="005">2002-11-29 08:59:54+02</controlfield>
    <controlfield tag="008"> s1981 sw |||||||||| ||||0|swe|c</controlfield>
    <datafield tag="020" ind1=" " ind2=" ">
      <subfield code="c">2.02 EUR</subfield>
    </datafield>
    <datafield tag="041" ind1="0" ind2=" ">
      <subfield code="a">swe</subfield>
    </datafield>
    <datafield tag="100" ind1="1" ind2=" ">
      <subfield code="a">BERGENDAL, G&#xD6;RAN.</subfield>
    </datafield>
    <datafield tag="245" ind1="1" ind2="0">
      <subfield code="a">MUSIKEN P&#xC5; ISLAND :</subfield>
      <subfield code="b">OM ISOLERING OCH INTERNATIONALISM /</subfield>
      <subfield code="c">G&#xD6;RAN BERGENDAL.</subfield>
    </datafield>
    <datafield tag="260" ind1=" " ind2=" ">
      <subfield code="c">1981.</subfield>
    </datafield>
    <datafield tag="300" ind1=" " ind2=" ">
      <subfield code="a">55 S. :</subfield>
      <subfield code="b">KUV. ;</subfield>
      <subfield code="c">21 CM.</subfield>
    </datafield>
    <datafield tag="650" ind1=" " ind2="7">
      <subfield code="a">s&#xE4;velt&#xE4;j&#xE4;t</subfield>
      <subfield code="2">ysa</subfield>
    </datafield>
    <datafield tag="650" ind1=" " ind2="7">
      <subfield code="a">islanti</subfield>
      <subfield code="2">ysa</subfield>
    </datafield>
    <datafield tag="650" ind1=" " ind2="7">
      <subfield code="a">musiikkiel&#xE4;m&#xE4;</subfield>
      <subfield code="2">ysa</subfield>
    </datafield>
    <datafield tag="852" ind1=" " ind2=" ">
      <subfield code="a">FI-Jm</subfield>
      <subfield code="h">78.93</subfield>
    </datafield>
    <datafield tag="901" ind1=" " ind2=" ">
      <subfield code="a">119400</subfield>
      <subfield code="b">AUTOGEN</subfield>
      <subfield code="c">119400</subfield>
      <subfield code="t">biblio</subfield>
    </datafield>
  </record>
</collection>

Revision history for this message
Lebbeous Fogle-Weekley (lebbeous) wrote :

I'm not saying this necessarily explains the issue, but in trying to import your record and duplicate the issue, I noticed (well, Galen pointed it out to me when I got warnings from Evergreen) that the LDR/09 is not 'a' as it should be for a UTF-8 record, but '4'.

Revision history for this message
Pasi Kallinen (paxed) wrote :

Lebbeous, thanks!

We're still writing the conversion script from our current closed-source system to EG, so things are a bit iffy. I'll point this out to the guy doing that...

Revision history for this message
Ben Shum (bshum) wrote :

Marking incomplete pending further testing to either confirm or invalidate this bug report.

Changed in evergreen:
status: New → Incomplete
Revision history for this message
Pasi Kallinen (paxed) wrote :

This happens even with the correct LDR/09.

Perhaps this is caused by apache caching (similar to bug 1096871), or the UTF-8 header doesn't get sent for some (other) reason. I used the Firefox Live Headers addon to get the headers for the query "öylätti" saved in http://bilious.alt.org/~paxed/eg/oylatti_headers.txt - that query showed the messed up ö and ä letters.

tags: removed: tpac
Revision history for this message
Eva Cerninakova (ece) wrote :

If I remember correctly, we encountered this problem in version 2.8 and repeatedly encountered it in the following versions. When search with diacritics was performed, occasionally the character were mismatched and the search ended up with empty results. It usually began to occur occasionally, but then it became more intense, and in the end it was no longer possible to search words with diacritics at all. The problem affected not only the catalog searches, but also the self registration (registration ended up with messed characters in pending patrons record) or patron pending addresses (see the attachment). When we examined the log, it turned out that it was usually (but not necessarily) related to the search bots on our website. An instant solution was the restarting the Apache, which used to help for a while. The more persistent solution turned out to be the change of the settings in /etc/apache2/mods-available/mpm-_prefork.conf

The similar issue occurs time to time in our production catalog in Evergreen 3.1. If occurring, it also affects the pending patrons or pending addresses. But unlike in the previous versions, it usually occurs only temporary and usually disappears without any intervention (I guess it must have something to do with the current Apache setting).

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.