Elasticsearch choking on non-ASCII characters

Bug #1487274 reported by Aaron Wells
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Mahara
Fix Released
High
Unassigned

Bug Description

In 15.10 I've added code to "quarantine" records that Elasticsearch won't index. That is, if Elasticsearch errors out while processing a batch of records, then I re-try each record individually. And if it errors out while processing one of those individual records, I mark the record as quarantined, and keep it in the search_elasticsearch_queue table.

I've backported that to one of our large 15.04 sites, and since then I've taken a look at the data in the records that have caused Elasticsearch to choke. They all contain non-ASCII characters, i.e. Unicode characters. These can be as simple as "e with an accent over it", all the way up to exotic ones like emoji and the Unicode snowman.

I was not able to replicate this when testing on my local machine, but it is certainly in place on our production servers, and bugs such as Bug 1408577 make me think it's probably also present on some other servers as well.

Revision history for this message
Aaron Wells (u-aaronw) wrote :

For testing purposes, here are a few sample words (in page titles, artefact titles, and user names) that have caused Elasticsearch to choke:

João
Jiménez
Māori

It's not clear from our situation whether the problem lies in our Elasticsearch setup, or in Mahara's code. I think it may be something peculiar to our server setup because I haven't been able to replicate the problem on my local machine.

Revision history for this message
Robert Lyon (robertl-9) wrote :
Robert Lyon (robertl-9)
Changed in mahara:
milestone: 16.04.1 → 16.10.0
Robert Lyon (robertl-9)
Changed in mahara:
milestone: 16.10.0 → 16.10.1
Robert Lyon (robertl-9)
no longer affects: mahara/17.04
Changed in mahara:
milestone: 16.10.1 → 17.04.0
no longer affects: mahara/15.04
no longer affects: mahara/1.9
no longer affects: mahara/1.10
Revision history for this message
Kristina Hoeppner (kris-hoeppner) wrote :

Similar report at bug #1408577

Changed in mahara:
milestone: 17.04.0 → 17.10.0
no longer affects: mahara/15.10
no longer affects: mahara/16.04
no longer affects: mahara/16.10
Revision history for this message
Kristina Hoeppner (kris-hoeppner) wrote :

We'll look at this issue when upgrading Elasticsearch to a newer version. Apparently, Elasticsearch can have a few problems with certain languages.

Revision history for this message
Robert Lyon (robertl-9) wrote :

I notice that on my local machine Mahara 17.04+ saves

João Jiménez Māori

in the title fields as 'João Jiménez Māori'
and in the description fields as 'João Jiménez Māori'

But on cluster machines in 16.10 it saves

in the title fields as 'Jo<C3><A3>o Jim<C3><A9>nez M<C4><81>ori'
in the description fields as 'Jo&atilde;o Jim&eacute;nez M<C4><81>ori'

If I do a

 SELECT 'João Jiménez Māori' AS test;

The both show the special chars correctly

But if I do

 UPDATE view SET description = '<p>João Jiménez Māori</p>' where id = 10;

My local shows it like local above but cluster shows it like cluster above.

So the cluster setup for postgres must be different in the way it handles special utf8 characters

Revision history for this message
Robert Lyon (robertl-9) wrote :

Will push this problem out to see if elasticsearch, using elasticsearch-php, fixes things

Changed in mahara:
milestone: 17.10.0 → 18.04.0
Changed in mahara:
status: Confirmed → Incomplete
Revision history for this message
Robert Lyon (robertl-9) wrote :

Testing this using elasticsearch-php 5.x and elasticsearch 5.x I was able to index things called 'João Jiménez Māori' and search them up again.

Robert Lyon (robertl-9)
Changed in mahara:
status: Incomplete → Fix Committed
Robert Lyon (robertl-9)
Changed in mahara:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.