Elasticsearch choking on non-ASCII characters

Bug #1487274 reported by Aaron Wells on 2015-08-21
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Mahara
High
Unassigned

Bug Description

In 15.10 I've added code to "quarantine" records that Elasticsearch won't index. That is, if Elasticsearch errors out while processing a batch of records, then I re-try each record individually. And if it errors out while processing one of those individual records, I mark the record as quarantined, and keep it in the search_elasticsearch_queue table.

I've backported that to one of our large 15.04 sites, and since then I've taken a look at the data in the records that have caused Elasticsearch to choke. They all contain non-ASCII characters, i.e. Unicode characters. These can be as simple as "e with an accent over it", all the way up to exotic ones like emoji and the Unicode snowman.

I was not able to replicate this when testing on my local machine, but it is certainly in place on our production servers, and bugs such as Bug 1408577 make me think it's probably also present on some other servers as well.

Aaron Wells (u-aaronw) wrote :

For testing purposes, here are a few sample words (in page titles, artefact titles, and user names) that have caused Elasticsearch to choke:

João
Jiménez
Māori

It's not clear from our situation whether the problem lies in our Elasticsearch setup, or in Mahara's code. I think it may be something peculiar to our server setup because I haven't been able to replicate the problem on my local machine.

Robert Lyon (robertl-9) on 2016-06-08
Changed in mahara:
milestone: 16.04.1 → 16.10.0
Robert Lyon (robertl-9) on 2016-10-20
Changed in mahara:
milestone: 16.10.0 → 16.10.1
Robert Lyon (robertl-9) on 2016-10-21
no longer affects: mahara/17.04
Changed in mahara:
milestone: 16.10.1 → 17.04.0
no longer affects: mahara/15.04
no longer affects: mahara/1.9
no longer affects: mahara/1.10

Similar report at bug #1408577

Changed in mahara:
milestone: 17.04.0 → 17.10.0
no longer affects: mahara/15.10
no longer affects: mahara/16.04
no longer affects: mahara/16.10

We'll look at this issue when upgrading Elasticsearch to a newer version. Apparently, Elasticsearch can have a few problems with certain languages.

Robert Lyon (robertl-9) wrote :

I notice that on my local machine Mahara 17.04+ saves

João Jiménez Māori

in the title fields as 'João Jiménez Māori'
and in the description fields as 'João Jiménez Māori'

But on cluster machines in 16.10 it saves

in the title fields as 'Jo<C3><A3>o Jim<C3><A9>nez M<C4><81>ori'
in the description fields as 'Jo&atilde;o Jim&eacute;nez M<C4><81>ori'

If I do a

 SELECT 'João Jiménez Māori' AS test;

The both show the special chars correctly

But if I do

 UPDATE view SET description = '<p>João Jiménez Māori</p>' where id = 10;

My local shows it like local above but cluster shows it like cluster above.

So the cluster setup for postgres must be different in the way it handles special utf8 characters

Robert Lyon (robertl-9) wrote :

Will push this problem out to see if elasticsearch, using elasticsearch-php, fixes things

Changed in mahara:
milestone: 17.10.0 → 18.04.0
Changed in mahara:
status: Confirmed → Incomplete
Robert Lyon (robertl-9) wrote :

Testing this using elasticsearch-php 5.x and elasticsearch 5.x I was able to index things called 'João Jiménez Māori' and search them up again.

Robert Lyon (robertl-9) on 2018-03-06
Changed in mahara:
status: Incomplete → Fix Committed
Robert Lyon (robertl-9) on 2018-04-05
Changed in mahara:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers