Quoter (de)serialize UTF8 data fails on CentOS 5.6

Bug #932327 reported by Daniel Nichter
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Percona Toolkit moved to https://jira.percona.com/projects/PT
Invalid
Medium
Brian Fraser

Bug Description

CentOS 5.6, Perl 5.8.8:

not ok 46 - Serialize [ニ\,è,a][!!!!__!*`,`\]
# Failed test 'Serialize [ニ\,è,a][!!!!__!*`,`\]'
# in lib/Quoter.t at line 197.
# Structures begin differing at:
# $got->[0] = 'ニ\,è,a'
# $expected->[0] = 'ニ\,è,a'

That's the test with UTF8 data. Maybe it's a false-positive due to how Perl 5.8 or CentOS 5.6 handles UTF8, but in any case the test is failing.

Related branches

tags: added: charset
Revision history for this message
Daniel Nichter (daniel-nichter) wrote :

DBD::mysql 3.0007
DBI 1.52

Changed in percona-toolkit:
status: New → Confirmed
milestone: none → 2.0.4
tags: added: all-tools
Changed in percona-toolkit:
status: Confirmed → In Progress
Revision history for this message
Daniel Nichter (daniel-nichter) wrote :

The root problem, iirc, is that DBD::mysql 3.x does not properly encode or set a flag for utf8 data (Brian knows the details). So utf8 goes into MySQL one ways and comes out another, hence the failing tests. DBD::mysql 4.x does not have this problem.

A working although not perfect solution is:

- if DBD::mysql::VERSION ge '4.000' then just quotemeta (the original code), no encode/decode because it's not needed because DBD::mysql 4+ and quotemeta work with utf8
- else (DBD::mysql 3.x): encode if value ($res) is_utf8 and then always decode the $part

So this solution really only applies to DBD::mysql 3.x with utf8 encoded strings. It seems to work because tests show that decoding a string even if it was not encoded and even if it's latin1 did not garble the string. There was debate whether this was reliable. Imo and based on my understanding of utf8, a latin1 string cannot be mistaken for utf8 because of the way utf8 uses leading and trailing bytes with special high-order bits. But, iirc, Brian thinks it is possible that just the right combination of latin1 chars could be mistaken for a utf8 char.

In any case, this seems to be the only simple, non-invasive solution for DBD::mysql 3.x and the tests work so I think it's worth trying. Plus, the current code (with only quotemeta) is clearly failing on DBD::mysql 3.x with utf8 strings, so even if this solution isn't perfect, it's slightly better.

tags: added: dbd-mysql utf8
Revision history for this message
Daniel Nichter (daniel-nichter) wrote :

I'm going to untarget this from 2.0.4 because the issue it too subtle to fix easily. Baron noted: "I think the issue is that people can put binary data into what we think is a latin1 character. A latin1 character can't be mistaken for a utf8 character, but a lot of people put non-characters into their "character" columns."

Changed in percona-toolkit:
milestone: 2.0.4 → none
tags: removed: utf8
Revision history for this message
Baron Schwartz (baron-xaprb) wrote :
Revision history for this message
Daniel Nichter (daniel-nichter) wrote :
Changed in percona-toolkit:
status: In Progress → Invalid
Revision history for this message
Shahriyar Rzayev (rzayev-sehriyar) wrote :

Percona now uses JIRA for bug reports so this bug report is migrated to: https://jira.percona.com/browse/PT-472

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.