pt-archiver --bulk-insert may corrupt data

Reported by Alex Geis on 2013-02-16
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Percona Toolkit
High
Brian Fraser

Bug Description

This bug seems to pop up whenever the following conditions are set for a table to table archive copy:

1. The tables have a TEXT field that include utf8 characters, like foreign language
2. --charset utf8 is used
3. --bulk-insert is used

When these 3 conditions are true, pt-archiver immediately returns with a Wide Character error on line 3950. This seems to be similar to bug #940253 and seems that it's linked to utf8 encoding as related to the temporary bulk load insert file that's created, as the problem goes away immediately when I turn off --bulk-insert or I set --no-check-charset. For now, I'm sacrificing speed (about 3x) by not using bulk-insert to get around this problem... otherwise I need to use pt-table-sync once finished to repair all rows with encoding data mismatches.

Brian Fraser (fraserbn) on 2013-02-21
Changed in percona-toolkit:
status: New → Confirmed
Alex Geis (ageis) wrote :

In pt-archiver version 2.1.8, the error has changed to:
Wide character in print at /usr/local/bin/pt-archiver line 5840.

Alex Geis (ageis) wrote :

Look to have solved this problem with the following additions...
100k insert on 2.1.8 build without bulk-insert: 53.4850s
100k insert with edits with bulk-insert w/o error: 7.7311s

# diff -u /usr/local/bin/pt-archiver /usr/local/bin/pt-archiver.1

--- /usr/local/bin/pt-archiver 2013-03-28 05:30:27.071386414 -0400
+++ /usr/local/bin/pt-archiver.1 2013-03-28 06:03:48.391462519 -0400
@@ -5740,7 +5740,7 @@
       require File::Temp;
       $bulkins_file = File::Temp->new( SUFFIX => 'pt-archiver' )
          or die "Cannot open temp file: $OS_ERROR\n";
- binmode($bulkins_file, ':utf8');
+ binmode($bulkins_file,":utf8");
    }

    # This row is the first row fetched from each 'chunk'.
@@ -5967,7 +5967,8 @@
          if ( $o->get('bulk-insert') ) {
             $bulkins_file = File::Temp->new( SUFFIX => 'pt-archiver' )
                or die "Cannot open temp file: $OS_ERROR\n";
- }
+ binmode($bulkins_file,":utf8");
+ }
       } # no next row (do bulk operations)
       else {
          PTDEBUG && _d('Got another row in this chunk');

Alex Geis (ageis) wrote :

couldn't edit and looks like there was a line in the 2.1.8 build on my vol.. better diff:

# diff -u /usr/local/bin/pt-archiver /usr/local/bin/pt-archiver.1

--- /usr/local/bin/pt-archiver 2013-03-28 06:11:27.143479965 -0400
+++ /usr/local/bin/pt-archiver.1 2013-03-28 06:03:48.391462519 -0400
@@ -5740,6 +5740,7 @@
       require File::Temp;
       $bulkins_file = File::Temp->new( SUFFIX => 'pt-archiver' )
          or die "Cannot open temp file: $OS_ERROR\n";
+ binmode($bulkins_file,":utf8");
    }

    # This row is the first row fetched from each 'chunk'.
@@ -5966,7 +5967,8 @@
          if ( $o->get('bulk-insert') ) {
             $bulkins_file = File::Temp->new( SUFFIX => 'pt-archiver' )
                or die "Cannot open temp file: $OS_ERROR\n";

+ binmode($bulkins_file,":utf8");
    }
       } # no next row (do bulk operations)
       else {
          PTDEBUG && _d('Got another row in this chunk');

Changed in percona-toolkit:
milestone: none → 2.2.2
tags: added: charset pt-archiver
Brian Fraser (fraserbn) wrote :

Possible workaround for previous versions: Try running the tool as

$ perl -Mopen=utf8 /path/to/pt-archiver ...

But this, not setting the encoding on the bulk-insert filehandle is a glaring oversight. This will be fixed in 2.2.2

Brian Fraser (fraserbn) wrote :

Having looked more into this, I have to amend my previous, overly optimistic message. Please, do not use that workaround, and, at least until 2.2.2, do not use --bulk-insert with anything besides binary data / latin1 -- It may corrupt your data by double-encoding things.

There were two issues here: First, the missing encodings for the bulk-insert filehandle, and second, a missing 'CHARACTER SET ...' for the LOAD DATA LOCAL INFILE statement. Once this is properly fixed in trunk, I'll try posting a workaround for previous versions of pt-archiver here.

Changed in percona-toolkit:
assignee: nobody → Brian Fraser (fraserbn)
Brian Fraser (fraserbn) on 2013-04-02
Changed in percona-toolkit:
importance: Undecided → High
Brian Fraser (fraserbn) on 2013-04-16
Changed in percona-toolkit:
status: Confirmed → Fix Committed
Alex Geis (ageis) wrote :

Appreciate you getting this one fixed for 2.2.2. This was a huge one for our workflow. Many thanks!

Changed in percona-toolkit:
status: Fix Committed → In Progress
Brian Fraser (fraserbn) on 2013-04-19
summary: - pt-archiver wide character
+ pt-archiver --charset and --bulk-insert fail, may corrupt data
summary: - pt-archiver --charset and --bulk-insert fail, may corrupt data
+ pt-archiver --bulk-insert may corrupt data
tags: added: dbd-mysql risk
Changed in percona-toolkit:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers