pt-archiver --bulk-insert may corrupt data

Bug #1127450 reported by Alex Geis
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Percona Toolkit moved to https://jira.percona.com/projects/PT
Fix Released
High
Brian Fraser

Bug Description

This bug seems to pop up whenever the following conditions are set for a table to table archive copy:

1. The tables have a TEXT field that include utf8 characters, like foreign language
2. --charset utf8 is used
3. --bulk-insert is used

When these 3 conditions are true, pt-archiver immediately returns with a Wide Character error on line 3950. This seems to be similar to bug #940253 and seems that it's linked to utf8 encoding as related to the temporary bulk load insert file that's created, as the problem goes away immediately when I turn off --bulk-insert or I set --no-check-charset. For now, I'm sacrificing speed (about 3x) by not using bulk-insert to get around this problem... otherwise I need to use pt-table-sync once finished to repair all rows with encoding data mismatches.

Related branches

Brian Fraser (fraserbn)
Changed in percona-toolkit:
status: New → Confirmed
Revision history for this message
Alex Geis (ageis) wrote :

In pt-archiver version 2.1.8, the error has changed to:
Wide character in print at /usr/local/bin/pt-archiver line 5840.

Revision history for this message
Alex Geis (ageis) wrote :

Look to have solved this problem with the following additions...
100k insert on 2.1.8 build without bulk-insert: 53.4850s
100k insert with edits with bulk-insert w/o error: 7.7311s

# diff -u /usr/local/bin/pt-archiver /usr/local/bin/pt-archiver.1

--- /usr/local/bin/pt-archiver 2013-03-28 05:30:27.071386414 -0400
+++ /usr/local/bin/pt-archiver.1 2013-03-28 06:03:48.391462519 -0400
@@ -5740,7 +5740,7 @@
       require File::Temp;
       $bulkins_file = File::Temp->new( SUFFIX => 'pt-archiver' )
          or die "Cannot open temp file: $OS_ERROR\n";
- binmode($bulkins_file, ':utf8');
+ binmode($bulkins_file,":utf8");
    }

    # This row is the first row fetched from each 'chunk'.
@@ -5967,7 +5967,8 @@
          if ( $o->get('bulk-insert') ) {
             $bulkins_file = File::Temp->new( SUFFIX => 'pt-archiver' )
                or die "Cannot open temp file: $OS_ERROR\n";
- }
+ binmode($bulkins_file,":utf8");
+ }
       } # no next row (do bulk operations)
       else {
          PTDEBUG && _d('Got another row in this chunk');

Revision history for this message
Alex Geis (ageis) wrote :

couldn't edit and looks like there was a line in the 2.1.8 build on my vol.. better diff:

# diff -u /usr/local/bin/pt-archiver /usr/local/bin/pt-archiver.1

--- /usr/local/bin/pt-archiver 2013-03-28 06:11:27.143479965 -0400
+++ /usr/local/bin/pt-archiver.1 2013-03-28 06:03:48.391462519 -0400
@@ -5740,6 +5740,7 @@
       require File::Temp;
       $bulkins_file = File::Temp->new( SUFFIX => 'pt-archiver' )
          or die "Cannot open temp file: $OS_ERROR\n";
+ binmode($bulkins_file,":utf8");
    }

    # This row is the first row fetched from each 'chunk'.
@@ -5966,7 +5967,8 @@
          if ( $o->get('bulk-insert') ) {
             $bulkins_file = File::Temp->new( SUFFIX => 'pt-archiver' )
                or die "Cannot open temp file: $OS_ERROR\n";

+ binmode($bulkins_file,":utf8");
    }
       } # no next row (do bulk operations)
       else {
          PTDEBUG && _d('Got another row in this chunk');

Changed in percona-toolkit:
milestone: none → 2.2.2
tags: added: charset pt-archiver
Revision history for this message
Brian Fraser (fraserbn) wrote :

Possible workaround for previous versions: Try running the tool as

$ perl -Mopen=utf8 /path/to/pt-archiver ...

But this, not setting the encoding on the bulk-insert filehandle is a glaring oversight. This will be fixed in 2.2.2

Revision history for this message
Brian Fraser (fraserbn) wrote :

Having looked more into this, I have to amend my previous, overly optimistic message. Please, do not use that workaround, and, at least until 2.2.2, do not use --bulk-insert with anything besides binary data / latin1 -- It may corrupt your data by double-encoding things.

There were two issues here: First, the missing encodings for the bulk-insert filehandle, and second, a missing 'CHARACTER SET ...' for the LOAD DATA LOCAL INFILE statement. Once this is properly fixed in trunk, I'll try posting a workaround for previous versions of pt-archiver here.

Changed in percona-toolkit:
assignee: nobody → Brian Fraser (fraserbn)
Brian Fraser (fraserbn)
Changed in percona-toolkit:
importance: Undecided → High
Brian Fraser (fraserbn)
Changed in percona-toolkit:
status: Confirmed → Fix Committed
Revision history for this message
Alex Geis (ageis) wrote :

Appreciate you getting this one fixed for 2.2.2. This was a huge one for our workflow. Many thanks!

Changed in percona-toolkit:
status: Fix Committed → In Progress
Brian Fraser (fraserbn)
summary: - pt-archiver wide character
+ pt-archiver --charset and --bulk-insert fail, may corrupt data
Revision history for this message
Daniel Nichter (daniel-nichter) wrote : Re: pt-archiver --charset and --bulk-insert fail, may corrupt data
Changed in percona-toolkit:
status: In Progress → Fix Committed
summary: - pt-archiver --charset and --bulk-insert fail, may corrupt data
+ pt-archiver --bulk-insert may corrupt data
tags: added: dbd-mysql risk
Changed in percona-toolkit:
status: Fix Committed → Fix Released
Revision history for this message
Shahriyar Rzayev (rzayev-sehriyar) wrote :

Percona now uses JIRA for bug reports so this bug report is migrated to: https://jira.percona.com/browse/PT-354

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.