pt-archiver --bulk-insert may corrupt data
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
| Percona Toolkit moved to https://jira.percona.com/projects/PT |
Fix Released
|
High
|
Brian Fraser |
Bug Description
This bug seems to pop up whenever the following conditions are set for a table to table archive copy:
1. The tables have a TEXT field that include utf8 characters, like foreign language
2. --charset utf8 is used
3. --bulk-insert is used
When these 3 conditions are true, pt-archiver immediately returns with a Wide Character error on line 3950. This seems to be similar to bug #940253 and seems that it's linked to utf8 encoding as related to the temporary bulk load insert file that's created, as the problem goes away immediately when I turn off --bulk-insert or I set --no-check-charset. For now, I'm sacrificing speed (about 3x) by not using bulk-insert to get around this problem... otherwise I need to use pt-table-sync once finished to repair all rows with encoding data mismatches.
Related branches
- Daniel Nichter: Approve on 2013-04-12
-
Diff: 161 lines (+77/-19)3 files modifiedbin/pt-archiver (+29/-19)
t/pt-archiver/bulk_insert.t (+36/-0)
t/pt-archiver/samples/bug_1127450.sql (+12/-0)
Changed in percona-toolkit: | |
status: | New → Confirmed |
Alex Geis (ageis) wrote : | #1 |
Alex Geis (ageis) wrote : | #2 |
Look to have solved this problem with the following additions...
100k insert on 2.1.8 build without bulk-insert: 53.4850s
100k insert with edits with bulk-insert w/o error: 7.7311s
# diff -u /usr/local/
--- /usr/local/
+++ /usr/local/
@@ -5740,7 +5740,7 @@
require File::Temp;
or die "Cannot open temp file: $OS_ERROR\n";
- binmode(
+ binmode(
}
# This row is the first row fetched from each 'chunk'.
@@ -5967,7 +5967,8 @@
if ( $o->get(
or die "Cannot open temp file: $OS_ERROR\n";
- }
+ binmode(
+ }
} # no next row (do bulk operations)
else {
PTDEBUG && _d('Got another row in this chunk');
Alex Geis (ageis) wrote : | #3 |
couldn't edit and looks like there was a line in the 2.1.8 build on my vol.. better diff:
# diff -u /usr/local/
--- /usr/local/
+++ /usr/local/
@@ -5740,6 +5740,7 @@
require File::Temp;
or die "Cannot open temp file: $OS_ERROR\n";
+ binmode(
}
# This row is the first row fetched from each 'chunk'.
@@ -5966,7 +5967,8 @@
if ( $o->get(
or die "Cannot open temp file: $OS_ERROR\n";
+ binmode(
}
} # no next row (do bulk operations)
else {
PTDEBUG && _d('Got another row in this chunk');
Changed in percona-toolkit: | |
milestone: | none → 2.2.2 |
tags: | added: charset pt-archiver |
Brian Fraser (fraserbn) wrote : | #4 |
Possible workaround for previous versions: Try running the tool as
$ perl -Mopen=utf8 /path/to/
But this, not setting the encoding on the bulk-insert filehandle is a glaring oversight. This will be fixed in 2.2.2
Brian Fraser (fraserbn) wrote : | #5 |
Having looked more into this, I have to amend my previous, overly optimistic message. Please, do not use that workaround, and, at least until 2.2.2, do not use --bulk-insert with anything besides binary data / latin1 -- It may corrupt your data by double-encoding things.
There were two issues here: First, the missing encodings for the bulk-insert filehandle, and second, a missing 'CHARACTER SET ...' for the LOAD DATA LOCAL INFILE statement. Once this is properly fixed in trunk, I'll try posting a workaround for previous versions of pt-archiver here.
Changed in percona-toolkit: | |
assignee: | nobody → Brian Fraser (fraserbn) |
Changed in percona-toolkit: | |
importance: | Undecided → High |
Changed in percona-toolkit: | |
status: | Confirmed → Fix Committed |
Alex Geis (ageis) wrote : | #6 |
Appreciate you getting this one fixed for 2.2.2. This was a huge one for our workflow. Many thanks!
Changed in percona-toolkit: | |
status: | Fix Committed → In Progress |
summary: |
- pt-archiver wide character + pt-archiver --charset and --bulk-insert fail, may corrupt data |
Daniel Nichter (daniel-nichter) wrote : Re: pt-archiver --charset and --bulk-insert fail, may corrupt data | #7 |
Another fix was made for this: http://
Changed in percona-toolkit: | |
status: | In Progress → Fix Committed |
summary: |
- pt-archiver --charset and --bulk-insert fail, may corrupt data + pt-archiver --bulk-insert may corrupt data |
tags: | added: dbd-mysql risk |
Changed in percona-toolkit: | |
status: | Fix Committed → Fix Released |
Shahriyar Rzayev (rzayev-sehriyar) wrote : | #8 |
Percona now uses JIRA for bug reports so this bug report is migrated to: https:/
In pt-archiver version 2.1.8, the error has changed to: bin/pt- archiver line 5840.
Wide character in print at /usr/local/