Percona Toolkit moved to https://jira.percona.com/projects/PT

Quoter::serialize_list() doesn't handle multiple NULL values

Bug #1087319 reported by Daniel Nichter on 2012-12-06

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	Percona Toolkit moved to https://jira.percona.com/projects/PT	Fix Released	Medium	Brian Fraser	Percona Toolkit moved to https://jira.percona.com/projects/PT 2.2.1

Bug Description

Quoter::seralize_list() dies on multiple NULL values because at first we couldn't represent multiple NULLs because <,> could be two empty strings or two NULLs, and <NULL,NULL> could be two literal string values of "NULL". But there is a way to do this: MySQL does it when it writes outfiles (SELECT INTO OUTFILE) by representing \N as NULL (so \\N is a literal "\N" string).

See original description

Tags:

Related branches

lp:~percona-toolkit-dev/percona-toolkit/fix-1087319-quoter-multiple-nulls

Merged into lp:percona-toolkit/2.2 at revision 527

Daniel Nichter: Approve on 2013-02-19

Brian Fraser (community): Approve on 2013-02-18

Daniel Nichter (daniel-nichter) on 2012-12-06

description:	updated
description:	updated

Brian Fraser (fraserbn) on 2012-12-11

Changed in percona-toolkit:
assignee:	nobody → Brian Fraser (fraserbn)

Brian Fraser (fraserbn) on 2012-12-12

Changed in percona-toolkit:
status:	Triaged → In Progress

Daniel Nichter (daniel-nichter) on 2013-02-19

Changed in percona-toolkit:
importance:	Undecided → Medium
status:	In Progress → Fix Committed

Daniel Nichter (daniel-nichter) on 2013-02-19

tags:

added: pt-online-schema-change pt-table-checksum pt-table-sync value-quoting
removed: all-tools

Revision history for this message

Daniel Nichter (daniel-nichter) wrote on 2013-02-19:

Download full text (5.1 KiB)

This affects pt-online-schema-change pt-table-checksum, and pt-table-sync. The first two use NibbleIterator; pt-table-sync uses the --replicate table, hence the serialized boundary values, from pt-table-checksum.

In 2.1, the relevant subs (Quoter::serialize_list() and Quoter::deserialize_list()) acted like:

"""
die "Cannot serialize multiple values with undef/NULL"
if grep { !defined $_ } @args;

return join ',', map { quotemeta } @args;
"""

then,

"""
   my @escaped_parts = $string =~ /
         \G # Start of string, or end of previous match.
         ( # Each of these is an element in the original list.
            [^\\,]* # Anything not a backslash or a comma
            (?: # When we get here, we found one of the above.
               \\. # A backslash followed by something so we can continue
               [^\\,]* # Same as above.
            )* # Repeat zero of more times.
         )
         , # Comma dividing elements
      /sxgc;

push @escaped_parts, pos($string) ? substr( $string, pos($string) ) : $string;

   my @unescaped_parts = map {
      my $part = $_;
      # Here be weirdness. Unfortunately quotemeta() is broken, and exposes
      # the internal representation of scalars. Namely, the latin-1 range,
      # \128-\377 (\p{Latin1} in newer Perls) is all escaped in downgraded
      # strings, but left alone in UTF-8 strings. Thus, this.

      # TODO: quotemeta() might change in 5.16 to mean
      # qr/(?=\p{ASCII})\W|\p{Pattern_Syntax}/
      # And also fix this whole weird behavior under
      # use feature 'unicode_strings' -- If/once that's
      # implemented, this will have to change.
      my $char_class = utf8::is_utf8($part) # If it's a UTF-8 string,
                     ? qr/(?=\p{ASCII})\W/ # We only care about non-word
                                             # characters in the ASCII range
                     : qr/(?=\p{ASCII})\W|[\x{80}-\x{FF}]/; # Otherwise,
                                             # same as above, but also
                                             # unescape the latin-1 range.
      $part =~ s/\\($char_class)/$1/g;
      $part;
   } @escaped_parts;

return @unescaped_parts;
"""

So rather complex. The UTF-8 stuff was motivated by a bug in DBD::mysql 3.0007 which CentOS 5 uses: it doesn't flag UTF-8 data as UTF-8. Brian says stuff was quoted because "so that we didn't run afoul of any perl / MYsql INTERACTIONS like inserting '\n' and getting back "\n"".

As for the UTF-8 bug, we tested and discuseed and decided that it should be ok, i.e. it won't affect anything. I'll blog about this later.

As for quoting, I don't think it's necessary because boundary values are straight from the table, so we save exactly whatever we got.

Consequently, this branch, which will be the new standard in 2.2, works like:

"""
   my @parts;
   foreach my $arg ( @args ) {
      if ( defined $arg ) {
         $arg =~ s/,/\\,/g; # escape commas
         $arg =~ s/\\N/\\\\N/g; # escape literal \N
         push @parts, $arg;
      }
      else {
         push @parts, '\N';
      }
   }

my $string = join(',',...

This affects pt-online-schema-change pt-table-checksum, and pt-table-sync.  The first two use NibbleIterator; pt-table-sync uses the --replicate table, hence the serialized boundary values, from pt-table-checksum.

In 2.1, the relevant subs (Quoter::serialize_list() and Quoter::deserialize_list()) acted like:

"""
   die "Cannot serialize multiple values with undef/NULL"
      if grep { !defined $_ } @args;

return join ',', map { quotemeta } @args;
"""

then,

"""
   my @escaped_parts = $string =~ /
         \G             # Start of string, or end of previous match.
         (              # Each of these is an element in the original list.
            [^\\,]*     # Anything not a backslash or a comma
            (?:         # When we get here, we found one of the above.
               \\.      # A backslash followed by something so we can continue
               [^\\,]*  # Same as above.
            )*          # Repeat zero of more times.
         )
         ,              # Comma dividing elements
      /sxgc;

push @escaped_parts, pos($string) ? substr( $string, pos($string) ) : $string;

# TODO: quotemeta() might change in 5.16 to mean
      # qr/(?=\p{ASCII})\W|\p{Pattern_Syntax}/
      # And also fix this whole weird behavior under
      # use feature 'unicode_strings' --  If/once that's
      # implemented, this will have to change.
      my $char_class = utf8::is_utf8($part)  # If it's a UTF-8 string,
                     ? qr/(?=\p{ASCII})\W/   # We only care about non-word
                                             # characters in the ASCII range
                     : qr/(?=\p{ASCII})\W|[\x{80}-\x{FF}]/; # Otherwise,
                                             # same as above, but also
                                             # unescape the latin-1 range.
      $part =~ s/\\($char_class)/$1/g;
      $part;
   } @escaped_parts;

return @unescaped_parts;
"""

So rather complex.  The UTF-8 stuff was motivated by a bug in DBD::mysql 3.0007 which CentOS 5 uses: it doesn't flag UTF-8 data as UTF-8.  Brian says stuff was quoted because "so that we didn't run afoul of any perl / MYsql INTERACTIONS like inserting '\n' and getting back "\n"".

As for the UTF-8 bug, we tested and discuseed and decided that it should be ok, i.e. it won't affect anything.  I'll blog about this later.

As for quoting, I don't think it's necessary because boundary values are straight from the table, so we save exactly whatever we got.

Consequently, this branch, which will be the new standard in 2.2, works like:

"""
   my @parts;
   foreach my $arg  ( @args ) {
      if ( defined $arg ) {
         $arg =~ s/,/\\,/g;      # escape commas
         $arg =~ s/\\N/\\\\N/g;  # escape literal \N
         push @parts, $arg;
      }
      else {
         push @parts, '\N';
      }
   }

my $string = join(',', @parts);
   return $string;
"""

then,

"""
   my @parts;
   foreach my $arg ( split(/(?<!\\),/, $string) ) {
      if ( $arg eq '\N' ) {
         $arg = undef;
      }
      else {
         $arg =~ s/\\,/,/g;
         $arg =~ s/\\\\N/\\N/g;
      }
      push @parts, $arg;
   }

if ( !@parts ) {
      # Perl split() won't split ",,", so handle it manually.
      my $n_empty_strings = $string =~ tr/,//;
      $n_empty_strings++;
      PTDEBUG && _d($n_empty_strings, 'empty strings');
      map { push @parts, '' } 1..$n_empty_strings;
   }
   elsif ( $string =~ m/(?<!\\),$/ ) {
      PTDEBUG && _d('Last value is an empty string');
      push @parts, '';
   }

return @parts;
"""

So the first notable thing is this fix for this bug: NULL MySQL values are serialized as literal \N, just like SELECT INTO OUTFILE.  Unlike that, however, since we can't tab-separate the values (well, we could, but it would be messy in a column value), we comma-separate the values, which means that literal commas have to be escaped.  So "split(/(?<!\\),/, $string)" means "split on commas that are not preceded (i.e. escaped by) a black-slash."  Similarly, we have to (un)escape literal \N.  This all is well-tested and works.

As for UTF-8: there's no special handling.  It's the user's responsibility to know that if their data is UTF-8, then

1) They should create the --replicate table with CHARSET=utf8
2) They should run the tool with -A utf8 (A=utf8, --charset utf8, etc.)

Perl and MySQL handle the rest, and that DBD::mysql bug does not affect anything (so far at least) because even though Perl thinks its UTF-8 data is Latin1, it sends/receives the same sequence of bytes which MySQL correctly knows and handles as UTF-8 (because of the aforementioned two user responsibilities).

The results are positive: whereas before a value like "Hello world!" would have been stored as "Hello world\!", i.e. not the same value, now it's stored as the exact same value.  The only exception to this is escaped commas and literal \N, but there's no way around that.

Daniel Nichter (daniel-nichter) on 2013-03-22

Changed in percona-toolkit:
status:	Fix Committed → Fix Released

Revision history for this message

Shahriyar Rzayev (rzayev-sehriyar) wrote on 2018-01-24:

Percona now uses JIRA for bug reports so this bug report is migrated to: https://jira.percona.com/browse/PT-599

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.