pt-online-schema-change aborts on --execute while --dry-run succeeds ("Use of uninitialized value in printf", line 8489)

Bug #1693614 reported by Xan Charbonnet
18
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Percona Toolkit moved to https://jira.percona.com/projects/PT
Fix Released
High
Carlos Salguero

Bug Description

percona-toolkit 3.0.3-1.wheezy

MariaDB 5.5.56 on this machine, one slave is also MariaDB 5.5.56 (in a master-master), the other is MariaDB 10.1.23. I've also tried this from the other master, where both slaves are MariaDB 5.5.56.

Running pt-online-schema-change like so:
-------------------
pt-online-schema-change --recursion-method dsn="D=percona,t=dsns_master" --execute --alter "ADD COLUMN blocked_from_crawl BOOL DEFAULT FALSE" D=atoz,t=c_bibliographies
-------------------

Results in:
-------------------
Found 2 slaves:
soft7 -> other-master:3307
Use of uninitialized value in printf at /usr/bin/pt-online-schema-change line 8489.
-------------------
and exit code 255.

Running the same thing (which has worked many many times in the past, by the way) with dry-run works just fine.

-------------------
pt-online-schema-change --recursion-method dsn="D=percona,t=dsns_master" --dry-run --alter "ADD COLUMN blocked_from_crawl BOOL DEFAULT FALSE" D=atoz,t=c_bibliographies
Operation, tries, wait:
  analyze_table, 10, 1
  copy_rows, 10, 0.25
  create_triggers, 10, 1
  drop_triggers, 10, 1
  swap_tables, 10, 1
  update_foreign_keys, 10, 1
Starting a dry run. `atoz`.`c_bibliographies` will not be altered. Specify --execute instead of --dry-run to alter the table.
Creating new table...
Created new table atoz._c_bibliographies_new OK.
Altering new table...
Altered `atoz`.`_c_bibliographies_new` OK.
Not creating triggers because this is a dry run.
Not copying rows because this is a dry run.
Not swapping tables because this is a dry run.
Not dropping old table because this is a dry run.
Not dropping triggers because this is a dry run.
2017-05-25T15:41:00 Dropping new table...
2017-05-25T15:41:00 Dropped new table OK.
Dry run complete. `atoz`.`c_bibliographies` was not altered.
-------------------

I'd like to try version 3.0.2, because I suspect this broke in the recent update, but it seems that the repository only has the latest version of the 3.x branch.

Tags: pt148
Revision history for this message
Carlos Salguero (carlos-salguero) wrote :

Hi,

Could you clarify how is your config please?

MariaDB 5.5.56 on this machine, one slave is also MariaDB 5.5.56 (in a master-master), the other is MariaDB 10.1.23 ...

which ones are the masters?

I've tried to reproduce it but it is not giving me an error.

Revision history for this message
Xan Charbonnet (xan-biblionix) wrote :

Thanks very much for looking into this.

This machine is a master, as is one of the slaves. They're in a master/master configuration. The other slave is just a slave.

Here's the DSN table which my command refers to:

MariaDB [percona]> select * from dsns_master;
+----+-----------+---------------------------------------------------+
| id | parent_id | dsn |
+----+-----------+---------------------------------------------------+
| 2 | NULL | h=other-master,F=/etc/mysql/my.cnf.percona,P=3307 |
| 3 | NULL | h=localhost,F=/etc/mysql/my.cnf.percona |
+----+-----------+---------------------------------------------------+

"other-master" has an entry to /etc/hosts pointing it to 127.0.0.1. I'm using SSH tunneling to have the other master's database appear on localhost port 3307.

This all may sound a little weird, but I've used pt-online-schema-change many times over many years in this configuration with no problem.

Revision history for this message
Xan Charbonnet (xan-biblionix) wrote :

If you can recommend a way for me to get pt-online-schema-change version 3.0.2, then that might shed some light. I'm pretty sure it worked with that version. Then I might be able to track down what's throwing it off.

Revision history for this message
Xan Charbonnet (xan-biblionix) wrote :

I found a copy. I'll try it and report back here.

Revision history for this message
Xan Charbonnet (xan-biblionix) wrote :

Found it. The problem was that the following:
printf("%s -> %s:%s\n", $cxn->name(), $cxn->{dsn}->{h}, $cxn->{dsn}->{P});
dies when $cxn->{dsn}->{P} is undefined. In my case, it is, because it's connecting to localhost via a UNIX socket. No port number.

I can work around it by specifying 3306 in my DSN, which forces TCP instead of UNIX sockets. Then everything works.

A fix is to do something like:
$cxn->{dsn}->{P} || 'socket'
instead of simply
$cxn->{dsn}->{P}
in that printf.

One issue is that the same printf appears three different times in the code. It might make sense to centralize that. I'm attaching a diff that I believe fixes everything.

Revision history for this message
Xan Charbonnet (xan-biblionix) wrote :
Revision history for this message
Carlos Salguero (carlos-salguero) wrote :

Thanks for the fix you submitted. I have one more question.
Is that slave, the one producing the error, a regular slave? Why is it not using a port?

Revision history for this message
Xan Charbonnet (xan-biblionix) wrote :

The one producing the error is actually the local machine. It is a slave in the system, since this is a master-master configuration. I specify it because I want pt-online-schema-change to check for replication delays on both masters, not just one.

tags: added: pt148
Revision history for this message
Carlos Salguero (carlos-salguero) wrote :
Changed in percona-toolkit:
status: New → Fix Committed
assignee: nobody → Carlos Salguero (carlos-salguero)
milestone: none → 3.0.4
importance: Undecided → High
Revision history for this message
Xan Charbonnet (xan-biblionix) wrote :

Thanks very much for committing the fix!

Revision history for this message
Stanley C (stanleyxcnx) wrote :

I actually just ran into this issue running pt-online-schema-change against a master with one slave attached and --check-slave-lag. Command options are:

pt-online-schema-change -F ./.my.cnf \
        --chunk-time=30 \
        --progress=time,30 \
        --check-slave-lag h=xxx.xxx.xxx.xxx,u=username,p='secret' \
        --max-lag=120 \
        --recursion-method=none \
 --statistics \
        --execute

I didn't specify the port number to check-slave-lag because I didn't think I needed to because slave accepts connection on standard mysql port.

Both master and slave are mysql 5.1.x

I am not sure printing 'Socket' would be correct in my case.

Revision history for this message
Stanley C (stanleyxcnx) wrote :

Instead of print out host and port from dsn structure, print dsn_name of cxn instead. That way, it's consistent with debug print from connect() subroutine of Cxn package. In connect() subroutine, line 3889: PTDEBUG && _d($dbh, 'Connected dbh to', $self->{hostname},$self->{dsn_name});

So replace line 8538:

printf("%s -> %s:%s\n", $cxn->name(), $cxn->{dsn}->{h}, $cxn->{dsn}->{P});

with:

printf("%s -> %s\n", $cxn->name(), $cxn->{dsn_name});

Revision history for this message
John A. Barbuto (jbarbuto) wrote :

This issue still appears when running with PTDEBUG=1, fixed with the attached patch. There's probably a cleaner fix possible, but I wasn't able to use the new Cxn->description() method since I don't see a Cxn instance in the scope of the affected methods.

Changed in percona-toolkit:
status: Fix Committed → Fix Released
Revision history for this message
Shahriyar Rzayev (rzayev-sehriyar) wrote :

Percona now uses JIRA for bug reports so this bug report is migrated to: https://jira.percona.com/browse/PT-403

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.