[FR] Add recursion-method=dsn in pt-table-sync

Bug #1253537 reported by Przemek on 2013-11-21
This bug affects 6 people
Affects Status Importance Assigned to Milestone
Percona Toolkit
Frank Cizmich

Bug Description

In some cases, available recursion methods in pt-table-sync are not capable of discovering the slave hosts. The mysqlsandbox replication sandbox is a good example, where both TCP port and report_host are custom:

master [localhost] {msandbox} ((none)) > show processlist;
| Id | User | Host | db | Command | Time | State | Info | Rows_sent | Rows_examined | Rows_read |
| 1 | rsandbox | localhost:49943 | NULL | Binlog Dump | 73470 | Master has sent all binlog to slave; waiting for binlog to be updated | NULL | 0 | 0 | 0 |
| 2 | rsandbox | localhost:49944 | NULL | Binlog Dump | 73468 | Master has sent all binlog to slave; waiting for binlog to be updated | NULL | 0 | 0 | 0 |
| 125 | msandbox | localhost | NULL | Query | 0 | NULL | show processlist | 0 | 0 | 0 |
3 rows in set (0.00 sec)

master [localhost] {msandbox} ((none)) > show slave hosts;
| Server_id | Host | Port | Master_id |
| 101 | SBslave1 | 23189 | 1 |
| 102 | SBslave2 | 23190 | 1 |
2 rows in set (0.00 sec)

In the same case, you can successfully use --recursion-method=dsn=... in pt-table-checksum.

Please, implement --recursion-method=dsn also in pt-table-sync.

Thank you for the reasonable feature request

Changed in percona-toolkit:
status: New → Confirmed
Matthew B (utdrmac) wrote :

Is there a workaround? This seems more like a bug.

Frank Cizmich (frank-cizmich) wrote :

It is possible to use pt-table-checksum first, with --recursion-method=dsn ,
and then run pt-table-sync using the --replicate option to use the data gathered in the checksums table


I think --recursion-method=dsn should be ported from pt-table-checksum I have a case (using tungsten replicator) where SHOW PROCESSLIST and SHOW SLAVE HOSTS doesn't work so using --replicate is not working.

Using PTDEBUG=1 throws:

# MasterSlave:6641 21976 Recursion methods: processlist hosts
# MasterSlave:6651 21976 Connected to h=masterxxx.net
# MasterSlave:6660 21976 SELECT @@SERVER_ID
# MasterSlave:6662 21976 Working on server ID 4000
# TableChecksum:4931 21976 SELECT db, tbl, CONCAT(db, '.', tbl) AS `table`, chunk, chunk_index, lower_boundary, upper_boundary, COALESCE(this_cnt-master_cnt, 0) AS cnt_diff, COALESCE(this_crc <> master_crc OR ISNULL(master_crc) <> ISNULL(this_crc), 0) AS crc_diff, this_cnt, master_cnt, this_crc, master_crc FROM finance.cksums_finance_db_server WHERE master_cnt <> this_cnt OR master_crc <> this_crc OR ISNULL(master_crc) <> ISNULL(this_crc)
# MasterSlave:6695 21976 Looking for slaves on h=masterxxx.net using methods processlist hosts
# MasterSlave:6702 21976 Finding slaves with _find_slaves_by_processlist
# MasterSlave:6764 21976 DBI::db=HASH(0x6f4a2b0) SHOW GRANTS FOR CURRENT_USER()
# MasterSlave:6794 21976 DBI::db=HASH(0x6f4a2b0) SHOW PROCESSLIST
# MasterSlave:6702 21976 Finding slaves with _find_slaves_by_hosts
# MasterSlave:6736 21976 DBI::db=HASH(0x6f4a2b0) SHOW SLAVE HOSTS
# MasterSlave:6707 21976 Found 0 slaves
# pt_table_sync:11009 21976 Disconnected dbh DBI::db=HASH(0x6f4a2b0)
# pt_table_sync:11009 21976 Disconnected dbh DBI::db=HASH(0x6f4fbb0)

However pt-table-checksum --replicate-check-only --replicate=percona.cksums reports differences.

Fernando Ipar (fipar) wrote :

Frank: That is not a workaround, at least not in all cases.

A typical use of recursion-method=dsns for me is when I want to exclude some replicas from being monitored by pt-table-checksum. Imagine intentionally delayed replicas (via pt-slave-delay): I don't want pt-table-checksum to wait for those to catch up before proceeding with the checksum process. Another case is when replicas are using non standard ports, or they don't all use the same credentials.

So if master M has replicas R1, R2 and R3, of which R3 is delayed, I'll usually create the percona.dsns table and only insert rows for R1 and R2.

That works well for pt-table-checksum, but since pt-table-sync does not support the option, if the delayed replica R3 is also on a non standard port and/or using different credentials, pt-table-sync will fail with an error like this:

[root@master ~]# pt-table-sync --print --replicate percona.checksums --databases dbname localhost
Cannot connect to h=some.example.ip.address

I think the only workaround would be to run --sync-to-master from some.example.ip.address directly, but in some cases, that can fail too (i.e. if the replication connection from replica to master is via an ssh tunnel, meaning the master is also on localhost but on a different port. It seems in that case pt-table-sync, when trying to connect to the master, connects to localhost:3306, which is the slave itself).

That means that, to me, there is no reliable workaround for this problem.

Changed in percona-toolkit:
status: Confirmed → Triaged
importance: Undecided → Wishlist
assignee: nobody → Frank Cizmich (frank-cizmich)
Frank Cizmich (frank-cizmich) wrote :

After examining both pt-table-checksum and pt-table-table-sync code it seems a direct transfer of this feature is non trivial, being both code bases very different (implemented years appart).
Will continue examining the posibility though since the request is reasonable.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers