pt-archiver misses data when only copying data

Bug #1096274 reported by Pritam Mungse on 2013-01-05
18
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Percona Toolkit
High
Carlos Salguero

Bug Description

when using '--no-delete' option in pt-archiver (to only copy data to destination) if the where clause column can have multiple records then pt-archiver will miss records in copying.
e.g. create table with following columns:
id, timestamp
data:
| 001-25941882 | 1280490242452 |
| 001-25941883 | 1280490242452 |
| 001-25941884 | 1280490242452 |
| 001-25941885 | 1280490242452 |
| 001-25941886 | 1280495048106 |
| 001-25941887 | 1280495048108 |
| 001-25942993 | 1280495048108 |
| 001-25942994 | 1280495048108 |
| 001-25942995 | 1280490776956 |
| 001-25942996 | 1280493558772 |

use batch size of 2 (limit) and with commit-each option, the pt-archiver will miss copying few records.
pt version- 2.1.7
mysql version - 5.1.40
command:
/usr/bin/pt-archiver --source h=127.0.0.1,P=3306,D=test_db_1,u=root,p=xxx,t=test,i=idx_timestamp,A=utf8 --dest h=127.0.0.1,P=3306,D=test_db_2,u=root,p=xxx,t=test,A=utf8 --where "created < 1280493558772 " --limit 2--sentinel /tmp/sentinel --set-vars "SQL_MODE=''" --no-delete --commit-each --skip-foreign-key-checks --no-safe-auto-increment --why-quit --statistics --progress 2

It's pretty easy to reproduce and happens due to optimized SQL (which add in above case timestamp > [max value in last run]) in the next sql clause - which cause the next run to miss few records.

Daniel Nichter (daniel-nichter) wrote :

Thanks for looking into this. We'll try to reproduce it on our end too. Bug 820079 is similar, though unconfirmed yet.

tags: added: risk
Changed in percona-toolkit:
status: New → Triaged
Karoly Bujtor (bujtor-karoly) wrote :

It might be caused by the WHERE expression in file pt-archive on line 5768 which uses a less than comparision instead of less than equal.

walito (walito) wrote :

I have the same problem.
I'm trying to copy all table from serverA to serverB, this process copy all rows, but the last one, this don't copy and then I need insert this last row manually.
I tested in pt-archiver 2.2.4 and 2.2.8

Steps
- In serverA
mysql> use sample
Database changed
mysql> show tables;
+------------------+
| Tables_in_sample |
+------------------+
| random_data |
+------------------+
1 row in set (0.00 sec)

mysql> select count(*) from random_data;
+----------+
| count(*) |
+----------+
| 30000 |
+----------+
1 row in set (0.06 sec)

mysql> select * from random_data order by id desc limit 5;
+-------+-------+-------+-------+-------+
| id | c1 | c2 | c3 | c4 |
+-------+-------+-------+-------+-------+
| 30000 | 67938 | 80318 | 97774 | 47916 | <------------------------- this row not copy
| 29999 | 50083 | 36562 | 32562 | 53125 |
| 29998 | 74207 | 20502 | 79891 | 37951 |
| 29997 | 69171 | 88837 | 36671 | 16844 |
| 29996 | 75947 | 45070 | 97510 | 52340 |
+-------+-------+-------+-------+-------+
5 rows in set (0.00 sec)

- Run pt-archiver in serverA
pt-archiver --source h=localhost,D=sample,t=random_data --where "1=1" --dest h=192.168.88.22 --user sync --ask-pass --txn-size 10000 --no-delete --progress 10000 --limit 10000
Enter password:
Enter password:
TIME ELAPSED COUNT
2014-06-05T08:18:00 0 0
2014-06-05T08:18:10 9 10000
2014-06-05T08:18:20 19 20000
2014-06-05T08:18:29 28 29999

- Test serverB
mysql> select * from random_data order by id desc limit 5;
+-------+-------+-------+-------+-------+
| id | c1 | c2 | c3 | c4 |
+-------+-------+-------+-------+-------+
| 29999 | 50083 | 36562 | 32562 | 53125 |
| 29998 | 74207 | 20502 | 79891 | 37951 |
| 29997 | 69171 | 88837 | 36671 | 16844 |
| 29996 | 75947 | 45070 | 97510 | 52340 |
| 29995 | 82478 | 74942 | 27276 | 11555 |
+-------+-------+-------+-------+-------+
5 rows in set (0.00 sec)

mysql> select count(*) from random_data;
+----------+
| count(*) |
+----------+
| 29999 |
+----------+
1 row in set (0.02 sec)

Thanks in advance

Code in recent version, 2.2.14, still says:

   5804 if ( $o->get('safe-auto-increment')
   5805 && $sel_stmt->{index}
   5806 && scalar(@{$src->{info}->{keys}->{$sel_stmt->{index}}->{cols}}) == 1
   5807 && $src->{info}->{is_autoinc}->{
   5808 $src->{info}->{keys}->{$sel_stmt->{index}}->{cols}->[0]
   5809 }
   5810 ) {
   5811 my $col = $q->quote($sel_stmt->{scols}->[0]);
   5812 my ($val) = $dbh->selectrow_array("SELECT MAX($col) FROM $src->{db_tbl}");
   5813 $first_sql .= " AND ($col < " . $q->quote_val($val) . ")";
   5814 }

So, "last" row is missing.

tags: added: i57076
Dave Gregg (dgregg) wrote :

I have a similar issue....
when there are records remaining that are LESS than the LIMIT - they are not copied.

In the following - there are 20111 records in the data set under the WHERE condition. But I only end up with
20000 records copied over.

pt-archiver --source h=10.0.10.140,D=db,t=tbl,u=perconatool,p=pswd--dest h=10.0.2.116,D=db,t=tbl_2014Q3_Cont,u=perconatool,p=pswd--where "DTime<='2014-08-19' and DTime>'2014-07-30'" --limit 500 --txn-size 500 --statistics --progress 500 --max-lag 10 --retries 10 --no-delete --password pswd --check-slave-lag h=10.0.10.137 --check-slave-lag h=10.0.10.138 --check-slave-lag h=10.0.10.139

I have played with this for a while now....seems that every time - if the LIMIT is set higher than the remaining records - it will just copy over everything except for the very last records that are under the LIMIT.

Also if you attempt to copy over a small number of records - say 100 and your LIMIT is set to 500 - the tool will not Copy ANYTHING! It just hangs!!!

Dave Gregg (dgregg) wrote :

Also this does not matter if you are just copying or trying to Delete....
it still works the same - it leaves the remaining records LESS the LIMIT value.

Tried ROW by ROW and I tried BULK - both are the same result. Records are left behind!

Dave Gregg (dgregg) wrote :

Here is my example BULK COPY and DELETE -

pt-archiver --source h=10.0.10.140,D=db,t=table,u=perconatool,p=pswd--dest h=10.0.2.116,D=DB,t=table,u=perconatool,p=pswd,L=yes --where "DTime<='2014-08-21' and DTime>'2014-07-30'" --limit 400 --txn-size 400 --statistics --progress 1000 --max-lag 10 --retries 10 --bulk-insert --bulk-delete --password pswd --check-slave-lag h=10.0.10.137 --check-slave-lag h=10.0.10.138 --check-slave-lag h=10.0.10.139

Dave Gregg (dgregg) wrote :

Another thing - the tool just HANGS after if finishes the copy or delete.......hangs forever. You need to kill it.

Frank Cizmich (frank-cizmich) wrote :

I'm hearing you Dave. As soon as I have a moment I'll look into this.
Meanwhile if you can provide a debug log that would be great. i.e: PTDEBUG=1 pt-archiver params > logfile

Dave Gregg (dgregg) wrote :

Also - this is 2.2.14 - latest release.
Interestingly - sometimes the jobs just hang right away....as below....after 10 minutes it times out......
THEN - if I restart the job again - it will work??????

TIME ELAPSED COUNT
2015-07-31T16:33:14 0 0
Exhausted retries while bulk_deleting; rolling back 0 rows.
2015-07-31T16:33:15 0 400
Started at 2015-07-31T16:33:14, ended at 2015-07-31T16:42:36
Source: D=DB,h=10.0.10.140,p=...,t=table,u=perconatool
Dest: D=DB,L=yes,h=10.0.2.116,p=...,t=Summary_2014Q3_Cont,u=perconatool
SELECT 400
INSERT 400
DELETE 0
Action Count Time Pct
select 2 0.5062 0.09
rollback 2 0.0568 0.01
bulk_inserting 1 0.0427 0.01
commit 4 0.0020 0.00
print_bulkfile 400 0.0013 0.00
other 0 561.6044 99.89

Changed in percona-toolkit:
assignee: nobody → Daniel Nichter (daniel-nichter)
importance: Undecided → High
status: Triaged → In Progress

I am not able to reproduce any of these reports, but I don't have enough info to do a true reproduction. To reproduce this, we need the full table structure (SHOW CREATE TABLE) being used. Two pseudo-table structs were provided, but indexes are critical, and I think that's probably the difference. PTDEBUG output might help, too.

In various attempts, the latest pt-archiver copied exactly the number of rows it should have. When it leaves a single row, this is due to not using --no-safe-auto-increment.

I could not verify this claim either: "if you attempt to copy over a small number of records - say 100 and your LIMIT is set to 500 - the tool will not Copy ANYTHING!". For example, using MySQL's sakila example database:

$ ./pt-archiver --source F=/tmp/12345/my.sandbox.cnf,D=sakila,t=rental --dest D=sakila,t=dst --limit 2000 --statistics --progress 2000 --no-delete --no-check-charset --where "rental_date < '2005-06-01'" --no-safe-auto-increment
TIME ELAPSED COUNT
2015-11-04T09:27:48 0 0
2015-11-04T09:27:49 0 1156
Started at 2015-11-04T09:27:48, ended at 2015-11-04T09:27:49
Source: D=sakila,F=/tmp/12345/my.sandbox.cnf,t=rental
Dest: D=sakila,F=/tmp/12345/my.sandbox.cnf,t=dst
SELECT 1156
INSERT 1156
DELETE 0

There's only 1156 rows in that range, less than the 2000 limit, but all rows are copied. Results are the same with --limit=5000 and with or without a matching --txn-size.

Changed in percona-toolkit:
status: In Progress → Incomplete
assignee: Daniel Nichter (daniel-nichter) → nobody
assignee: nobody → Daniel Nichter (daniel-nichter)
Jesse McLaughlin (nzjess) wrote :

Here's a test case that reproduces this issue:

# STEP 1. Create some test data (create table and inserts below):

$ mysql tmp

Server version: 5.5.37-MariaDB-log MariaDB Server
MariaDB [tmp]>

CREATE TABLE `tiger_pony` (
  `id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
  `ref` varchar(32) COLLATE utf8_unicode_ci NOT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;

insert into tiger_pony (ref) values ('Timber');
insert into tiger_pony (ref) values ('Shiny');

select * from tiger_pony;
+----+--------+
| id | ref |
+----+--------+
| 1 | Timber |
| 2 | Shiny |
+----+--------+

# STEP 2. Run the Percona archiver utility on the table:

$ /opt/percona-toolkit-2.2.17/bin/pt-archiver -u <user> -p <password> --source h=<host>,D=tmp,t=tiger_pony --where 'true' --no-delete --limit 1000 --commit-each --file ./tmp.txt

$ cat tmp.txt
1 Timber

# So the original select gives two rows, and the Percona archiver utility only outputs one.

Jesse McLaughlin (nzjess) wrote :

FYI,

With reference to earlier in this thread, the following patch fixes the test case above:

$ diff pt-archiver pt-archiver-patched
6247c6247
< $first_sql .= " AND ($col < " . $q->quote_val($val) . ")";
---
> $first_sql .= " AND ($col <= " . $q->quote_val($val) . ")";

Changed in percona-toolkit:
assignee: Daniel Nichter (daniel-nichter) → Carlos Salguero (carlos-salguero)
milestone: none → 2.2.20
status: Incomplete → Triaged
Changed in percona-toolkit:
status: Triaged → In Progress

That 'fix' breaks all the tests

Jesse McLaughlin (nzjess) wrote :

Although if the test case listed above was added to the current code, it would also break. So either way you have broken tests.

This is the default behavior.
From the docs:

--[no]safe-auto-increment
default: yes

Do not archive row with max AUTO_INCREMENT.

Adds an extra WHERE clause to prevent pt-archiver from removing the newest row when ascending a single-column AUTO_INCREMENT key. This guards against re-using AUTO_INCREMENT values if the server restarts, and is enabled by default.

The extra WHERE clause contains the maximum value of the auto-increment column as of the beginning of the archive or purge job. If new rows are inserted while pt-archiver is running, it will not see them.

Changed in percona-toolkit:
status: In Progress → Fix Released
milestone: 2.2.20 → none
status: Fix Released → Won't Fix
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers