pt-archiver dies with "MySQL server has gone away" when innodb_kill_idle_transaction set to low value and bulk insert/delete process takes too long time

Bug #1452895 reported by Sveta Smirnova
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Percona Toolkit moved to https://jira.percona.com/projects/PT
Fix Released
Medium
Frank Cizmich

Bug Description

If innodb_kill_idle_transaction set to low value and bulk insert process takes too long time pt-archiver fails with "MySQL server has gone away" error.

How to repeat:

1. Modify pt-archiver as follow:

--- /home/sveta/Downloads/percona-toolkit-2.2.13/bin/pt-archiver 2015-05-07 23:49:27.674726547 +0300
+++ pt-archiver 2015-05-07 23:49:23.583956700 +0300
@@ -6198,6 +6198,7 @@
       if ( $get_sth->{Active} ) { # Fetch until exhausted
          $row = $get_sth->fetchrow_arrayref();
       }
+ sleep(1);
       if ( !$row ) {
          PTDEBUG && _d('No more rows in this chunk; doing bulk operations');

This way we imitate long running INSERT

2. Start MySQL server (cd mysql-test; ./mtr --start innodb &)

3.
USE test
CREATE TABLE user LIKE mysql.user;
ALTER TABLE user DROP PRIMARY KEY;
ALTER TABLE user ENGINE=InnoDB;
INSERT INTO user SELECT * FROM mysql.user;
INSERT INTO user SELECT * FROM user;
INSERT INTO user SELECT * FROM user;
INSERT INTO user SELECT * FROM user;
...
repeat until you have, say, 10000 rows.
ALTER TABLE user ADD id INT NOT NULL PRIMARY KEY AUTO_INCREMENT;
CREATE TABLE user_copy LIKE user;

4. Run pt-archiver:

 ~/Downloads/percona-toolkit-2.2.13/bin/pt-archiver --purge --no-delete --quick --no-check-charset --statistics --buffer --bulk-delete --primary-key-only --source h=127.0.0.1,P=13001,u=root,D=test,t=user --where="user='root'" --limit 200 --progress 200 --commit-each --dest h=127.0.0.1,P=13001,u=root,D=test,t=user_copy
TIME ELAPSED COUNT
2015-05-07T23:40:16 0 0
2015-05-07T23:43:39 202 200
DBD::mysql::db commit failed: MySQL server has gone away at /home/sveta/Downloads/percona-toolkit-2.2.13/bin/pt-archiver line 6453.

Suggested fix: don't allow connection to sleep for long time, run dummy queries or implement similar solutions.

Changed in percona-toolkit:
status: New → Confirmed
Revision history for this message
Sveta Smirnova (svetasmirnova) wrote :

Version 2.2.14

tags: added: i59430
Revision history for this message
Sveta Smirnova (svetasmirnova) wrote :

Better test case.

1. Create replication sandbox using MySQL Sandbox
2. Load employees example database (or any other)
3. Connect to master and run:

create database assistly;
use assistly
create table agent_sessions like employees.employees;
insert into agent_sessions select * from employees.employees;

4. Connect to slave1 (s1) and run:

create database backup;
use backup
create table agent_sessions_r like assistly.agent_sessions;

5. Ensure on all 3 servers innodb_kill_idle_transaction=1:

set global innodb_kill_idle_transaction=1;

6. Create load script:

cat pt-archiver-03.sh
#!/bin/bash

pr=127.0.0.1
sl=127.0.0.1
ba=127.0.0.1

u=msandbox
p=msandbox

for ti in \
agent_sessions:primary
do
t=${ti%:*}
i=${ti#*:}
date
PTDEBUG=0 /home/sveta/Downloads/percona-toolkit-2.2.15/bin/pt-archiver \
--purge \
--quick \
--no-check-charset \
--statistics \
--buffer \
--bulk-delete \
--max-lag 585016 \
--check-interval 1 \
--check-slave-lag h=$sl,P=21893,u=$u,p=$p \
--source h=$pr,P=21891,u=$u,p=$p,D=assistly,t=$t,i=$i \
--dest h=$ba,P=21892,u=$u,p=$p,D=backup,t=${t}_r \
--where="emp_no in (select a.emp_no from (select 53364 emp_no, sleep(120) sl from dual) a)" \
--limit 100 \
--progress 1000 \
--commit-each
date
done

exit 0

7. Run it:

./pt-archiver-03.sh
Thu Sep 17 01:55:08 EEST 2015
TIME ELAPSED COUNT
2015-09-17T01:55:08 0 0
^C^C^C^CExiting on SIGINT.
2015-09-17T01:55:08 0 0
Started at 2015-09-17T01:55:08, ended at 2015-09-17T01:57:08
Source: D=assistly,P=21891,h=127.0.0.1,i=primary,p=...,t=agent_sessions,u=msandbox
Dest: D=backup,P=21892,h=127.0.0.1,i=primary,p=...,t=agent_sessions_r,u=msandbox
SELECT 1
INSERT 0
DELETE 0
Action Count Time Pct
select 1 120.0006 100.00
commit 2 0.0002 0.00
other 0 0.0001 0.00

sveta@thinkie:~/SharedData/percona/issues/59430> ./pt-archiver-03.sh
Thu Sep 17 01:57:12 EEST 2015
TIME ELAPSED COUNT
2015-09-17T01:57:12 0 0
DBD::mysql::db commit failed: MySQL server has gone away at /home/sveta/Downloads/percona-toolkit-2.2.15/bin/pt-archiver line 6899.
Thu Sep 17 02:01:12 EEST 2015

summary: pt-archiver dies with "MySQL server has gone away" when
- innodb_kill_idle_transaction set to low value and bulk insert process
- takes too long time
+ innodb_kill_idle_transaction set to low value and bulk insert/delete
+ process takes too long time
Revision history for this message
Michael Wang (xw73) wrote :

The slow insert can be simulated by "lock tables <table_name> write" instead of modifying the code, I believe.

Revision history for this message
Frank Cizmich (frank-cizmich) wrote :

Thanks for the test cases.

I'm attaching a modified version of the latest pt-archiver which does as suggested and fixed the problem in my tests.
It issues dummy 'keepalive' queries in two parts of the code, each of which independently caused trouble during my tests:
- the loading of the queries into the temp file (used for bulk-insert and bulk-delete)
- and after each bulk insert

This kept the "bulk delete" connection alive in the tests I've run, even with innodb_kill_idle_transaction = 1

Any feedback is appreciated.

Regards

Changed in percona-toolkit:
assignee: nobody → Frank Cizmich (frank-cizmich)
importance: Undecided → Medium
status: Confirmed → Fix Committed
milestone: none → 2.3.1
Changed in percona-toolkit:
milestone: 2.3.1 → 2.2.16
Changed in percona-toolkit:
status: Fix Committed → Fix Released
Revision history for this message
Shahriyar Rzayev (rzayev-sehriyar) wrote :

Percona now uses JIRA for bug reports so this bug report is migrated to: https://jira.percona.com/browse/PT-682

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.