pt-online-schema-change should set session-level lock_wait_timeout for schema change operations

Bug #1100557 reported by Geoffrey Anderson
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Percona Toolkit moved to https://jira.percona.com/projects/PT
Triaged
Undecided
Unassigned

Bug Description

Version: pt-online-schema-change 2.1.2
Mysqld version: Percona-Server-server-55-5.5.28-rel29.2.360.rhel5

Description:
When running pt-online-schema-change against a mysql instance, some of the operations it performs require having a table metadata lock. This cannot be granted until all locks (including read locks) are released on the table. Since operations in pt-online-schema change are affecting the schema of "live" tables, if it gets blocked waiting for a metadata lock, INSERT statements to the same table will be blocked and stack up behind it. If there's a long-running transaction (longer than 60 seconds, for instance), a busy mysql server will quickly stack up connections and potentially blow out its connection limit. I've observed that pt-online-schema-change requires an exclusive metadata lock when doing the following operations:
* creating triggers on source table that is being altered
* renaming the "new" and "old" tables after a schema change is complete

This is further exacerbated by the default setting of "lock_wait_timeout" being set to "31536000" (365 days). Since pt-online-schema-change obeys lock_wait_timeout and the fact that lock_wait_timeout is also a session-level variable, a suggested fix is to have pt-online-schema-change set a session level "lock_wait_timeout" to a relatively low number (say....60 seconds) and additionally have a parameter to change this value on the command-line.

Steps to reproduce:
1) Create a test table:
CREATE DATABASE test;
CREATE TABLE test2 (
  id int(10) unsigned not null auto_increment,
  c int(11) default null,
  primary key (id)
) engine=innodb auto_increment=5102195 default charset=latin1;

2) Start a transaction that will take a read lock on this table:
BEGIN;
USE test;
SELECT * FROM test2 LIMIT 1;

3) In a different window, begin concurrent inserts (these will insert and complete without being blocked by the transaction above):
for i in {1..1000}; do usleep 500; (mysql -S /var/lib/mysql/mysql.sock test -e "insert into test2 (c) VALUES(${i});" &) ; done

4) In a third window, while the concurrent inserts from above are running:
pt-online-schema-change --alter "add index idx_c (c)" --execute h=127.0.0.1,P=3306,u=root,p=testpass,D=test,t=test2 2>&1 | tee pt-online-schema-change.log

5) If you check the processlist on the server, you'll see inserts from step 2 stacking up behind the "CREATE TRIGGER" operation being performed by pt-online-schema-change in step 3.

Revision history for this message
Raghavendra D Prabhu (raghavendra-prabhu) wrote :

Yes, the default value of lock_wait_timeout is 365 days.

However, to set any session level variable you can use --set-vars
(http://www.percona.com/doc/percona-toolkit/2.1/pt-online-schema-change.html )

Since lock_wait_timeout is also a session level variable, it
should do.

Revision history for this message
Daniel Nichter (daniel-nichter) wrote :

Where do you all see that the default for lock_wait_timeout is 365 days?

The tool sets lock wait timeout to 1s by default:

=item --lock-wait-timeout

type: int; default: 1

Set the session value of C<innodb_lock_wait_timeout>.

And for MySQL: http://dev.mysql.com/doc/refman/5.0/en/innodb-parameters.html#sysvar_innodb_lock_wait_timeout

tags: added: pt-online-schema-change
Changed in percona-toolkit:
status: New → Triaged
tags: added: locking
Revision history for this message
Geoffrey Anderson (geoff-geoffreyanderson) wrote :

@Raghavendra D Prabhu: Ah, that may actually provide a solution to what I'm seeing, I'll try that out and report back. Whether or not it works, I feel that the tool should perhaps be a little more proactive in this scenario and set the variable automatically since some of the operations the tool does can cause some nasty locking in a highly concurrent environment.

@Daniel Nichter: There are 2 "lock wait timeout" variables in play in the server. I experienced this problem on Percona Server 5.5.13-20 and found the following DEFINE in source is what "lock_wait_timeout" is set to: http://bazaar.launchpad.net/~percona-core/percona-server/5.5/view/head:/Percona-Server/sql/sql_const.h#L225

This variable doesn't seem to appear anywhere in the documentation. The other variable is "innodb_lock_wait_timeout" which is set to 50s by default (as per the documentation). I guess I'm curious if the tool is setting "innodb_lock_wait_timeout" and then watching "lock_wait_timeout" to determine when it should timeout.

Revision history for this message
Geoffrey Anderson (geoff-geoffreyanderson) wrote :

I can confirm that setting the lock_wait_timeout to a low value using the '--set-vars' option works the expected behavior. I followed the same steps in the bug description and simply added "--set-vars 'lock_wait_timeout=5'" as a parameter to pt-online-schema-change. Output below:

ganderson@db01:~$ pt-online-schema-change --alter "add index idx_c (c)" --set-vars 'lock_wait_timeout=5' --execute h=127.0.0.1,P=3306,u=root,p=testpass,D=test,t=test2 2>&1| tee pt-online-schema-change.log
Altering `test`.`test2`...
Creating new table...
Created new table test._test2_new OK.
Altering new table...
Altered `test`.`_test2_new` OK.
Creating triggers...
Dropping triggers...
Dropped triggers OK.
Dropping new table...
Dropped new table OK.
`test`.`test2` was not altered.
Error creating triggers: DBD::mysql::db do failed: Lock wait timeout exceeded; try restarting transaction at /usr/bin/pt-online-schema-change line 7020.

I've also confirmed that setting "innodb_lock_wait_timeout" INSTEAD of "lock_wait_timeout" doesn't fix this issue. Is it possible/safe to have the tool default to setting the session level variable "lock_wait_timeout" to a relatively low number (e.g. 5 seconds or less)?

An additional thought -- using this workaround also more readily exposes bug #1093016 (https://bugs.launchpad.net/percona-toolkit/+bug/1093016) since some of these metadata-locking changes won't be retried (e.g. dropping triggers).

Revision history for this message
Daniel Nichter (daniel-nichter) wrote :

This is fixed in 2.2.1. See bug 1113301.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.