pt-online-schema-change gives spurious error messages on failure if DROP TRIGGER also fails

Bug #1422535 reported by Gillian Gunson
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Percona Toolkit moved to https://jira.percona.com/projects/PT
New
Undecided
Unassigned

Bug Description

If pt-osc fails during a "--alter-foreign-keys-method=drop_swap" run and then fails while trying to drop triggers on cleanup, here's the error output (clipped for brevity):

...
Copying `db_prod`.`tblname`: 90% 07:34 remain
2015-02-15T09:29:30 Copied rows OK.
2015-02-15T09:29:30 Drop-swapping tables...
2015-02-15T09:29:49 Dropping triggers...
2015-02-15T09:30:08 Error dropping trigger: 2015-02-15T09:30:08 DBD::mysql::db do failed: Lock wait timeout exceeded; try restarting transaction at /usr/bin/pt-online-schema-change line 10161.

2015-02-15T09:30:27 Error dropping trigger: 2015-02-15T09:30:27 DBD::mysql::db do failed: Lock wait timeout exceeded; try restarting transaction at /usr/bin/pt-online-schema-change line 10161.

2015-02-15T09:30:46 Error dropping trigger: 2015-02-15T09:30:46 DBD::mysql::db do failed: Lock wait timeout exceeded; try restarting transaction at /usr/bin/pt-online-schema-change line 10161.

2015-02-15T09:30:46 To try dropping the triggers again, execute:
DROP TRIGGER IF EXISTS `db_prod`.`pt_osc_db_prod_tblname_del`;
...
Not dropping the new table `db_prod`.`_tblname_new` because dropping these triggers failed:
  DROP TRIGGER IF EXISTS `db_prod`.`pt_osc_db_prod_tblname_del`;
...
These triggers must be dropped before dropping `db_prod`.`_tblname_new`, else writing to `db_prod`.`tblname` will cause MySQL error 1146 (42S02): "Table `db_prod`.`_tblname_new` doesn't exist".
Altered `db_prod`.`tblname` but there were errors or warnings.
Error updating foreign key constraints: 2015-02-15T09:29:49 DBD::mysql::db do failed: Lock wait timeout exceeded; try restarting transaction at /usr/bin/pt-online-schema-change line 10018.

The output suggests that pt-osc only failed while trying to drop triggers, but the old and new tables are still in place, and the drop and rename are supposed to happen before the dropping of triggers. Based on the last message about line 10018 one can tell 'DROP TABLE tblname' failed and pt-osc immediately went into cleanup mode by (attempting to) drop triggers.

I was able to repeat a slightly different failure getting a read lock on the old table during the copy process and causing it to time out and the drop triggers to also fail. The log was:

2015-02-16T14:41:29 Copying approximately 2164 rows...
2015-02-16T14:42:01 Dropping triggers...
2015-02-16T14:51:01 Error dropping trigger: 2015-02-16T14:51:01 DBD::mysql::db do failed: Lock wait timeout exceeded; try restarting transaction [for Statement "DROP TRIGGER IF EXISTS `test`.`pt_osc_test_parent_del`;"] at /usr/bin/pt-online-schema-change line 10409.
...

2015-02-16T15:09:01 To try dropping the triggers again, execute:
DROP TRIGGER IF EXISTS `test`.`pt_osc_test_parent_del`;
...

Again it looks like it failed because it was trying to drop triggers and couldn't, but it failed before it could copy data first.

So it looks like the pattern is that the failure happens at the last step before "Dropping triggers...", but this isn't obvious from the output or the final error message.

Is it possible to print an error message at the point of pt-osc failing, or to tell at which step it occurred in the final line's error message, even when the cleanup also fails?

This was found in versions 2.2.7 and 2.2.13.

Original command: pt-online-schema-change -u... -p... --max-load Threads_running=50 --critical-load Threads_running=75 --set-vars lock_wait_timeout=1 --alter "drop column val" --alter-foreign-keys-method=drop_swap --execute D=db_prod,t=tblname

Note this is unlikely related to drop_swap method or foreign keys, they just existed in the examples.

Gillian Gunson (ggunson)
description: updated
summary: - pt-online-schema-change gives spurious error messages on failure
+ pt-online-schema-change gives spurious error messages on failure if DROP
+ TRIGGER also fails
Revision history for this message
Nilnandan Joshi (nilnandan-joshi) wrote :

Unable to reproduce the same with 2.2.13.

nilnandan@desktop:~$ pt-online-schema-change -uroot -proot --max-load Threads_running=50 --critical-load Threads_running=75 --set-vars lock_wait_timeout=1 --alter "drop column city" --alter-foreign-keys-method=drop_swap --execute D=nil,t=test1
No slaves found. See --recursion-method if host desktop has slaves.
Not checking slave lag because no slaves were found and --check-slave-lag was not specified.
Operation, tries, wait:
  copy_rows, 10, 0.25
  create_triggers, 10, 1
  drop_triggers, 10, 1
  swap_tables, 10, 1
  update_foreign_keys, 10, 1
Child tables:
  `nil`.`test2` (approx. 14 rows)
Will use the drop_swap method to update foreign keys.
Altering `nil`.`test1`...
Creating new table...
Created new table nil._test1_new OK.
Altering new table...
Altered `nil`.`_test1_new` OK.
2015-03-02T09:25:13 Creating triggers...
2015-03-02T09:25:13 Created triggers OK.
2015-03-02T09:25:13 Copying approximately 1047066 rows...
2015-03-02T09:25:17 Copied rows OK.
2015-03-02T09:25:17 Drop-swapping tables...
2015-03-02T09:25:17 Dropped and swapped tables OK.
Not dropping old table because --no-drop-old-table was specified.
2015-03-02T09:25:17 Dropping triggers...
2015-03-02T09:25:17 Dropped triggers OK.
Successfully altered `nil`.`test1`.
nilnandan@desktop:~$

Is it possible for you provide table structures/small data set to reproduce this? Also run it with PTDEBUG=1 and provide us the output.

i.e
PTDEBUG=1 pt-online-schema-change -u... -p... --max-load ....

Changed in percona-toolkit:
status: New → Incomplete
Revision history for this message
Gillian Gunson (ggunson) wrote :

Hi Nilnandan,

The bug is about what pt-osc reports when it fails on failure cleanup, not that it's failing. My point was that if the pt-osc fails and starts the cleanup process of dropping the triggers and the _new table, and then it fails *again* on dropping triggers due to a metadata lock or some other issue, the output reads as if the cause of pt-osc failing was the failed dropping of triggers. So the original cause of why pt-osc failed is lost because of the 2nd failure at cleanup.

I believe I repeated the problem by starting a pt-osc on a table and then having a second connection do a "select * from original_tbl for update" during the copying rows part. This would then stop the drop-swap from happening due to a metadata lock, and then when the cleanup started it would cause the first drop trigger to also fail on the same metadata lock. Yet the output says nothing about the failure occurring when attempting to drop the original table, and I was only able to figure that out from looking at the referenced "line 10018" in the error message.

It would be useful to have some output saying which stage of the pt-osc failed (copying rows, drop-swap or renaming of tables, etc.) to help the user understand why it failed.

Maybe part of the problem is this message: "Altered `db_prod`.`tblname` but there were errors or warnings." I'd need to test if this is always the case, but maybe this is being printed if the cleanup drop triggers fails, even when the pt-osc fails prior to the cleanup. It's misleading as it suggests that the pt-osc worked and just the cleanup failed.

Changed in percona-toolkit:
status: Incomplete → New
Revision history for this message
Shahriyar Rzayev (rzayev-sehriyar) wrote :

Percona now uses JIRA for bug reports so this bug report is migrated to: https://jira.percona.com/browse/PT-1267

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.