pt-online-schema-change doesn't ignore signals during drop-swap

Bug #1368244 reported by Daniel Bogatev on 2014-09-11
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Percona Toolkit moved to https://jira.percona.com/projects/PT
Fix Released
High
Frank Cizmich

Bug Description

Hi,

I tried to use pt-online-schema-change to alter one table. First attempt (adding a few columns) ran without any issues, after that I tried another alter and after this attempt I ended with missing table! I restored the table from backup and tried again and this time the command ran without no problems. However I'm worried that this could lead to loosing some data.
Here is the command line and the output from the failed execution:

root@DB:~# pt-online-schema-change --set-vars innodb_lock_wait_timeout=1 --execute -p**************** --alter-foreign-keys-method auto --alter 'change default_vast_url default_vast_url varchar(512), change phone_vast_url phone_vast_url varchar(512), change tablet_vast_url tablet_vast_url varchar(512)' D=ims,t=feed_items
Cannot connect to D=ims,h=XXX.XXX.XXX.XX1,p=...
Cannot connect to D=ims,h=XXX.XXX.XXX.XX2,p=...
Cannot connect to D=ims,h=XXX.XXX.XXX.XX3,p=...
Child tables:
  `ims`.`video_thumbs` (approx. 2249874 rows)
Will automatically choose the method to update foreign keys.
Altering `ims`.`feed_items`...
Creating new table...
Created new table ims._feed_items_new OK.
Altering new table...
Altered `ims`.`_feed_items_new` OK.
Creating triggers...
Created triggers OK.
Copying approximately 489575 rows...
Copying `ims`.`feed_items`: 13% 03:10 remain
Copying `ims`.`feed_items`: 20% 03:54 remain
Copying `ims`.`feed_items`: 26% 04:06 remain
Copying `ims`.`feed_items`: 32% 04:07 remain
Copying `ims`.`feed_items`: 38% 04:01 remain
Copying `ims`.`feed_items`: 43% 03:57 remain
Copying `ims`.`feed_items`: 48% 03:46 remain
Copying `ims`.`feed_items`: 53% 03:30 remain
Copying `ims`.`feed_items`: 58% 03:13 remain
Copying `ims`.`feed_items`: 62% 03:00 remain
Copying `ims`.`feed_items`: 66% 02:44 remain
Copying `ims`.`feed_items`: 70% 02:31 remain
Copying `ims`.`feed_items`: 74% 02:14 remain
Copying `ims`.`feed_items`: 77% 01:58 remain
Copying `ims`.`feed_items`: 81% 01:39 remain
Copying `ims`.`feed_items`: 85% 01:19 remain
Copying `ims`.`feed_items`: 89% 00:57 remain
Copied rows OK.
Max rows for the rebuild_constraints method: 1722
Determining the method to update foreign keys...
  `ims`.`video_thumbs`: too many rows: 2236639; must use drop_swap
Drop-swapping tables...
Dropping triggers...
Dropped triggers OK.
Dropping new table...
Dropped new table OK.
Altered `ims`.`feed_items` but there were errors or warnings.
        (in cleanup) Error updating foreign key constraints: DBD::mysql::db do failed: Table './ims/feed_items' already exists [for Statement "RENAME TABLE `ims`.`_feed_items_new` TO `ims`.`feed_items`"] at /usr/bin/pt-online-schema-change line 6944.

Dropping triggers...
Dropped triggers OK.
Altered `ims`.`feed_items` but there were errors or warnings.

- We use one master with 3 slave replicas. All MySQL servers are version 5.5.33
- pt-online-schema-change is version 2.1.2 (latest version installed from Debian 7.1 repository)

Hi, Let me describe here, how exactly pt-online-schema works.

"pt-online-schema-change works by creating an empty copy of the table to alter, modifying it as desired, and then copying rows from the original table into the new table. When the copy is complete, it moves away the original table and replaces it with the new one. By default, it also drops the original table."

In your case, you get error while rename new table to original table,
Error updating foreign key constraints: DBD::mysql::db do failed: Table './ims/feed_items' already exists [for Statement "RENAME TABLE `ims`.`_feed_items_new` TO `ims`.`feed_items`"]

So you must have table feed_items in the database. It should not dropped. Even you should also have table like _feed_items_new. I have noticed one more thing that even if you have given option --alter-foreign-keys-method auto it uses drop-swap.

Determining the method to update foreign keys...
  `ims`.`video_thumbs`: too many rows: 2236639; must use drop_swap
Drop-swapping tables...

Can you please check again and provide the exact test case that how exactly we can reproduce it?
MySQL error log and table structure will be helpful.

Changed in percona-toolkit:
status: New → Incomplete
Launchpad Janitor (janitor) wrote :

[Expired for Percona Toolkit because there has been no activity for 60 days.]

Changed in percona-toolkit:
status: Incomplete → Expired
summary: - Table deleted during alter with pt-online-schema-change
+ pt-online-schema-change --alter-foreign-keys-method=drop-swap is not
+ atomic
tags: added: i50649
Download full text (3.4 KiB)

There is another problem here: the DROP/RENAME table used by --alter-foreign-keys-method=drop-swap is not atomic, and thus it can be interrupted (which is different from failing as reported here originally).

Output shows "the tool was interrupted":

Child tables:
`EXAMPLEDB`.`STATUS_ITEM` (approx. 140890126 rows)
Will use the drop_swap method to update foreign keys.
Altering `EXAMPLEDB`.`JOB_ITEM`...
Creating new table...
Created new table EXAMPLEDB._JOB_ITEM_new OK.
Altering new table...
Altered `EXAMPLEDB`.`_JOB_ITEM_new` OK.
2015-02-04T18:52:44 Creating triggers...
2015-02-04T18:52:44 Created triggers OK.
2015-02-04T18:52:44 Copying approximately 127945887 rows...
2015-02-04T20:29:41 Copied rows OK.
2015-02-04T20:29:41 Drop-swapping tables...
Not dropping triggers because the tool was interrupted. To drop the triggers, execute:
DROP TRIGGER IF EXISTS `EXAMPLEDB`.`pt_osc_EXAMPLEDB_JOB_ITEM_del`;
DROP TRIGGER IF EXISTS `EXAMPLEDB`.`pt_osc_EXAMPLEDB_JOB_ITEM_upd`;
DROP TRIGGER IF EXISTS `EXAMPLEDB`.`pt_osc_EXAMPLEDB_JOB_ITEM_ins`;
Not dropping the new table `EXAMPLEDB`.`_JOB_ITEM_new` because the tool was interrupted. To drop the new table, execute:
DROP TABLE IF EXISTS `EXAMPLEDB`.`_JOB_ITEM_new`;
Altered `EXAMPLEDB`.`JOB_ITEM` but there were errors or warnings.

And this was run through nohup pt-osc-scripted.sh and the nohup.out file contained:

Copying `EXAMPLEDB`.`JOB_ITEM`: 95% 04:19 remain
Copying `EXAMPLEDB`.`JOB_ITEM`: 95% 04:12 remain
Copying `EXAMPLEDB`.`JOB_ITEM`: 95% 04:03 remain
Copying `EXAMPLEDB`.`JOB_ITEM`: 95% 03:51 remain
Copying `EXAMPLEDB`.`JOB_ITEM`: 95% 03:41 remain
Copying `EXAMPLEDB`.`JOB_ITEM`: 95% 03:32 remain
Copying `EXAMPLEDB`.`JOB_ITEM`: 96% 03:21 remain
Copying `EXAMPLEDB`.`JOB_ITEM`: 96% 03:10 remain
Copying `EXAMPLEDB`.`JOB_ITEM`: 96% 02:58 remain
Copying `EXAMPLEDB`.`JOB_ITEM`: 96% 02:49 remain
Copying `EXAMPLEDB`.`JOB_ITEM`: 96% 02:39 remain
Copying `EXAMPLEDB`.`JOB_ITEM`: 97% 02:27 remain
Copying `EXAMPLEDB`.`JOB_ITEM`: 97% 02:16 remain
Copying `EXAMPLEDB`.`JOB_ITEM`: 97% 02:06 remain
Copying `EXAMPLEDB`.`JOB_ITEM`: 97% 01:56 remain
Copying `EXAMPLEDB`.`JOB_ITEM`: 98% 01:44 remain
Copying `EXAMPLEDB`.`JOB_ITEM`: 98% 01:33 remain
Copying `EXAMPLEDB`.`JOB_ITEM`: 98% 01:24 remain
Copying `EXAMPLEDB`.`JOB_ITEM`: 98% 01:13 remain
Copying `EXAMPLEDB`.`JOB_ITEM`: 98% 01:02 remain
Copying `EXAMPLEDB`.`JOB_ITEM`: 99% 00:52 remain
Copying `EXAMPLEDB`.`JOB_ITEM`: 99% 00:44 remain
Copying `EXAMPLEDB`.`JOB_ITEM`: 99% 00:34 remain
Copying `EXAMPLEDB`.`JOB_ITEM`: 99% 00:22 remain
Copying `EXAMPLEDB`.`JOB_ITEM`: 99% 00:11 remain
Copying `EXAMPLEDB`.`JOB_ITEM`: 99% 00:03 remain
# Exiting on SIGHUP.

So, the SIGHUP certainly arrived at a horrible time, and pt-osc acted on it even when the script was run through "nohup" because it reconnects PERL's "normal-signals" (These are the signals w and leave a table and a non-renamed hich a program might normally expect to encounter and which by default cause it to terminate. They are HUP, INT, PIPE and TERM. http://perldoc.perl.org/sigtrap.html)

What should be done here is make the operation "atomic" in the sense that it can not be interrupted (if the RENAME TABLE fails then it...

Read more...

Changed in percona-toolkit:
status: Expired → New

I think what Marcos described:

"What should be done here is make the operation "atomic" in the sense that it can not be interrupted (if the RENAME TABLE fails then it can't be atomic because we can't rollback the DROP TABLE).

So the suggested fix is: wrap the drop_swap handling code with alternative signal handler which will refuse to exit at the critical time between drop and rename.

And for scripters out there: use "disown" instead of "nohup" to avoid most signals getting there."

is a reasonable feature request. Confirming it as such.

Changed in percona-toolkit:
status: New → Confirmed
Changed in percona-toolkit:
milestone: none → 2.2.14
assignee: nobody → Frank Cizmich (frank-cizmich)
importance: Undecided → Medium
Changed in percona-toolkit:
status: Confirmed → Fix Committed
Changed in percona-toolkit:
importance: Medium → High
Frank Cizmich (frank-cizmich) wrote :

Fixed by disabling common interrupt signals during the critical drop-rename phase.

Changed in percona-toolkit:
status: Fix Committed → Fix Released
summary: - pt-online-schema-change --alter-foreign-keys-method=drop-swap is not
- atomic
+ pt-online-schema-change doesn't ignore signals during drop-swap

Percona now uses JIRA for bug reports so this bug report is migrated to: https://jira.percona.com/browse/PT-378

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers