pt-online-schema-change --alter-foreign-keys-method=drop_swap can indirectly cause slave errors

Bug #1161440 reported by Daniel Nichter on 2013-03-28
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Percona Toolkit
Carlos Salguero
Carlos Salguero

Bug Description

pt-online-schema-change --alter-foreign-keys-method=drop_swap can indirectly cause slave errors if the DROP TABLE takes a long time and during that time other processes access child tables which access the table being dropped/altered. In this case, the binary logs contain a sequence of statements like:

DROP TABLE parent;
UPDATE child SET fk_col="foo"
RENAME TABLE _parent_new TO parent

The 2nd query will cause a slave error like:

Last_SQL_Errno: 1452
Last_SQL_Error: Error 'Cannot add or update a child row: a foreign key constraint fails ... FOREIGN KEY (`fk_col`) REFERENCES `parent` (`fk_col`) on query. ... Query: 'UPDATE child SET fk_col="foo"'

Because DROP causes and implicit commit, and because it took awhile, the UPDATE was able to execute between it and the RENAME.

As the docs say about drop_swap:

This method is faster and does not block, but it is riskier for two reasons. First, for a short time between dropping the original table and renaming the temporary table, the table to be altered simply does not exist, and queries against it will result in an error. Secondly, if there is an error and the new table cannot be renamed into the place of the old one, then it is too late to abort, because the old table is gone permanently.

However, the situation is worse with replication and slaves because it can indirectly result in a slave error.

One solution is to lock all child tables before the DROP TABLE then unlock them after the RENAME. Consequently, all child tables are locked for however long the DROP takes. If the DROP takes a long time, then the purpose of drop_swap is defeated and --alter-foreign-keys-method=rebuild_constraints might be a better choice since it blocks too but is safer.

Florian Munz (theflow) wrote :

We just ran into this yesterday. Quite a mess to clean up manually using pt-table-sync. Do you think this is something that can get fixed within pt-osc?


To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers