pt-online-schema-change --alter-foreign-keys-method=drop_swap can indirectly cause slave errors

Bug #1161440 reported by Daniel Nichter
18
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Percona Toolkit moved to https://jira.percona.com/projects/PT
Triaged
Medium
Carlos Salguero
2.1
Won't Fix
Undecided
Unassigned
2.2
Triaged
Medium
Carlos Salguero

Bug Description

pt-online-schema-change --alter-foreign-keys-method=drop_swap can indirectly cause slave errors if the DROP TABLE takes a long time and during that time other processes access child tables which access the table being dropped/altered. In this case, the binary logs contain a sequence of statements like:

DROP TABLE parent;
UPDATE child SET fk_col="foo"
RENAME TABLE _parent_new TO parent

The 2nd query will cause a slave error like:

Last_SQL_Errno: 1452
Last_SQL_Error: Error 'Cannot add or update a child row: a foreign key constraint fails ... FOREIGN KEY (`fk_col`) REFERENCES `parent` (`fk_col`) on query. ... Query: 'UPDATE child SET fk_col="foo"'

Because DROP causes and implicit commit, and because it took awhile, the UPDATE was able to execute between it and the RENAME.

As the docs say about drop_swap:

"""
This method is faster and does not block, but it is riskier for two reasons. First, for a short time between dropping the original table and renaming the temporary table, the table to be altered simply does not exist, and queries against it will result in an error. Secondly, if there is an error and the new table cannot be renamed into the place of the old one, then it is too late to abort, because the old table is gone permanently.
"""

However, the situation is worse with replication and slaves because it can indirectly result in a slave error.

One solution is to lock all child tables before the DROP TABLE then unlock them after the RENAME. Consequently, all child tables are locked for however long the DROP takes. If the DROP takes a long time, then the purpose of drop_swap is defeated and --alter-foreign-keys-method=rebuild_constraints might be a better choice since it blocks too but is safer.

Revision history for this message
Florian Munz (theflow) wrote :

We just ran into this yesterday. Quite a mess to clean up manually using pt-table-sync. Do you think this is something that can get fixed within pt-osc?

Revision history for this message
Carlos Salguero (carlos-salguero) wrote :

Yes

Revision history for this message
Shahriyar Rzayev (rzayev-sehriyar) wrote :

Percona now uses JIRA for bug reports so this bug report is migrated to: https://jira.percona.com/browse/PT-608

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.