pt-online-schema-change should reconnect to slaves

Bug #1402051 reported by Frank Cizmich
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Percona Toolkit moved to https://jira.percona.com/projects/PT
Fix Released
Medium
Frank Cizmich

Bug Description

pt-online-schema-change exits if a slave is taken offline or does not respond.
This is problematic for very long running schema changes.
Some sort of fault tolerant behavior, optional or otherwise would be useful.
It should skip checks for slaves that don't respond and eventually include them again if they become available.

Changed in percona-toolkit:
status: New → Triaged
assignee: nobody → Frank Cizmich (frank-cizmich)
importance: Undecided → Medium
tags: added: i49004 pt-online-schema-change
Revision history for this message
Muhammad Irfan (muhammad-irfan) wrote :

I altered table on master server via pt-online-schema-change tool and killed mysqld on slave2 during p-osc tool is in progress to simulate slave network connectivity issues/mysqld disappeared I found that killing mysqld process on slave aborts the pt-osc tool and in result table is not altered no where neither master nor slave2.

root@master:~# ./pt-online-schema-change --execute --nodrop-old-table --alter "ADD COLUMN line_number VARCHAR(10) DEFAULT NULL" u=root,p=p3rc0na123,D=world=test &>> ptosc9.log

Found 2 slaves:
slave2
slave1
Will check slave lag on:
slave2
slave1
Operation, tries, wait:
copy_rows, 10, 0.25
create_triggers, 10, 1
drop_triggers, 10, 1
swap_tables, 10, 1
update_foreign_keys, 10, 1
Altering `world`.`test`...
Creating new table...
Created new table world._test_new OK.
Altering new table...
Altered `world`.`_test_new` OK.
2014-12-11T15:59:35 Creating triggers...
2014-12-11T15:59:35 Created triggers OK.
2014-12-11T15:59:35 Copying approximately 58402 rows...
Not dropping triggers because the tool was interrupted. To drop the triggers, execute:
DROP TRIGGER IF EXISTS `world`.`pt_osc_world_test_del`;
DROP TRIGGER IF EXISTS `world`.`pt_osc_world_test_upd`;
DROP TRIGGER IF EXISTS `world`.`pt_osc_world_test_ins`;
Not dropping the new table `world`.`_test_new` because the tool was interrupted. To drop the new table, execute:
DROP TABLE IF EXISTS `world`.`_test_new`;
`world`.`test` was not altered.
(in cleanup) 2014-12-11T15:59:44 Error copying rows from `world`.`test` to `world`.`_test_new`: Lost connection to replica slave2 while attempting to get its lag (DBI connect('world;host=slave2;mysql_read_default_group=client','root',...) failed: Can't connect to MySQL server on 'slave2' (111) at ./pt-online-schema-change line 2261)

Not dropping triggers because the tool was interrupted. To drop the triggers, execute:
DROP TRIGGER IF EXISTS `world`.`pt_osc_world_test_del`;
DROP TRIGGER IF EXISTS `world`.`pt_osc_world_test_upd`;
DROP TRIGGER IF EXISTS `world`.`pt_osc_world_test_ins`;
Not dropping the new table `world`.`_test_new` because the tool was interrupted. To drop the new table, execute:
DROP TABLE IF EXISTS `world`.`_test_new`;
`world`.`test` was not altered.

As you can see from the output that world.test is not altered. pt-osc behavior doesn't seems to be user friendly as tool aborted and failed because of temporal mysqld disappeared.

Changed in percona-toolkit:
milestone: none → 2.2.14
Changed in percona-toolkit:
milestone: 2.2.14 → none
Changed in percona-toolkit:
status: Triaged → In Progress
milestone: none → 2.3.1
Changed in percona-toolkit:
status: In Progress → Fix Committed
summary: - [Feature] pt-osc fault tolerance if slave disconnects
+ pt-online-schema-change should try to reconnect to slaves
summary: - pt-online-schema-change should try to reconnect to slaves
+ pt-online-schema-change should reconnect to slaves
Revision history for this message
Daniel Nichter (daniel-nichter) wrote :

Going to change this a bit to make the tools consistent the desired behavior is already documented:
"""
The tool waits forever for replicas to stop lagging. If any
replica is stopped, the tool waits forever until the replica is
started. The data copy continues when all replicas are running and
not lagging too much.
"""
And we did the same to pt-table-checksum in https://github.com/percona/percona-toolkit/pull/21: wait forever for slaves. The problem with checking, removing if dead, then adding back when alive is stated in the PR: "can't tell if slave is really dead or just temporarily unavailable. Moreover, all slaves are detected at startup, so a slave truly disappearing mid-run seems like an edge case not worth handling."

Revision history for this message
Daniel Nichter (daniel-nichter) wrote :

https://github.com/percona/percona-toolkit/pull/52

Manually tested and working: pt-osc will wait forever for slaves to 1) be alive, 2) not be lagging. Use --check-slave-lag and --recursion-method to change which slaves are checked (same with other tools).

Changed in percona-toolkit:
milestone: 2.3.1 → 2.2.16
Changed in percona-toolkit:
status: Fix Committed → Fix Released
Revision history for this message
Shahriyar Rzayev (rzayev-sehriyar) wrote :

Percona now uses JIRA for bug reports so this bug report is migrated to: https://jira.percona.com/browse/PT-664

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.