Slave with MTS enabled may get broken after restart

Bug #1454272 reported by Thomas Babut
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Percona Server moved to https://jira.percona.com/projects/PS
Fix Released
High
Unassigned
5.6
Fix Released
High
Unassigned

Bug Description

If you restart or shutdown a mysql slave with MTS enabled, the slave can get get broken. Please take a look at the report #1420606 also. The version 5.6.24-72.2 was reported to fix this bug, but the problem still exists, at least in our case. Since we know of this behavior we always do a "STOP SLAVE;" before we shutdown or restart the server.

Because of privacy we cannot offer the relay logs. In addition we are running the row based replication. I don't know, if the logs would even help.

Configuration on the master:
[mysqld]
port = 6033
user = mysql
basedir = /usr/local
datadir = /zdata/db
slave_load_tmpdir = /zdata/tmp
tmpdir = /ramdisk
bind-address = 0.0.0.0
socket = /var/run/mysql/mysqld.sock
slow_query_log_file = /var/log/mysql/mysql-slow.log
slow_query_log = 1
slow_query_log_timestamp_always = 1
long_query_time = 4
log_slow_verbosity = full
log_queries_not_using_indexes = 1
log-slow-slave-statements = 1
min_examined_row_limit = 20000
server-id = 2
innodb_data_home_dir = /zdata/db
innodb_log_group_home_dir = /ziblog
innodb_log_buffer_size = 64M
innodb_data_file_path = ibdata1:512M:autoextend
innodb_buffer_pool_instances = 8
table_open_cache_instances = 8
innodb_buffer_pool_size = 72G
innodb_old_blocks_time = 1000
innodb_doublewrite = 0
innodb_checksum_algorithm = NONE
innodb_flush_log_at_trx_commit = 2
innodb_flush_method = O_DIRECT
innodb_log_files_in_group = 2
innodb_log_file_size = 512M
innodb_status_file = 1
innodb_file_per_table = 1
innodb_purge_threads = 1
innodb_read_io_threads = 8
innodb_write_io_threads = 8
innodb_io_capacity = 10000
innodb_io_capacity_max = 100000
innodb_lru_scan_depth = 2048
innodb_online_alter_log_max_size = 1G
thread_handling = pool-of-threads
slave-parallel-workers = 5
skip-slave-start = 1
log_slave_updates = 1
sync_binlog = 0
max_connections = 10000
log_bin = /zdata/binlog/mysql-bin.log
max_binlog_size = 512M
binlog_format = ROW
binlog_row_image = MINIMAL
expire_logs_days = 5
relay-log = master-relay-bin
relay_log_info_repository = TABLE
relay_log_recovery = 1
master_info_repository = TABLE
slave_net_timeout = 60
read_only = 0
table_open_cache = 4096
table_definition_cache = 4096
query_cache_size = 0
query_cache_type = 0
key_buffer_size = 256M
join_buffer_size = 256K
sort_buffer_size = 256K
max_heap_table_size = 1G
tmp_table_size = 1G
max_allowed_packet = 64M
slave-pending-jobs-size-max = 64M
innodb_buffer_pool_dump_at_shutdown = 1
innodb_buffer_pool_load_at_startup = 1
log_warnings = 1
skip-host-cache
skip-name-resolve
performance_schema = 0
slave_compressed_protocol = 0

The slave:
[mysqld]
port = 6033
user = mysql
basedir = /usr/local
datadir = /zdata/db
slave_load_tmpdir = /zdata/db
tmpdir = /zdata/db
bind-address = 0.0.0.0
socket = /var/run/mysql/mysqld.sock
slow_query_log_file = /var/log/mysql/mysql-slow.log
slow_query_log = 1
slow_query_log_timestamp_always = 1
long_query_time = 4
log_slow_verbosity = full
log_queries_not_using_indexes = 1
log-slow-slave-statements = 1
min_examined_row_limit = 8000
server-id = 4
innodb_data_home_dir = /zdata/db
innodb_log_group_home_dir = /ziblog
innodb_log_buffer_size = 64M
innodb_data_file_path = ibdata1:512M:autoextend
innodb_buffer_pool_instances = 4
table_open_cache_instances = 4
innodb_buffer_pool_size = 12G
innodb_old_blocks_time = 1000
innodb_doublewrite = 0
innodb_checksum_algorithm = NONE
innodb_flush_log_at_trx_commit = 2
innodb_flush_method = O_DIRECT
innodb_log_files_in_group = 2
innodb_log_file_size = 512M
innodb_status_file = 1
innodb_file_per_table = 1
innodb_purge_threads = 1
innodb_io_capacity = 2000
innodb_io_capacity_max = 20000
innodb_log_block_size = 4096
thread_handling = pool-of-threads
innodb_online_alter_log_max_size = 1G
slave-parallel-workers = 4
skip-slave-start = 1
log_slave_updates = 0
sync_binlog = 0
max_connections = 1000
log_bin = /zdata/binlog/mysql-bin.log
max_binlog_size = 512M
binlog_format = ROW
binlog_row_image = MINIMAL
expire_logs_days = 3
relay-log = /zdata/relaylog/slave-relay-bin
relay_log_info_repository = TABLE
relay_log_recovery = 1
master_info_repository = TABLE
slave_net_timeout = 60
read_only = 1
table_open_cache = 4096
table_definition_cache = 4096
query_cache_size = 0
query_cache_type = 0
key_buffer_size = 256M
join_buffer_size = 256K
sort_buffer_size = 256K
max_heap_table_size = 256M
tmp_table_size = 256M
max_allowed_packet = 64M
slave-pending-jobs-size-max = 64M
innodb_buffer_pool_dump_at_shutdown = 1
innodb_buffer_pool_load_at_startup = 1
log_warnings = 1
skip-host-cache
skip-name-resolve
performance_schema = 0
slave_compressed_protocol = 1

Tags: regression
Revision history for this message
Valerii Kravchuk (valerii-kravchuk) wrote :

Please, provide some more details about the way slave is "broken". Do you get errors when slave starts? Can you share any error messages from the error log or the output of

show slave status\G

that demonstrates the problem?

Revision history for this message
Valerii Kravchuk (valerii-kravchuk) wrote :

Note that https://bugs.launchpad.net/percona-server/+bug/1420606 is fixed in 5.6.24. So, if you had the same problem, please, check if this version works any better for you.

Revision history for this message
Thomas Babut (thbabut) wrote :

We did test the version 5.6.24-72.2, because it was reported to fix the problem, but it doesn't.

The slave came up with these errors after restarting it:

May 11 14:19:13 server mysqld: 2015-05-11 14:19:13 25016 [ERROR] --relay-log-recovery cannot be executed when the slave was stopped with an error or killed in MTS mode; consider using RESET SLAVE or restart the server with --relay-log-recovery = 0 followed by START SLAVE UNTIL SQL_AFTER_MTS_GAPS
May 11 14:19:13 server mysqld: 2015-05-11 14:19:13 25016 [ERROR] Failed to initialize the master info structure
May 11 14:19:13 server mysqld: 2015-05-11 14:19:13 25016 [Note] Check error log for additional messages. You will not be able to start replication until the issue is resolved and the server restarted.

We made two restarts for the test. First restart with the new version was fine, but on the second one the slave got broken, like in earlier versions.

Revision history for this message
Valerii Kravchuk (valerii-kravchuk) wrote :

Would you mind to share larger part of the error log? I want to see what happened before shutdown and in the process of startup and recovery.

Revision history for this message
Thomas Babut (thbabut) wrote :
Download full text (172.9 KiB)

No problem:

May 11 14:18:49 slaveserver systemd: Stopping MySQL Percona Server...
May 11 14:18:49 slaveserver mysqld: 2015-05-11 14:18:49 23869 [Note] Slave I/O thread killed while reading event
May 11 14:18:49 slaveserver mysqld: 2015-05-11 14:18:49 23869 [Note] Slave I/O thread exiting, read up to log 'mysql-bin.031147', position 381969967
May 11 14:18:49 slaveserver mysqld: 2015-05-11 14:18:49 23869 [ERROR] Slave SQL: ... The slave coordinator and worker threads are stopped, possibly leaving data in inconsistent state. A restart should restore consistency automatically, although using non-transactional storage for data or info tables or DDL queries could lead to problems. In such cases you have to examine your data (see documentation for details). Error_code: 1756
May 11 14:18:49 slaveserver mysqld: 2015-05-11 14:18:49 23869 [Note] /usr/sbin/mysqld: Normal shutdown
May 11 14:18:49 slaveserver mysqld:
May 11 14:18:49 slaveserver mysqld: 2015-05-11 14:18:49 23869 [Note] Giving 0 client threads a chance to die gracefully
May 11 14:18:49 slaveserver mysqld: 2015-05-11 14:18:49 23869 [Note] Event Scheduler: Purging the queue. 0 events
May 11 14:18:49 slaveserver mysqld: 2015-05-11 14:18:49 23869 [Note] Shutting down slave threads
May 11 14:18:49 slaveserver mysqld: 2015-05-11 14:18:49 23869 [Note] Forcefully disconnecting 0 remaining clients
May 11 14:18:49 slaveserver mysqld: 2015-05-11 14:18:49 23869 [Note] Binlog end
May 11 14:18:49 slaveserver mysqld: 2015-05-11 14:18:49 23869 [Note] Shutting down plugin 'partition'
May 11 14:18:49 slaveserver mysqld: 2015-05-11 14:18:49 23869 [Note] Shutting down plugin 'INNODB_CHANGED_PAGES'
May 11 14:18:49 slaveserver mysqld: 2015-05-11 14:18:49 23869 [Note] Shutting down plugin 'INNODB_SYS_DATAFILES'
May 11 14:18:49 slaveserver mysqld: 2015-05-11 14:18:49 23869 [Note] Shutting down plugin 'INNODB_SYS_TABLESPACES'
May 11 14:18:49 slaveserver mysqld: 2015-05-11 14:18:49 23869 [Note] Shutting down plugin 'INNODB_SYS_FOREIGN_COLS'
May 11 14:18:49 slaveserver mysqld: 2015-05-11 14:18:49 23869 [Note] Shutting down plugin 'INNODB_SYS_FOREIGN'
May 11 14:18:49 slaveserver mysqld: 2015-05-11 14:18:49 23869 [Note] Shutting down plugin 'INNODB_SYS_FIELDS'
May 11 14:18:49 slaveserver mysqld: 2015-05-11 14:18:49 23869 [Note] Shutting down plugin 'INNODB_SYS_COLUMNS'
May 11 14:18:49 slaveserver mysqld: 2015-05-11 14:18:49 23869 [Note] Shutting down plugin 'INNODB_SYS_INDEXES'
May 11 14:18:49 slaveserver mysqld: 2015-05-11 14:18:49 23869 [Note] Shutting down plugin 'INNODB_SYS_TABLESTATS'
May 11 14:18:49 slaveserver mysqld: 2015-05-11 14:18:49 23869 [Note] Shutting down plugin 'INNODB_SYS_TABLES'
May 11 14:18:49 slaveserver mysqld: 2015-05-11 14:18:49 23869 [Note] Shutting down plugin 'INNODB_FT_INDEX_TABLE'
May 11 14:18:49 slaveserver mysqld: 2015-05-11 14:18:49 23869 [Note] Shutting down plugin 'INNODB_FT_INDEX_CACHE'
May 11 14:18:49 slaveserver mysqld: 2015-05-11 14:18:49 23869 [Note] Shutting down plugin 'INNODB_FT_CONFIG'
May 11 14:18:49 slaveserver mysqld: 2015-05-11 14:18:49 23869 [Note] Shutting down plugin 'INNODB_FT_BEING_DELETED'
May 11 14:18:49 slaveserver mysqld: 2015-05-11 14:18:49 23869 [Note] Shutting ...

Revision history for this message
Sveta Smirnova (svetasmirnova) wrote :
Download full text (3.2 KiB)

Thank you for the feedback.

Bug confirmed with version 5.6.25-73.1-log

To repeat:

1. Start master and slave using options, provided in the initial description of this bug report, start slave, ensure everything works fine.
2. Save generate_insert_load.sh from bug #1420606
3. Type ./generate_insert_load.sh 2>1 >/dev/null & about 75-100 times or run generate_insert_load.sh in 75-100 threads using some other method
4. Run start-stop process for the slave:

while true; do ./node1/stop ; ./node1/start ; ./s1 -e "start slave; show slave status;"; sleep 10; done

I use MySQL Sandbox, ./node1/stop|start are shourcuts for stopping and starting server; ./s1 is client which connects to this slave.

5. After some time you will notice what slave cannot be started anymore.

6. Check error log file and find:

2015-07-30 00:44:01 2786 [ERROR] Slave SQL: ... The slave coordinator and worker threads are stopped, possibly leaving data in inconsistent state. A restart should restore consistency automatically, although using non-transactional storage for data or info tables or DDL queries could lead to problems. In such cases you have to examine your data (see documentation for details). Error_code: 1756
2015-07-30 00:44:01 2786 [Note] Slave SQL thread exiting, replication stopped in log 'mysql-bin.000002' at position 12600601
2015-07-30 00:44:01 2786 [Note] /home/sveta/SharedData/Downloads/5.6.25/bin/mysqld: Normal shutdown
...
2015-07-30 00:44:51 27572 [Note] Slave: MTS group recovery relay log info group_master_log_name mysql-bin.000002, event_master_log_pos 12600798.
2015-07-30 00:44:51 27572 [Note] Slave: MTS group recovery relay log info group_master_log_name mysql-bin.000002, event_master_log_pos 12600995.
2015-07-30 00:44:51 27572 [Note] Slave: MTS group recovery relay log info group_master_log_name mysql-bin.000002, event_master_log_pos 12601192.
2015-07-30 00:44:51 27572 [Note] Slave: MTS group recovery relay log info group_master_log_name mysql-bin.000002, event_master_log_pos 12601389.
2015-07-30 00:44:51 27572 [Note] Slave: MTS group recovery relay log info group_master_log_name mysql-bin.000002, event_master_log_pos 12601586.
2015-07-30 00:44:51 27572 [Note] Slave: MTS group recovery relay log info group_master_log_name mysql-bin.000002, event_master_log_pos 12601783.
2015-07-30 00:44:51 27572 [Note] Slave: MTS group recovery relay log info group_master_log_name mysql-bin.000002, event_master_log_pos 12601980.
...
2015-07-30 00:52:59 22949 [ERROR] --relay-log-recovery cannot be executed when the slave was stopped with an error or killed in MTS mode; consider using RESET SLAVE or restart the server with --relay-log-recovery = 0 followed by START SLAVE UNTIL SQL_AFTER_MTS_GAPS
2015-07-30 00:52:59 22949 [ERROR] Failed to initialize the master info structure
2015-07-30 00:52:59 22949 [Note] Check error log for additional messages. You will not be able to start replication until the issue is resolved and the server restarted.
2015-07-30 00:52:59 22949 [Note] Event Scheduler: Loaded 0 events
2015-07-30 00:52:59 22949 [Note] /home/sveta/SharedData/Downloads/5.6.25/bin/mysqld: ready for connections.
Version: '5.6.25-73.1-log' socket: '/tmp/mysql_sandbox2...

Read more...

Revision history for this message
Sveta Smirnova (svetasmirnova) wrote :

I was not able to repeat bug with upstream MySQL Server.

tags: added: regression
Revision history for this message
Sergei Glushchenko (sergei.glushchenko) wrote :

Hi Sveta,

Which upstream MySQL version have you try? I can easily reproduce with 5.6.25, but not with 5.6.26.

Revision history for this message
Sergei Glushchenko (sergei.glushchenko) wrote :

Might be fixed by commit 37f2e96 or fce5589

Revision history for this message
Shahriyar Rzayev (rzayev-sehriyar) wrote :

Percona now uses JIRA for bug reports so this bug report is migrated to: https://jira.percona.com/browse/PS-899

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Related blueprints

Remote bug watches

Bug watches keep track of this bug in other bug trackers.