"Deadlock" on START SLAVE with crash-safe former MTS slave and autocommit=OFF

Bug #1471141 reported by Valerii Kravchuk on 2015-07-03
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
MySQL Server
Unknown
Unknown
Percona Server moved to https://jira.percona.com/projects/PS
Status tracked in 5.7
5.6
Fix Released
Medium
Unassigned
5.7
Fix Released
Medium
Unassigned

Bug Description

It seems START SLAVE can hang in a session with autocommit=OFF waiting for a lock at InnoDB level if other session with autocommit=OFF checks InnoDB locks in the information_schema.

To reproduce this, set up usual replication sandboxes using MySQL Sandbox, start all nodes and on one of slaves add the following lines to default my.sandbox.cnf:

slave_parallel_workers=8
master_info_repository=TABLE
relay_log_info_repository = TABLE

Now, stop and start the node:

1008 ./stop
1009 ./start
1010 ./my sql

start two sessions, and in one session do:

set @@session.autocommit=OFF;
show slave status\G
select count(*) from mysql.slave_worker_info; -- you should see 8 here
stop slave;
set global slave_parallel_workers=0;
 SELECT t.trx_id , t.trx_mysql_thread_id , p.USER , p.Host , p.TIME , t.trx_state , t.trx_operation_state , t.trx_tables_in_use , t.trx_tables_locked , t.trx_rows_modified , t.trx_started , SUBSTRING(t.trx_query, 1, 50) Query FROM information_schema.innodb_trx t INNER JOIN information_schema.processlist p ON t.trx_mysql_thread_id = p.ID;

Now in other session, do:

set @@session.autocommit = OFF;
start slave; -- it hangs

Now in the first session:

slave2 [localhost] {msandbox} ((none)) > SELECT t.trx_id , t.trx_mysql_thread_id , p.USER , p.Host , p.TIME , t.trx_state , t.trx_operation_state , t.trx_tables_in_use , t.trx_tables_locked , t.trx_rows_modified , t.trx_started , SUBSTRING(t.trx_query, 1, 50) Query FROM information_schema.innodb_trx t INNER JOIN information_schema.processlist p ON t.trx_mysql_thread_id = p.ID;
+--------+---------------------+-------------+-----------+------+-----------+---------------------+-------------------+-------------------+-------------------+---------------------+-------------+
| trx_id | trx_mysql_thread_id | USER | Host | TIME | trx_state | trx_operation_state | trx_tables_in_use | trx_tables_locked | trx_rows_modified | trx_started | Query |
+--------+---------------------+-------------+-----------+------+-----------+---------------------+-------------------+-------------------+-------------------+---------------------+-------------+
| 21056 | 14 | system user | | 3 | LOCK WAIT | starting index read | 1 | 1 | 0 | 2015-07-03 11:15:10 | NULL |
| 21055 | 12 | msandbox | localhost | 3 | RUNNING | NULL | 0 | 0 | 0 | 2015-07-03 11:15:10 | start slave |
+--------+---------------------+-------------+-----------+------+-----------+---------------------+-------------------+-------------------+-------------------+---------------------+-------------+
2 rows in set (0.00 sec)

slave2 [localhost] {msandbox} ((none)) > show processlist;
+----+-------------+-----------+------+---------+------+-----------------------------------+------------------+-----------+---------------+
| Id | User | Host | db | Command | Time | State | Info | Rows_sent | Rows_examined |
+----+-------------+-----------+------+---------+------+-----------------------------------+------------------+-----------+---------------+
| 11 | msandbox | localhost | NULL | Query | 0 | init | show processlist | 0 | 0 |
| 12 | msandbox | localhost | NULL | Query | 9 | Waiting for slave thread to start | start slave | 0 | 0 |
| 13 | system user | | NULL | Connect | 9 | Connecting to master | NULL | 0 | 0 |
| 14 | system user | | NULL | Connect | 9 | System lock | NULL | 0 | 0 |
+----+-------------+-----------+------+---------+------+-----------------------------------+------------------+-----------+---------------+
4 rows in set (0.00 sec)

slave2 [localhost] {msandbox} ((none)) > show engine innodb status\G
*************************** 1. row ***************************
  Type: InnoDB
  Name:
Status:
=====================================
2015-07-03 11:15:30 7f059fbbe700 INNODB MONITOR OUTPUT
=====================================
Per second averages calculated from the last 32 seconds
-----------------
BACKGROUND THREAD
-----------------
srv_master_thread loops: 2 srv_active, 0 srv_shutdown, 152 srv_idle
srv_master_thread log flush and writes: 154
----------
SEMAPHORES
----------
OS WAIT ARRAY INFO: reservation count 6
OS WAIT ARRAY INFO: signal count 6
Mutex spin waits 8, rounds 497, OS waits 2
RW-shared spins 3, rounds 90, OS waits 3
RW-excl spins 0, rounds 0, OS waits 0
Spin rounds per wait: 62.12 mutex, 30.00 RW-shared, 0.00 RW-excl
------------
TRANSACTIONS
------------
Trx id counter 21057
Purge done for trx's n:o < 21054 undo n:o < 0 state: running but idle
History list length 73
LIST OF TRANSACTIONS FOR EACH SESSION:
---TRANSACTION 21052, not started
MySQL thread id 11, OS thread handle 0x7f059fbbe700, query id 18 localhost msandbox init
show engine innodb status
---TRANSACTION 21056, ACTIVE 20 sec starting index read
mysql tables in use 1, locked 1
LOCK WAIT 2 lock struct(s), heap size 360, 1 row lock(s)
MySQL thread id 14, OS thread handle 0x7f05c0617700, query id 0 System lock
------- TRX HAS BEEN WAITING 20 SEC FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 5 page no 3 n bits 80 index `PRIMARY` of table `mysql`.`slave_worker_info` trx id 21056 lock mode S waiting
------------------
---TRANSACTION 21055, ACTIVE 20 sec
4 lock struct(s), heap size 1184, 16 row lock(s)
MySQL thread id 12, OS thread handle 0x7f059fb7d700, query id 15 localhost msandbox Waiting for slave thread to start
start slave
--------

...

After some time in a second session you'll get:

slave2 [localhost] {msandbox} ((none)) > start slave;
ERROR 2013 (HY000): Lost connection to MySQL server during query

and in the first one:

slave2 [localhost] {msandbox} ((none)) > show processlist; +----+-------------+-----------+------+---------+------+-----------------------------------------------------------------------------+------------------+-----------+---------------+
| Id | User | Host | db | Command | Time | State | Info | Rows_sent | Rows_examined |
+----+-------------+-----------+------+---------+------+-----------------------------------------------------------------------------+------------------+-----------+---------------+
| 1 | system user | | NULL | Connect | 666 | Waiting for master to send event | NULL | 0 | 0 |
| 2 | system user | | NULL | Connect | 666 | Slave has read all relay log; waiting for the slave I/O thread to update it | NULL | 0 | 0 |
| 3 | system user | | NULL | Connect | 666 | Waiting for an event from Coordinator | NULL | 0 | 0 |
| 4 | system user | | NULL | Connect | 666 | Waiting for an event from Coordinator | NULL | 0 | 0 |
| 5 | system user | | NULL | Connect | 666 | Waiting for an event from Coordinator | NULL | 0 | 0 |
| 6 | system user | | NULL | Connect | 666 | Waiting for an event from Coordinator | NULL | 0 | 0 |
| 7 | system user | | NULL | Connect | 666 | Waiting for an event from Coordinator | NULL | 0 | 0 |
| 8 | system user | | NULL | Connect | 666 | Waiting for an event from Coordinator | NULL | 0 | 0 |
| 9 | system user | | NULL | Connect | 666 | Waiting for an event from Coordinator | NULL | 0 | 0 |
| 10 | system user | | NULL | Connect | 666 | Waiting for an event from Coordinator | NULL | 0 | 0 |
| 11 | msandbox | localhost | NULL | Query | 0 | init | show processlist | 0 | 0 |
| 12 | msandbox | localhost | NULL | Sleep | 333 | | NULL | 1 | 0 |
+----+-------------+-----------+------+---------+------+-----------------------------------------------------------------------------+------------------+-----------+---------------+
12 rows in set (0.00 sec)

The deadlock is not detected in InnoDB and slave actually gets started.

tags: added: i56286

Upstream?

description: updated

Confirmed with upstream 5.6.25 as well.

tags: added: upstream

By the way, start slave session just crashes eventually, then server is restarted and then it's clear why we have 8 workers again... From the error log (upstream 5.6.25 test):

...
/home/openxs/5.6.25/bin/mysqld(my_print_stacktrace+0x35)[0x90cb35]
/home/openxs/5.6.25/bin/mysqld(handle_fatal_signal+0x3d8)[0x673c48]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x10340)[0x7f119e9f1340]
/home/openxs/5.6.25/bin/mysqld(_ZN14Relay_log_info21mts_finalize_recoveryEv+0x87)[0x8e8517]
/home/openxs/5.6.25/bin/mysqld(_Z19slave_start_workersP14Relay_log_infomPb+0x262)[0x8d4c42]
/home/openxs/5.6.25/bin/mysqld(handle_slave_sql+0x567)[0x8e5677]
/home/openxs/5.6.25/bin/mysqld(pfs_spawn_thread+0x126)[0xaddbc6]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x8182)[0x7f119e9e9182]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7f119d6af47d]

Trying to get some variables.
Some pointers may be invalid and cause the dump to abort.
Query (0): Connection ID (thread ID): 14
...

Upstream bug report says:

"[8 Dec 9:23] David Moss

Thanks very much for your feedback. This has been fixed in upcoming versions and the following was noted in the 5.6.28 and 5.7.10 change logs:
On a multi-threaded slave configured with master_info_repository=TABLE and relay_log_info_repository=TABLE which had previously been run with autocommit=1, if the slave was stopped and autocommit changed to 0, executing START SLAVE caused the session to appear to hang. After the lock wait timeout, when START SLAVE proceeded the server would stop unexpectedly. The fix ensures that when master_info_repository=TABLE, relay_log_info_repository=TABLE, and autocommit=0 a new transaction is generated for start and commit to avoid deadlocks."

So, seems to be fixed in 5.6.28.

Percona now uses JIRA for bug reports so this bug report is migrated to: https://jira.percona.com/browse/PS-1636

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Related blueprints

Remote bug watches

Bug watches keep track of this bug in other bug trackers.