Replication stall with multi-threaded replication
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
MySQL Server |
Unknown
|
Unknown
|
|||
Percona Server moved to https://jira.percona.com/projects/PS |
Fix Released
|
Medium
|
Sergei Glushchenko | ||
5.6 |
Fix Released
|
Medium
|
Sergei Glushchenko |
Bug Description
Percona Server version 5.6.17-66.0 compiled from source.
OS: SunOS 5.11 joyent_
Periodically what happens every few days that the replication gets stalled with the coordinator thread in the state "Waiting for Slave Worker to release partition" and the worker threads in the state "Waiting for an event from Coordinator". At that point trying to stop the slave by executing "STOP SLAVE" also gets stalled, together with stalling all the further "SHOW SLAVE STAUS" calls.
Below is how typically the processlist looks in that particular case:
+------
| Id | User | Host | db | Command | Time | State | Info | Rows_sent | Rows_examined |
+------
| 2477 | root | localhost | NULL | Sleep | 33355 | | NULL | 1 | 46 |
| 2967 | system user | | NULL | Connect | 24248 | Waiting for master to send event | NULL | 0 | 0 |
| 2968 | system user | | NULL | Connect | 24085 | Waiting for Slave Worker to release partition | NULL | 0 | 0 |
| 2969 | system user | | NULL | Connect | 24248 | Waiting for an event from Coordinator | NULL | 0 | 0 |
| 2970 | system user | | NULL | Connect | 24248 | Waiting for an event from Coordinator | NULL | 0 | 0 |
| 2971 | system user | | NULL | Connect | 24248 | Waiting for an event from Coordinator | NULL | 0 | 0 |
| 2972 | system user | | NULL | Connect | 24248 | Waiting for an event from Coordinator | NULL | 0 | 0 |
| 2973 | system user | | NULL | Connect | 24248 | Waiting for an event from Coordinator | NULL | 0 | 0 |
| 2974 | system user | | NULL | Connect | 84603 | Waiting for an event from Coordinator | NULL | 0 | 0 |
| 2975 | system user | | NULL | Connect | 84603 | Waiting for an event from Coordinator | NULL | 0 | 0 |
| 2976 | system user | | NULL | Connect | 84603 | Waiting for an event from Coordinator | NULL | 0 | 0 |
| 2977 | system user | | NULL | Connect | 84603 | Waiting for an event from Coordinator | NULL | 0 | 1 |
| 2978 | system user | | NULL | Connect | 84603 | Waiting for an event from Coordinator | NULL | 0 | 0 |
| 4268 | root | localhost | some_db | Query | 871 | Killing slave | stop slave | 0 | 0 |
| 4370 | root | localhost | some_db | Sleep | 4 | | NULL | 0 | 0 |
| 4377 | monitoring | 192.168.1.222:11159 | NULL | Query | 464 | init | show slave status | 0 | 0 |
| 4419 | root | localhost | NULL | Query | 0 | init | show processlist | 0 | 0 |
+------
17 rows in set (0.00 sec)
Looking into this further, the InnoDB status shows us that one of the worker thread actually has an open transaction:
---TRANSACTION 53684133006, ACTIVE 24245 sec
2 lock struct(s), heap size 360, 1 row lock(s), undo log entries 1
MySQL thread id 2977, OS thread handle 0x36, query id 56740540 Waiting for an event from Coordinator
TABLE LOCK table `db2`.`tbl_name` /* Partition `p24` */ trx id 53684133006 lock mode IX
RECORD LOCKS space id 3504 page no 23250 n bits 440 index `PRIMARY` of table `db2`.`tbl_name` /* Partition `p24` */ trx id 53684133006 lock_mode X locks rec but not gap
First off all the transaction is a replicated transaction being executed by the worker thread and is still open for an unknown reason, while holding only a single row lock.
Secondly, apparently the coordinator is blocked because the next even to apply is also to the `db2` database and hence it must wait for the worker thread id 2977 to commit the transaction. This is based on the implementation described here http://
The MySQL configuration is as follows:
[mysqld]
core-file
user = mysql
port = 3306
datadir = /db
socket = /tmp/mysql.sock
default-
skip-external-
log_warnings=1
skip_name_resolve
character-
collation-server = utf8mb4_unicode_ci
group_concat_
server-id = 666
# InnoDB settings
innodb_
innodb_
innodb_
innodb_
innodb_
innodb_
innodb_
innodb_
# allows for row_format=
innodb_
innodb_
innodb_
innodb_
innodb_
innodb_doublewrite = 0
innodb_io_capacity = 1200
innodb_
innodb_
innodb_
# Slow query log settings
# The default logs all full table scans,tmp tables,filesorts on disk queries
#use_global_
#long_query_time = 0.5
slow_query_log_file = slowquery.log
slow_query_log = 1
long_query_time = 3
log_slow_filter = "full_scan,
log_slow_verbosity = "full"
# Other general MySQL settings
sync_binlog = 0
query_cache_type = 0
query_cache_size = 0
max_connections = 3000
thread_cache_size = 1000
back_log = 1024
thread_concurrency = 32
innodb_
tmpdir = /var/tmp
max_allowed_packet = 24M
max_join_size = 4294967295
net_buffer_length = 2K
thread_stack = 512K
tmp_table_size = 64M
max_heap_table_size = 64M
table_open_cache = 2000
# Replication settings (master to slave)
binlog_cache_size = 2M
binlog_format=mixed
log-bin = bin
log-error = error.log
expire_logs_days = 5
slave-parallel-
master-
relay-log-
gtid_mode = ON
enforce_
log-slave-updates
replicate-
replicate-
replicate-
replicate-
# Started tuning slave catchup speed, can use more research
slave-checkpoin
slave-checkpoin
I will be attaching the GDB stack traces in a follow up comment here
Related branches
- Laurynas Biveinis (community): Approve
-
Diff: 526 lines (+409/-1)9 files modifiedmysql-test/suite/rpl/r/percona_bug1331586.result (+139/-0)
mysql-test/suite/rpl/t/percona_bug1331586-master.opt (+3/-0)
mysql-test/suite/rpl/t/percona_bug1331586-slave.opt (+5/-0)
mysql-test/suite/rpl/t/percona_bug1331586.test (+148/-0)
sql/log_event.h (+25/-0)
sql/rpl_master.cc (+10/-0)
sql/rpl_rli.cc (+2/-1)
sql/rpl_rli.h (+23/-0)
sql/rpl_slave.cc (+54/-0)
description: | updated |
thread apply all bt