semi sync seems broken because of Group commit in Percona Server5.5

Bug #1254571 reported by zhai weixiang on 2013-11-25
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Percona Server moved to https://jira.percona.com/projects/PS
Invalid
Undecided
Unassigned
5.1
Invalid
Undecided
Unassigned
5.5
Triaged
Medium
Unassigned
5.6
Invalid
Undecided
Unassigned

Bug Description

In function MYSQL_BIN_LOG::trx_group_commit_leader, ONLY the leader thread calls after_flush (this function will store the binlog position using my_pthread_setspecific_ptr)

quoted code in log.cc:

5903 if (RUN_HOOK(binlog_storage, after_flush,
5904 (leader->thd, log_file_name, log_file.pos_in_file, synced)))
5905 {

After all transactions committed by the leader thread, other threads were waked up and after_commit was called

quoted code :

29 int Trans_delegate::after_commit(THD *thd, bool all)
230 {
231 Trans_param param;
232 bool is_real_trans= (all || thd->transaction.all.ha_list == 0);
233
234 param.flags = is_real_trans ? TRANS_IS_REAL_TRANS : 0;
235
236 Trans_binlog_info *log_info=
237 my_pthread_getspecific_ptr(Trans_binlog_info*, RPL_TRANS_BINLOG_INFO);
238
239 param.log_file= log_info ? log_info->log_file : 0;
240 param.log_pos= log_info ? log_info->log_pos : 0;
241
242 int ret= 0;
243 FOREACH_OBSERVER(ret, after_commit, thd, (&param));

 Here log_info is NULL if the current thread is not the leader of group commit, So these threads will never waiting for the ACK from slave.

zhai weixiang (zhaiwx1987) wrote :

a simple patch based on the logic of MySQL 5.6. Can be patched on Percona Server 5.5.18 we are using now.

tags: added: contribution
zhai weixiang (zhaiwx1987) wrote :

I found anther bug, after_flush should be called before signal_update, otherwise the binlog dump thread may be waked up first and missed to mark the packet header (to tell the slave that the master needs an ACK)

quoted code:(trx_group_commit_leader, log.cc)

5901 else
5902 {
5903 signal_update();
5904 }
5905
5906 if (RUN_HOOK(binlog_storage, after_flush,
5907 (leader->thd, log_file_name, log_file.pos_in_file, synced)))
5908 {
5909 sql_print_error("Failed to run 'after_flush' hooks");
5910 for (current= queue; current != NULL; current= current->next)
5911 {
5912 if (!current->error)
5913 {

So the attached file didn't fix this bug completely. I'll rewrite the patch based on the latest Percona Server 5.5.34

zhai weixiang (zhaiwx1987) wrote :

patch based on Percona Server 5.5.34

zhai weixiang (zhaiwx1987) wrote :

Sorry , comment of #2 is incorrect.

It's ok to call after_flush after signal_update because the LOCK_log was hold by the leader thread.

(It's a problem for me because I have split the LOCK_log for binlog write/dump).

I do miss some MTR test case in the patch, but hardly it makes any sense to have "New" status for this report. Let developers check and (hopefully) use this contribution.

Zhai -

Have you checked MariaDB 5.5 where our group commit implementation is taken from? Are they affected? Have they fixed this?

zhai weixiang (zhaiwx1987) wrote :

I haven't check the mariadb carefully,just quickly read the related code . I guess mariadb was affected too...Hope someone from maria can help me verifying it.

bellow is a simple test case.

source include/have_semisync_plugin.inc;
source include/have_debug.inc;
source include/have_innodb.inc;
source include/master-slave.inc;

#Install semisync plugin on slave
connection slave;
--replace_regex /\.dll/.so/
eval INSTALL PLUGIN rpl_semi_sync_slave SONAME '$SEMISYNC_SLAVE_PLUGIN';
select @@global.rpl_semi_sync_slave_enabled;
show variables like '%semi%';
SET GLOBAL rpl_semi_sync_slave_enabled= 1;
STOP SLAVE;
START SLAVE;

#Install semisync plugin on master
connection master;
--replace_regex /\.dll/.so/
eval INSTALL PLUGIN rpl_semi_sync_master SONAME '$SEMISYNC_MASTER_PLUGIN';
SET GLOBAL rpl_semi_sync_master_enabled = 1;
SET GLOBAL rpl_semi_sync_master_timeout = 100000000;

--sleep 2

CREATE TABLE t1 (a INT PRIMARY KEY) ENGINE=innodb;
connect(con1,localhost,root,,);
connect(con2,localhost,root,,);

connection master;
SET GLOBAL rpl_semi_sync_master_wait_no_slave = 1;

connection slave;
#Stop slave, then all transaction on master will wait for restart of slave;
STOP SLAVE;

connection con1;
#leader of group commit
SET DEBUG_SYNC= "commit_after_get_LOCK_log SIGNAL con1_waiting WAIT_FOR
con2_queued";
send INSERT INTO t1 VALUES (1);

connection con2;
#follower of group commit
SET DEBUG_SYNC= "now WAIT_FOR con1_waiting";
SET DEBUG_SYNC= "commit_group_commit_queue SIGNAL con2_queued";
send INSERT INTO t1 VALUES (2);

connection master;
# we expact con1 and con2 should be both waiting for ACK
--sleep 1
SHOW PROCESSLIST;

connection slave;
START SLAVE;
--sleep 1

connection master;
SHOW PROCESSLIST;

#cleanup
DROP TABLE t1;
UNINSTALL PLUGIN rpl_semi_sync_master;

connection slave;
UNINSTALL PLUGIN rpl_semi_sync_slave;

--source include/rpl_end.inc

Percona now uses JIRA for bug reports so this bug report is migrated to: https://jira.percona.com/browse/PS-3074

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers