Crash safe slave with gtid + less durable settings

Bug #1411872 reported by Jericho Rivera on 2015-01-16
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
MySQL Server
Unknown
Unknown
Percona Server
Status tracked in 5.7
5.5
Undecided
Unassigned
5.6
High
Vlad Lesin
5.7
High
Vlad Lesin

Bug Description

When using GTID with less durable settings (sync_binlog != 1 and innodb_flush_log_at_trx_commit != 1), after crash recovery on a slave, the following problems may happen.

1) The slave loses transactions if slave's binary log is ahead innodb (binary log has N transactions that are not persisted in InnoDB => N transactions are lost)

2) The slave stops with duplicate key error if binary log is behind innodb (InnoDB has N transactions that are not persisted in binary log)

Without gtid, both cases do not cause problems because slave can continue replication based on slave_relay_log_info InnoDB table.

How to repeat:
Set master/slave replication with gtid enabled, and set less durable settings.

sync-binlog=0
innodb-flush-log-at-trx-commit=0
log-bin
log-slave-updates
gtid-mode=on
enforce-gtid-consistency
relay_log_info_repository=table
relay_log_recovery=1

master> create table t1 (id int primary key) engine=innodb;
master> insert into t1 values (1);

1) Losing transaction

Attach slave's mysqld process via gdb, then set breakpoint on trx_commit_complete_for_mysql function

master> insert into t1 values (2);
-- hit breakpoint on the slave
-- binlog on the slave should have "insert into t1 values (2)"

kill -9 slave's mysqld

restart slave mysqld, restart replication

master> insert into t1 values (3);

slave has only 1 and 3. It misses 2.

2) Duplicate key error

slave> select @@global.gtid_executed; --> x

copy slave's binary log somewhere --> (a)

master> insert into t1 values (2,2,2);

slave> select @@global.gtid_executed; --> y

kill -9 slave's mysqld_safe and mysqld

restore slave's binary log taken at (a) (simulating "transactions are persisted in InnoDB but not in binary log")

restarting slave, restarting replication
=> slave stops with duplicate key error

select @@global.gtid_executed; => pointing x

so slave tries to start replication from gtid x, so apparently it fails by duplicate key error.

This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.