MTS breaks in after restart

Bug #1420606 reported by Jeremy Tinley on 2015-02-11
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Percona Server moved to https://jira.percona.com/projects/PS
Fix Released
High
Sergei Glushchenko
5.6
Fix Released
High
Sergei Glushchenko

Bug Description

Am able to break multi-threaded slave (SQL thread) in Percona Server 5.6.22-71 with a 1755 error after restarting the server a few times.

Master:
24 schemas (db1, db2.. etc to db24)
Simple bash script inserting an auto-increment value randomly into a different database.

Replica has the following:
master_info_repository = TABLE
relay_log_info_repository = TABLE
slave_parallel_workers = 16

While running ~100 copies of the attached script to generate insert traffic, I would restart the replica. This will eventually lead to an error similar to the following:

2015-02-11 00:09:17 93614 [Note] Slave I/O thread: connected to master 'replication@127.0.0.1:3306',replication started in log 'mts1-bin.000001' at position 495366437
2015-02-11 00:09:17 93614 [Note] Event Scheduler: Loaded 0 events
2015-02-11 00:09:17 93614 [Note] /usr/sbin/mysqld: ready for connections.
Version: '5.6.22-71.0-log' socket: '/var/lib/mysql/mts2/mysql.sock' port: 3307 Percona Server (GPL), Release 71.0, Revision 726
2015-02-11 00:09:17 93614 [Note] Slave SQL thread initialized, starting replication in log 'mts1-bin.000001' at position 369721635, relay log '/var/lib/mysql/mts2/mts2-relay.000284' position: 55398130
2015-02-11 00:09:17 7f524f5f7700 InnoDB: Buffer pool(s) load completed at 150211 0:09:17
2015-02-11 00:09:19 93614 [Warning] Slave SQL: injecting ROLLBACK at the end of cold group, Error_code: 0
2015-02-11 00:09:19 93614 [ERROR] Slave SQL: Cannot execute the current event group in the parallel mode. Encountered event Xid, relay-log name /var/lib/mysql/mts2/mts2-relay.000286, position 435 which prevents execution of this event group in parallel mode. Reason: the event is a part of a group that is unsupported in the parallel execution mode. Error_code: 1755

I'm guessing that this merge: https://code.launchpad.net/~sergei.glushchenko/percona-server/5.6-ST43091-ps-bug1331586/+merge/229168 does not handle MTS correctly. I reproduced it a few times and every time the "injecting ROLLBACK at the end of cold group" message appeared prior to every breakage. The event that breaks is always the first event in the relay log.

I was unable to reproduce this in MYSQL Community Edition 5.6.22.

Attached is 'generate_insert_load.sh', the script I use to generate the inserts on the master. I ran 100 copies to generate inserts.

Changed in percona-server:
assignee: nobody → Muhammad Irfan (muhammad-irfan)
Muhammad Irfan (muhammad-irfan) wrote :

I tried to reproduce it with no luck. I executed more than 100 copies of attached script and restarted slave in between but no error, slave gets back as normal. Can you please provide my.cnf for master and affected slave please along with error log too from slave server.

Jeremy Tinley (techwolf359) wrote :

Attaching my-mts2.cnf

Jeremy Tinley (techwolf359) wrote :

One mistake, the cnf file should have
slave_parallel_workers = 16

Jeremy Tinley (techwolf359) wrote :

Adding error log. This shows a start-up, shutdown, and a start-up which leads to a 1755 error on replication.

In every case of 1755, the following message appears in the log after starting replication:
[Warning] Slave SQL: injecting ROLLBACK at the end of cold group, Error_code: 0

Jeremy Tinley (techwolf359) wrote :

Also worth noting that setting slave_parallel_workers to 0 and starting replication allows it to recover.

Muhammad Irfan (muhammad-irfan) wrote :

I am able to produce the problem as described with slave_parallel_workers = 16

mysql> SHOW GLOBAL VARIABLES LIKE '%version%';
+-------------------------+--------------------------------------------------+
| Variable_name | Value |
+-------------------------+--------------------------------------------------+
| innodb_version | 5.6.22-rel72.0 |
| protocol_version | 10 |
| slave_type_conversions | |
| version | 5.6.22-72.0-log |
| version_comment | Percona Server (GPL), Release 72.0, Revision 738 |
| version_compile_machine | x86_64 |
| version_compile_os | Linux |
+-------------------------+--------------------------------------------------+

2015-03-18 00:07:42 6413 [Note] Slave: MTS Recovery has completed at relay log ./mysql_sandbox21088-relay-bin.000007, position 6124140 master log mysql-bin.000007, position 16222833.
2015-03-18 00:07:42 6413 [ERROR] Slave SQL: Cannot execute the current event group in the parallel mode. Encountered event Xid, relay-log name ./mysql_sandbox21088-relay-bin.000009, position 395 which prevents execution of this event group in parallel mode. Reason: the event is a part of a group that is unsupported in the parallel execution mode. Error_code: 1755
2015-03-18 00:07:42 6413 [Warning] Slave: Cannot execute the current event group in the parallel mode. Encountered event Xid, relay-log name ./mysql_sandbox21088-relay-bin.000009, position 395 which prevents execution of this event group in parallel mode. Reason: the event is a part of a group that is unsupported in the parallel execution mode. Error_code: 1755
2015-03-18 00:07:42 6413 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'mysql-bin.000007' position 15873496

As soon as I enabled relay_log_recovery = 1 It turn's out to different error message as below:

2015-03-18 00:32:52 1731 [ERROR] --relay-log-recovery cannot be executed when the slave was stopped with an error or killed in MTS mode; consider using RESET SLAVE or restart the server with --relay-log-recovery = 0 followed by START SLAVE UNTIL SQL_AFTER_MTS_GAPS
2015-03-18 00:32:52 1731 [ERROR] Failed to initialize the master info structure
2015-03-18 00:32:52 1731 [Note] Check error log for additional messages. You will not be able to start replication until the issue is resolved and the server restarted.

mysql> > show slave status\G
*************************** 1. row ***************************
             Slave_IO_Running: No
            Slave_SQL_Running: No
                   Last_Errno: 1872
                   Last_Error: Slave failed to initialize relay log info structure from the repository
               Last_SQL_Errno: 1872
               Last_SQL_Error: Slave failed to initialize relay log info structure from the repository

tags: added: regression
removed: 5.6

Last events from relay log:

SET TIMESTAMP=1427450551/*!*/;
BEGIN
/*!*/;
# at 1803382
# at 1803414
#150327 17:02:31 server id 1 end_log_pos 52539808 CRC32 0x279389ee Intvar
SET INSERT_ID=8528/*!*/;
#150327 17:02:31 server id 1 end_log_pos 52539922 CRC32 0x0f2cdcd8 Query thread_id=211974 exec_time=0 error_code=0
SET TIMESTAMP=1427450551/*!*/;
INSERT INTO db14.insert_test VALUES ( NULL )
/*!*/;
# at 1803528
#150327 17:02:31 server id 1 end_log_pos 52539953 CRC32 0x2a79d9ef Xid = 636398
COMMIT/*!*/;
# at 1803559
#150327 17:02:31 server id 1 end_log_pos 52540028 CRC32 0x17f9929e Query thread_id=211973 exec_time=0 error_code=0
SET TIMESTAMP=1427450551/*!*/;
BEGIN
/*!*/;
# at 1803634
#150327 17:02:32 server id 101 end_log_pos 1803657 CRC32 0x7992e758 Stop
DELIMITER ;
# End of log file
ROLLBACK /* added by mysqlbinlog */;
/*!50003 SET COMPLETION_TYPE=@OLD_COMPLETION_TYPE*/;
/*!50530 SET @@SESSION.PSEUDO_SLAVE_MODE=0*/;

The last group has BEGIN but has no statements.

Upstream has it's own fix for bug 1331586.
It passes our test case. I am going to revert our fix.

Author: Joao Gramacho <email address hidden>
Date: Mon Oct 20 08:24:05 2014 +0100

    Bug#18885916 RELAY LOG WITHOUT XID_LOG_EVENT MAY CASE PARALLEL
                 REPLICATION HANG

    Problem:
    =======

    With MTS, GTIDs and auto positioning enabled, when a worker applies a
    partial transaction left on relaylog by an IO thread reconnection, it
    will wait for the XID log event to commit the transaction.

    Unfortunately, the SQL thread coordinator will reach the master's
    ROTATE event on the next relaylog file and will wait for all workers
    to finish their tasks before applying the ROTATE.

    Analysis:
    ========

    As the whole transaction is retrieved again by the IO thread after the
    reconnection, the slave must rollback the partial transaction once
    noticing this ROTATE from the master.

    This bug reports the same issue already fixed by BUG#17326020, and the
    reported issue is not reproducible anymore. So, this patch is just
    adding a new test case.

Errr... Upstream fix is

Author: Joao Gramacho <email address hidden>
Date: Mon Aug 11 13:02:10 2014 +0100

    BUG#17326020 ASSERTION ON SLAVE AFTER STOP/START SLAVE
                 USING MTS+GTID REPLICATION

    Problem:
    =======

    When the IO thread reconnects to a master using GTID auto positioning
    while in the middle of a transaction, it will left the partial
    transaction on the relaylog and will fully retrieve the same
    transaction again.

    If the slave is configured to use MTS, it will hit an assert "!worker"
    once reaching the ROTATE_LOG_EVENT send by the master in the IO thread
    reconnection.

    Analysis:
    ========

    Once a slave with MTS reaches the beginning of a group (by an
    GTID_LOG_EVENT or QUERY(BEGIN)), it will expect to feed the same
    worker with events until reaching the end of the transaction. No
    events that should be applied synchronously (by the MTS coordinator
    with all workers waiting for jobs) can be applied while MTS is feeding
    a worker with events. The assertion exists to stop the server execution
    in such situations.

    The correct behavior of the slave once knowing that a transaction was
    left in the middle and will not finish (as it will be applied again
    from the beginning later) is to abort this transaction.

    STS SQL thread knows how to rollback the incomplete transaction, but
    MTS doesn't know how to do it yet.

    Fix:
    ===

    The SQL slave will now check if it is going to apply a synchronous
    ROTATE_LOG_EVENT sent by the master during GTID auto negotiation after
    IO thread reconnection.

    Before applying this ROTATE_LOG_EVENT, the SQL slave will check also if
    it is in the middle of a group. If it is, it will queue to the current
    worker a QUERY(ROLLBACK) event to make the worker gracefully finish its
    work and, only then, will let the MTS coordinator to apply the
    ROTATE_LOG_EVENT in synchronous mode.

    @ sql/rpl_slave.cc

    Added code into exec_relay_log_event() to inject a QUERY(ROLLBACK)
    event if necessary to make the current worker gracefully finish its
    job before applying an event that needs synchronous MTS execution.

Thomas Babut (thbabut) wrote :

Unfortunately I have to say, that the problem still exists (CentOS 7.1, Percona 5.6.24-72.2). After the first restart of the slave with MTS enabled there was no problem, but after the second restart the slave was broken:

May 11 14:19:13 server mysqld: 2015-05-11 14:19:13 25016 [ERROR] --relay-log-recovery cannot be executed when the slave was stopped with an error or killed in MTS mode; consider using RESET SLAVE or restart the server with --relay-log-recovery = 0 followed by START SLAVE UNTIL SQL_AFTER_MTS_GAPS
May 11 14:19:13 server mysqld: 2015-05-11 14:19:13 25016 [ERROR] Failed to initialize the master info structure
May 11 14:19:13 server mysqld: 2015-05-11 14:19:13 25016 [Note] Check error log for additional messages. You will not be able to start replication until the issue is resolved and the server restarted.

Thomas, can you please report it as a new bug? Please include master and slave configuration, and, if possible, the tail of the slave relay log

Percona now uses JIRA for bug reports so this bug report is migrated to: https://jira.percona.com/browse/PS-870

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers