Bug #1420606 “MTS breaks in after restart” : Bugs : Percona Server moved to https://jira.percona.com/projects/PS

Revision history for this message

Jeremy Tinley (techwolf359) wrote on 2015-02-11:

#1

Script to generate lots of insert traffic at a master Edit (379 bytes, text/x-sh)

Muhammad Irfan (muhammad-irfan) on 2015-02-17

Changed in percona-server:
assignee:	nobody → Muhammad Irfan (muhammad-irfan)

Revision history for this message

Muhammad Irfan (muhammad-irfan) wrote on 2015-02-17:

#2

I tried to reproduce it with no luck. I executed more than 100 copies of attached script and restarted slave in between but no error, slave gets back as normal. Can you please provide my.cnf for master and affected slave please along with error log too from slave server.

Revision history for this message

Jeremy Tinley (techwolf359) wrote on 2015-02-17:

#3

my-mts2.cnf Edit (3.6 KiB, text/plain)

Attaching my-mts2.cnf

Revision history for this message

Jeremy Tinley (techwolf359) wrote on 2015-02-17:

#4

One mistake, the cnf file should have
slave_parallel_workers = 16

Revision history for this message

Jeremy Tinley (techwolf359) wrote on 2015-02-17:

#5

error.txt Edit (11.4 KiB, text/plain)

Adding error log. This shows a start-up, shutdown, and a start-up which leads to a 1755 error on replication.

In every case of 1755, the following message appears in the log after starting replication:
[Warning] Slave SQL: injecting ROLLBACK at the end of cold group, Error_code: 0

Revision history for this message

Jeremy Tinley (techwolf359) wrote on 2015-02-17:

#6

Also worth noting that setting slave_parallel_workers to 0 and starting replication allows it to recover.

Revision history for this message

Muhammad Irfan (muhammad-irfan) wrote on 2015-03-18:

#8

I am able to produce the problem as described with slave_parallel_workers = 16

2015-03-18 00:07:42 6413 [Note] Slave: MTS Recovery has completed at relay log ./mysql_sandbox21088-relay-bin.000007, position 6124140 master log mysql-bin.000007, position 16222833.
2015-03-18 00:07:42 6413 [ERROR] Slave SQL: Cannot execute the current event group in the parallel mode. Encountered event Xid, relay-log name ./mysql_sandbox21088-relay-bin.000009, position 395 which prevents execution of this event group in parallel mode. Reason: the event is a part of a group that is unsupported in the parallel execution mode. Error_code: 1755
2015-03-18 00:07:42 6413 [Warning] Slave: Cannot execute the current event group in the parallel mode. Encountered event Xid, relay-log name ./mysql_sandbox21088-relay-bin.000009, position 395 which prevents execution of this event group in parallel mode. Reason: the event is a part of a group that is unsupported in the parallel execution mode. Error_code: 1755
2015-03-18 00:07:42 6413 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'mysql-bin.000007' position 15873496

As soon as I enabled relay_log_recovery = 1 It turn's out to different error message as below:

2015-03-18 00:32:52 1731 [ERROR] --relay-log-recovery cannot be executed when the slave was stopped with an error or killed in MTS mode; consider using RESET SLAVE or restart the server with --relay-log-recovery = 0 followed by START SLAVE UNTIL SQL_AFTER_MTS_GAPS
2015-03-18 00:32:52 1731 [ERROR] Failed to initialize the master info structure
2015-03-18 00:32:52 1731 [Note] Check error log for additional messages. You will not be able to start replication until the issue is resolved and the server restarted.

mysql> > show slave status\G
*************************** 1. row ***************************
             Slave_IO_Running: No
            Slave_SQL_Running: No
                   Last_Errno: 1872
                   Last_Error: Slave failed to initialize relay log info structure from the repository
               Last_SQL_Errno: 1872
               Last_SQL_Error: Slave failed to initialize relay log info structure from the repository

I am able to produce the problem as described with slave_parallel_workers = 16

2015-03-18 00:07:42 6413 [Note] Slave: MTS Recovery has completed at relay log ./mysql_sandbox21088-relay-bin.000007, position 6124140 master log mysql-bin.000007, position 16222833.
2015-03-18 00:07:42 6413 [ERROR] Slave SQL: Cannot execute the current event group in the parallel mode. Encountered event Xid, relay-log name ./mysql_sandbox21088-relay-bin.000009, position 395 which prevents execution of this event group in parallel mode. Reason: the event is a part of a group that is unsupported in the parallel execution mode. Error_code: 1755
2015-03-18 00:07:42 6413 [Warning] Slave: Cannot execute the current event group in the parallel mode. Encountered event Xid, relay-log name ./mysql_sandbox21088-relay-bin.000009, position 395 which prevents execution of this event group in parallel mode. Reason: the event is a part of a group that is unsupported in the parallel execution mode. Error_code: 1755
2015-03-18 00:07:42 6413 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'mysql-bin.000007' position 15873496

As soon as I enabled relay_log_recovery = 1 It turn's out to different error message as below:

2015-03-18 00:32:52 1731 [ERROR] --relay-log-recovery cannot be executed when the slave was stopped with an error or killed in MTS mode; consider using RESET SLAVE or restart the server with --relay-log-recovery = 0 followed by START SLAVE UNTIL SQL_AFTER_MTS_GAPS
2015-03-18 00:32:52 1731 [ERROR] Failed to initialize the master info structure
2015-03-18 00:32:52 1731 [Note] Check error log for additional messages. You will not be able to start replication until the issue is resolved and the server restarted.

mysql> > show slave status\G
*************************** 1. row ***************************
             Slave_IO_Running: No
            Slave_SQL_Running: No
                   Last_Errno: 1872
                   Last_Error: Slave failed to initialize relay log info structure from the repository
               Last_SQL_Errno: 1872
               Last_SQL_Error: Slave failed to initialize relay log info structure from the repository

Laurynas Biveinis (laurynas-biveinis) on 2015-03-19

tags:

added: regression
removed: 5.6

Revision history for this message

Sergei Glushchenko (sergei.glushchenko) wrote on 2015-03-28:

#9

Last events from relay log:

SET TIMESTAMP=1427450551/*!*/;
BEGIN
/*!*/;
# at 1803382
# at 1803414
#150327 17:02:31 server id 1 end_log_pos 52539808 CRC32 0x279389ee Intvar
SET INSERT_ID=8528/*!*/;
#150327 17:02:31 server id 1 end_log_pos 52539922 CRC32 0x0f2cdcd8 Query thread_id=211974 exec_time=0 error_code=0
SET TIMESTAMP=1427450551/*!*/;
INSERT INTO db14.insert_test VALUES ( NULL )
/*!*/;
# at 1803528
#150327 17:02:31 server id 1 end_log_pos 52539953 CRC32 0x2a79d9ef Xid = 636398
COMMIT/*!*/;
# at 1803559
#150327 17:02:31 server id 1 end_log_pos 52540028 CRC32 0x17f9929e Query thread_id=211973 exec_time=0 error_code=0
SET TIMESTAMP=1427450551/*!*/;
BEGIN
/*!*/;
# at 1803634
#150327 17:02:32 server id 101 end_log_pos 1803657 CRC32 0x7992e758 Stop
DELIMITER ;
# End of log file
ROLLBACK /* added by mysqlbinlog */;
/*!50003 SET COMPLETION_TYPE=@OLD_COMPLETION_TYPE*/;
/*!50530 SET @@SESSION.PSEUDO_SLAVE_MODE=0*/;

The last group has BEGIN but has no statements.

Revision history for this message

Sergei Glushchenko (sergei.glushchenko) wrote on 2015-03-31:

#10

Upstream has it's own fix for bug 1331586.
It passes our test case. I am going to revert our fix.

Author: Joao Gramacho <email address hidden>
Date: Mon Oct 20 08:24:05 2014 +0100

Bug#18885916 RELAY LOG WITHOUT XID_LOG_EVENT MAY CASE PARALLEL
REPLICATION HANG

Problem:
=======

    With MTS, GTIDs and auto positioning enabled, when a worker applies a
    partial transaction left on relaylog by an IO thread reconnection, it
    will wait for the XID log event to commit the transaction.

    Unfortunately, the SQL thread coordinator will reach the master's
    ROTATE event on the next relaylog file and will wait for all workers
    to finish their tasks before applying the ROTATE.

Analysis:
========

    As the whole transaction is retrieved again by the IO thread after the
    reconnection, the slave must rollback the partial transaction once
    noticing this ROTATE from the master.

    This bug reports the same issue already fixed by BUG#17326020, and the
    reported issue is not reproducible anymore. So, this patch is just
    adding a new test case.

Revision history for this message

Sergei Glushchenko (sergei.glushchenko) wrote on 2015-03-31:

#11

Errr... Upstream fix is

Author: Joao Gramacho <email address hidden>
Date: Mon Aug 11 13:02:10 2014 +0100

BUG#17326020 ASSERTION ON SLAVE AFTER STOP/START SLAVE
USING MTS+GTID REPLICATION

Problem:
=======

    When the IO thread reconnects to a master using GTID auto positioning
    while in the middle of a transaction, it will left the partial
    transaction on the relaylog and will fully retrieve the same
    transaction again.

    If the slave is configured to use MTS, it will hit an assert "!worker"
    once reaching the ROTATE_LOG_EVENT send by the master in the IO thread
    reconnection.

Analysis:
========

    Once a slave with MTS reaches the beginning of a group (by an
    GTID_LOG_EVENT or QUERY(BEGIN)), it will expect to feed the same
    worker with events until reaching the end of the transaction. No
    events that should be applied synchronously (by the MTS coordinator
    with all workers waiting for jobs) can be applied while MTS is feeding
    a worker with events. The assertion exists to stop the server execution
    in such situations.

    The correct behavior of the slave once knowing that a transaction was
    left in the middle and will not finish (as it will be applied again
    from the beginning later) is to abort this transaction.

STS SQL thread knows how to rollback the incomplete transaction, but
MTS doesn't know how to do it yet.

Fix:
===

    The SQL slave will now check if it is going to apply a synchronous
    ROTATE_LOG_EVENT sent by the master during GTID auto negotiation after
    IO thread reconnection.

    Before applying this ROTATE_LOG_EVENT, the SQL slave will check also if
    it is in the middle of a group. If it is, it will queue to the current
    worker a QUERY(ROLLBACK) event to make the worker gracefully finish its
    work and, only then, will let the MTS coordinator to apply the
    ROTATE_LOG_EVENT in synchronous mode.

@ sql/rpl_slave.cc

    Added code into exec_relay_log_event() to inject a QUERY(ROLLBACK)
    event if necessary to make the current worker gracefully finish its
    job before applying an event that needs synchronous MTS execution.

Errr... Upstream fix is

Author: Joao Gramacho <joao.gramacho@oracle.com>
Date:   Mon Aug 11 13:02:10 2014 +0100

BUG#17326020 ASSERTION ON SLAVE AFTER STOP/START SLAVE
                 USING MTS+GTID REPLICATION

Problem:
    =======

When the IO thread reconnects to a master using GTID auto positioning
    while in the middle of a transaction, it will left the partial
    transaction on the relaylog and will fully retrieve the same
    transaction again.

If the slave is configured to use MTS, it will hit an assert "!worker"
    once reaching the ROTATE_LOG_EVENT send by the master in the IO thread
    reconnection.

Analysis:
    ========

Once a slave with MTS reaches the beginning of a group (by an
    GTID_LOG_EVENT or QUERY(BEGIN)), it will expect to feed the same
    worker with events until reaching the end of the transaction. No
    events that should be applied synchronously (by the MTS coordinator
    with all workers waiting for jobs) can be applied while MTS is feeding
    a worker with events. The assertion exists to stop the server execution
    in such situations.

The correct behavior of the slave once knowing that a transaction was
    left in the middle and will not finish (as it will be applied again
    from the beginning later) is to abort this transaction.

STS SQL thread knows how to rollback the incomplete transaction, but
    MTS doesn't know how to do it yet.

Fix:
    ===

The SQL slave will now check if it is going to apply a synchronous
    ROTATE_LOG_EVENT sent by the master during GTID auto negotiation after
    IO thread reconnection.

Before applying this ROTATE_LOG_EVENT, the SQL slave will check also if
    it is in the middle of a group. If it is, it will queue to the current
    worker a QUERY(ROLLBACK) event to make the worker gracefully finish its
    work and, only then, will let the MTS coordinator to apply the
    ROTATE_LOG_EVENT in synchronous mode.

@ sql/rpl_slave.cc

Added code into exec_relay_log_event() to inject a QUERY(ROLLBACK)
    event if necessary to make the current worker gracefully finish its
    job before applying an event that needs synchronous MTS execution.

Revision history for this message

Laurynas Biveinis (laurynas-biveinis) wrote on 2015-04-30:

#12

https://github.com/percona/percona-server/pull/59

Revision history for this message

Thomas Babut (thbabut) wrote on 2015-05-11:

#13

Unfortunately I have to say, that the problem still exists (CentOS 7.1, Percona 5.6.24-72.2). After the first restart of the slave with MTS enabled there was no problem, but after the second restart the slave was broken:

May 11 14:19:13 server mysqld: 2015-05-11 14:19:13 25016 [ERROR] --relay-log-recovery cannot be executed when the slave was stopped with an error or killed in MTS mode; consider using RESET SLAVE or restart the server with --relay-log-recovery = 0 followed by START SLAVE UNTIL SQL_AFTER_MTS_GAPS
May 11 14:19:13 server mysqld: 2015-05-11 14:19:13 25016 [ERROR] Failed to initialize the master info structure
May 11 14:19:13 server mysqld: 2015-05-11 14:19:13 25016 [Note] Check error log for additional messages. You will not be able to start replication until the issue is resolved and the server restarted.

Revision history for this message

Laurynas Biveinis (laurynas-biveinis) wrote on 2015-05-12:

#14

Thomas, can you please report it as a new bug? Please include master and slave configuration, and, if possible, the tail of the slave relay log

Revision history for this message

Laurynas Biveinis (laurynas-biveinis) wrote on 2015-05-13:

#15

Bug 1454272

Revision history for this message

Shahriyar Rzayev (rzayev-sehriyar) wrote on 2018-01-25:

#16

Percona now uses JIRA for bug reports so this bug report is migrated to: https://jira.percona.com/browse/PS-870

Affects		Status	Importance	Assigned to	Milestone
	Percona Server moved to https://jira.percona.com/projects/PS	Fix Released	High	Sergei Glushchenko	Percona Server moved to https://jira.percona.com/projects/PS 5.6.24-72.2
	5.6	Fix Released	High	Sergei Glushchenko	Percona Server moved to https://jira.percona.com/projects/PS 5.6.24-72.2

Percona Server moved to https://jira.percona.com/projects/PS

MTS breaks in after restart

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches