Slave fails to apply relay log if certain XA transaction exists

Bug #1662725 reported by Sveta Smirnova
14
This bug affects 3 people
Affects Status Importance Assigned to Milestone
MySQL Server
Unknown
Unknown
Percona Server moved to https://jira.percona.com/projects/PS
Status tracked in 5.7
5.5
New
Undecided
Unassigned
5.6
New
Undecided
Unassigned
5.7
Fix Released
Undecided
Vlad Lesin

Bug Description

There is some error in slave SQL thread which makes it to fail applying XA transactions, existing in the binary log.

It fails with different errors for the same binary log:

ast_SQL_Error: Error 'XAER_DUPID: The XID already exists' on query. Default database: 'test'. Query: 'XA START X'65323638313162392d636631372d386433632d666334332d3138343136646666623339662c2d613361642d306535333535646465323066',X'',1'

ast_Error: Error 'XAER_RMFAIL: The command cannot be executed when global transaction is in the ACTIVE state' on query. Default database: 'test'. Query: 'XA START X'39633161346565632d326238352d326538662d316338622d3530646539373032323666302c2d383633622d663862666239353534383032',X'',1'
Exec_Master_Log_Pos: 109981795

Error in Xid_log_event: Commit could not be completed, 'XAER_RMFAIL: The command cannot be executed when global transaction is in the NON-EXISTING state'

How to repeat:

1. Setup master-slave, but don't start slave yet
2. Copy binary log file into master's data directory, edit index file accordingly
3.

stop slave;
reset slave all;
drop database nsm;
reset master;
change master to master_host='127.0.0.1', master_port=19388, master_user='rsandbox', master_password='rsandbox', master_log_file='CORRUPTED-bin.000001', master_log_pos=4;
start slave;
4. Wait when slave receives error

Unfortunately we were not able to find sequence of actions which lead to creation of such a binary log, but I believe this happens in code which encodes/decodes XID name

All details and data in ticket #159852. Contact Sveta, Przemek or Marcos if you need help in getting corrupted binary log.

Revision history for this message
Sveta Smirnova (svetasmirnova) wrote :

I was able to repeat generation of such binary logs on slave using dump of customer's data and entries in their binary log. I cannot repeat same issue with artificial SQL.

Instructions:

1. Use mysql-5.7.17 or Percona Server 5.7.17. I used MySQL, thus master port is 13000 and slave port is 13001. In case if you use Percona Server adjust ports accordingly (13001 and 13002). I also assume you put test files into same mysql-test dir. If not adjust paths accordingly.
2. cd mysql-test
3. ./mtr --start --suite=rpl rpl_alter --mysqld=--gtid_mode=ON --mysqld=--log-slave-updates --mysqld=--enforce-gtid-consistency --mysqld=--binlog-format=row --mysqld=--lower-case-table-names --mysqld=--transaction-isolation=read-committed &
4. ../bin/mysql -uroot -h127.0.0.1 -P13000 -e "create database XXX"
5. ../bin/mysql -uroot -h127.0.0.1 -P13000 XXX < XXX.sql
6. Now connect two clients to master and run tests (order matters!):

session 1> source 36742_nogtids.sql.1

session 2> source 36743_nogtids.sql.1
session 2> source 36743_nogtids.sql.2

session 1> source 36742_nogtids.sql.2
session 1> source 36742_nogtids.sql.3
session 1> source 36742_nogtids.sql.4
session 1> source 36742_nogtids.sql.5
session 1> source 36742_nogtids.sql.6

session 2> source 36743_nogtids.sql.3
session 2> source 36743_nogtids.sql.4
session 2> source 36743_nogtids.sql.5
session 2> source 36743_nogtids.sql.6

session 1> \d ;
session 1> xa commit 'e26811b9-cf17-8d3c-fc43-18416dffb39f,-a3ad-0e5355dde20f';

session 2> \d ;
session 2> xa commit '9c1a4eec-2b85-2e8f-1c8b-50de970226f0,-863b-f8bfb9554802';

7. Now connect to slave and run SHOW SLAVE STATUS\G

You will see something like this:

      Slave_SQL_Running_State: System lock
...
           Retrieved_Gtid_Set: b2184f1d-fa76-11e6-9631-90b11c55312a:1-1429
            Executed_Gtid_Set: b2184f1d-fa76-11e6-9631-90b11c55312a:1-1426

8. After some time slave will fail with error:

Last_Errno: 1205
Last_Error: Slave SQL thread retried transaction 10 time(s) in vain, giving up. Consider raising the value of the slave_transaction_retries variable.

All data is in the ticket.

Revision history for this message
Sveta Smirnova (svetasmirnova) wrote :

I added smaller test case (requires 4 tables instead of full database) to the ticket.

Revision history for this message
Laurynas Biveinis (laurynas-biveinis) wrote :

--source include/have_innodb.inc
--source include/have_binlog_format_row.inc
--source include/master-slave.inc
--source include/count_sessions.inc

CREATE TABLE t1 (t1_pk INT PRIMARY KEY, t1_blob BLOB) ENGINE=InnoDB;

INSERT INTO t1 VALUES (1, REPEAT("a", 65000));
INSERT INTO t1 VALUES (2, REPEAT("a" ,65000));
INSERT INTO t1 VALUES (3, REPEAT("a", 5165));
INSERT INTO t1 VALUES (4, REPEAT("a", 5158));

CREATE TABLE t2 (t2_pk INT PRIMARY KEY, t1_pk INT,
  FOREIGN KEY (t1_pk) REFERENCES t1 (t1_pk)) ENGINE=InnoDB;

XA START '1';

--connect(con2,localhost,root)
XA START '2';
INSERT INTO t1 VALUES (5, NULL);

--connection master
INSERT INTO t1 VALUES (6, REPEAT("a", 8000));

INSERT INTO t2 VALUES (4, 6);

XA END '1';
XA PREPARE '1';

--connection con2
XA END '2';
XA PREPARE '2';

--connection master
XA COMMIT '1';

--connection con2
XA COMMIT '2';

--source include/sync_slave_sql_with_master.inc

--disconnect con2

--connection master
DROP TABLE t2, t1;
--source include/wait_until_count_sessions.inc

--source include/rpl_end.inc

Revision history for this message
Laurynas Biveinis (laurynas-biveinis) wrote :

Master and slave option file content for the above MTR:

--gtid-mode=ON --log-slave-updates --enforce-gtid-consistency --lower-case-table-names --transaction-isolation=read-committed

tags: added: upstream
Revision history for this message
Kenny Gryp (gryp) wrote :

This is what you would get in INNODB_LOCKS:
perconaserver mysql> select * from information_schema.INNODB_LOCKS;
+------------------+-------------+-----------+-----------+------------+------------+------------+-----------+----------+------------------------+
| lock_id | lock_trx_id | lock_mode | lock_type | lock_table | lock_index | lock_space | lock_page | lock_rec | lock_data |
+------------------+-------------+-----------+-----------+------------+------------+------------+-----------+----------+------------------------+
| 2238925:347:13:1 | 2238925 | X | RECORD | `xa`.`t1` | PRIMARY | 347 | 13 | 1 | supremum pseudo-record |
| 2238923:347:13:1 | 2238923 | S | RECORD | `xa`.`t1` | PRIMARY | 347 | 13 | 1 | supremum pseudo-record |
+------------------+-------------+-----------+-----------+------------+------------+------------+-----------+----------+------------------------+

Revision history for this message
Kenny Gryp (gryp) wrote :

SHOW ENGINE INNODB TRANSACTIONS output:

---TRANSACTION 2238938, ACTIVE 2 sec inserting
mysql tables in use 1, locked 1
LOCK WAIT 2 lock struct(s), heap size 1136, 1 row lock(s)
MySQL thread id 18, OS thread handle 140294237603584, query id 96216 System lock
------- TRX HAS BEEN WAITING 2 SEC FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 347 page no 13 n bits 72 index PRIMARY of table `xa`.`t1` trx id 2238938 lock_mode X insert intention waiting
Record lock, heap no 1 PHYSICAL RECORD: n_fields 1; compact format; info bits 0
 0: len 8; hex 73757072656d756d; asc supremum;;

------------------
TABLE LOCK table `xa`.`t1` trx id 2238938 lock mode IX
RECORD LOCKS space id 347 page no 13 n bits 72 index PRIMARY of table `xa`.`t1` trx id 2238938 lock_mode X insert intention waiting
Record lock, heap no 1 PHYSICAL RECORD: n_fields 1; compact format; info bits 0
 0: len 8; hex 73757072656d756d; asc supremum;;

---TRANSACTION 2238923, ACTIVE (PREPARED) 1085 sec recovered trx
4 lock struct(s), heap size 1136, 2 row lock(s), undo log entries 2
TABLE LOCK table `xa`.`t1` trx id 2238923 lock mode IX
TABLE LOCK table `xa`.`t2` trx id 2238923 lock mode IX
RECORD LOCKS space id 347 page no 13 n bits 72 index PRIMARY of table `xa`.`t1` trx id 2238923 lock mode S
Record lock, heap no 1 PHYSICAL RECORD: n_fields 1; compact format; info bits 0
 0: len 8; hex 73757072656d756d; asc supremum;;

RECORD LOCKS space id 347 page no 14 n bits 72 index PRIMARY of table `xa`.`t1` trx id 2238923 lock_mode X locks rec but not gap
Record lock, heap no 2 PHYSICAL RECORD: n_fields 4; compact format; info bits 0
 0: len 4; hex 80000006; asc ;;
 1: len 6; hex 0000002229cb; asc ") ;;
 2: len 7; hex ee000001d80110; asc ;;
 3: len 30; hex 616161616161616161616161616161616161616161616161616161616161; asc aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa; (total 8000 bytes);

Revision history for this message
Kenny Gryp (gryp) wrote :

Given the fact that this is also a 'supremum pseudo-record' issue. This bug might be very likely related: https://bugs.mysql.com/bug.php?id=86486

Revision history for this message
Kenny Gryp (gryp) wrote :

NOTE FOR PEOPLE HAVING THIS ISSUE:

You can 'workaround' this issue by restarting the database, which causes the locks by the XA transaction that is already prepared to be gone and replication will just continue.

(no need to restore slaves from backups, skip binlog events...)

Revision history for this message
Vlad Lesin (vlad-lesin) wrote :

The issue was already fixed in 5.7.18 with the following commit:

commit c90aba97ee76f65770f6b4641edef1c32db58279
Author: Debarun Banerjee <email address hidden>
Date: Sun Nov 13 10:31:35 2016 +0530

    BUG#25082593 FOREIGN KEY VALIDATION DOESN'T NEED TO ACQUIRE GAP LOCK
                 IN READ COMMITTED

Revision history for this message
Sergey Korobitsin (thundertaker) wrote :

Confirming this (or very similar) issue still exists in our production configuration:

* percona-server-server 5.7.18-15-1.jessie (master)
* percona-server-server 5.7.19-17-1.jessie (slave)

Revision history for this message
Sergey Korobitsin (thundertaker) wrote :

"Workaround" from Kenny Gryp helps, but (in our workload) for ~30 minutes, then it fails again, restarting, working for some time, fail, etc. Not an option for us.

Revision history for this message
Shahriyar Rzayev (rzayev-sehriyar) wrote :

Percona now uses JIRA for bug reports so this bug report is migrated to: https://jira.percona.com/browse/PS-3636

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.