Percona XtraDB Cluster moved to https://jira.percona.com/projects/PXC

All nodes on cluster crash on foreign key check

Bug #1692745 reported by Marcelo Altmann on 2017-05-23

This bug affects 5 people

	Status	Importance	Assigned to
Percona XtraDB Cluster moved to https://jira.percona.com/projects/PXC	Status tracked in 5.6
5.6	Fix Released	Critical	Unassigned
5.7	Fix Released	Critical	Unassigned

Bug Description

On a 3 nodes pxc cluster, the whole cluster goes down when foreign key constrain fail.

2017-05-22 20:37:29 983 [ERROR] Slave SQL: Could not execute Delete_rows event on table fk.parent; Cannot delete or update a parent row: a foreign key constraint fails (`fk`.`child`, CONSTRAINT `fk_parent` FOREIGN KEY (`pID`) REFERENCES `parent` (`ID`) ON DELETE NO ACTION ON UPDATE NO ACTION), Error_code: 1451; handler error HA_ERR_ROW_IS_REFERENCED; the event's master log FIRST, end_log_pos 158, Error_code: 1451
2017-05-22 20:37:29 983 [Warning] WSREP: RBR event 3 Delete_rows apply warning: 152, 203
2017-05-22 20:37:29 983 [Warning] WSREP: Failed to apply app buffer: seqno: 203, status: 1
at galera/src/trx_handle.cpp:apply():351

How to reproduce:
simplified verion of https://jira.mariadb.org/browse/MDEV-12398

-- Prepare -

CREATE DATABASE fk;
CREATE TABLE fk.parent (ID INT NOT NULL AUTO_INCREMENT PRIMARY KEY, b varchar(10));
CREATE TABLE fk.child (ID INT NOT NULL AUTO_INCREMENT PRIMARY KEY, pID INT, CONSTRAINT fk_parent FOREIGN KEY (pID) REFERENCES fk.parent (ID) ON DELETE NO ACTION ON UPDATE NO ACTION);
-- Node1
cat << EOF > node1.sh
#!/bin/bash
MYSQL="mysql -u root -psekret"
MYSQLADMIN="mysqladmin -u root -psekret"
while : ; do

    ID=\`\$MYSQL -BN -e "INSERT INTO fk.parent VALUES (NULL, 'test'); SELECT LAST_INSERT_ID();"\`
    sleep 1
    \$MYSQL -e "DELETE FROM fk.parent WHERE ID=\$ID"
    \$MYSQLADMIN ping > /dev/null
    if [[ \$? -ne 0 ]]
    then
       break
    fi
done
EOF

chmod +x node1.sh

-- add delay
tc qdisc add dev eth0 root handle 1: netem delay 35ms

./node1.sh

-- Node2
cat << EOF > node2.sh
#!/bin/bash
MYSQL="mysql -u root -psekret"
MYSQLADMIN="mysqladmin -u root -psekret"
while : ; do

    ID=\`\$MYSQL -BN -e "INSERT INTO fk.child SELECT NULL, ID FROM fk.parent ORDER BY ID DESC LIMIT 1; SELECT LAST_INSERT_ID();"\`
    sleep 2
    \$MYSQL -e "DELETE FROM fk.child WHERE ID=\$ID"
    \$MYSQLADMIN ping > /dev/null
    if [[ \$? -ne 0 ]]
    then
       break
    fi
done
EOF

chmod +x node2.sh

tc qdisc add dev eth0 root handle 1: netem delay 35ms
./node2.sh

-- Cleanup
DELETE FROM fk.child; DELETE FROM fk.parent;

-- tested
5.7.17-13-57
5.6.35-81.0-56

my.cnf attached

Tags:

Revision history for this message

Marcelo Altmann (marcelo.altmann) wrote on 2017-05-23:

my.cnf Edit (1.1 KiB, text/plain)

Revision history for this message

Steve Shipway (sshipway) wrote on 2017-05-24:

This affected us badly during data load last year. We worked around it by shutting down the other nodes of the cluster until all data was loaded, then running on a single thread for the replication.

Revision history for this message

Jericho Rivera (jericho-rivera) wrote on 2017-05-24:

I can confirm the crash on PXC 5.7 (latest), will commence testing on PXC 5.6

Changed in percona-xtradb-cluster:
assignee:	nobody → Jericho Rivera (jericho-rivera)

Revision history for this message

Jericho Rivera (jericho-rivera) wrote on 2017-05-30:

Confirmed on PXC 5.6 as well:

Node2:
May 30 02:51:01 localhost mysqld: 2017-05-30 02:51:01 911 [ERROR] Slave SQL: Could not execute Delete_rows event on table fk.parent; Cannot delete or update a parent row: a foreign key constraint fails (`fk`.`child`, CONSTRAINT `fk_parent` FOREIGN KEY (`pID`) REFERENCES `parent` (`ID`) ON DELETE NO ACTION ON UPDATE NO ACTION), Error_code: 1451; handler error HA_ERR_ROW_IS_REFERENCED; the event's master log FIRST, end_log_pos 160, Error_code: 1451
May 30 02:51:01 localhost mysqld: 2017-05-30 02:51:01 911 [Warning] WSREP: RBR event 3 Delete_rows apply warning: 152, 56
May 30 02:51:01 localhost mysqld: 2017-05-30 02:51:01 911 [Warning] WSREP: Failed to apply app buffer: seqno: 56, status: 1
May 30 02:51:01 localhost mysqld: #011 at galera/src/trx_handle.cpp:apply():351

Node3:
May 30 02:51:01 localhost mysqld: 2017-05-30 02:51:01 936 [ERROR] Slave SQL: Could not execute Delete_rows event on table fk.parent; Cannot delete or update a parent row: a foreign key constraint fails (`fk`.`child`, CONSTRAINT `fk_parent` FOREIGN KEY (`pID`) REFERENCES `parent` (`ID`) ON DELETE NO ACTION ON UPDATE NO ACTION), Error_code: 1451; handler error HA_ERR_ROW_IS_REFERENCED; the event's master log FIRST, end_log_pos 160, Error_code: 1451
May 30 02:51:01 localhost mysqld: 2017-05-30 02:51:01 936 [Warning] WSREP: RBR event 3 Delete_rows apply warning: 152, 56
May 30 02:51:01 localhost mysqld: 2017-05-30 02:51:01 936 [Warning] WSREP: Failed to apply app buffer: seqno: 56, status: 1
May 30 02:51:01 localhost mysqld: #011 at galera/src/trx_handle.cpp:apply():351

Revision history for this message

Florian (kartoffelheinz) wrote on 2017-07-19:

I Can also confirm the issue on PXC 5.6, though we "only" experience 1 out of 3 nodes to crash in a 3-way-master setup (same error though).

We were able to narrow the issue down to 2 consecutive transactions (first deletes from table that is referenced by foreign key which seem to be applied out of order by the 2 slaves. The transaction usually fails only on one node after 4 tries (which crashes the node then, because he thinks he is no longer consistent). The other slave node tries 1 or 2 times and then gets it right. I suspect some kind of out-of-order-committing, maybe there is a race condition in the code that makes sure transaction will be applied in order?

Btw, we double checked repl.commit_order to make sure it has been set to never apply out-of-order on any of the nodes... :)

Would be glad if someone from Percona could have a look if there is a configurable workaround until no patched version is released.

Sveta Smirnova (svetasmirnova) on 2017-09-29

tags:

added: i206952

Revision history for this message

Krunal Bauskar (krunal-bauskar) wrote on 2017-10-03:

Bug was fixed as part of upstream refresh from Codership 5.6.36 and 5.7.18.

PXC recent release PXC-5.6.37 and PXC-5.7.19 carry the said fix.

Release: https://www.percona.com/blog/2017/09/20/percona-xtradb-cluster-5-6-37-26-21-is-now-available/

Release https://www.percona.com/blog/2017/09/22/percona-xtradb-cluster-5-7-19-29-22-now-available/