Joiner hangs after failed SST
Affects | Status | Importance | Assigned to | Milestone | ||
---|---|---|---|---|---|---|
Percona XtraDB Cluster moved to https://jira.percona.com/projects/PXC | Status tracked in 5.6 | |||||
| 5.5 |
Won't Fix
|
Undecided
|
Unassigned | ||
| 5.6 |
Incomplete
|
Undecided
|
Kenn Takara | ||
| 5.7 |
Fix Committed
|
Undecided
|
Kenn Takara |
Bug Description
How to repeat:
Start 3-nodes cluster.
export PXC=/bigdisk/
export GALERA=
export PATH=$PXC:$PATH
./bin/mysqld --basedir=. --datadir=./data1 --socket=
./bin/mysqld --basedir=. --datadir=./data2 --socket=
./bin/mysqld --basedir=. --datadir=./data3 --socket=
Make sure cluster works, data replicated.
Stop node1.
Create a table and load a lot of data into it:
USE test;
DROP TABLE IF EXISTS `joinit`;
CREATE TABLE `joinit` (
`i` int(11) NOT NULL AUTO_INCREMENT,
`s` varchar(64) DEFAULT NULL,
`t` time NOT NULL,
`g` int(11) NOT NULL,
PRIMARY KEY (`i`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
INSERT INTO joinit VALUES (NULL, uuid(), time(now()), (FLOOR( 1 + RAND( ) *60 )));
INSERT INTO joinit SELECT NULL, uuid(), time(now()), (FLOOR( 1 + RAND( ) *60 )) FROM joinit;
INSERT INTO joinit SELECT NULL, uuid(), time(now()), (FLOOR( 1 + RAND( ) *60 )) FROM joinit;
INSERT INTO joinit SELECT NULL, uuid(), time(now()), (FLOOR( 1 + RAND( ) *60 )) FROM joinit;
...
INSERT INTO joinit SELECT NULL, uuid(), time(now()), (FLOOR( 1 + RAND( ) *60 )) FROM joinit;
mysql> select count(*) from joinit;
+----------+
| count(*) |
+----------+
| 67108864 |
+----------+
1 row in set (28.93 sec)
On one of nodes (node 2) run OPTIMIZE TABLE joinit;
Wait few seconds and start earlier stopped node.
This node will try to initiate SST from node3 which will fail with error:
2017-07-
2017-07-
2017-07-
2017-07-
2017-07-
2017-07-
2017-07-
$ tail -n 45 data3/innobacku
170729 06:26:04 >> log scanned up to (8909890959)
170729 06:26:05 >> log scanned up to (8909891005)
170729 06:26:06 >> log scanned up to (8909891051)
170729 06:26:07 >> log scanned up to (8909891113)
170729 06:26:08 >> log scanned up to (8909891159)
170729 06:26:09 >> log scanned up to (8909891205)
InnoDB: Last flushed lsn: 8901930471 load_index lsn 8909891205
[FATAL] InnoDB: An optimized(without redo logging) DDLoperation has been performed. All modified pages may not have been flushed to the disk yet.
PXB will not be able take a consistent backup. Retry the backup operation
2017-07-29 06:26:10 0x7fc17076e700 InnoDB: Assertion failure in thread 140468792256256 in file ut0ut.cc line 916
InnoDB: We intentionally generate a memory trap.
InnoDB: Submit a detailed bug report to http://
InnoDB: If you get repeated assertion failures or crashes, even
InnoDB: immediately after the mysqld startup, there may be
InnoDB: corruption in the InnoDB tablespace. Please refer to
InnoDB: http://
InnoDB: about forcing recovery.
10:26:10 UTC - xtrabackup got signal 6 ;
This could be because you hit a bug or data is corrupted.
This error can also be caused by malfunctioning hardware.
Attempting to collect some information that could help diagnose the problem.
As this is a crash and something is definitely wrong, the information
collection process might fail.
Thread pointer: 0x0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0 thread_stack 0x10000
xtrabackup(
xtrabackup(
/lib/x86_
/lib/x86_
/lib/x86_
xtrabackup[
xtrabackup(
xtrabackup[
xtrabackup(
xtrabackup[
xtrabackup[
xtrabackup[
/lib/x86_
/lib/x86_
Please report a bug at https:/
Node1 will not try to repeat SST, but stall:
$ tail -n20 data1/bm-
be949bc4,0
}
)
2017-07-
view ((empty))
2017-07-
2017-07-
2017-07-
2017-07-
2017-07-
2017-07-
2017-07-
2017-07-
2017-07-
2017-07-
2017-07-
2017-07-
2017-07-
2017-07-
2017-07-
But mysqld will still run:
$ ps -ef | grep sveta | grep mysqld | grep pxc1
sveta.s+ 46217 26371 0 Jul29 pts/206 00:00:00 ./bin/mysqld --basedir=. --datadir=./data1 --socket=
Suggested fix: retry SST after failure or at least stop joiner node, so some failover script can restart it again
Kenn Takara (kenn-takara) wrote : | #1 |
Kenn Takara (kenn-takara) wrote : | #2 |
The bug doesn't happen with PXC 5.6
Krunal Bauskar (krunal-bauskar) wrote : | #3 |
commit 6ea7d1786122ebb
Author: Kenn Takara <email address hidden>
Date: Thu Aug 10 13:49:09 2017 -1000
Fix for PXC-810 PXC-843
PXC-810: PXC 5.7: segfault in GCommConn::close in galera_
PXC-843: Joiner hangs after failed SST
Issue:
If an SST fails due to a DML (such as OPTIMIZE or ALTER TABLE), which is run as the SST
is running, then the joiner node can hang (PXC-843). This is due to the applier
thread picking a transaction off of the recv_q and attempting to apply it, even though
the thread has been killed.
This exposed PXC-810, which would then happen while testing for PXC-843.
Solution:
For PXC-843, if the SST has failed, then we just drop any transactions that hav
been received. If the SST has failed, the node should be shutting down anyway.
For PXC-810, we do an additional check within the critical section to avoid the
null pointer dereference.
Shahriyar Rzayev (rzayev-sehriyar) wrote : | #4 |
Percona now uses JIRA for bug reports so this bug report is migrated to: https:/
I have verified this with PXC 5.7.18
However the steps (at the end are slightly different).
Start node 1 up (and as it starts the SST) then start the 'optimize table joinit' on node 2.
This will also cause PXB to crash. (Also I did it with 16M rows).