IST Fails with sst script error
Affects | Status | Importance | Assigned to | Milestone | ||
---|---|---|---|---|---|---|
Percona XtraDB Cluster moved to https://jira.percona.com/projects/PXC | Status tracked in 5.6 | |||||
5.5 |
New
|
Undecided
|
Raghavendra D Prabhu | |||
5.6 |
Fix Released
|
Undecided
|
Raghavendra D Prabhu |
Bug Description
This problem happens intermittently. Once it happens on a given joiner, I can't force it to IST, the node must SST. I can reproduce this with a few attempts at stopping and restarting a node while workload runs on the rest of the cluster.
This is on Centos 7 with the latest packages:
[root@node3 mysql]# rpm -qa | grep -i percona
percona-
percona-
percona-
Percona-
Percona-
Percona-
Percona-
Percona-
Log:
2015-04-08 17:20:46 4417 [Note] WSREP: Shifting OPEN -> PRIMARY (TO: 3128)
2015-04-08 17:20:46 4417 [Note] WSREP: State transfer required:
Group state: b9a78dfb-
Local state: b9a78dfb-
2015-04-08 17:20:46 4417 [Note] WSREP: New cluster view: global state: b9a78dfb-
ry, number of nodes: 3, my index: 2, protocol version 3
2015-04-08 17:20:46 4417 [Warning] WSREP: Gap in state sequence. Need state transfer.
2015-04-08 17:20:46 4417 [Note] WSREP: Running: 'wsrep_
cret' --datadir '/var/lib/mysql/' --defaults-file '/etc/my.cnf' --parent '4417' '' '
WSREP_SST: [INFO] Streaming with xbstream (20150408 17:20:47.133)
WSREP_SST: [INFO] Using socat as streamer (20150408 17:20:47.136)
WSREP_SST: [INFO] Stale sst_in_progress file: /var/lib/
2015-04-08 17:20:47 4417 [Note] WSREP: Prepared SST request: xtrabackup-
2015-04-08 17:20:47 4417 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2015-04-08 17:20:47 4417 [Note] WSREP: REPL Protocols: 7 (3, 2)
2015-04-08 17:20:47 4417 [Note] WSREP: Service thread queue flushed.
2015-04-08 17:20:47 4417 [Note] WSREP: Assign initial position for certification: 3128, protocol version: 3
2015-04-08 17:20:47 4417 [Note] WSREP: Service thread queue flushed.
2015-04-08 17:20:47 4417 [Note] WSREP: Prepared IST receiver, listening at: tcp://192.
WSREP_SST: [INFO] Evaluating timeout -k 110 100 socat -u TCP-LISTEN:
2015-04-08 17:20:47 4417 [Note] WSREP: Member 2.0 (node3) requested state transfer from '*any*'. Selected 1.0 (node1)(SYNCED) as donor.
2015-04-08 17:20:47 4417 [Note] WSREP: Shifting PRIMARY -> JOINER (TO: 3131)
2015-04-08 17:20:47 4417 [Note] WSREP: Requesting state transfer: success, donor: 1
2015-04-08 17:20:47 4417 [Note] WSREP: 1.0 (node1): State transfer to 2.0 (node3) complete.
2015-04-08 17:20:47 4417 [Note] WSREP: Member 1.0 (node1) synced with group.
WSREP_SST: [ERROR] Removing /tmp/tmp.
WSREP_SST: [INFO] xtrabackup_ist received from donor: Running IST (20150408 17:20:47.592)
/usr//bin/
cat: /tmp/tmp.
WSREP_SST: [INFO] Galera co-ords from recovery: (20150408 17:20:47.595)
cat: /tmp/tmp.
WSREP_SST: [ERROR] Cleanup after exit with status:1 (20150408 17:20:47.597)
2015-04-08 17:20:47 4417 [ERROR] WSREP: Process completed with error: wsrep_sst_
2015-04-08 17:20:47 4417 [ERROR] WSREP: Failed to read uuid:seqno from joiner script.
2015-04-08 17:20:47 4417 [ERROR] WSREP: SST failed: 1 (Operation not permitted)
2015-04-08 17:20:47 4417 [ERROR] Aborting
Yes, this can happen.
--- wsrep_sst_ xtrabackup- v2-1.sh 2015-04-08 23:23:20.295067447 +0530 xtrabackup- v2.sh 2015-04-08 23:36:03.431729274 +0530
+++ wsrep_sst_
@@ -537,6 +537,11 @@
local checkf=$4
local ltcmd
+ if [[ ! -d ${dir} ]];then
+ # This indicates that IST is in progress
+ return
+ fi
+
pushd ${dir} 1>/dev/null
set +e
@@ -838,12 +843,6 @@ FILE="$ {STATDIR} /${INFO_ FILE}"
MAGIC_
recv_joiner $STATDIR "${stagemsg}-gtid" $stimeout 1
- if [[ -d ${DATA}/.sst ]];then
- wsrep_log_info "WARNING: Stale temporary SST directory: ${DATA}/.sst from previous state transfer"
- fi
- mkdir -p ${DATA}/.sst
- (recv_joiner $DATA/.sst "${stagemsg}-SST" 0 0) &
- jpid=$!
if ! ps -p ${WSREP_ SST_OPT_ PARENT} &>/dev/null
then
@@ -853,6 +852,13 @@
if [ ! -r "${STATDIR} /${IST_ FILE}" ]
wsrep_ log_info "Proceeding with SST"
then
+ if [[ -d ${DATA}/.sst ]];then
+ wsrep_log_info "WARNING: Stale temporary SST directory: ${DATA}/.sst from previous state transfer"
+ fi
+ mkdir -p ${DATA}/.sst
+ (recv_joiner $DATA/.sst "${stagemsg}-SST" 0 0) &
+ jpid=$!
+
@@ -984,11 +990,7 @@
wsrep_ log_error "Check ${DATA} /innobackup. move.log for details"
fi
-
wsrep_ log_info "${IST_FILE} received from donor: Running IST"
else
- # || true if it has already exited
- kill $jpid || true
- rm -rf $DATA/.sst
fi
should fix it.