IST Fails with sst script error

Bug #1441762 reported by Jay Janssen
48
This bug affects 11 people
Affects Status Importance Assigned to Milestone
Percona XtraDB Cluster moved to https://jira.percona.com/projects/PXC
Status tracked in 5.6
5.5
New
Undecided
Raghavendra D Prabhu
5.6
Fix Released
Undecided
Raghavendra D Prabhu

Bug Description

This problem happens intermittently. Once it happens on a given joiner, I can't force it to IST, the node must SST. I can reproduce this with a few attempts at stopping and restarting a node while workload runs on the rest of the cluster.

This is on Centos 7 with the latest packages:

[root@node3 mysql]# rpm -qa | grep -i percona
percona-toolkit-2.2.13-1.noarch
percona-xtrabackup-2.2.10-1.el7.x86_64
percona-release-0.1-3.noarch
Percona-XtraDB-Cluster-shared-56-5.6.22-25.8.978.el7.x86_64
Percona-XtraDB-Cluster-client-56-5.6.22-25.8.978.el7.x86_64
Percona-XtraDB-Cluster-galera-3-3.9-1.3494.rhel7.x86_64
Percona-XtraDB-Cluster-server-56-5.6.22-25.8.978.el7.x86_64
Percona-XtraDB-Cluster-devel-56-5.6.22-25.8.978.el7.x86_64

Log:

2015-04-08 17:20:46 4417 [Note] WSREP: Shifting OPEN -> PRIMARY (TO: 3128)
2015-04-08 17:20:46 4417 [Note] WSREP: State transfer required:
        Group state: b9a78dfb-de0f-11e4-9826-eb80bdd55bc5:3128
        Local state: b9a78dfb-de0f-11e4-9826-eb80bdd55bc5:705
2015-04-08 17:20:46 4417 [Note] WSREP: New cluster view: global state: b9a78dfb-de0f-11e4-9826-eb80bdd55bc5:3128, view# 9: Prima
ry, number of nodes: 3, my index: 2, protocol version 3
2015-04-08 17:20:46 4417 [Warning] WSREP: Gap in state sequence. Need state transfer.
2015-04-08 17:20:46 4417 [Note] WSREP: Running: 'wsrep_sst_xtrabackup-v2 --role 'joiner' --address '192.168.70.4' --auth 'sst:se
cret' --datadir '/var/lib/mysql/' --defaults-file '/etc/my.cnf' --parent '4417' '' '
WSREP_SST: [INFO] Streaming with xbstream (20150408 17:20:47.133)
WSREP_SST: [INFO] Using socat as streamer (20150408 17:20:47.136)
WSREP_SST: [INFO] Stale sst_in_progress file: /var/lib/mysql//sst_in_progress (20150408 17:20:47.138)
2015-04-08 17:20:47 4417 [Note] WSREP: Prepared SST request: xtrabackup-v2|192.168.70.4:4444/xtrabackup_sst//1
2015-04-08 17:20:47 4417 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2015-04-08 17:20:47 4417 [Note] WSREP: REPL Protocols: 7 (3, 2)
2015-04-08 17:20:47 4417 [Note] WSREP: Service thread queue flushed.
2015-04-08 17:20:47 4417 [Note] WSREP: Assign initial position for certification: 3128, protocol version: 3
2015-04-08 17:20:47 4417 [Note] WSREP: Service thread queue flushed.
2015-04-08 17:20:47 4417 [Note] WSREP: Prepared IST receiver, listening at: tcp://192.168.70.4:4568
WSREP_SST: [INFO] Evaluating timeout -k 110 100 socat -u TCP-LISTEN:4444,reuseaddr stdio | xbstream -x; RC=( ${PIPESTATUS[@]} ) (20150408 17:20:47.176)
2015-04-08 17:20:47 4417 [Note] WSREP: Member 2.0 (node3) requested state transfer from '*any*'. Selected 1.0 (node1)(SYNCED) as donor.
2015-04-08 17:20:47 4417 [Note] WSREP: Shifting PRIMARY -> JOINER (TO: 3131)
2015-04-08 17:20:47 4417 [Note] WSREP: Requesting state transfer: success, donor: 1
2015-04-08 17:20:47 4417 [Note] WSREP: 1.0 (node1): State transfer to 2.0 (node3) complete.
2015-04-08 17:20:47 4417 [Note] WSREP: Member 1.0 (node1) synced with group.
WSREP_SST: [ERROR] Removing /tmp/tmp.vOO1WbEZ4X/xtrabackup_galera_info file due to signal (20150408 17:20:47.590)
WSREP_SST: [INFO] xtrabackup_ist received from donor: Running IST (20150408 17:20:47.592)
/usr//bin/wsrep_sst_xtrabackup-v2: line 540: pushd: /var/lib/mysql//.sst: No such file or directory
cat: /tmp/tmp.vOO1WbEZ4X/xtrabackup_galera_info: No such file or directory
WSREP_SST: [INFO] Galera co-ords from recovery: (20150408 17:20:47.595)
cat: /tmp/tmp.vOO1WbEZ4X/xtrabackup_galera_info: No such file or directory
WSREP_SST: [ERROR] Cleanup after exit with status:1 (20150408 17:20:47.597)
2015-04-08 17:20:47 4417 [ERROR] WSREP: Process completed with error: wsrep_sst_xtrabackup-v2 --role 'joiner' --address '192.168.70.4' --auth 'sst:secret' --datadir '/var/lib/mysql/' --defaults-file '/etc/my.cnf' --parent '4417' '' : 1 (Operation not permitted)
2015-04-08 17:20:47 4417 [ERROR] WSREP: Failed to read uuid:seqno from joiner script.
2015-04-08 17:20:47 4417 [ERROR] WSREP: SST failed: 1 (Operation not permitted)
2015-04-08 17:20:47 4417 [ERROR] Aborting

Revision history for this message
Raghavendra D Prabhu (raghavendra-prabhu) wrote :

Yes, this can happen.

--- wsrep_sst_xtrabackup-v2-1.sh 2015-04-08 23:23:20.295067447 +0530
+++ wsrep_sst_xtrabackup-v2.sh 2015-04-08 23:36:03.431729274 +0530
@@ -537,6 +537,11 @@
     local checkf=$4
     local ltcmd

+ if [[ ! -d ${dir} ]];then
+ # This indicates that IST is in progress
+ return
+ fi
+
     pushd ${dir} 1>/dev/null
     set +e

@@ -838,12 +843,6 @@
     MAGIC_FILE="${STATDIR}/${INFO_FILE}"
     recv_joiner $STATDIR "${stagemsg}-gtid" $stimeout 1

- if [[ -d ${DATA}/.sst ]];then
- wsrep_log_info "WARNING: Stale temporary SST directory: ${DATA}/.sst from previous state transfer"
- fi
- mkdir -p ${DATA}/.sst
- (recv_joiner $DATA/.sst "${stagemsg}-SST" 0 0) &
- jpid=$!

     if ! ps -p ${WSREP_SST_OPT_PARENT} &>/dev/null
     then
@@ -853,6 +852,13 @@

     if [ ! -r "${STATDIR}/${IST_FILE}" ]
     then
+ if [[ -d ${DATA}/.sst ]];then
+ wsrep_log_info "WARNING: Stale temporary SST directory: ${DATA}/.sst from previous state transfer"
+ fi
+ mkdir -p ${DATA}/.sst
+ (recv_joiner $DATA/.sst "${stagemsg}-SST" 0 0) &
+ jpid=$!
+
         wsrep_log_info "Proceeding with SST"

@@ -984,11 +990,7 @@
             wsrep_log_error "Check ${DATA}/innobackup.move.log for details"
         fi

-
     else
- # || true if it has already exited
- kill $jpid || true
- rm -rf $DATA/.sst
         wsrep_log_info "${IST_FILE} received from donor: Running IST"
     fi

should fix it.

Revision history for this message
Raghavendra D Prabhu (raghavendra-prabhu) wrote :

@Jay,

Actually, this shouldn't happen given that

        kill $jpid || true
        rm -rf $DATA/.sst

ie. the directory is removed after the background process is killed.

" WSREP_SST: [ERROR] Removing /tmp/tmp.vOO1WbEZ4X/xtrabackup_galera_info file due to signal (20150408 17:20:47.590)" is actually causing the failure.

Any reason a signal may be being sent? (or generated)

Revision history for this message
Jay Janssen (jay-janssen) wrote :

I'm not doing anything beyond just restarting the node via systemd. There is sysbench workload on one of the other nodes, that's it.

Revision history for this message
Michael Wang (xw73) wrote :
Download full text (12.2 KiB)

I had the same or similar error which happens consistently, my error log is copied below.

I am very curious about the error on the missing "/var/lib/mysql//.sst" because it was created inside the code:

    mkdir -p ${DATA}/.sst
    (recv_joiner $DATA/.sst "${stagemsg}-SST" 0 0) &
    jpid=$!

so I added a few lines to track how it disappeared, and the log shows:

WSREP_SST: [INFO] xxxxxxxx (20150519 15:13:53.850)
WSREP_SST: [INFO] 0 drwxrwx--x. 2 mysql mysql 6 May 19 15:13 /var/lib/mysql//.sst (20150519 15:13:53.855)
WSREP_SST: [INFO] xxxxxxxx (20150519 15:13:53.858)
WSREP_SST: [ERROR] Removing /tmp/tmp.jkaFe9krxm/xtrabackup_galera_info file due to signal (20150519 15:13:53.871)
WSREP_SST: [INFO] xtrabackup_ist received from donor: Running IST (20150519 15:13:53.874)
/usr//bin/wsrep_sst_xtrabackup-v2: line 540: pushd: /var/lib/mysql//.sst: No such file or directory

that is, the directory was indeed created, but it was removed by the program itself, something triggered the "rm" inside the code I assume. No one send any signal to the program, could be due to some "internal" signal related to child process.

But anyway, I applied the patch and the problem seems to be gone. Applying the patch is difficult process because when you copy the patch with a mouse, the space at the end of line is not copied and that causes problems. For example:

if [[ -d ${DATA}/.sst ]];then $

where $ indicates the end of line, and there should be a space.

PS: The error log

150519 15:13:41 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
150519 15:13:41 mysqld_safe Skipping wsrep-recover for 05d1ca52-fd85-11e4-a7bd-12d874519759:1641 pair
150519 15:13:41 mysqld_safe Assigning 05d1ca52-fd85-11e4-a7bd-12d874519759:1641 to wsrep_start_position
2015-05-19 15:13:41 0 [Note] WSREP: wsrep_start_position var submitted: '05d1ca52-fd85-11e4-a7bd-12d874519759:1641'
2015-05-19 15:13:41 0 [Warning] TIMESTAMP with implicit DEFAULT value is deprecated. Please use --explicit_defaults_for_timestamp server option (see documentation for more details).
2015-05-19 15:13:41 1077 [Warning] No argument was provided to --log-bin, and --log-bin-index was not used; so replication may break when this MySQL server acts as a master and has his hostname changed!! Please use '--log-bin=node1-bin' to avoid this problem.
2015-05-19 15:13:41 1077 [Note] WSREP: Read nil XID from storage engines, skipping position init
2015-05-19 15:13:41 1077 [Note] WSREP: wsrep_load(): loading provider library '/usr/lib64/libgalera_smm.so'
2015-05-19 15:13:41 1077 [Note] WSREP: wsrep_load(): Galera 3.9(r93aca2d) by Codership Oy <email address hidden> loaded successfully.
2015-05-19 15:13:41 1077 [Note] WSREP: CRC-32C: using "slicing-by-8" algorithm.
2015-05-19 15:13:41 1077 [Note] WSREP: Found saved state: 05d1ca52-fd85-11e4-a7bd-12d874519759:1641
2015-05-19 15:13:41 1077 [Note] WSREP: Passing config to GCS: base_dir = /var/lib/mysql/; base_host = 192.168.70.2; base_port = 4567; cert.log_conflicts = no; debug = no; evs.auto_evict = 0; evs.delay_margin = PT1S; evs.delayed_keep_period = PT30S; evs.inactive_check_period = PT0.5S; evs.inactive_timeout = PT15S; evs.join_retrans_period = PT1S; evs.max_...

Revision history for this message
Stefan Schlesinger (sts-2) wrote :

I can confirm the patch from Raghavendra Prabhu is working for us as well! Thanks!

Revision history for this message
Shahriyar Rzayev (rzayev-sehriyar) wrote :

Percona now uses JIRA for bug reports so this bug report is migrated to: https://jira.percona.com/browse/PXC-1816

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.