IST Fails with sst script error

Bug #1441762 reported by Jay Janssen on 2015-04-08
48
This bug affects 11 people
Affects Status Importance Assigned to Milestone
Percona XtraDB Cluster moved to https://jira.percona.com/projects/PXC
Status tracked in 5.6
5.5
New
Undecided
Raghavendra D Prabhu
5.6
Fix Released
Undecided
Raghavendra D Prabhu

Bug Description

This problem happens intermittently. Once it happens on a given joiner, I can't force it to IST, the node must SST. I can reproduce this with a few attempts at stopping and restarting a node while workload runs on the rest of the cluster.

This is on Centos 7 with the latest packages:

[root@node3 mysql]# rpm -qa | grep -i percona
percona-toolkit-2.2.13-1.noarch
percona-xtrabackup-2.2.10-1.el7.x86_64
percona-release-0.1-3.noarch
Percona-XtraDB-Cluster-shared-56-5.6.22-25.8.978.el7.x86_64
Percona-XtraDB-Cluster-client-56-5.6.22-25.8.978.el7.x86_64
Percona-XtraDB-Cluster-galera-3-3.9-1.3494.rhel7.x86_64
Percona-XtraDB-Cluster-server-56-5.6.22-25.8.978.el7.x86_64
Percona-XtraDB-Cluster-devel-56-5.6.22-25.8.978.el7.x86_64

Log:

2015-04-08 17:20:46 4417 [Note] WSREP: Shifting OPEN -> PRIMARY (TO: 3128)
2015-04-08 17:20:46 4417 [Note] WSREP: State transfer required:
        Group state: b9a78dfb-de0f-11e4-9826-eb80bdd55bc5:3128
        Local state: b9a78dfb-de0f-11e4-9826-eb80bdd55bc5:705
2015-04-08 17:20:46 4417 [Note] WSREP: New cluster view: global state: b9a78dfb-de0f-11e4-9826-eb80bdd55bc5:3128, view# 9: Prima
ry, number of nodes: 3, my index: 2, protocol version 3
2015-04-08 17:20:46 4417 [Warning] WSREP: Gap in state sequence. Need state transfer.
2015-04-08 17:20:46 4417 [Note] WSREP: Running: 'wsrep_sst_xtrabackup-v2 --role 'joiner' --address '192.168.70.4' --auth 'sst:se
cret' --datadir '/var/lib/mysql/' --defaults-file '/etc/my.cnf' --parent '4417' '' '
WSREP_SST: [INFO] Streaming with xbstream (20150408 17:20:47.133)
WSREP_SST: [INFO] Using socat as streamer (20150408 17:20:47.136)
WSREP_SST: [INFO] Stale sst_in_progress file: /var/lib/mysql//sst_in_progress (20150408 17:20:47.138)
2015-04-08 17:20:47 4417 [Note] WSREP: Prepared SST request: xtrabackup-v2|192.168.70.4:4444/xtrabackup_sst//1
2015-04-08 17:20:47 4417 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2015-04-08 17:20:47 4417 [Note] WSREP: REPL Protocols: 7 (3, 2)
2015-04-08 17:20:47 4417 [Note] WSREP: Service thread queue flushed.
2015-04-08 17:20:47 4417 [Note] WSREP: Assign initial position for certification: 3128, protocol version: 3
2015-04-08 17:20:47 4417 [Note] WSREP: Service thread queue flushed.
2015-04-08 17:20:47 4417 [Note] WSREP: Prepared IST receiver, listening at: tcp://192.168.70.4:4568
WSREP_SST: [INFO] Evaluating timeout -k 110 100 socat -u TCP-LISTEN:4444,reuseaddr stdio | xbstream -x; RC=( ${PIPESTATUS[@]} ) (20150408 17:20:47.176)
2015-04-08 17:20:47 4417 [Note] WSREP: Member 2.0 (node3) requested state transfer from '*any*'. Selected 1.0 (node1)(SYNCED) as donor.
2015-04-08 17:20:47 4417 [Note] WSREP: Shifting PRIMARY -> JOINER (TO: 3131)
2015-04-08 17:20:47 4417 [Note] WSREP: Requesting state transfer: success, donor: 1
2015-04-08 17:20:47 4417 [Note] WSREP: 1.0 (node1): State transfer to 2.0 (node3) complete.
2015-04-08 17:20:47 4417 [Note] WSREP: Member 1.0 (node1) synced with group.
WSREP_SST: [ERROR] Removing /tmp/tmp.vOO1WbEZ4X/xtrabackup_galera_info file due to signal (20150408 17:20:47.590)
WSREP_SST: [INFO] xtrabackup_ist received from donor: Running IST (20150408 17:20:47.592)
/usr//bin/wsrep_sst_xtrabackup-v2: line 540: pushd: /var/lib/mysql//.sst: No such file or directory
cat: /tmp/tmp.vOO1WbEZ4X/xtrabackup_galera_info: No such file or directory
WSREP_SST: [INFO] Galera co-ords from recovery: (20150408 17:20:47.595)
cat: /tmp/tmp.vOO1WbEZ4X/xtrabackup_galera_info: No such file or directory
WSREP_SST: [ERROR] Cleanup after exit with status:1 (20150408 17:20:47.597)
2015-04-08 17:20:47 4417 [ERROR] WSREP: Process completed with error: wsrep_sst_xtrabackup-v2 --role 'joiner' --address '192.168.70.4' --auth 'sst:secret' --datadir '/var/lib/mysql/' --defaults-file '/etc/my.cnf' --parent '4417' '' : 1 (Operation not permitted)
2015-04-08 17:20:47 4417 [ERROR] WSREP: Failed to read uuid:seqno from joiner script.
2015-04-08 17:20:47 4417 [ERROR] WSREP: SST failed: 1 (Operation not permitted)
2015-04-08 17:20:47 4417 [ERROR] Aborting

Yes, this can happen.

--- wsrep_sst_xtrabackup-v2-1.sh 2015-04-08 23:23:20.295067447 +0530
+++ wsrep_sst_xtrabackup-v2.sh 2015-04-08 23:36:03.431729274 +0530
@@ -537,6 +537,11 @@
     local checkf=$4
     local ltcmd

+ if [[ ! -d ${dir} ]];then
+ # This indicates that IST is in progress
+ return
+ fi
+
     pushd ${dir} 1>/dev/null
     set +e

@@ -838,12 +843,6 @@
     MAGIC_FILE="${STATDIR}/${INFO_FILE}"
     recv_joiner $STATDIR "${stagemsg}-gtid" $stimeout 1

- if [[ -d ${DATA}/.sst ]];then
- wsrep_log_info "WARNING: Stale temporary SST directory: ${DATA}/.sst from previous state transfer"
- fi
- mkdir -p ${DATA}/.sst
- (recv_joiner $DATA/.sst "${stagemsg}-SST" 0 0) &
- jpid=$!

     if ! ps -p ${WSREP_SST_OPT_PARENT} &>/dev/null
     then
@@ -853,6 +852,13 @@

     if [ ! -r "${STATDIR}/${IST_FILE}" ]
     then
+ if [[ -d ${DATA}/.sst ]];then
+ wsrep_log_info "WARNING: Stale temporary SST directory: ${DATA}/.sst from previous state transfer"
+ fi
+ mkdir -p ${DATA}/.sst
+ (recv_joiner $DATA/.sst "${stagemsg}-SST" 0 0) &
+ jpid=$!
+
         wsrep_log_info "Proceeding with SST"

@@ -984,11 +990,7 @@
             wsrep_log_error "Check ${DATA}/innobackup.move.log for details"
         fi

-
     else
- # || true if it has already exited
- kill $jpid || true
- rm -rf $DATA/.sst
         wsrep_log_info "${IST_FILE} received from donor: Running IST"
     fi

should fix it.

@Jay,

Actually, this shouldn't happen given that

        kill $jpid || true
        rm -rf $DATA/.sst

ie. the directory is removed after the background process is killed.

" WSREP_SST: [ERROR] Removing /tmp/tmp.vOO1WbEZ4X/xtrabackup_galera_info file due to signal (20150408 17:20:47.590)" is actually causing the failure.

Any reason a signal may be being sent? (or generated)

Jay Janssen (jay-janssen) wrote :

I'm not doing anything beyond just restarting the node via systemd. There is sysbench workload on one of the other nodes, that's it.

Michael Wang (xw73) wrote :
Download full text (12.2 KiB)

I had the same or similar error which happens consistently, my error log is copied below.

I am very curious about the error on the missing "/var/lib/mysql//.sst" because it was created inside the code:

    mkdir -p ${DATA}/.sst
    (recv_joiner $DATA/.sst "${stagemsg}-SST" 0 0) &
    jpid=$!

so I added a few lines to track how it disappeared, and the log shows:

WSREP_SST: [INFO] xxxxxxxx (20150519 15:13:53.850)
WSREP_SST: [INFO] 0 drwxrwx--x. 2 mysql mysql 6 May 19 15:13 /var/lib/mysql//.sst (20150519 15:13:53.855)
WSREP_SST: [INFO] xxxxxxxx (20150519 15:13:53.858)
WSREP_SST: [ERROR] Removing /tmp/tmp.jkaFe9krxm/xtrabackup_galera_info file due to signal (20150519 15:13:53.871)
WSREP_SST: [INFO] xtrabackup_ist received from donor: Running IST (20150519 15:13:53.874)
/usr//bin/wsrep_sst_xtrabackup-v2: line 540: pushd: /var/lib/mysql//.sst: No such file or directory

that is, the directory was indeed created, but it was removed by the program itself, something triggered the "rm" inside the code I assume. No one send any signal to the program, could be due to some "internal" signal related to child process.

But anyway, I applied the patch and the problem seems to be gone. Applying the patch is difficult process because when you copy the patch with a mouse, the space at the end of line is not copied and that causes problems. For example:

if [[ -d ${DATA}/.sst ]];then $

where $ indicates the end of line, and there should be a space.

PS: The error log

150519 15:13:41 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
150519 15:13:41 mysqld_safe Skipping wsrep-recover for 05d1ca52-fd85-11e4-a7bd-12d874519759:1641 pair
150519 15:13:41 mysqld_safe Assigning 05d1ca52-fd85-11e4-a7bd-12d874519759:1641 to wsrep_start_position
2015-05-19 15:13:41 0 [Note] WSREP: wsrep_start_position var submitted: '05d1ca52-fd85-11e4-a7bd-12d874519759:1641'
2015-05-19 15:13:41 0 [Warning] TIMESTAMP with implicit DEFAULT value is deprecated. Please use --explicit_defaults_for_timestamp server option (see documentation for more details).
2015-05-19 15:13:41 1077 [Warning] No argument was provided to --log-bin, and --log-bin-index was not used; so replication may break when this MySQL server acts as a master and has his hostname changed!! Please use '--log-bin=node1-bin' to avoid this problem.
2015-05-19 15:13:41 1077 [Note] WSREP: Read nil XID from storage engines, skipping position init
2015-05-19 15:13:41 1077 [Note] WSREP: wsrep_load(): loading provider library '/usr/lib64/libgalera_smm.so'
2015-05-19 15:13:41 1077 [Note] WSREP: wsrep_load(): Galera 3.9(r93aca2d) by Codership Oy <email address hidden> loaded successfully.
2015-05-19 15:13:41 1077 [Note] WSREP: CRC-32C: using "slicing-by-8" algorithm.
2015-05-19 15:13:41 1077 [Note] WSREP: Found saved state: 05d1ca52-fd85-11e4-a7bd-12d874519759:1641
2015-05-19 15:13:41 1077 [Note] WSREP: Passing config to GCS: base_dir = /var/lib/mysql/; base_host = 192.168.70.2; base_port = 4567; cert.log_conflicts = no; debug = no; evs.auto_evict = 0; evs.delay_margin = PT1S; evs.delayed_keep_period = PT30S; evs.inactive_check_period = PT0.5S; evs.inactive_timeout = PT15S; evs.join_retrans_period = PT1S; evs.max_...

Stefan Schlesinger (sts-2) wrote :

I can confirm the patch from Raghavendra Prabhu is working for us as well! Thanks!

Percona now uses JIRA for bug reports so this bug report is migrated to: https://jira.percona.com/browse/PXC-1816

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers