wsrep_sst_xtrabackup-v2 fails to clean a stale .sst temporary directory

Bug #1573159 reported by Andrew Garner
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Percona XtraDB Cluster moved to https://jira.percona.com/projects/PXC
Invalid
Undecided
Unassigned

Bug Description

If SST previously failed partway through and there are stale files in the "${DATA}/.sst" directory all future SST attempts will fail until administrative action is taken to clean these files.

As I understand, xtrabackup-v2 supports a "cpat" option whose purpose is to details the files that should be preserved when cleaning the datadir and that ".sst" is explicitly included. If ".sst" is removed from that regex, SST fails when trying to write into a non-existent ".sst" directory (although it is then cleaned).

I am not sure if this behavior is intentional, but having some option for xtrabackup-v2 to do the cleaning itself would be convenient rather than forcing administrative action after a failure. I suppose the existing behavior is in place for troubleshooting or avoiding certain data loss scenarios.

My initial thoughts at a fix involved allow manually updating the cpat variable to exclude .sst via configuration and simply ensuring this directory is (re)created before proceeding. Perhaps a change like this:

--- a/wsrep_sst_xtrabackup-v2.sh
+++ b/wsrep_sst_xtrabackup-v2.sh
@@ -861,15 +861,16 @@ then
             wsrep_log_info "WARNING: Stale temporary SST directory: ${DATA}/.sst from previous state transfer. Removing"
             rm -rf ${DATA}/.sst
         fi
- mkdir -p ${DATA}/.sst
- (recv_joiner $DATA/.sst "${stagemsg}-SST" 0 0) &
- jpid=$!
- wsrep_log_info "Proceeding with SST"

         wsrep_log_info "Cleaning the existing datadir and innodb-data/log directories"
         find $ib_home_dir $ib_log_dir $ib_undo_dir $DATA -mindepth 1 -regex $cpat -prune -o -exec rm -rfv {} 1>&2 \+

+ mkdir -p ${DATA}/.sst
+ (recv_joiner $DATA/.sst "${stagemsg}-SST" 0 0) &
+ jpid=$!
+ wsrep_log_info "Proceeding with SST"
+
         tempdir=$(parse_cnf mysqld log-bin "")
         if [[ -n ${tempdir:-} ]];then
             binlog_dir=$(dirname $tempdir)

Alternatively, maybe just cleaning out the contents of .sst (perhaps via some opt-in option) when a stale .sst is detected would be somewhat cleaner.

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

I put a duplicate here, although I'm not sure the ./ibdata1 path may be related to the .sst dir

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

I can also reproduce the bug even if I use
[sst]
sst-special-dirs = 0

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

I can reproduce this bug with a 100 rate with the steps:

* Run mysqld binary for a joiner node manually:
/usr/sbin/mysqld --verbose --basedir=/usr --datadir=/var/lib/mysql --plugin-dir=/usr/lib/mysql/plugin --user=mysql --log-error=/var/log/mysql/error.log --open-files-limit=102400 --pid-file=/var/run/mysqld/mysqld.pid --socket=/var/run/mysqld/mysqld.sock --port=3306
* kill -STOP `pgrep xbstream`
* touch /var/lib/mysql/ibdata1
* kill -CONT `pgrep xbstream`

The log example:
WSREP_SST: [INFO] Evaluating socat -u TCP-LISTEN:4444,reuseaddr stdio | xbstream -x; RC=( ${PIPESTATUS[@]} ) (20160427 12:07:35.834)
2016-04-27 12:07:38 29076 [Note] WSREP: (217919ed, 'tcp://10.10.10.2:4567') turning message relay requesting off
Aborted
root@n1:/usr/bin# xbstream: Can't create/write to file './ibdata1' (Errcode: 17 - File exists)
xbstream: failed to create file.
2016/04/27 12:07:47 socat[29351] E write(1, 0x13713d0, 8192): Broken pipe
WSREP_SST: [INFO] NOTE: Joiner-Recv-SST took 12 seconds (20160427 12:07:47.444)

So that's a race with the ibdata1 file, something else creates it while streaming's in progress.
Inotify tools gave no insights to me *what* creates it :(

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Long story short.
First, use innodb-data-home-dir = /var/lib/mysql. Then [sst] sst-special-dirs, if set, will not be ingored and $speciald starts working as expected and cleanup the stale .sst dir.
And then, with sst-special-dirs=1 defaults, there is no more a race with ibdata1 as it is placed to the /var/lib/mysql/.sst

But this
922 if $speciald -eq 1 ;then
923 MAGIC_FILE="${TDATA}/${INFO_FILE}"
924 set +e
925 rm $TDATA/innobackup.prepare.log $TDATA/innobackup.move.log
926 set -e
refers to wrong DATA/TDATA paths, and fails SST

The patch:
--- a/wsrep_sst_xtrabackup-v2
+++ b/wsrep_sst_xtrabackup-v2
@@ -921,9 +921,7 @@ then

         if $speciald -eq 1 ;then
             MAGIC_FILE="${TDATA}/${INFO_FILE}"
- set +e
- rm $TDATA/innobackup.prepare.log $TDATA/innobackup.move.log
- set -e
+ rm -f {$TDATA,$DATA}/innobackup.prepare.log {$TDATA,$DATA}/innobackup.move.log
             wsrep_log_info "Moving the backup to ${TDATA}"
             timeit "Xtrabackup move stage" "$INNOMOVE"
             if $? -eq 0 ;then

Changed in percona-xtradb-cluster:
status: New → Confirmed
assignee: nobody → Bogdan Dobrelya (bogdando)
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

As far as I can see here https://github.com/percona/percona-xtradb-cluster/blob/0aac2f1a946395a95deb5cf541f6f5eb9016657b/scripts/wsrep_sst_xtrabackup-v2.sh#L334-336, the 1st part of the issue is no more exist. But maybe it makes sens to apply the patch just as well

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :
Changed in percona-xtradb-cluster:
assignee: Bogdan Dobrelya (bogdando) → nobody
Revision history for this message
Andrew Garner (muzazzi) wrote :

My original complaint is totally invalid - apologies for the noise. I was using a stale xtrabackup-v2 from a mariadb distribution due to my own sloppiness in testing.

Obviously PXC is doing this already for quite some time:

https://github.com/percona/percona-xtradb-cluster/blame/5.6/scripts/wsrep_sst_xtrabackup-v2.sh#L861

I can't say I've seen the problem Bogdan refers to here which is probably a different issue than what I reported.

Revision history for this message
Krunal Bauskar (krunal-bauskar) wrote :

Thanks Andrew. Will close this bug.

Changed in percona-xtradb-cluster:
status: Confirmed → Invalid
Revision history for this message
Shahriyar Rzayev (rzayev-sehriyar) wrote :

Percona now uses JIRA for bug reports so this bug report is migrated to: https://jira.percona.com/browse/PXC-1899

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.