wsrep_sst_rsync fails silently on joiner when rsync server port is already taken

Bug #1099783 reported by Alex Yurchenko on 2013-01-15
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
MySQL patches by Codership
Status tracked in 5.6
5.5
Medium
Yan Zhang
5.6
Medium
Yan Zhang
Percona XtraDB Cluster
Status tracked in 5.6
5.5
Undecided
Unassigned
5.6
Undecided
Unassigned

Bug Description

Worst part is that it does not even fail, but pretends to be working so that SST request is prepared as if rsync is listening at the specified port.

Changed in codership-mysql:
assignee: nobody → Alex Yurchenko (ayurchen)
importance: Undecided → Medium
status: New → Confirmed
Yan Zhang (yan.zhang) wrote :

Actually I met this problem the first time I tried Galera, by using 'scripts/command.sh start'. I configured three Galera instance named home{0,1,2}. Instance home0 is started first, then home{1,2} are started simultaneously and request state in SST. Usually home2 instance failed with following message:

cat: /dev/shm/galera2/mysql/var//rsync_sst.pid: No such file or directory

There is also this loop which can be infinite

    until check_pid_and_port $RSYNC_PID $RSYNC_PORT
    do
        sleep 0.2
    done

where it does

" [ -r "$pid_file" ] && ps -p $(cat $pid_file) >/dev/null 2>&1
"

Yan Zhang (yan.zhang) wrote :

maybe we could set a timeout to avoid infinite loop

```
    timeout=200 # 40 secs.
    until check_pid_and_port $RSYNC_PID $RSYNC_PORT
    do
        sleep 0.2
        timeout=$((timeout-1))
        if [ $timeout == 0 ]
        then
            wsrep_log_error "rsync daemon may fails already."
            exit 255 # unknown error. maybe port has been taken
        fi
    done
```

Alex Yurchenko (ayurchen) wrote :

This is one possibility. But you don't what to wait even 40 seconds if port is already busy.

I'd suggest enhance check_pid_and_port() function to exit right away with an appropriate error message if it finds the port busy.

Yan Zhang (yan.zhang) wrote :

netstat -nltp sometimes can not output pid/progname, like
```
tcp 0 0 0.0.0.0:10033 0.0.0.0:* LISTEN -
tcp6 0 0 :::10033 :::* LISTEN -
```

so we can't detect whether port has been taken by this rsync process or not.

seems lsof is much reliable. so we change to lsof for all platforms

http://bazaar.launchpad.net/~codership/codership-mysql/wsrep-5.5/revision/3980
http://bazaar.launchpad.net/~codership/codership-mysql/5.6/revision/4075

On CentOS jenkins, this was required

------------------------------------------------------------
revno: 724
committer: Raghavendra D Prabhu <email address hidden>
branch nick: trunk-25
timestamp: Mon 2014-04-21 14:16:22 +0530
message:
  Add the PATH for lsof
diff:
=== modified file 'scripts/wsrep_sst_rsync.sh'
--- scripts/wsrep_sst_rsync.sh 2014-04-20 17:19:52 +0000
+++ scripts/wsrep_sst_rsync.sh 2014-04-21 08:46:22 +0000
@@ -25,6 +25,9 @@

 . $(dirname $0)/wsrep_sst_common

+# Setting the path for lsof
+export PATH="/usr/sbin:/sbin:$PATH"
+
 cleanup_joiner()
 {
     wsrep_log_info "Joiner cleanup."

Otherwise, wsrep_sst_rsync was just hanging when lsof (which is /usr/sbin/lsof on CentOS) was not found.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers