wsrep_sst_rsync fails silently on joiner when rsync server port is already taken

Bug #1099783 reported by Alex Yurchenko
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
MySQL patches by Codership
Status tracked in 5.6
5.5
Fix Released
Medium
Yan Zhang
5.6
Fix Released
Medium
Yan Zhang
Percona XtraDB Cluster moved to https://jira.percona.com/projects/PXC
Status tracked in 5.6
5.5
Fix Released
Undecided
Unassigned
5.6
Fix Released
Undecided
Unassigned

Bug Description

Worst part is that it does not even fail, but pretends to be working so that SST request is prepared as if rsync is listening at the specified port.

Changed in codership-mysql:
assignee: nobody → Alex Yurchenko (ayurchen)
importance: Undecided → Medium
status: New → Confirmed
Revision history for this message
Yan Zhang (yan.zhang) wrote :

Actually I met this problem the first time I tried Galera, by using 'scripts/command.sh start'. I configured three Galera instance named home{0,1,2}. Instance home0 is started first, then home{1,2} are started simultaneously and request state in SST. Usually home2 instance failed with following message:

cat: /dev/shm/galera2/mysql/var//rsync_sst.pid: No such file or directory

Revision history for this message
Raghavendra D Prabhu (raghavendra-prabhu) wrote :

There is also this loop which can be infinite

    until check_pid_and_port $RSYNC_PID $RSYNC_PORT
    do
        sleep 0.2
    done

where it does

" [ -r "$pid_file" ] && ps -p $(cat $pid_file) >/dev/null 2>&1
"

Revision history for this message
Yan Zhang (yan.zhang) wrote :

maybe we could set a timeout to avoid infinite loop

```
    timeout=200 # 40 secs.
    until check_pid_and_port $RSYNC_PID $RSYNC_PORT
    do
        sleep 0.2
        timeout=$((timeout-1))
        if [ $timeout == 0 ]
        then
            wsrep_log_error "rsync daemon may fails already."
            exit 255 # unknown error. maybe port has been taken
        fi
    done
```

Revision history for this message
Alex Yurchenko (ayurchen) wrote :

This is one possibility. But you don't what to wait even 40 seconds if port is already busy.

I'd suggest enhance check_pid_and_port() function to exit right away with an appropriate error message if it finds the port busy.

Revision history for this message
Yan Zhang (yan.zhang) wrote :
Revision history for this message
Yan Zhang (yan.zhang) wrote :
Revision history for this message
Yan Zhang (yan.zhang) wrote :
Revision history for this message
Yan Zhang (yan.zhang) wrote :

netstat -nltp sometimes can not output pid/progname, like
```
tcp 0 0 0.0.0.0:10033 0.0.0.0:* LISTEN -
tcp6 0 0 :::10033 :::* LISTEN -
```

so we can't detect whether port has been taken by this rsync process or not.

seems lsof is much reliable. so we change to lsof for all platforms

http://bazaar.launchpad.net/~codership/codership-mysql/wsrep-5.5/revision/3980
http://bazaar.launchpad.net/~codership/codership-mysql/5.6/revision/4075

Revision history for this message
Raghavendra D Prabhu (raghavendra-prabhu) wrote :

On CentOS jenkins, this was required

------------------------------------------------------------
revno: 724
committer: Raghavendra D Prabhu <email address hidden>
branch nick: trunk-25
timestamp: Mon 2014-04-21 14:16:22 +0530
message:
  Add the PATH for lsof
diff:
=== modified file 'scripts/wsrep_sst_rsync.sh'
--- scripts/wsrep_sst_rsync.sh 2014-04-20 17:19:52 +0000
+++ scripts/wsrep_sst_rsync.sh 2014-04-21 08:46:22 +0000
@@ -25,6 +25,9 @@

 . $(dirname $0)/wsrep_sst_common

+# Setting the path for lsof
+export PATH="/usr/sbin:/sbin:$PATH"
+
 cleanup_joiner()
 {
     wsrep_log_info "Joiner cleanup."

Otherwise, wsrep_sst_rsync was just hanging when lsof (which is /usr/sbin/lsof on CentOS) was not found.

Revision history for this message
Yan Zhang (yan.zhang) wrote :
Revision history for this message
Shahriyar Rzayev (rzayev-sehriyar) wrote :

Percona now uses JIRA for bug reports so this bug report is migrated to: https://jira.percona.com/browse/PXC-1282

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.