wsrep_xtrabackup_sst-v2 sst-initial-timeout doesn't work when 'timeout -k' is unsupported

Bug #1409710 reported by Kenny Gryp on 2015-01-12
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Percona XtraDB Cluster moved to https://jira.percona.com/projects/PXC
Status tracked in 5.6
5.5
Fix Released
Undecided
Raghavendra D Prabhu
5.6
Fix Released
Undecided
Raghavendra D Prabhu

Bug Description

wsrep_xtrabackup_sst-v2 has an option called sst-initial-timeout http://www.percona.com/doc/percona-xtradb-cluster/5.6/manual/xtrabackup_sst.html#sst-initial-timeout

It is handled in the code like this::

```
recv_joiner()
{
...
    if [[ $tmt -gt 0 && -x `which timeout` ]];then
        if timeout --help | grep -q -- '-k';then
            ltcmd="timeout -k $(( tmt+10 )) $tmt $tcmd"
        else
            ltcmd="timeout $tmt $tcmd"
        fi
        timeit "$msg" "$ltcmd | $strmcmd; RC=( "\${PIPESTATUS[@]}" )"
    else
        timeit "$msg" "$tcmd | $strmcmd; RC=( "\${PIPESTATUS[@]}" )"
    fi
....
}
```

RHEL 6.6 for example does not support the '-k' flag.

So the script falls back to just using 'timeout '

But I have seen that if 'timeout' is being used, that the timeout does not work. 'socat' command keeps on running forever....

The only way that I can get it to work is by adding timeout:

            ltcmd="timeout -s9 $tmt $tcmd"

That sends the KILL signal upon timeout, which IMHO is the same as 'timeout -k'

I've had this with mariadb galera cluster 10.0.14 and 10.0.15. but PXC behavior will be the same.

Related branches

Kenny Gryp (gryp) on 2015-01-12
summary: - wsrep_xtrabackup_sst-v2 timeout doesn't kill
+ wsrep_xtrabackup_sst-v2 sst-initial-timeout doesn't kill when 'timeout
+ -k' is unsupported
summary: - wsrep_xtrabackup_sst-v2 sst-initial-timeout doesn't kill when 'timeout
+ wsrep_xtrabackup_sst-v2 sst-initial-timeout doesn't work when 'timeout
-k' is unsupported

Just checked on Oracle 6.6 (I don't have a RH or Centos) and it is confirmed, it doesn't have -k parameter:

Mandatory arguments to long options are mandatory for short options too.
  -s, --signal=SIGNAL
                   specify the signal to be sent on timeout.
                   SIGNAL may be a name like `HUP' or a number.
                   See `kill -l` for a list of signals
      --help display this help and exit
      --version output version information and exit

So our script is not compatible with some distributions.

# timeout --version
timeout (GNU coreutils) 8.4

Changed in percona-xtradb-cluster:
status: New → Confirmed
tags: added: i49877

Same on CentOS 6.6. if somebody cares:

[openxs@centos bzr2]$ timeout --help
Usage: timeout [OPTION] NUMBER[SUFFIX] COMMAND [ARG]...
  or: timeout [OPTION]
Start COMMAND, and kill it if still running after NUMBER seconds.
SUFFIX may be `s' for seconds (the default), `m' for minutes,
`h' for hours or `d' for days.

Mandatory arguments to long options are mandatory for short options too.
  -s, --signal=SIGNAL
                   specify the signal to be sent on timeout.
                   SIGNAL may be a name like `HUP' or a number.
                   See `kill -l` for a list of signals
      --help display this help and exit
      --version output version information and exit

If the command times out, then exit with status 124. Otherwise, exit
with the status of COMMAND. If no signal is specified, send the TERM
signal upon timeout. The TERM signal kills any process that does not
block or catch that signal. For other processes, it may be necessary to
use the KILL (9) signal, since this signal cannot be caught.

Report timeout bugs to <email address hidden>
GNU coreutils home page: <http://www.gnu.org/software/coreutils/>
General help using GNU software: <http://www.gnu.org/gethelp/>
For complete documentation, run: info coreutils 'timeout invocation'
[openxs@centos bzr2]$ cat /etc/issue
CentOS release 6.6 (Final)
Kernel \r on an \m

Alexey Kopytov (akopytov) wrote :

To the bug verification team:

the problem described in this report is not about the "-k" switch not being supported by the timeout utility. This case is handled in wsrep_xtrabackup_sst-v2 as you can see from the original report.

The problem being described here is that under some (unknown) circumstances the socat utility is not properly terminated after a timeout when started like this:

timeout $tmt $tcmd

where $tcmd contains socat + some command line arguments.

This is what needs to be verified.

Muhammad Irfan (muhammad-irfan) wrote :

[mysqld]
# SST method
wsrep_sst_method=xtrabackup-v2
[SST]
sst-initial-timeout=1

mysql> show global variables like '%version%';
+-------------------------+---------------------------------------------------------------------------------------------------+
| Variable_name | Value |
+-------------------------+---------------------------------------------------------------------------------------------------+
| innodb_version | 5.6.20-68.0 |
| protocol_version| 10|
| slave_type_conversions | |
| version | 5.6.20-68.0-56 |
| version_comment | Percona XtraDB Cluster (GPL), Release rel68.0, Revision 886, WSREP version 25.7, wsrep_25.7.r4126 |
| version_compile_machine | x86_64 |
| version_compile_os | Linux |
+-------------------------+---------------------------------------------------------------------------------------------------+

[root@centos3 mysql]# iptables -A INPUT -p tcp --destination-port 4444 -j DROP && iptables -A OUTPUT -p tcp --destination-port 4444 -j DROP

[root@centos3 mysql]# /etc/init.d/mysql start
Starting MySQL (Percona XtraDB Cluster)....State transfer in progress, setting sleep higher
....... ERROR! The server quit without updating PID file (/var/lib/mysql/centos3.bm.int.percona.com.pid).
 ERROR! MySQL (Percona XtraDB Cluster) server startup failed!

[root@centos3 ~]# iptables -F

[root@centos3 mysql]# ps -ef | grep mysql
mysql 3618 1 0 02:27 pts/0 00:00:00 /bin/bash -ue /usr//bin/wsrep_sst_xtrabackup-v2 --role joiner --address 10.0.3.172 --auth irfan:irfan --datadir /var/lib/mysql/ --defaults-file /etc/my.cnf --parent 3604
mysql 3833 3618 0 02:27 pts/0 00:00:00 timeout 1 socat -u TCP-LISTEN:4444,reuseaddr stdio
mysql 3834 3618 0 02:27 pts/0 00:00:00 xbstream -x
mysql 3835 3833 0 02:27 pts/0 00:00:00 socat -u TCP-LISTEN:4444,reuseaddr stdio
root 3875 1855 0 02:30 pts/0 00:00:00 grep mysql

While timeout -s9 works, it is not a replacement. Also, we want to avoid sending SIGKILL to ensure cleanup takes place. Does the process not die eventually even with SIGTERM?

Able to replicate with #4.

David Bennett (dbpercona) wrote :

Testing script based on Muhammad's comment #4 method

David Bennett (dbpercona) wrote :

Reproduced on stock MariaDB 10.0.16 on Centos 6.6:

# ./test_xtrabackup-v2-script-lp1409710.sh
--- os and db version ---
CentOS release 6.6 (Final)
Linux centos6-6 2.6.32-504.8.1.el6.x86_64 #1 SMP Wed Jan 28 21:11:36 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
version 10.0.16-MariaDB-wsrep-log
version-comment MariaDB Server, wsrep_25.10.r4144
version-compile-machine x86_64
version-compile-os Linux
version-malloc-library bundled jemalloc
--- pertinent config ---
[mysqld]
datadir=/var/lib/mysql
[galera]
wsrep_provider=/usr/lib64/galera/libgalera_smm.so
wsrep_cluster_address=gcomm://192.168.122.114,192.168.122.115,192.168.122.116
wsrep_sst_method=xtrabackup-v2
wsrep_sst_auth=root:
#wsrep_slave_threads=1
--- simulate SST timeout ---
MySQL server PID file could not be found! [FAILED]
iptables: Setting chains to policy ACCEPT: filter [ OK ]
iptables: Flushing firewall rules: [ OK ]
iptables: Unloading modules: [ OK ]
--- blocking inbound SST on port 4444 ---
--- start node [mysqld] ---
Starting MySQL...SST in progress, setting sleep higher.....[FAILED]
--- looking for hung processes ---
mysql 9913 1 0 21:37 pts/0 00:00:00 /bin/bash -ue /usr//bin/wsrep_sst_xtrabackup-v2 --role joiner --address 192.168.122.116 --auth root: --datadir /var/lib/mysql/ --defaults-file /etc/my.cnf --parent 9903
mysql 10127 9913 0 21:37 pts/0 00:00:00 timeout 1 socat -u TCP-LISTEN:4444,reuseaddr stdio
mysql 10128 9913 0 21:37 pts/0 00:00:00 xbstream -x
mysql 10129 10127 0 21:37 pts/0 00:00:00 socat -u TCP-LISTEN:4444,reuseaddr stdio

David Bennett (dbpercona) wrote :

This is related to https://bugs.launchpad.net/percona-xtradb-cluster/+bug/1399175

The cause of the problem isn't a incorrect timeout parameter, but the use of the posix_spawn() function to launch SST scripts in the WSREP code. This function masks SIGALRM which is used to terminate child processes on failure via timeout. There is a fix pending for PXC as well as a fix from Codership for their patches. There is no report or fix for MariaDB at this time.

Switching to rsync SST may not fix the issue as the rsync SST script also uses timeout to terminate child processes on failure.

Percona now uses JIRA for bug reports so this bug report is migrated to: https://jira.percona.com/browse/PXC-1788

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers