Percona XtraDB Cluster moved to https://jira.percona.com/projects/PXC

wsrep_xtrabackup_sst-v2 sst-initial-timeout doesn't work when 'timeout -k' is unsupported

Bug #1409710 reported by Kenny Gryp on 2015-01-12

This bug affects 2 people

	Status	Importance	Assigned to	Milestone
Percona XtraDB Cluster moved to https://jira.percona.com/projects/PXC	Status tracked in 5.6
5.5	Fix Released	Undecided	Raghavendra D Prabhu	Percona XtraDB Cluster moved to https://jira.percona.com/projects/PXC 5.5.41-25.11
5.6	Fix Released	Undecided	Raghavendra D Prabhu	Percona XtraDB Cluster moved to https://jira.percona.com/projects/PXC 5.6.22-25.8

Bug Description

wsrep_xtrabackup_sst-v2 has an option called sst-initial-timeout http://www.percona.com/doc/percona-xtradb-cluster/5.6/manual/xtrabackup_sst.html#sst-initial-timeout

It is handled in the code like this::

```
recv_joiner()
{
...
    if [[ $tmt -gt 0 && -x `which timeout` ]];then
        if timeout --help | grep -q -- '-k';then
            ltcmd="timeout -k $(( tmt+10 )) $tmt $tcmd"
        else
            ltcmd="timeout $tmt $tcmd"
        fi
        timeit "$msg" "$ltcmd | $strmcmd; RC=( "\${PIPESTATUS[@]}" )"
    else
        timeit "$msg" "$tcmd | $strmcmd; RC=( "\${PIPESTATUS[@]}" )"
    fi
....
}
```

RHEL 6.6 for example does not support the '-k' flag.

So the script falls back to just using 'timeout '

But I have seen that if 'timeout' is being used, that the timeout does not work. 'socat' command keeps on running forever....

The only way that I can get it to work is by adding timeout:

ltcmd="timeout -s9 $tmt $tcmd"

That sends the KILL signal upon timeout, which IMHO is the same as 'timeout -k'

I've had this with mariadb galera cluster 10.0.14 and 10.0.15. but PXC behavior will be the same.

Tags:

Related branches

lp:~percona-core/percona-xtradb-cluster/bug-1409710-5.6

Merged into lp:percona-xtradb-cluster

Alexey Kopytov (community): Approve on 2015-02-05

lp:~percona-core/percona-xtradb-cluster/bug-1409710-5.5

Rejected for merging into lp:percona-xtradb-cluster/5.5

Alexey Kopytov (community): Needs Resubmitting on 2015-02-23

Kenny Gryp (gryp) on 2015-01-12

summary:	- wsrep_xtrabackup_sst-v2 timeout doesn't kill + wsrep_xtrabackup_sst-v2 sst-initial-timeout doesn't kill when 'timeout + -k' is unsupported
summary:	- wsrep_xtrabackup_sst-v2 sst-initial-timeout doesn't kill when 'timeout + wsrep_xtrabackup_sst-v2 sst-initial-timeout doesn't work when 'timeout -k' is unsupported

Revision history for this message

Miguel Angel Nieto (miguelangelnieto) wrote on 2015-01-13:

Just checked on Oracle 6.6 (I don't have a RH or Centos) and it is confirmed, it doesn't have -k parameter:

Mandatory arguments to long options are mandatory for short options too.
  -s, --signal=SIGNAL
                   specify the signal to be sent on timeout.
                   SIGNAL may be a name like `HUP' or a number.
                   See `kill -l` for a list of signals
      --help display this help and exit
      --version output version information and exit

So our script is not compatible with some distributions.

# timeout --version
timeout (GNU coreutils) 8.4

Changed in percona-xtradb-cluster:
status:	New → Confirmed
tags:	added: i49877

Revision history for this message

Valerii Kravchuk (valerii-kravchuk) wrote on 2015-01-13:

Same on CentOS 6.6. if somebody cares:

[openxs@centos bzr2]$ timeout --help
Usage: timeout [OPTION] NUMBER[SUFFIX] COMMAND [ARG]...
or: timeout [OPTION]
Start COMMAND, and kill it if still running after NUMBER seconds.
SUFFIX may be `s' for seconds (the default), `m' for minutes,
`h' for hours or `d' for days.

If the command times out, then exit with status 124. Otherwise, exit
with the status of COMMAND. If no signal is specified, send the TERM
signal upon timeout. The TERM signal kills any process that does not
block or catch that signal. For other processes, it may be necessary to
use the KILL (9) signal, since this signal cannot be caught.

Report timeout bugs to <email address hidden>
GNU coreutils home page: <http://www.gnu.org/software/coreutils/>
General help using GNU software: <http://www.gnu.org/gethelp/>
For complete documentation, run: info coreutils 'timeout invocation'
[openxs@centos bzr2]$ cat /etc/issue
CentOS release 6.6 (Final)
Kernel \r on an \m

Revision history for this message

Alexey Kopytov (akopytov) wrote on 2015-01-13:

To the bug verification team:

the problem described in this report is not about the "-k" switch not being supported by the timeout utility. This case is handled in wsrep_xtrabackup_sst-v2 as you can see from the original report.

The problem being described here is that under some (unknown) circumstances the socat utility is not properly terminated after a timeout when started like this:

timeout $tmt $tcmd

where $tcmd contains socat + some command line arguments.

This is what needs to be verified.

Revision history for this message

Muhammad Irfan (muhammad-irfan) wrote on 2015-01-23:

[mysqld]
# SST method
wsrep_sst_method=xtrabackup-v2
[SST]
sst-initial-timeout=1

[root@centos3 mysql]# iptables -A INPUT -p tcp --destination-port 4444 -j DROP && iptables -A OUTPUT -p tcp --destination-port 4444 -j DROP

[root@centos3 mysql]# /etc/init.d/mysql start
Starting MySQL (Percona XtraDB Cluster)....State transfer in progress, setting sleep higher
....... ERROR! The server quit without updating PID file (/var/lib/mysql/centos3.bm.int.percona.com.pid).
ERROR! MySQL (Percona XtraDB Cluster) server startup failed!

[root@centos3 ~]# iptables -F

[root@centos3 mysql]# ps -ef | grep mysql
mysql 3618 1 0 02:27 pts/0 00:00:00 /bin/bash -ue /usr//bin/wsrep_sst_xtrabackup-v2 --role joiner --address 10.0.3.172 --auth irfan:irfan --datadir /var/lib/mysql/ --defaults-file /etc/my.cnf --parent 3604
mysql 3833 3618 0 02:27 pts/0 00:00:00 timeout 1 socat -u TCP-LISTEN:4444,reuseaddr stdio
mysql 3834 3618 0 02:27 pts/0 00:00:00 xbstream -x
mysql 3835 3833 0 02:27 pts/0 00:00:00 socat -u TCP-LISTEN:4444,reuseaddr stdio
root 3875 1855 0 02:30 pts/0 00:00:00 grep mysql

Revision history for this message

Raghavendra D Prabhu (raghavendra-prabhu) wrote on 2015-01-25:

While timeout -s9 works, it is not a replacement. Also, we want to avoid sending SIGKILL to ensure cleanup takes place. Does the process not die eventually even with SIGTERM?

Revision history for this message

Raghavendra D Prabhu (raghavendra-prabhu) wrote on 2015-02-05:

Able to replicate with #4.

Revision history for this message

David Bennett (dbpercona) wrote on 2015-02-06:

Reproduce this bug on Centos 6 / MariaDB 10 Edit (821 bytes, text/x-sh)

Testing script based on Muhammad's comment #4 method

Revision history for this message

David Bennett (dbpercona) wrote on 2015-02-06:

Reproduced on stock MariaDB 10.0.16 on Centos 6.6:

# ./test_xtrabackup-v2-script-lp1409710.sh
--- os and db version ---
CentOS release 6.6 (Final)
Linux centos6-6 2.6.32-504.8.1.el6.x86_64 #1 SMP Wed Jan 28 21:11:36 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
version 10.0.16-MariaDB-wsrep-log
version-comment MariaDB Server, wsrep_25.10.r4144
version-compile-machine x86_64
version-compile-os Linux
version-malloc-library bundled jemalloc
--- pertinent config ---
[mysqld]
datadir=/var/lib/mysql
[galera]
wsrep_provider=/usr/lib64/galera/libgalera_smm.so
wsrep_cluster_address=gcomm://192.168.122.114,192.168.122.115,192.168.122.116
wsrep_sst_method=xtrabackup-v2
wsrep_sst_auth=root:
#wsrep_slave_threads=1
--- simulate SST timeout ---
MySQL server PID file could not be found! [FAILED]
iptables: Setting chains to policy ACCEPT: filter [ OK ]
iptables: Flushing firewall rules: [ OK ]
iptables: Unloading modules: [ OK ]
--- blocking inbound SST on port 4444 ---
--- start node [mysqld] ---
Starting MySQL...SST in progress, setting sleep higher.....[FAILED]
--- looking for hung processes ---
mysql 9913 1 0 21:37 pts/0 00:00:00 /bin/bash -ue /usr//bin/wsrep_sst_xtrabackup-v2 --role joiner --address 192.168.122.116 --auth root: --datadir /var/lib/mysql/ --defaults-file /etc/my.cnf --parent 9903
mysql 10127 9913 0 21:37 pts/0 00:00:00 timeout 1 socat -u TCP-LISTEN:4444,reuseaddr stdio
mysql 10128 9913 0 21:37 pts/0 00:00:00 xbstream -x
mysql 10129 10127 0 21:37 pts/0 00:00:00 socat -u TCP-LISTEN:4444,reuseaddr stdio

Revision history for this message

David Bennett (dbpercona) wrote on 2015-02-10:

This is related to https://bugs.launchpad.net/percona-xtradb-cluster/+bug/1399175

The cause of the problem isn't a incorrect timeout parameter, but the use of the posix_spawn() function to launch SST scripts in the WSREP code. This function masks SIGALRM which is used to terminate child processes on failure via timeout. There is a fix pending for PXC as well as a fix from Codership for their patches. There is no report or fix for MariaDB at this time.

Switching to rsync SST may not fix the issue as the rsync SST script also uses timeout to terminate child processes on failure.

Revision history for this message

Shahriyar Rzayev (rzayev-sehriyar) wrote on 2018-01-18:

#10

Percona now uses JIRA for bug reports so this bug report is migrated to: https://jira.percona.com/browse/PXC-1788