wsrep_xtrabackup_sst-v2 sst-initial-timeout doesn't work when 'timeout -k' is unsupported
Affects | Status | Importance | Assigned to | Milestone | ||
---|---|---|---|---|---|---|
Percona XtraDB Cluster moved to https://jira.percona.com/projects/PXC | Status tracked in 5.6 | |||||
| 5.5 |
Fix Released
|
Undecided
|
Raghavendra D Prabhu | ||
| 5.6 |
Fix Released
|
Undecided
|
Raghavendra D Prabhu |
Bug Description
wsrep_xtrabacku
It is handled in the code like this::
```
recv_joiner()
{
...
if [[ $tmt -gt 0 && -x `which timeout` ]];then
if timeout --help | grep -q -- '-k';then
else
fi
timeit "$msg" "$ltcmd | $strmcmd; RC=( "\${PIPESTATUS[@]}" )"
else
timeit "$msg" "$tcmd | $strmcmd; RC=( "\${PIPESTATUS[@]}" )"
fi
....
}
```
RHEL 6.6 for example does not support the '-k' flag.
So the script falls back to just using 'timeout '
But I have seen that if 'timeout' is being used, that the timeout does not work. 'socat' command keeps on running forever....
The only way that I can get it to work is by adding timeout:
That sends the KILL signal upon timeout, which IMHO is the same as 'timeout -k'
I've had this with mariadb galera cluster 10.0.14 and 10.0.15. but PXC behavior will be the same.
Related branches
- Alexey Kopytov (community): Approve on 2015-02-05
-
Diff: 12 lines (+1/-1)1 file modifiedscripts/wsrep_sst_xtrabackup-v2.sh (+1/-1)
- Alexey Kopytov (community): Resubmit on 2015-02-23
-
Diff: 12 lines (+1/-1)1 file modifiedscripts/wsrep_sst_xtrabackup-v2.sh (+1/-1)
summary: |
- wsrep_xtrabackup_sst-v2 timeout doesn't kill + wsrep_xtrabackup_sst-v2 sst-initial-timeout doesn't kill when 'timeout + -k' is unsupported |
summary: |
- wsrep_xtrabackup_sst-v2 sst-initial-timeout doesn't kill when 'timeout + wsrep_xtrabackup_sst-v2 sst-initial-timeout doesn't work when 'timeout -k' is unsupported |
Changed in percona-xtradb-cluster: | |
status: | New → Confirmed |
tags: | added: i49877 |
Same on CentOS 6.6. if somebody cares:
[openxs@centos bzr2]$ timeout --help
Usage: timeout [OPTION] NUMBER[SUFFIX] COMMAND [ARG]...
or: timeout [OPTION]
Start COMMAND, and kill it if still running after NUMBER seconds.
SUFFIX may be `s' for seconds (the default), `m' for minutes,
`h' for hours or `d' for days.
Mandatory arguments to long options are mandatory for short options too.
-s, --signal=SIGNAL
--help display this help and exit
--version output version information and exit
If the command times out, then exit with status 124. Otherwise, exit
with the status of COMMAND. If no signal is specified, send the TERM
signal upon timeout. The TERM signal kills any process that does not
block or catch that signal. For other processes, it may be necessary to
use the KILL (9) signal, since this signal cannot be caught.
Report timeout bugs to <email address hidden>
GNU coreutils home page: <http://
General help using GNU software: <http://
For complete documentation, run: info coreutils 'timeout invocation'
[openxs@centos bzr2]$ cat /etc/issue
CentOS release 6.6 (Final)
Kernel \r on an \m
Alexey Kopytov (akopytov) wrote : | #3 |
To the bug verification team:
the problem described in this report is not about the "-k" switch not being supported by the timeout utility. This case is handled in wsrep_xtrabacku
The problem being described here is that under some (unknown) circumstances the socat utility is not properly terminated after a timeout when started like this:
timeout $tmt $tcmd
where $tcmd contains socat + some command line arguments.
This is what needs to be verified.
Muhammad Irfan (muhammad-irfan) wrote : | #4 |
[mysqld]
# SST method
wsrep_sst_
[SST]
sst-initial-
mysql> show global variables like '%version%';
+------
| Variable_name | Value |
+------
| innodb_version | 5.6.20-68.0 |
| protocol_version| 10|
| slave_type_
| version | 5.6.20-68.0-56 |
| version_comment | Percona XtraDB Cluster (GPL), Release rel68.0, Revision 886, WSREP version 25.7, wsrep_25.7.r4126 |
| version_
| version_compile_os | Linux |
+------
[root@centos3 mysql]# iptables -A INPUT -p tcp --destination-port 4444 -j DROP && iptables -A OUTPUT -p tcp --destination-port 4444 -j DROP
[root@centos3 mysql]# /etc/init.d/mysql start
Starting MySQL (Percona XtraDB Cluster)....State transfer in progress, setting sleep higher
....... ERROR! The server quit without updating PID file (/var/lib/
ERROR! MySQL (Percona XtraDB Cluster) server startup failed!
[root@centos3 ~]# iptables -F
[root@centos3 mysql]# ps -ef | grep mysql
mysql 3618 1 0 02:27 pts/0 00:00:00 /bin/bash -ue /usr//bin/
mysql 3833 3618 0 02:27 pts/0 00:00:00 timeout 1 socat -u TCP-LISTEN:
mysql 3834 3618 0 02:27 pts/0 00:00:00 xbstream -x
mysql 3835 3833 0 02:27 pts/0 00:00:00 socat -u TCP-LISTEN:
root 3875 1855 0 02:30 pts/0 00:00:00 grep mysql
While timeout -s9 works, it is not a replacement. Also, we want to avoid sending SIGKILL to ensure cleanup takes place. Does the process not die eventually even with SIGTERM?
Able to replicate with #4.
David Bennett (dbpercona) wrote : | #7 |
Testing script based on Muhammad's comment #4 method
David Bennett (dbpercona) wrote : | #8 |
Reproduced on stock MariaDB 10.0.16 on Centos 6.6:
# ./test_
--- os and db version ---
CentOS release 6.6 (Final)
Linux centos6-6 2.6.32-
version 10.0.16-
version-comment MariaDB Server, wsrep_25.10.r4144
version-
version-compile-os Linux
version-
--- pertinent config ---
[mysqld]
datadir=
[galera]
wsrep_provider=
wsrep_cluster_
wsrep_sst_
wsrep_sst_
#wsrep_
--- simulate SST timeout ---
MySQL server PID file could not be found! [FAILED]
iptables: Setting chains to policy ACCEPT: filter [ OK ]
iptables: Flushing firewall rules: [ OK ]
iptables: Unloading modules: [ OK ]
--- blocking inbound SST on port 4444 ---
--- start node [mysqld] ---
Starting MySQL...SST in progress, setting sleep higher.....[FAILED]
--- looking for hung processes ---
mysql 9913 1 0 21:37 pts/0 00:00:00 /bin/bash -ue /usr//bin/
mysql 10127 9913 0 21:37 pts/0 00:00:00 timeout 1 socat -u TCP-LISTEN:
mysql 10128 9913 0 21:37 pts/0 00:00:00 xbstream -x
mysql 10129 10127 0 21:37 pts/0 00:00:00 socat -u TCP-LISTEN:
David Bennett (dbpercona) wrote : | #9 |
This is related to https:/
The cause of the problem isn't a incorrect timeout parameter, but the use of the posix_spawn() function to launch SST scripts in the WSREP code. This function masks SIGALRM which is used to terminate child processes on failure via timeout. There is a fix pending for PXC as well as a fix from Codership for their patches. There is no report or fix for MariaDB at this time.
Switching to rsync SST may not fix the issue as the rsync SST script also uses timeout to terminate child processes on failure.
Percona now uses JIRA for bug reports so this bug report is migrated to: https:/
Just checked on Oracle 6.6 (I don't have a RH or Centos) and it is confirmed, it doesn't have -k parameter:
Mandatory arguments to long options are mandatory for short options too.
specify the signal to be sent on timeout.
SIGNAL may be a name like `HUP' or a number.
See `kill -l` for a list of signals
-s, --signal=SIGNAL
--help display this help and exit
--version output version information and exit
So our script is not compatible with some distributions.
# timeout --version
timeout (GNU coreutils) 8.4