sst_donor_thread stuck/tables_flushed file not created
Affects | Status | Importance | Assigned to | Milestone | ||
---|---|---|---|---|---|---|
MySQL patches by Codership |
New
|
Undecided
|
Unassigned | |||
Percona XtraDB Cluster moved to https://jira.percona.com/projects/PXC | Status tracked in 5.6 | |||||
5.5 |
Incomplete
|
Undecided
|
Unassigned | |||
5.6 |
Incomplete
|
Undecided
|
Unassigned |
Bug Description
In seesaw test donor was stuck in
2013-10-31 15:44:19 11500 [Note] WSREP: Provider paused at a13fb7c4-
2013-10-31 15:44:19 11500 [Note] WSREP: Tables flushed.
It appeared that sst_donor_thread was waiting input from wsrep_sst_rsync script:
#0 0x00007f00405c48cd in read () at ../sysdeps/
#1 0x00007f0040558ff8 in _IO_new_
#2 0x00007f004055a03e in _IO_default_uflow (fp=0x7effd4008190) at genops.c:440
#3 0x00007f004054e18a in _IO_getline_info (fp=0x7effd4008190, buf=0x7efff8ff8dc0 "flush tables", n=127, delim=10,
extract_
#4 0x00007f004054d06b in _IO_fgets (buf=0x7efff8ff8dc0 "flush tables", n=<optimized out>, fp=0x7effd4008190) at iofgets.c:58
#5 0x000000000062ba0e in my_fgets (buf=0x7efff8ff8dc0 "flush tables", buf_len=128, stream=
at /home/teemu/
#6 0x000000000062dd47 in sst_donor_thread (a=0x7f003c0d6040)
at /home/teemu/
#7 0x00007f00410b6e9a in start_thread (arg=0x7efff8ff
#8 0x00007f00405d1ccd in clone () at ../sysdeps/
#9 0x0000000000000000 in ?? ()
(gdb) f 6
#6 0x000000000062dd47 in sst_donor_thread (a=0x7f003c0d6040)
at /home/teemu/
859 out= my_fgets (out_buf, out_len, proc.pipe());
(gdb) p locked
$14 = true
In other words, wsrep_sst_rsync had written "flush tables" in pipe.
Processlist indicates that wsrep_sst_rsync was waiting for "tables_flushed" file to be created:
11500 pts/19 tl 1:07 /run/shm/
14848 pts/19 S 0:00 \_ sh -c wsrep_sst_rsync --role 'donor' --address 'gw:10013/
14850 pts/19 S 0:52 \_ /bin/bash -ue /run/shm/
28488 pts/19 S 0:00 \_ sleep 0.2
Sleep corresponds to lines
# wait for tables flushed and state ID written to the file
while [ ! -r "$FLUSHED" ] && ! grep -q ':' "$FLUSHED" >/dev/null 2>&1
do
sleep 0.2
done
in script.
So either "tables_flushed" was not created (does not look likely that it could happen without error message in log), or somehow the file got deleted before script saw it.
A related issue here is that, that loop can get infinite depending on circumstances, so it would be prudent to keep a timeout there where it bails out (if FTWRL is not possible at all).