Seg fault when "FLUSH LOGS" simultaneously on both Master and Slave
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Percona Server moved to https://jira.percona.com/projects/PS |
New
|
Undecided
|
Unassigned |
Bug Description
We have a master-slave setup and a cron executes "FLUSH LOGS" on both every 5 minutes. Due to poor locking, a race condition exists that can crash the server with a seg fault occasionally.
This is on Percona 5.5.27-28.1-log
How to reproduce:
1) Set up a master-slave environment. Do some dummy inserts on Master in a loop:
i=1; while(true); do echo $i; mysql -uuser -ppass -A -Dtest -hmaster -e "insert into dummy_test(id, value) values ($i, 'abc')"; i=$(( i+1 )); done;
2) Set up a loop running "FLUSH LOGS" on Master
while(true); mysql -uuser -ppass -A -Dtest -hmaster -e "FLUSH LOGS"; done;
3) Set up another loop running "FLUSH LOGS" on Slave
while(true); mysql -uuser -ppass -A -Dtest -hslave -e "FLUSH LOGS"; done;
Details:
The full stack trace is attached. The 2 important threads are:
*Thread 527 (Thread 8937)*:
#0 0x00000039f1e0d4c4 in __lll_lock_wait () from /lib64/
#1 0x00000039f1e08e1a in _L_lock_1034 () from /lib64/
#2 0x00000039f1e08cdc in pthread_mutex_lock () from /lib64/
#3 0x000000000070c5fb in inline_
src_
src_line=4316) at /usr/src/
#4 0x0000000000714994 in MYSQL_BIN_
#5 0x0000000000526b0c in rotate_relay_log (mi=0x2abbec011df0) at /usr/src/
#6 0x00000000005295af in queue_event (mi=0x2abbec011df0, buf=0x2abc42542721 "\205\346\
at /usr/src/
#7 0x000000000052f1f8 in handle_slave_io (arg=0x2abbec01
#8 0x00000000008fb0d9 in pfs_spawn_thread (arg=0x2abbec00
#9 0x00000039f1e0673d in start_thread () from /lib64/
#10 0x00000039f12d44bd in clone () from /lib64/libc.so.6
*Thread 1 (Thread 0x539e3940 (LWP 22442))*:
#0 0x00000039f1e0bd02 in pthread_kill () from /lib64/
#1 0x00000000006839ae in handle_fatal_signal (sig=11) at /usr/src/
#2 <signal handler called>
#3 0x00002abc94450d60 in ?? ()
#4 0x00000000007135fb in MYSQL_BIN_LOG::open (this=0x2abbec0
new_name=<value optimized out>, io_cache_
at /usr/src/
#5 0x0000000000714aea in MYSQL_BIN_
#6 0x0000000000526b0c in rotate_relay_log (mi=0x2abbec011df0) at /usr/src/
#7 0x0000000000645b81 in reload_
at /usr/src/
#8 0x00000000005874d2 in mysql_execute_
#9 0x000000000058b913 in mysql_parse (thd=0x2abc6055
at /usr/src/
#10 0x000000000058cf72 in dispatch_command (command=COM_QUERY, thd=0x2abc605516e0, packet=
at /usr/src/
#11 0x00000000006273b7 in do_handle_
#12 0x00000000006274f1 in handle_
#13 0x00000000008fb0d9 in pfs_spawn_thread (arg=0x2abbf147
#14 0x00000039f1e0673d in start_thread () from /lib64/
#15 0x00000039f12d44bd in clone () from /lib64/libc.so.6
(gdb) p *description_
$3 = {
<Start_
<Log_event> = {
_
log_pos = 0,
temp_buf = 0x0,
when = 0,
exec_time = 0,
data_written = 0,
server_id = 1016589,
flags = 0,
cache_type = 0,
slave_
thd = 0x0
},
members of Start_log_event_v3:
created = 0,
binlog_version = 3,
server_version = "4.0\000\
dont_
},
members of Format_
common_header_len = 19 '\023',
number_
post_header_len = 0x2abc880b79b0 "8\v",
server_
event_
}
In thread 527 process_io_rotate() in sql/slave.cc has a lock on active_
The race condition is such that thread 1 may manage to squeak past the check "if (description_
#0 0x00002abcf777c285 in raise () from /lib64/libc.so.6
#1 0x00002abcf777dd30 in abort () from /lib64/libc.so.6
#2 0x00002abcf7775706 in __assert_fail () from /lib64/libc.so.6
#3 0x00000000007afbe3 in __cxa_pure_virtual () at /usr/src/
#4 0x0000000000723018 in Log_event:
at /usr/src/
#5 0x0000000000725546 in Format_
at /usr/src/
#6 0x00000000007135fb in MYSQL_BIN_LOG::open (this=0x2abbdb4
new_name=<value optimized out>, io_cache_
at /usr/src/
#7 0x0000000000714aea in MYSQL_BIN_
#8 0x0000000000526b0c in rotate_relay_log (mi=0x2abbdb4e66d0) at /usr/src/
#9 0x0000000000645b81 in reload_
at /usr/src/
#10 0x00000000005874d2 in mysql_execute_
#11 0x000000000058b913 in mysql_parse (thd=0x2abbe000
at /usr/src/
#12 0x000000000058cf72 in dispatch_command (command=COM_QUERY, thd=0x2abbe000bb00, packet=
at /usr/src/
#13 0x00000000006273b7 in do_handle_
#14 0x00000000006274f1 in handle_
#15 0x00000000008fb0d9 in pfs_spawn_thread (arg=0x2abbdbdc
#16 0x00002abcf6a6c77d in start_thread () from /lib64/
#17 0x00002abcf781f25d in clone () from /lib64/libc.so.6
(gdb) p *description_
$10 = {
<Start_
<Log_event> = {
_
log_pos = 0,
temp_buf = 0x0,
when = 0,
exec_time = 0,
data_written = 103,
server_id = 10165122,
flags = 0,
cache_type = 0,
slave_
thd = 0x0
},
members of Start_log_event_v3:
created = 0,
binlog_version = 3,
server_version = "4.0\000\
dont_
},
members of Format_
common_header_len = 19 '\023',
number_
post_header_len = 0x2abbe0000a20 "8\v",
server_
event_
}
Workaround:
For our use case, we only need the BIN logs flushed every 5 minutes. We're using using "FLUSH LOCAL BINARY LOGS" instead of "FLUSH LOGS".
A patch with the one necessary lock is attached.
Changed in percona-server: | |
assignee: | nobody → Muhammad Irfan (muhammad-irfan) |
Changed in percona-server: | |
assignee: | Muhammad Irfan (muhammad-irfan) → nobody |
The full backtrace for one of the crashes.