Seg fault when "FLUSH LOGS" simultaneously on both Master and Slave

Bug #1091072 reported by Keyur
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Percona Server moved to https://jira.percona.com/projects/PS
New
Undecided
Unassigned

Bug Description

We have a master-slave setup and a cron executes "FLUSH LOGS" on both every 5 minutes. Due to poor locking, a race condition exists that can crash the server with a seg fault occasionally.

This is on Percona 5.5.27-28.1-log

How to reproduce:
1) Set up a master-slave environment. Do some dummy inserts on Master in a loop:
    i=1; while(true); do echo $i; mysql -uuser -ppass -A -Dtest -hmaster -e "insert into dummy_test(id, value) values ($i, 'abc')"; i=$(( i+1 )); done;
2) Set up a loop running "FLUSH LOGS" on Master
    while(true); mysql -uuser -ppass -A -Dtest -hmaster -e "FLUSH LOGS"; done;
3) Set up another loop running "FLUSH LOGS" on Slave
    while(true); mysql -uuser -ppass -A -Dtest -hslave -e "FLUSH LOGS"; done;

Details:
The full stack trace is attached. The 2 important threads are:
*Thread 527 (Thread 8937)*:
#0 0x00000039f1e0d4c4 in __lll_lock_wait () from /lib64/libpthread.so.0
#1 0x00000039f1e08e1a in _L_lock_1034 () from /lib64/libpthread.so.0
#2 0x00000039f1e08cdc in pthread_mutex_lock () from /lib64/libpthread.so.0
#3 0x000000000070c5fb in inline_mysql_mutex_lock (that=0x2abbec013610,
    src_file=0xa3e618 "/home/jenkins/workspace/percona-server-5.5-rpms/label_exp/centos5-64/target/BUILD/Percona-Server-5.5.27-rel28.1/Percona-Server-5.5.27-rel28.1/sql/log.cc",
    src_line=4316) at /usr/src/debug/Percona-Server-5.5.27-rel28.1/Percona-Server-5.5.27-rel28.1/include/mysql/psi/mysql_thread.h:617
#4 0x0000000000714994 in MYSQL_BIN_LOG::new_file_impl (this=0x2abbec013608, need_lock=true) at /usr/src/debug/Percona-Server-5.5.27-rel28.1/Percona-Server-5.5.27-rel28.1/sql/log.cc:4316
#5 0x0000000000526b0c in rotate_relay_log (mi=0x2abbec011df0) at /usr/src/debug/Percona-Server-5.5.27-rel28.1/Percona-Server-5.5.27-rel28.1/sql/slave.cc:4997
#6 0x00000000005295af in queue_event (mi=0x2abbec011df0, buf=0x2abc42542721 "\205\346\313P\004\016\203\017", event_len=45)
    at /usr/src/debug/Percona-Server-5.5.27-rel28.1/Percona-Server-5.5.27-rel28.1/sql/slave.cc:3954
#7 0x000000000052f1f8 in handle_slave_io (arg=0x2abbec011df0) at /usr/src/debug/Percona-Server-5.5.27-rel28.1/Percona-Server-5.5.27-rel28.1/sql/slave.cc:3018
#8 0x00000000008fb0d9 in pfs_spawn_thread (arg=0x2abbec009d80) at /usr/src/debug/Percona-Server-5.5.27-rel28.1/Percona-Server-5.5.27-rel28.1/storage/perfschema/pfs.cc:1015
#9 0x00000039f1e0673d in start_thread () from /lib64/libpthread.so.0
#10 0x00000039f12d44bd in clone () from /lib64/libc.so.6

*Thread 1 (Thread 0x539e3940 (LWP 22442))*:
#0 0x00000039f1e0bd02 in pthread_kill () from /lib64/libpthread.so.0
#1 0x00000000006839ae in handle_fatal_signal (sig=11) at /usr/src/debug/Percona-Server-5.5.27-rel28.1/Percona-Server-5.5.27-rel28.1/sql/signal_handler.cc:249
#2 <signal handler called>
#3 0x00002abc94450d60 in ?? ()
#4 0x00000000007135fb in MYSQL_BIN_LOG::open (this=0x2abbec013608, log_name=0x2abc942dc630 "/var/lib/mysql/dbslave-relay", log_type_arg=<value optimized out>,
    new_name=<value optimized out>, io_cache_type_arg=<value optimized out>, no_auto_events_arg=false, max_size_arg=1073741824, null_created_arg=true, need_mutex=false)
    at /usr/src/debug/Percona-Server-5.5.27-rel28.1/Percona-Server-5.5.27-rel28.1/sql/log.cc:3181
#5 0x0000000000714aea in MYSQL_BIN_LOG::new_file_impl (this=0x2abbec013608, need_lock=true) at /usr/src/debug/Percona-Server-5.5.27-rel28.1/Percona-Server-5.5.27-rel28.1/sql/log.cc:4406
#6 0x0000000000526b0c in rotate_relay_log (mi=0x2abbec011df0) at /usr/src/debug/Percona-Server-5.5.27-rel28.1/Percona-Server-5.5.27-rel28.1/sql/slave.cc:4997
#7 0x0000000000645b81 in reload_acl_and_cache (thd=0x2abc605516e0, options=16130, tables=0x0, write_to_binlog=0x539e2600)
    at /usr/src/debug/Percona-Server-5.5.27-rel28.1/Percona-Server-5.5.27-rel28.1/sql/sql_reload.cc:159
#8 0x00000000005874d2 in mysql_execute_command (thd=0x2abc605516e0) at /usr/src/debug/Percona-Server-5.5.27-rel28.1/Percona-Server-5.5.27-rel28.1/sql/sql_parse.cc:3804
#9 0x000000000058b913 in mysql_parse (thd=0x2abc605516e0, rawbuf=<value optimized out>, length=1616196272, parser_state=0x539e2e00)
    at /usr/src/debug/Percona-Server-5.5.27-rel28.1/Percona-Server-5.5.27-rel28.1/sql/sql_parse.cc:5811
#10 0x000000000058cf72 in dispatch_command (command=COM_QUERY, thd=0x2abc605516e0, packet=0x2abc3e4a5910 "FLUSH LOGS", packet_length=1402875624)
    at /usr/src/debug/Percona-Server-5.5.27-rel28.1/Percona-Server-5.5.27-rel28.1/sql/sql_parse.cc:1061
#11 0x00000000006273b7 in do_handle_one_connection (thd_arg=<value optimized out>) at /usr/src/debug/Percona-Server-5.5.27-rel28.1/Percona-Server-5.5.27-rel28.1/sql/sql_connect.cc:1484
#12 0x00000000006274f1 in handle_one_connection (arg=0x2abc15d36d20) at /usr/src/debug/Percona-Server-5.5.27-rel28.1/Percona-Server-5.5.27-rel28.1/sql/sql_connect.cc:1391
#13 0x00000000008fb0d9 in pfs_spawn_thread (arg=0x2abbf1479070) at /usr/src/debug/Percona-Server-5.5.27-rel28.1/Percona-Server-5.5.27-rel28.1/storage/perfschema/pfs.cc:1015
#14 0x00000039f1e0673d in start_thread () from /lib64/libpthread.so.0
#15 0x00000039f12d44bd in clone () from /lib64/libc.so.6

(gdb) p *description_event_for_queue
$3 = {
  <Start_log_event_v3> = {
    <Log_event> = {
      _vptr.Log_event = 0xf9acd0,
      log_pos = 0,
      temp_buf = 0x0,
      when = 0,
      exec_time = 0,
      data_written = 0,
      server_id = 1016589,
      flags = 0,
      cache_type = 0,
      slave_exec_mode = 0,
      thd = 0x0
    },
    members of Start_log_event_v3:
    created = 0,
    binlog_version = 3,
    server_version = "4.0\000\062\067-28.1-log", '\000' <repeats 34 times>,
    dont_set_created = false
  },
  members of Format_description_log_event:
  common_header_len = 19 '\023',
  number_of_event_types = 14 '\016',
  post_header_len = 0x2abc880b79b0 "8\v",
  server_version_split = "\004\000",
  event_type_permutation = 0x0
}

In thread 527 process_io_rotate() in sql/slave.cc has a lock on active_mi->data_lock and deletes and creates a new description_event_for_queue of bin-log-type 3. At the same time, thread 1 is reading description_event_for_queue WITHOUT a lock on active_mi->data_lock.

The race condition is such that thread 1 may manage to squeak past the check "if (description_event_for_queue && description_event_for_queue->binlog_version>=4)" in sql/log.cc (MYSQL_BIN_LOG::open() ) but when it comes to description_event_for_queue->write(&log_file), description_event_for_queue is being deleted by thread 527. Most times, this crashes hard. Along with the example above, we've seen crashes like these too:

#0 0x00002abcf777c285 in raise () from /lib64/libc.so.6
#1 0x00002abcf777dd30 in abort () from /lib64/libc.so.6
#2 0x00002abcf7775706 in __assert_fail () from /lib64/libc.so.6
#3 0x00000000007afbe3 in __cxa_pure_virtual () at /usr/src/debug/Percona-Server-5.5.27-rel28.1/Percona-Server-5.5.27-rel28.1/mysys/my_new.cc:52
#4 0x0000000000723018 in Log_event::write_header (this=0x2abbe002aec0, file=0x2abbdb4e8208, event_data_length=<value optimized out>)
    at /usr/src/debug/Percona-Server-5.5.27-rel28.1/Percona-Server-5.5.27-rel28.1/sql/log_event.cc:999
#5 0x0000000000725546 in Format_description_log_event::write (this=0x2abbe002aec0, file=0x2abbdb4e8208)
    at /usr/src/debug/Percona-Server-5.5.27-rel28.1/Percona-Server-5.5.27-rel28.1/sql/log_event.cc:4214
#6 0x00000000007135fb in MYSQL_BIN_LOG::open (this=0x2abbdb4e7ee8, log_name=0x2abbe0055ce0 "/var/lib/mysql/dbslave-relay", log_type_arg=<value optimized out>,
    new_name=<value optimized out>, io_cache_type_arg=<value optimized out>, no_auto_events_arg=false, max_size_arg=1073741824, null_created_arg=true, need_mutex=false)
    at /usr/src/debug/Percona-Server-5.5.27-rel28.1/Percona-Server-5.5.27-rel28.1/sql/log.cc:3181
#7 0x0000000000714aea in MYSQL_BIN_LOG::new_file_impl (this=0x2abbdb4e7ee8, need_lock=true) at /usr/src/debug/Percona-Server-5.5.27-rel28.1/Percona-Server-5.5.27-rel28.1/sql/log.cc:4406
#8 0x0000000000526b0c in rotate_relay_log (mi=0x2abbdb4e66d0) at /usr/src/debug/Percona-Server-5.5.27-rel28.1/Percona-Server-5.5.27-rel28.1/sql/slave.cc:4997
#9 0x0000000000645b81 in reload_acl_and_cache (thd=0x2abbe000bb00, options=16130, tables=0x0, write_to_binlog=0x51645600)
    at /usr/src/debug/Percona-Server-5.5.27-rel28.1/Percona-Server-5.5.27-rel28.1/sql/sql_reload.cc:159
#10 0x00000000005874d2 in mysql_execute_command (thd=0x2abbe000bb00) at /usr/src/debug/Percona-Server-5.5.27-rel28.1/Percona-Server-5.5.27-rel28.1/sql/sql_parse.cc:3804
#11 0x000000000058b913 in mysql_parse (thd=0x2abbe000bb00, rawbuf=<value optimized out>, length=3758151376, parser_state=0x51645e00)
    at /usr/src/debug/Percona-Server-5.5.27-rel28.1/Percona-Server-5.5.27-rel28.1/sql/sql_parse.cc:5811
#12 0x000000000058cf72 in dispatch_command (command=COM_QUERY, thd=0x2abbe000bb00, packet=0x2abbe0056170 "FLUSH LOGS", packet_length=1365532392)
    at /usr/src/debug/Percona-Server-5.5.27-rel28.1/Percona-Server-5.5.27-rel28.1/sql/sql_parse.cc:1061
#13 0x00000000006273b7 in do_handle_one_connection (thd_arg=<value optimized out>) at /usr/src/debug/Percona-Server-5.5.27-rel28.1/Percona-Server-5.5.27-rel28.1/sql/sql_connect.cc:1484
#14 0x00000000006274f1 in handle_one_connection (arg=0x2abbdb5b90f0) at /usr/src/debug/Percona-Server-5.5.27-rel28.1/Percona-Server-5.5.27-rel28.1/sql/sql_connect.cc:1391
#15 0x00000000008fb0d9 in pfs_spawn_thread (arg=0x2abbdbdcadf0) at /usr/src/debug/Percona-Server-5.5.27-rel28.1/Percona-Server-5.5.27-rel28.1/storage/perfschema/pfs.cc:1015
#16 0x00002abcf6a6c77d in start_thread () from /lib64/libpthread.so.0
#17 0x00002abcf781f25d in clone () from /lib64/libc.so.6

(gdb) p *description_event_for_queue
$10 = {
  <Start_log_event_v3> = {
    <Log_event> = {
      _vptr.Log_event = 0xf9acd0,
      log_pos = 0,
      temp_buf = 0x0,
      when = 0,
      exec_time = 0,
      data_written = 103,
      server_id = 10165122,
      flags = 0,
      cache_type = 0,
      slave_exec_mode = 0,
      thd = 0x0
    },
    members of Start_log_event_v3:
    created = 0,
    binlog_version = 3,
    server_version = "4.0\000\062\067-28.1-log", '\000' <repeats 34 times>,
    dont_set_created = false
  },
  members of Format_description_log_event:
  common_header_len = 19 '\023',
  number_of_event_types = 14 '\016',
  post_header_len = 0x2abbe0000a20 "8\v",
  server_version_split = "\004\000",
  event_type_permutation = 0x0
}

Workaround:
For our use case, we only need the BIN logs flushed every 5 minutes. We're using using "FLUSH LOCAL BINARY LOGS" instead of "FLUSH LOGS".

A patch with the one necessary lock is attached.

Revision history for this message
Keyur (keyurdg) wrote :
Revision history for this message
Keyur (keyurdg) wrote :

The full backtrace for one of the crashes.

Changed in percona-server:
assignee: nobody → Muhammad Irfan (muhammad-irfan)
Revision history for this message
Muhammad Irfan (muhammad-irfan) wrote :

Keyur, this requires more info in order to reproduce. Can you please provide my.cnf file from master/slave
We need to check if seg fault occurs with any specific my.cnf parameter.

Changed in percona-server:
status: New → Incomplete
Changed in percona-server:
assignee: Muhammad Irfan (muhammad-irfan) → nobody
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for Percona Server because there has been no activity for 60 days.]

Changed in percona-server:
status: Incomplete → Expired
Revision history for this message
Valerii Kravchuk (valerii-kravchuk) wrote :

It seems same problem still happens on the intermediate master with 5.5.48-37.8 when FLUSH LOGS is executed there on regular basis to get a chance to save/sync binary logs:

pure virtual method called
terminate called without an active exception
22:40:01 UTC - mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.
Please help us make Percona Server better by reporting any
bugs at http://bugs.percona.com/

key_buffer_size=8388608
read_buffer_size=131072
max_used_connections=45
max_threads=20002
thread_count=38
connection_count=38
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 43781150 K bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x5ec2a000
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 7f1585086e70 thread_stack 0x40000
/usr/sbin/mysqld(my_print_stacktrace+0x2c)[0x7ce92c]
/usr/sbin/mysqld(handle_fatal_signal+0x4a1)[0x69ac21]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x10340)[0x7f1a66e71340]
/lib/x86_64-linux-gnu/libc.so.6(gsignal+0x39)[0x7f1a65846cc9]
/lib/x86_64-linux-gnu/libc.so.6(abort+0x148)[0x7f1a6584a0d8]
/usr/lib/x86_64-linux-gnu/libstdc++.so.6(_ZN9__gnu_cxx27__verbose_terminate_handlerEv+0x155)[0x7f1a65f3b535]
/usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x5e6d6)[0x7f1a65f396d6]
/usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x5e703)[0x7f1a65f39703]
/usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x5f1bf)[0x7f1a65f3a1bf]
/usr/sbin/mysqld(_ZN9Log_event12write_headerEP11st_io_cachem+0x81)[0x73c1c1]
/usr/sbin/mysqld(_ZN28Format_description_log_event5writeEP11st_io_cache+0xeb)[0x73caab]
/usr/sbin/mysqld(_ZN13MYSQL_BIN_LOG4openEPKc13enum_log_typeS1_10cache_typebmbb+0x354)[0x72c5e4]
/usr/sbin/mysqld(_ZN13MYSQL_BIN_LOG13new_file_implEb+0x503)[0x72f4d3]
/usr/sbin/mysqld(_Z16rotate_relay_logP11Master_info+0x2c)[0x53dd6c]
/usr/sbin/mysqld(_Z20reload_acl_and_cacheP3THDmP10TABLE_LISTPi+0x848)[0x65a798]
/usr/sbin/mysqld(_Z16dispatch_command19enum_server_commandP3THDPcj+0x1772)[0x5a6a02]
/usr/sbin/mysqld(_Z24do_handle_one_connectionP3THD+0x13a)[0x6451fa]
/usr/sbin/mysqld(handle_one_connection+0x4a)[0x6452ea]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x8182)[0x7f1a66e69182]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7f1a6590a47d]

Trying to get some variables.
Some pointers may be invalid and cause the dump to abort.
Query (0): is an invalid pointer
Connection ID (thread ID): 4099705
Status: NOT_KILLED

Please, take a look at the patch provided and consider implementing it.

There is nothing special in my.cnf (other than some binary and relay logs filtering).

Changed in percona-server:
status: Expired → New
Revision history for this message
Keyur (keyurdg) wrote :

Sorry for not following up earlier. There's nothing special in my my.cnf either, just the default stuff. The bug is trivial to reproduce. Please see the original description.

Revision history for this message
Shahriyar Rzayev (rzayev-sehriyar) wrote :

Percona now uses JIRA for bug reports so this bug report is migrated to: https://jira.percona.com/browse/PS-2857

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.