mysqld crashes when shuts down under load

Bug #404302 reported by Alex Yurchenko
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
MySQL patches by Codership
Fix Released
High
Seppo Jaakola
Stable
Invalid
Undecided
Unassigned
Trunk
Fix Released
High
Seppo Jaakola

Bug Description

3-node cluster, sqlgen load:

./sqlgen --user=test --pswd=testpass --host=127.0.0.1 --port=3307 --tables=10 --rows=1000 --create=1 --duration=1200

When trying to shut down a node, I get the following:

090724 22:57:12 [Note] gcs_group.c:gcs_group_handle_sync_msg():545: Member 1 (bernhard) synced with group.
090724 22:59:28 [Note] gcs_group.c:gcs_group_handle_sync_msg():545: Member 0 (home) synced with group.
090724 23:01:22 [Note] Got signal 15 to shutdown mysqld
090724 23:01:22 [Note] /home/alex/codership/mysql-5.1.36-2848,944M/mysql/libexec/mysqld: Normal shutdown

090724 23:01:22 [Note] Event Scheduler: Purging the queue. 0 events
090724 23:01:22 [Note] WSREP: rollbacker thread exiting
mysqld: sql_class.h:1178: uint Diagnostics_area::sql_errno() const: Assertion `m_status == DA_ERROR' failed.
090724 23:01:22 - mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
We will try our best to scrape up some info that will hopefully help diagnose
the problem, but since we have already crashed, something is definitely wrong
and this may fail.

key_buffer_size=8384512
read_buffer_size=131072
max_used_connections=7
max_threads=151
threads_connected=3
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 337751 K
bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

thd: 0x9e2a9c0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x960923ac thread_stack 0x30000
/home/alex/codership/mysql-5.1.36-2848,944M/mysql/libexec/mysqld(my_print_stacktrace+0x26)[0x85b7a8f]
/home/alex/codership/mysql-5.1.36-2848,944M/mysql/libexec/mysqld(handle_segfault+0x2c0)[0x824d9d0]
[0xb7f3f400]
/lib/tls/i686/cmov/libc.so.6(abort+0x188)[0xb7d4e098]
/lib/tls/i686/cmov/libc.so.6(__assert_fail+0xee)[0xb7d455ce]
/home/alex/codership/mysql-5.1.36-2848,944M/mysql/libexec/mysqld(_ZNK16Diagnostics_area9sql_errnoEv+0x38)[0x823ba2c]
/home/alex/codership/mysql-5.1.36-2848,944M/mysql/libexec/mysqld(_ZN14Rows_log_event14do_apply_eventEPK14Relay_log_info+0x395)[0x8354415]
/home/alex/codership/mysql-5.1.36-2848,944M/mysql/libexec/mysqld(_ZN9Log_event11apply_eventEPK14Relay_log_info+0x1f)[0x8270825]
/home/alex/codership/mysql-5.1.36-2848,944M/mysql/libexec/mysqld[0x825ff0e]
/home/alex/codership/mysql-5.1.36-2848,944M/mysql/libexec/mysqld(_Z17wsrep_bf_apply_cbPvP16wsrep_apply_data+0x83)[0x826fa83]
/home/alex/codership/mysql-5.1.36-2848,944M/galera/lib/libmmgalera.so[0x9733bb21]
/home/alex/codership/mysql-5.1.36-2848,944M/galera/lib/libmmgalera.so[0x9733c70f]
/home/alex/codership/mysql-5.1.36-2848,944M/galera/lib/libmmgalera.so[0x9733d00e]
/home/alex/codership/mysql-5.1.36-2848,944M/galera/lib/libmmgalera.so[0x9733d58e]
/home/alex/codership/mysql-5.1.36-2848,944M/galera/lib/libmmgalera.so[0x9733f1cf]
/home/alex/codership/mysql-5.1.36-2848,944M/mysql/libexec/mysqld(_Z25wsrep_replication_processP3THD+0x90)[0x8258a00]
/home/alex/codership/mysql-5.1.36-2848,944M/mysql/libexec/mysqld(start_wsrep_THD+0x3b7)[0x824ee8f]
/lib/tls/i686/cmov/libpthread.so.0[0xb7f164ff]
/lib/tls/i686/cmov/libc.so.6(clone+0x5e)[0xb7e0549e]
Trying to get some variables.
Some pointers may be invalid and cause the dump to abort...
thd->query at (nil) is an invalid pointer
thd->thread_id=1
thd->killed=KILL_CONNECTION

Related branches

Revision history for this message
Seppo Jaakola (seppo-jaakola) wrote :

Here is stack trace of the crash:

Program received signal SIGABRT, Aborted.
[Switching to Thread 0xa6d2cb90 (LWP 4033)]
0x0056a402 in __kernel_vsyscall ()
(gdb) bt
#0 0x0056a402 in __kernel_vsyscall ()
#1 0x00976fd0 in raise () from /lib/i686/nosegneg/libc.so.6
#2 0x009789b1 in abort () from /lib/i686/nosegneg/libc.so.6
#3 0x0097039b in __assert_fail () from /lib/i686/nosegneg/libc.so.6
#4 0x0823b35a in Diagnostics_area::sql_errno (this=0xab97d58)
    at sql_class.h:1178
#5 0x0835551e in Rows_log_event::do_apply_event (this=0xa6224310,
    rli=0xaba7518) at log_event.cc:7220
#6 0x082705d3 in Log_event::apply_event (this=0xa6224310, rli=0xaba7518)
    at log_event.h:1058
#7 0x0825f110 in wsrep_bf_apply_rbr (ctx=0xab96fa0,
    rbr_buf=0xa6231330 "^\226xJ\023", buf_len=130) at sql_parse.cc:8114
#8 0x0826f7a7 in wsrep_bf_apply_cb (ctx=0xab96fa0, data=0xa6d2bf44)
    at sql_parse.cc:8177
#9 0x006d695f in ?? ()
   from /home/galera/mysql-5.1.36-2855/galera/lib/libmmgalera.so

Revision history for this message
Seppo Jaakola (seppo-jaakola) wrote :
Download full text (39.7 KiB)

simple_open_n_lock_tables(thd, rli->tables_to_lock) failed, here is the contents of
the THD after the call:

(gdb) p *thd
$1 = {<Statement> = {<ilink> = {_vptr.ilink = 0x862d488, prev = 0xb112508,
      next = 0x881c920}, <Query_arena> = {_vptr.Query_arena = 0x862d4a0,
      free_list = 0x0, mem_root = 0xab989d0, is_backup_arena = false,
      is_reprepared = false, state = Query_arena::CONVENTIONAL_EXECUTION},
    id = 0, mark_used_columns = MARK_COLUMNS_READ, name = {str = 0x0,
      length = 2779096485}, lex = 0xab9811c, query = 0x0, query_length = 0,
    cursor = 0x0, db = 0x0, db_length = 0}, <Open_tables_state> = {
    m_reprepare_observer = 0x0, open_tables = 0x0, temporary_tables = 0x0,
    handler_tables = 0x0, derived_tables = 0x0, lock = 0x0,
    locked_tables = 0x0, extra_lock = 0x0, prelocked_mode = NON_PRELOCKED,
    version = 1, current_tablenr = 0, state_flags = 0}, rli_fake = 0x0,
  static DEFAULT_WHERE = 0x862bc0e "field list", net = {vio = 0x0,
    buff = 0xaba14d8 '�' <repeats 200 times>...,
    buff_end = 0xaba54d8 "�������h4z\025",
    write_pos = 0xaba14d8 '�' <repeats 200 times>...,
    read_pos = 0xaba14d8 '�' <repeats 200 times>..., fd = -1515870811,
    remain_in_buf = 0, length = 2779096485, buf_length = 2779096485,
    where_b = 0, max_packet = 16384, max_packet_size = 1048576, pkt_nr = 0,
    compress_pkt_nr = 0, write_timeout = 60, read_timeout = 30,
    retry_count = 10, fcntl = -1515870811, return_status = 0x0,
    reading_or_writing = 0 '\0', save_char = -91 '�', unused0 = -91 '�',
    unused = -91 '�', compress = 0 '\0', unused1 = -91 '�',
    query_cache_query = 0x0, last_errno = 0, error = 0 '\0',
---Type <return> to continue, or q <return> to quit---
    unused2 = -91 '�', return_errno = 0 '\0',
    last_error = "\000", '�' <repeats 511 times>, sqlstate = "������",
    extension = 0xa5a5a5a5}, warn_root = {free = 0xa6231aa8, used = 0x0,
    pre_alloc = 0x0, min_malloc = 32, block_size = 1992, block_num = 5,
    first_block_usage = 0,
    error_handler = 0x81eb8d8 <sql_alloc_error_handler>},
  protocol = 0xab972ac, protocol_text = {<Protocol> = {
      _vptr.Protocol = 0x862e8a8, thd = 0xab96fa0, packet = 0xab9731c,
      convert = 0xab97330, field_pos = 2779096485, field_types = 0x0,
      field_count = 2779096485}, <No data fields>},
  protocol_binary = {<Protocol> = {_vptr.Protocol = 0x862e828,
      thd = 0xab96fa0, packet = 0xab9731c, convert = 0xab97330,
      field_pos = 2779096485, field_types = 0x0, field_count = 2779096485},
    bit_fields = 2779096485}, user_vars = {key_offset = 0, key_length = 0,
    blength = 1, records = 0, flags = 0, array = {
      buffer = 0xab98e90 '�' <repeats 128 times>, "h4z\025a", elements = 0,
      max_element = 16, alloc_increment = 32, size_of_element = 8},
    get_key = 0x822ea00 <get_var_key>, free = 0x822f836 <free_user_var>,
    charset = 0x87f46e0}, packet = {Ptr = 0x0, str_length = 0,
    Alloced_length = 0, alloced = false, str_charset = 0x8814520},
  convert_buffer = {Ptr = 0x0, str_length = 0, Alloced_length = 0,
    alloced = false, str_charset = 0x8814520}, remote = {sin_family = 42405,
    sin_port = 42405, s...

Revision history for this message
Seppo Jaakola (seppo-jaakola) wrote :

...and here are corresponding error log entries:

090804 23:13:18 [Note] /home/galera/mysql-5.1.36-2855/mysql/libexec/mysqld: Norm
al shutdown

090804 23:13:21 [ERROR] Got error 110 from pthread_cond_timedwait
090804 23:13:22 [Note] Event Scheduler: Purging the queue. 0 events
090804 23:13:22 [ERROR] Slave SQL: Error 'Server shutdown in progress' on openin
g tables, Error_code: 1053
090804 23:13:22 [Note] WSREP: rollbacker thread exiting
090804 23:13:22 [Warning] RBR event 2 apply warning: 1053
mysqld: sql_class.h:1178: uint Diagnostics_area::sql_errno() const: Assertion `m
_status == DA_ERROR' failed.

Revision history for this message
Seppo Jaakola (seppo-jaakola) wrote :

It appears, that slave applier thread does not treat errors from applying correctly. It should propagate (at least severe) error codes to wsrep recv processor, which should then decide to stop.

Also, KILL_CONNECTION should be checked before applying.

Revision history for this message
Seppo Jaakola (seppo-jaakola) wrote :

A fix was committed to enable slave applier thread graceful termination when KILL_CONNECTION is thrown. This fix requires changes in replication provider side (galera library) as well.

However, after the fix for slave applier logic, the crash can still occasionally happen.

Revision history for this message
Seppo Jaakola (seppo-jaakola) wrote :

Changeset #2861 circumvents this assert. This is safe, because the applier thread was
bailing out from open&lock tables processing due to KILL_CONNECTION signal and no
SQL error has happened.

Also fix for Bug: #413272, modifies shutdown sequence so that slave applier will not be
interrupted anymore.

Testing with trunk version has not bumped into this issue again.

Revision history for this message
Alex Yurchenko (ayurchen) wrote :

This is fixed in development tree. This tree is oblosolete

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.