Crash during sysbench with debug build

Bug #1189430 reported by Raghavendra D Prabhu on 2013-06-10
26
This bug affects 2 people
Affects Status Importance Assigned to Milestone
MySQL patches by Codership
Status tracked in 5.6
5.5
Undecided
Seppo Jaakola
5.6
Undecided
Unassigned
Percona Server moved to https://jira.percona.com/projects/PS
Incomplete
Undecided
Unassigned
5.1
Invalid
Undecided
Unassigned
5.5
Invalid
Undecided
Unassigned
5.6
Incomplete
Undecided
Unassigned
Percona XtraDB Cluster moved to https://jira.percona.com/projects/PXC
Status tracked in 5.6
5.5
Won't Fix
High
Raghavendra D Prabhu
5.6
Fix Committed
High
Unassigned

Bug Description

The node taking the writes crashes with following

0x80d9aa my_print_stacktrace + 44
0x6cdb5b handle_fatal_signal + 783
0x7f21a54ab0e0 _end + -1540472296
0x7f21a3c5a1c9 _end + -1565969663
0x7f21a3c5b5c8 _end + -1565964544
0x8772d2 row_sel_store_mysql_rec + 242
0x87c565 row_search_for_mysql + 2698
0x827fec ha_innobase::index_read(unsigned char*, unsigned char const*, unsigned int, ha_rkey_function) + 762
0x6d7c72 handler::index_read_map(unsigned char*, unsigned char const*, unsigned long, ha_rkey_function) + 62
0x6ce20e handler::index_read_idx_map(unsigned char*, unsigned int, unsigned char const*, unsigned long, ha_rkey_function) + 78
0x5d3af7 join_read_const(st_join_table*) + 142
0x5d3d20 join_read_const_table(st_join_table*, st_position*) + 303
0x5d84e9 make_join_statistics(JOIN*, TABLE_LIST*, Item*, st_dynamic_array*) + 3731
0x5db989 JOIN::optimize() + 2237
0x5e0420 _Z12mysql_selectP3THDPPP4ItemP10TABLE_LISTjR4ListIS1_ES2_jP8st_orderSB_S2_SB_yP13select_resultP18st_select_lex_unitP13st_select + 1582
0x5e06b6 handle_select(THD*, LEX*, select_result*, unsigned long) + 379
0x5949be execute_sqlcom_select(THD*, TABLE_LIST*) + 786
0x59c142 mysql_execute_command(THD*) + 2026
0x5a4551 mysql_parse(THD*, char*, unsigned int, Parser_state*) + 522
0x5a51c3 wsrep_mysql_parse(THD*, char*, unsigned int, Parser_state*) + 278
0x5a6199 dispatch_command(enum_server_command, THD*, char*, unsigned int) + 2658
0x5a7eba do_command(THD*) + 1728
0x654a2a do_handle_one_connection(THD*) + 323
0x654c5e handle_one_connection + 73
0x7f21a54a3dd2 _end + -1540501750
0x7f21a3d0acdd _end + -1565245931

Full log here: http://paste.wnohang.net/3c1529

Note: This is a UNIV_DEBUG only crash, release builds shouldn't crash for this. However, it needs to be checked why this may be happening.

a) Adding Percona Server as well since it seems to be related to that.

b) Seems that assert was removed in http://bugs.mysql.com/bug.php?id=55626 but added back in http://bugs.mysql.com/bug.php?id=56680

c) Also note, this is a hard to trigger bug.

a) Sysbench cmdline:

Prepare
sysbench --test=./oltp.lua --db-driver=mysql --mysql-engine-trx=yes --mysql-table-engine=innodb --mysql-socket=/pxc/datadir/pxc.sock --mysql-user=root --mysql-password=test --oltp-table-size=30000 prepare

Run:

sysbench --test=./oltp.lua --db-driver=mysql --mysql-engine-trx=yes --mysql-table-engine=innodb --mysql-socket=/pxc/datadir/pxc.sock --mysql-user=root --mysql-password=test --oltp-table-size=30000 --num-threads=32 --init-rng=on --max-requests=0 --oltp-auto-inc=off --max-time=3000 --max-requests=100000 run

b) I wonder if this is related to https://bugs.launchpad.net/percona-server/+bug/841621 , however, this should have been fixed in 5.5.31 (I am running this on merge of PS 5.5.31).

I have been able to crash it under gdb. Attached bt and 'thread
apply all bt full' (latter is really comprehensive) outputs. It also has the log.

However, after analysis and from crash, it looks like:

#2 0x00000000008772d2 in row_sel_store_mysql_rec (
     mysql_rec=mysql_rec@entry=0x7fff8800ff10 "\377\344:", prebuilt=prebuilt@entry=0x7fff880107d8,
     rec=rec@entry=0x7fffd42fa6af "", rec_clust=rec_clust@entry=0, offsets=0x7fffac0efdb0)
     at /media/Tintin/Work/code/percona-xtradb-cluster/release-5.5.31/Percona-Server/storage/innobase/row/row0sel.c:2724
 #3 0x000000000087c565 in row_search_for_mysql (buf=0x7fff8800ff10 "\377\344:", mode=2,
     prebuilt=0x7fff880107d8, match_mode=1, direction=0)

rec is empty in #2.

a)row_sel_try_search_shortcut_for_mysql called by row_search_for_mysql does a:

 if (rec_get_deleted_flag(rec, dict_table_is_comp(index->table))) {

  return(SEL_EXHAUSTED);
 }

b) row_search_for_mysql also checks: ut_ad(!rec_get_deleted_flag(rec, comp));

   switch (row_sel_try_search_shortcut_for_mysql(
     &rec, prebuilt, &offsets, &heap,
     &mtr)) {
   case SEL_FOUND:
#ifdef UNIV_SEARCH_DEBUG
    ut_a(0 == cmp_dtuple_rec(search_tuple,
        rec, offsets));
#endif
    /* At this point, rec is protected by
    a page latch that was acquired by
    row_sel_try_search_shortcut_for_mysql().
    The latch will not be released until
    mtr_commit(&mtr). */
    ut_ad(!rec_get_deleted_flag(rec, comp));

    if (!row_sel_store_mysql_rec(buf, prebuilt,
            rec, FALSE,
            offsets)) {

c) Finally row_sel_store_mysql_rec also checks ut_ad(!rec_get_deleted_flag(rec, rec_offs_comp(offsets))):

 ut_ad(prebuilt->mysql_template);
 ut_ad(prebuilt->default_rec);
 ut_ad(rec_offs_validate(rec, NULL, offsets));
 ut_ad(!rec_get_deleted_flag(rec, rec_offs_comp(offsets)));

=============================================================================
Interesting pat is that: ut_ad(!rec_get_deleted_flag(rec, comp));
in b is not triggered rather c after that is triggered.

Also, note that this is not that hard to trigger but required
sometime to be triggered. (30-40 min of sysbench)

Download full text (4.7 KiB)

Some more info:

(gdb) frame 3
#3 0x000000000087c565 in row_search_for_mysql (buf=0x7fff8800ff10 "\377\344:", mode=2,
    prebuilt=0x7fff880107d8, match_mode=1, direction=0)
    at /media/Tintin/Work/code/percona-xtradb-cluster/release-5.5.31/Percona-Server/storage/innobase/row/row0sel.c:3738
3738 if (!row_sel_store_mysql_rec(buf, prebuilt,

(gdb) info locals
index = 0x7fff98016748
search_tuple = 0x7fff88010af8
pcur = 0x7fff88010848
trx = 0x7fff0c045b48
clust_index = <optimized out>
rec = 0x7fffd42fa6af ""
result_rec = <optimized out>
clust_rec = <optimized out>
err = 10
unique_search = 1
unique_search_from_clust_index = 1
mtr_has_extra_clust_latch = 0
moves_up = 0
set_also_gap_locks = 1
did_semi_consistent_read = 0
i = <optimized out>
should_release = <optimized out>
next_offs = <optimized out>
same_user_rec = 4
mtr = {
  state = 12231,
  memo = {
    heap = 0x0,
    used = 0,
    data = "*&\203\000\000\000\000\000\026Ь\000\000\000\000\000\230\346\254\000\000\000\000\000\n\000\000\200\000\000\000\000\060\a\017\254\377\177\000\000\000\000\000\000\000\000\000\000\240\t\000\f\377\177\000\000\300\003\017\254\377\177\000\000\240\t\000\f\377\177\000\000\320\003\017\254\377\177\000\000\210\066\017\254\377\177\000\000p\001\017\254\377\177\000\000\271C\250\000\000\000\000\000\260\003\017\254\377\177\000\000\240\t\000\f\377\177\000\000\300\003\017\254\377\177\000\000\240\t\000\f\377\177\000\000`\004\017\254\377\177\000\000\210\066\017\254\377\177\000\000\260\001\017\254\377\177\000\000\271C\250\000\000\000\000\000dQ\000\f\377\177\000\000\240\t\000\f\377\177\000\000\000\004\017\254\377\177\000\000\256e\250\000\000\000\000\000\250"...,
    base = {
      count = 140733394717088,
      start = 0x7fffac0f0550,
      end = 0xa865ae <_db_return_+134>
    },
    list = {
      prev = 0x7fffac0f05b0,
      next = 0xf0
    },
    buf_end = 0,
    magic_n = 375767
  },
  log = {
    heap = 0x0,
    used = 0,
    data = "\220\005\017\254\377\177\000\000\256e\250\000\000\000\000\000\220\003\017\254\377\177\000\000\240\t\000\f\377\177\000\000\360\005\017\254\377\177\000\000\210\066\017\254\377\177\000\000\220\003\017\254\377\177\000\000\271C\250\000\000\000\000\000\240\003\017\254\377\177\000\000\240\t\000\f\377\177\000\000\340\005\017\254\377\177\000\000\256e\250\000\000\000\000\000\360\005\017\254\377\177\000\000\240\t\000\f\377\177\000\000\360\006\017\254\377\177\000\000\210\066\017\254\377\177\000\000\340\003\017\254\377\177\000\000\271C\250\000\000\000\000\000`\006\017\254\377\177\000\000\240\t\000\f\377\177\000\000p\006\017\254\377\177\000\000\210\066\017\254\377\177\000\000\020\004\017\254\377\177\000\000\271C\250\000\000\000\000\000P\006\017\254\377\177\000\000"...,
    base = {
      count = 0,
      start = 0x7fff88010b48,
      end = 0x7fff980169a0
    },
    list = {
      prev = 0x0,
      next = 0x7fffac0f0600
    },
    buf_end = 0,
    magic_n = 375767
  },
  inside_ibuf = 0,
  modifications = 0,
  made_dirty = 0,
  n_log_recs = 0,
  n_freed_pages = 0,
  log_mode = 21,
  start_lsn = 140735475145064,
  end_lsn = 1,
  magic_n = 54551
}
heap = 0x0
offsets_ = {100, 6, 140736753280687, 140735743616840, 2147483655, 4, 1...

Read more...

Config change made for last run (which crashed): innodb_change_buffering=inserts .. but it still crashed, so it may not be related. Full config: http://paste.wnohang.net/5c1d2b

Changed in percona-xtradb-cluster:
milestone: none → 5.5.31-25
summary: - Crash during sysbench
+ Crash during sysbench with debug build

I've tried to crash -debug builds of upstream MySQL 5.5.32 and Percona Server 5.6.11 on Fedora and Debian 7 for a long time but with no luck. How many CPUs/cores do you have on the box where this crash/assertion failure is repeatable?

I need to test the crash with 5.5.32 builds, however, note that
this happens after a high uptime with sysbench (on a 2 node cluster). The CPUs on my
laptop is 8 but it doesn't matter in this case I think. Also, no
need to test 5.6.11 or 5.6 series (since PXC is not on 5.6 yet).

Pasting the log here (since jenkins rotates):

https://gist.github.com/f74c5772887f437d0bad

Another crash Log:

https://gist.github.com/9bf075eb67812acb2e4e

This time (after latest fixes), it doesn't crash as much but here it crashed at different but related location.

Since there is no debug build-based sysbench of PS, I will need to run that locally and/or jenkins to investigate further.

Not seeing the crash anymore: http://jenkins.percona.com/job/PXC-5.5-msysbench/33/

Need a few more runs to verify before closing.

The issue seems to repeat only when innodb_flush_method is default and not O_DIRECT. This is due to running the tests on VM, the VM doesn't handle the fsync correctly, this can happen due to dual caching on VM and on host. So this shouldn't occur normally.

Hence, closing the issue for now. (as fix committed since a merge fix was fixed there)

Changed in percona-xtradb-cluster:
assignee: nobody → Raghavendra D Prabhu (raghavendra-prabhu)
status: New → Fix Committed
Alex Yurchenko (ayurchen) wrote :
Download full text (3.9 KiB)

Got almost identical stack trace with a trivial SELECT, default innodb_flush_method, no VM and datadir in /dev/shm:

#5 0x00007fa2d806d698 in __GI_abort () at abort.c:90
#6 0x00000000009c110d in row_sel_store_mysql_rec (mysql_rec=0x7fa2c0020110 "\341\003", prebuilt=0x7fa2c0020958, rec=0x7fa2a44c00dc "\200", rec_clust=0, offsets=0x7fa2c80e9370)
    at /home/ayurchen/codership/mysql/5.5/storage/innobase/row/row0sel.c:2724
#7 0x00000000009c2b38 in row_search_for_mysql (buf=0x7fa2c0020110 "\341\003", mode=2, prebuilt=0x7fa2c0020958, match_mode=1, direction=0)
    at /home/ayurchen/codership/mysql/5.5/storage/innobase/row/row0sel.c:3701
#8 0x000000000097d366 in ha_innobase::index_read (this=0x7fa2c001e6c0, buf=0x7fa2c0020110 "\341\003", key_ptr=0x7fa270006538 "\002", key_len=4, find_flag=HA_READ_KEY_EXACT)
    at /home/ayurchen/codership/mysql/5.5/storage/innobase/handler/ha_innodb.cc:6485
#9 0x0000000000777568 in handler::index_read_map (this=0x7fa2c001e6c0, buf=0x7fa2c0020110 "\341\003", key=0x7fa270006538 "\002", keypart_map=1, find_flag=HA_READ_KEY_EXACT)
    at /home/ayurchen/codership/mysql/5.5/sql/handler.h:1585
#10 0x0000000000775a79 in handler::index_read_idx_map (this=0x7fa2c001e6c0, buf=0x7fa2c0020110 "\341\003", index=0, key=0x7fa270006538 "\002", keypart_map=1, find_flag=HA_READ_KEY_EXACT)
    at /home/ayurchen/codership/mysql/5.5/sql/handler.cc:4858
#11 0x000000000063ee9a in join_read_const (tab=0x7fa270006060) at /home/ayurchen/codership/mysql/5.5/sql/sql_select.cc:12278
#12 0x000000000063ea2e in join_read_const_table (tab=0x7fa270006060, pos=0x7fa27005f960) at /home/ayurchen/codership/mysql/5.5/sql/sql_select.cc:12181
#13 0x00000000006281f4 in make_join_statistics (join=0x7fa27005f8d0, tables_arg=0x7fa270004db0, conds=0x7fa270005e50, keyuse_array=0x7fa270060e88)
    at /home/ayurchen/codership/mysql/5.5/sql/sql_select.cc:3002
#14 0x000000000062116d in JOIN::optimize (this=0x7fa27005f8d0) at /home/ayurchen/codership/mysql/5.5/sql/sql_select.cc:1057
#15 0x0000000000626c4f in mysql_select (thd=0x20121c0, rref_pointer_array=0x2014580, tables=0x7fa270004db0, wild_num=1, fields=..., conds=0x7fa270005468, og_num=0, order=0x0, group=0x0,
    having=0x0, proc_param=0x0, select_options=2148797184, result=0x7fa270005668, unit=0x2013d78, select_lex=0x2014398) at /home/ayurchen/codership/mysql/5.5/sql/sql_select.cc:2590
#16 0x000000000061ed4c in handle_select (thd=0x20121c0, lex=0x2013cc8, result=0x7fa270005668, setup_tables_done_option=0) at /home/ayurchen/codership/mysql/5.5/sql/sql_select.cc:297
#17 0x00000000005f8900 in execute_sqlcom_select (thd=0x20121c0, all_tables=0x7fa270004db0) at /home/ayurchen/codership/mysql/5.5/sql/sql_parse.cc:4985
#18 0x00000000005f087a in mysql_execute_command (thd=0x20121c0) at /home/ayurchen/codership/mysql/5.5/sql/sql_parse.cc:2450
#19 0x00000000005fb895 in mysql_parse (thd=0x20121c0, rawbuf=0x7fa270004bc0 "SELECT * FROM comm02 WHERE p = 2", length=32, parser_state=0x7fa2c80ec2a0)
    at /home/ayurchen/codership/mysql/5.5/sql/sql_parse.cc:6208
#20 0x00000000005fb0d9 in wsrep_mysql_parse (thd=0x20121c0, rawbuf=0x7fa270004bc0 "SELECT * FROM comm02 WHERE p = 2", length=32, parser_...

Read more...

Changed in codership-mysql:
assignee: nobody → Seppo Jaakola (seppo-jaakola)
milestone: none → 5.5.32-23.7.6
status: New → Confirmed

I saw a crash with O_DIRECT, switching this to "In Progress", need to investigate that further (whether due to device or due to a real bug).

Changed in percona-xtradb-cluster:
status: Fix Committed → In Progress

I tested with PS debug build and so far I have not been able to reproduce this (I have to test with jenkins further to confirm), so the issue may be in wsrep patches. (unless this is not reproduceable with vanilla mysql)

Changed in percona-xtradb-cluster:
milestone: 5.5.33-23.7.6 → none
Changed in codership-mysql:
milestone: 5.5.33-23.7.6 → 5.5.34-24.9
Alex Yurchenko (ayurchen) wrote :

Just hit it again

Teemu Ollakka (teemu-ollakka) wrote :

Still happens with codership-mysql 5.5-23 rev 3931.

Teemu Ollakka (teemu-ollakka) wrote :

And with codership-mysql 5.5 rev 3906.

Teemu Ollakka (teemu-ollakka) wrote :
Download full text (4.7 KiB)

Reproduced this with ROLLBACK heavy pure DML load against single node.

From multimaster pure DML run, helgrind errors indicate that there might be possibility or race:

==7868== Possible data race during read of size 1 at 0x1483C0A8 by thread #32
==7868== Locks held: none
==7868== at 0x85132F: rec_get_deleted_flag (mach0data.ic:54)
==7868== by 0x85AF05: row_search_for_mysql (row0sel.c:4360)
==7868== by 0x81ED5C: ha_innobase::general_fetch(unsigned char*, unsigned int, unsigned int) (ha_innodb.cc:6812)
==7868== by 0x81EEAC: ha_innobase::rnd_next(unsigned char*) (ha_innodb.cc:7006)
==7868== by 0x7A2286: rr_sequential(READ_RECORD*) (records.cc:461)
==7868== by 0x5D0E0E: sub_select(JOIN*, st_join_table*, bool) (sql_select.cc:11801)
==7868== by 0x5D9BEC: do_select(JOIN*, List<Item>*, TABLE*, Procedure*) (sql_select.cc:11562)
==7868== by 0x5EDB08: JOIN::exec() (sql_select.cc:2385)
==7868== by 0x5E7FE6: mysql_select(THD*, Item***, TABLE_LIST*, unsigned int, List<Item>&, Item*, unsigned int, st_order*, st_order*, Item*, st_order*, unsigned long long, select_result*, st_select_lex_unit*, st_select_lex*) (sql_select.cc:2604)
==7868== by 0x5EE219: handle_select(THD*, LEX*, select_result*, unsigned long) (sql_select.cc:297)
==7868== by 0x59AF7C: execute_sqlcom_select(THD*, TABLE_LIST*) (sql_parse.cc:5040)
==7868== by 0x5A36DE: mysql_execute_command(THD*) (sql_parse.cc:2475)
==7868== by 0x5AB83A: mysql_parse(THD*, char*, unsigned int, Parser_state*) (sql_parse.cc:6266)
==7868== by 0x5ABA90: wsrep_mysql_parse(THD*, char*, unsigned int, Parser_state*) (sql_parse.cc:6099)
==7868== by 0x5AD0F8: dispatch_command(enum_server_command, THD*, char*, unsigned int) (sql_parse.cc:1214)
==7868== by 0x5AF918: do_command(THD*) (sql_parse.cc:869)
==7868== by 0x658320: do_handle_one_connection(THD*) (sql_connect.cc:887)
==7868== by 0x6584DC: handle_one_connection (sql_connect.cc:799)
==7868== by 0x4C2DC3D: ??? (in /usr/lib/valgrind/vgpreload_helgrind-amd64-linux.so)
==7868== by 0x5899E99: start_thread (pthread_create.c:308)
==7868== by 0x67D7CCC: clone (clone.S:112)
==7868==
==7868== This conflicts with a previous write of size 1 by thread #29
==7868== Locks held: none
==7868== at 0x86160B: row_upd_rec_in_place (mach0data.ic:41)
==7868== by 0x8BF796: btr_cur_update_in_place (btr0cur.c:1919)
==7868== by 0x8BFDBA: btr_cur_optimistic_update (btr0cur.c:2023)
==7868== by 0x9C0D38: row_undo_mod_clust_low (row0umod.c:103)
==7868== by 0x9C23C9: row_undo_mod_clust (row0umod.c:247)
==7868== by 0x9C3694: row_undo_mod (row0umod.c:891)
==7868== by 0x9B8B6A: row_undo_step (row0undo.c:306)
==7868== by 0x9A6905: que_run_threads (que0que.c:1245)

Thread #29 stack trace was not complete, but there is a high possibility that it was still doing the same rollback as reported in previous error:

==7868== Possible data race during read of size 1 at 0x14118C98 by thread #29
==7868== Locks held: none
==7868== at 0x8DD996: buf_page_get_gen (buf0buf.ic:1000)
==7868== by 0x89698C: trx_sys_update_mysql_binlog_offset (trx0sys.ic:119)
==7868== by 0x89A7B2: trx_write_serialisation_history (trx0trx.c:893)
==...

Read more...

Changed in codership-mysql:
milestone: 5.5.34-25.9 → 5.5.34-25.10
Changed in codership-mysql:
status: Confirmed → Opinion
status: Opinion → Confirmed
Changed in codership-mysql:
milestone: 5.5.37-25.10 → 5.5.37-25.11
no longer affects: codership-mysql

Haven't seen this issue for while now. Will close the bug. Will re-visit if we see re-occurance.

Percona now uses JIRA for bug reports so this bug report is migrated to: https://jira.percona.com/browse/PXC-968

Percona now uses JIRA for bug reports so this bug report is migrated to: https://jira.percona.com/browse/PS-2971

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.