Percona Server with XtraDB

InnoDB with shm buffer pool crashes after recovery

Reported by Vadim Tkachenko on 2010-09-27
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Percona Server
High
Unassigned

Bug Description

I do next steps:

- kill mysqld;
- remove shm segment
- start mysqld ( it creates new shm segment)
- InnoDB performs recovery

when recovery finished and we start load on server , InnoDB crashes.

There is bt from gdb and core file.

I have core file and binary.

Thread 1 (Thread 10616):
#0 mem_heap_create_block (heap=0x2acb641be400, n=1048576, type=0,
    file_name=0x2acb641be408 <Address 0x2acb641be408 out of bounds>, line=528) at mem/mem0mem.c:357
#1 0x0000000000707a36 in mem_heap_add_block (heap=0x2acb641be400, n=1048576) at mem/mem0mem.c:446
#2 0x0000000000707db8 in mem_heap_alloc (heap=0x8, data=0x2ac65e4c942c, len=18446744073709546452) at ./include/mem0mem.ic:186
#3 mem_heap_dup (heap=0x8, data=0x2ac65e4c942c, len=18446744073709546452) at mem/mem0mem.c:125
#4 0x000000000075544e in trx_undo_rec_copy (roll_ptr=..., heap=0x2acb641be400) at ./include/trx0rec.ic:110
#5 trx_undo_get_undo_rec_low (roll_ptr=..., heap=0x2acb641be400) at trx/trx0rec.c:1360
#6 0x00000000007554da in trx_undo_get_undo_rec (roll_ptr=..., trx_id=..., undo_rec=0x599f1768, heap=0x2acb641be400)
    at trx/trx0rec.c:1398
#7 0x00000000007573ce in trx_undo_prev_version_build (index_rec=0x2ac65e53a79a <Address 0x2ac65e53a79a out of bounds>,
    index_mtr=<value optimized out>, rec=0x2ac65e53a79a <Address 0x2ac65e53a79a out of bounds>, index=0x2acb6800a7f8,
    offsets=0x599f1ea0, heap=0x2acb641be400, old_vers=0x599f1828) at trx/trx0rec.c:1483
#8 0x000000000074741b in row_vers_build_for_consistent_read (rec=0x2ac65e53a79a <Address 0x2ac65e53a79a out of bounds>,
    mtr=0x599f1a00, index=0x2acb6800a7f8, offsets=0x599f21d0, view=0x2acb680d2468, offset_heap=0x599f21d8,
    in_heap=0x2acb64100e80, old_vers=0x599f21c8) at row/row0vers.c:559
#9 0x000000000073e821 in row_search_for_mysql (buf=0x2acb6808d2a0 <Address 0x2acb6808d2a0 out of bounds>, mode=2,
    prebuilt=0x2acb6809fb08, match_mode=1, direction=0) at row/row0sel.c:4247
#10 0x00000000006e5165 in ha_innobase::index_read (this=0x2acb6808ce80,
    buf=0x2acb6808d2a0 <Address 0x2acb6808d2a0 out of bounds>, key_ptr=<value optimized out>, key_len=<value optimized out>,
    find_flag=HA_READ_KEY_EXACT) at handler/ha_innodb.cc:5801
#11 0x0000000000656ee9 in handler::index_read_idx_map (this=0x2acb6808ce80,
    buf=0x2acb6808d2a0 <Address 0x2acb6808d2a0 out of bounds>, index=<value optimized out>,
    key=0x2acb1f0eec58 <Address 0x2acb1f0eec58 out of bounds>, keypart_map=3, find_flag=HA_READ_KEY_EXACT) at handler.cc:4428
#12 0x00000000005e2897 in join_read_const (tab=0x2acb1f0ee9f0) at sql_select.cc:11909
#13 0x00000000005eb9dd in join_read_const_table (tab=0x2acb1f0ee9f0, pos=0x2acb1f0ecf60) at sql_select.cc:11812
#14 0x00000000005ec3f6 in make_join_statistics (join=0x2acb1f0ecec8, tables_arg=0x2acb641b5eb8, conds=0x2acb1f0ecd20,
---Type <return> to continue, or q <return> to quit---
    keyuse_array=0x2acb1f0ee488) at sql_select.cc:2933
#15 0x00000000005ede07 in JOIN::optimize (this=0x2acb1f0ecec8) at sql_select.cc:1035
#16 0x00000000005f7101 in mysql_select (thd=0x2acb6804c950, rref_pointer_array=0x2acb641b53f8, tables=0x2acb641b5eb8,
    wild_num=0, fields=..., conds=0x2acb1f0ecd20, og_num=0, order=0x0, group=0x0, having=0x0, proc_param=0x0,
    select_options=2416724480, result=0x2acb641b9568, unit=0x2acb641b4e00, select_lex=0x2acb641b5228) at sql_select.cc:2518
#17 0x00000000005f7b55 in handle_select (thd=0x2acb6804c950, lex=0x2acb641b4d60, result=0x2acb641b9568,
    setup_tables_done_option=0) at sql_select.cc:280
#18 0x00000000005864b1 in execute_sqlcom_select (thd=0x2acb6804c950, all_tables=0x2acb641b5eb8) at sql_parse.cc:5181
#19 0x0000000000589721 in mysql_execute_command (thd=0x2acb6804c950) at sql_parse.cc:2362
#20 0x0000000000603520 in Prepared_statement::execute (this=0x2acb641b49f0, expanded_query=<value optimized out>,
    open_cursor=false) at sql_prepare.cc:3879
#21 0x00000000006056f4 in Prepared_statement::execute_loop (this=0x2acb641b49f0, expanded_query=0x599f4180, open_cursor=false,
    packet=<value optimized out>, packet_end=<value optimized out>) at sql_prepare.cc:3554
#22 0x0000000000605a09 in mysqld_stmt_execute (thd=0x2acb6804c950,
    packet_arg=0x2acb6804fcf1 <Address 0x2acb6804fcf1 out of bounds>, packet_length=23) at sql_prepare.cc:2601
#23 0x000000000058fc61 in dispatch_command (command=COM_STMT_EXECUTE, thd=0x2acb6804c950,
    packet=0x2acb6804fcf1 <Address 0x2acb6804fcf1 out of bounds>, packet_length=0) at sql_parse.cc:1231
#24 0x0000000000590e30 in do_command (thd=0x2acb6804c950) at sql_parse.cc:901
#25 0x0000000000582d17 in handle_one_connection (arg=0x2acb6804c950) at sql_connect.cc:1738
#26 0x0000003d2760673d in ?? ()
#27 0x0000000000000000 in ?? ()

Changed in percona-server:
status: New → Confirmed
importance: Undecided → High
assignee: nobody → Yasufumi Kinoshita (yasufumi-kinoshita)
Vadim Tkachenko (vadim-tk) wrote :
Download full text (5.2 KiB)

Issue is repeated very well.

Second time I've got message:

thd: 0x2acb5c09b890
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
InnoDB: Error: trying to access update undo rec field 1895425 in index "PRIMARY" of table "tpcc"."warehouse"
InnoDB: but index has only 11 fields
InnoDB: Submit a detailed bug report to http://bugs.mysql.com
InnoDB: Run also CHECK TABLE "tpcc"."warehouse"
InnoDB: n_fields = 4, i = 1, ptr 0x2acb40e42911
InnoDB: Error: trying to access update undo rec for table tpcc/warehouse
InnoDB: but the table id in the undo record is wrong
InnoDB: Submit a detailed bug report to http://bugs.mysql.com
InnoDB: Run also CHECK TABLE tpcc/warehouse
InnoDB: table tpcc/warehouse, index PRIMARY, n_uniq 1
InnoDB: undo rec address 0x2acb40e428f8, type 12 cmpl_info 1
InnoDB: undo rec table id 0 66, index table id 0 49

and stacktrace:

(gdb) bt
#0 mem_heap_create_block (heap=0x2acb4c106950, n=1048576, type=0,
    file_name=0x2acb4c106958 <Address 0x2acb4c106958 out of bounds>, line=528) at mem/mem0mem.c:357
#1 0x0000000000707a36 in mem_heap_add_block (heap=0x2acb4c106950, n=1048576) at mem/mem0mem.c:446
#2 0x0000000000707db8 in mem_heap_alloc (heap=0x8, data=0x2ac6546faca6, len=18446744073709540186) at ./include/mem0mem.ic:186
#3 mem_heap_dup (heap=0x8, data=0x2ac6546faca6, len=18446744073709540186) at mem/mem0mem.c:125
#4 0x000000000075544e in trx_undo_rec_copy (roll_ptr=..., heap=0x2acb4c106950) at ./include/trx0rec.ic:110
#5 trx_undo_get_undo_rec_low (roll_ptr=..., heap=0x2acb4c106950) at trx/trx0rec.c:1360
#6 0x00000000007554da in trx_undo_get_undo_rec (roll_ptr=..., trx_id=..., undo_rec=0x59c71768, heap=0x2acb4c106950)
    at trx/trx0rec.c:1398
#7 0x00000000007573ce in trx_undo_prev_version_build (index_rec=0x2ac65e270880 <Address 0x2ac65e270880 out of bounds>,
    index_mtr=<value optimized out>, rec=0x2ac65e270880 <Address 0x2ac65e270880 out of bounds>, index=0x2aca9c02c788,
    offsets=0x59c71ea0, heap=0x2acb4c106950, old_vers=0x59c71828) at trx/trx0rec.c:1483
#8 0x000000000074741b in row_vers_build_for_consistent_read (rec=0x2ac65e270880 <Address 0x2ac65e270880 out of bounds>,
    mtr=0x59c71a00, index=0x2aca9c02c788, offsets=0x59c721d0, view=0x2acb54015aa8, offset_heap=0x59c721d8,
    in_heap=0x2acb4c106800, old_vers=0x59c721c8) at row/row0vers.c:559
#9 0x000000000073e821 in row_search_for_mysql (buf=0x2acb40dad4c0 <Address 0x2acb40dad4c0 out of bounds>, mode=2,
    prebuilt=0x2acb40dade48, match_mode=1, direction=0) at row/row0sel.c:4247
#10 0x00000000006e5165 in ha_innobase::index_read (this=0x2acb40dad0a0,
    buf=0x2acb40dad4c0 <Address 0x2acb40dad4c0 out of bounds>, key_ptr=<value optimized out>, key_len=<value optimized out>,
    find_flag=HA_READ_KEY_EXACT) at handler/ha_innodb.cc:5801
#11 0x0000000000656ee9 in handler::index_read_idx_map (this=0x2acb40dad0a0,
    buf=0x2acb40dad4c0 <Address 0x2acb40dad4c0 out of bounds>, index=<value optimized out>,
    key=0x2acb406500a0 <Address 0x2acb406500a0 out of bounds>, keypart_map=1, find_flag=HA_READ_KEY_EXACT) at handler.cc:4428
#12 0x00000000005e289...

Read more...

If the fix would not be effective, it might be the problem of the server.
The crash is caused by malloc() returning NULL pointer.

Changed in percona-server:
status: Confirmed → Fix Committed
Vadim Tkachenko (vadim-tk) wrote :

I still have crash after performing recovery.

Vadim Tkachenko (vadim-tk) wrote :
Download full text (4.3 KiB)

Program terminated with signal 11, Segmentation fault.
#0 mem_heap_create_block (heap=0x2acaac0eb9c0, n=100704, type=0, file_name=0x2acaac0eb9c8 "0vers.c", line=528)
    at mem/mem0mem.c:357
357 mem/mem0mem.c: No such file or directory.
        in mem/mem0mem.c
(gdb) bt
#0 mem_heap_create_block (heap=0x2acaac0eb9c0, n=100704, type=0, file_name=0x2acaac0eb9c8 "0vers.c", line=528)
    at mem/mem0mem.c:357
#1 0x0000000000707a46 in mem_heap_add_block (heap=0x2acaac0eb9c0, n=100704) at mem/mem0mem.c:446
#2 0x0000000000707dc8 in mem_heap_alloc (heap=0x8, data=0x2ac96edb5397, len=18446744073709546601) at ./include/mem0mem.ic:186
#3 mem_heap_dup (heap=0x8, data=0x2ac96edb5397, len=18446744073709546601) at mem/mem0mem.c:125
#4 0x000000000075545e in trx_undo_rec_copy (roll_ptr=..., heap=0x2acaac0eb9c0) at ./include/trx0rec.ic:110
#5 trx_undo_get_undo_rec_low (roll_ptr=..., heap=0x2acaac0eb9c0) at trx/trx0rec.c:1360
#6 0x00000000007554ea in trx_undo_get_undo_rec (roll_ptr=..., trx_id=..., undo_rec=0x4182b768, heap=0x2acaac0eb9c0)
    at trx/trx0rec.c:1398
#7 0x00000000007573de in trx_undo_prev_version_build (index_rec=0x2ac96e76af66 "\201e\203", index_mtr=<value optimized out>,
    rec=0x2ac96e76af66 "\201e\203", index=0x2acaa7f65728, offsets=0x4182bea0, heap=0x2acaac0eb9c0, old_vers=0x4182b828)
    at trx/trx0rec.c:1483
#8 0x000000000074742b in row_vers_build_for_consistent_read (rec=0x2ac96e76af66 "\201e\203", mtr=0x4182ba00,
    index=0x2acaa7f65728, offsets=0x4182c1d0, view=0x2acaa8025798, offset_heap=0x4182c1d8, in_heap=0x2acaac0eb870,
    old_vers=0x4182c1c8) at row/row0vers.c:559
#9 0x000000000073e831 in row_search_for_mysql (buf=0x2acaa7f727f0 "\320'\367\247\312*", mode=2, prebuilt=0x2acaa7f73428,
    match_mode=1, direction=0) at row/row0sel.c:4247
#10 0x00000000006e5175 in ha_innobase::index_read (this=0x2acaa7f723d0, buf=0x2acaa7f727f0 "\320'\367\247\312*",
    key_ptr=<value optimized out>, key_len=<value optimized out>, find_flag=HA_READ_KEY_EXACT) at handler/ha_innodb.cc:5803
#11 0x0000000000656ee9 in handler::index_read_idx_map (this=0x2acaa7f723d0, buf=0x2acaa7f727f0 "\320'\367\247\312*",
    index=<value optimized out>, key=0x2acaa8042828 "e\001\003", keypart_map=3, find_flag=HA_READ_KEY_EXACT) at handler.cc:4428
#12 0x00000000005e2897 in join_read_const (tab=0x2acaa80425c0) at sql_select.cc:11909
#13 0x00000000005eb9dd in join_read_const_table (tab=0x2acaa80425c0, pos=0x2acaa8040b30) at sql_select.cc:11812
#14 0x00000000005ec3f6 in make_join_statistics (join=0x2acaa8040a98, tables_arg=0x2acaac0dc9d0, conds=0x2acaa80408f0,
    keyuse_array=0x2acaa8042058) at sql_select.cc:2933
#15 0x00000000005ede07 in JOIN::optimize (this=0x2acaa8040a98) at sql_select.cc:1035
#16 0x00000000005f7101 in mysql_select (thd=0x2acaa7ea6010, rref_pointer_array=0x2acaac0dbf08, tables=0x2acaac0dc9d0,
    wild_num=0, fields=..., conds=0x2acaa80408f0, og_num=0, order=0x0, group=0x0, having=0x0, proc_param=0x0,
    select_options=2416724480, result=0x2acaac0ddd08, unit=0x2acaac0db910, select_lex=0x2acaac0dbd38) at sql_select.cc:2518
#17 0x00000000005f7b55 in handle_select (thd=0x2acaa7ea6010, lex=0x2acaac0db870, result=0x2acaac0ddd08,
    ...

Read more...

Vadim Tkachenko (vadim-tk) wrote :

it seems it can damage recovery area, as I am getting crash now even after clean start / shutdown

Vadim Tkachenko (vadim-tk) wrote :

another stack
#0 0x0000003d26e30265 in raise () from /lib64/libc.so.6
(gdb) bt
#0 0x0000003d26e30265 in raise () from /lib64/libc.so.6
#1 0x0000003d26e31d10 in abort () from /lib64/libc.so.6
#2 0x0000000000731187 in row_mysql_handle_errors (new_err=0x5a75f1e8, trx=0x2acb501cab98, thr=0x2acb680fc348, savept=0x0)
    at row/row0mysql.c:582
#3 0x000000000073d473 in row_search_for_mysql (buf=0x2acb680f1430 "\020\024\017h\313*", mode=2, prebuilt=0x2acb680f1db8,
    match_mode=1, direction=0) at row/row0sel.c:4589
#4 0x00000000006e5175 in ha_innobase::index_read (this=0x2acb680f1010, buf=0x2acb680f1430 "\020\024\017h\313*",
    key_ptr=<value optimized out>, key_len=<value optimized out>, find_flag=HA_READ_KEY_EXACT) at handler/ha_innodb.cc:5803
#5 0x0000000000656ee9 in handler::index_read_idx_map (this=0x2acb680f1010, buf=0x2acb680f1430 "\020\024\017h\313*",
    index=<value optimized out>, key=0x2acab000a520 "6\003", keypart_map=1, find_flag=HA_READ_KEY_EXACT) at handler.cc:4428
#6 0x00000000005e2897 in join_read_const (tab=0x2acb680f8128) at sql_select.cc:11909
#7 0x00000000005eb9dd in join_read_const_table (tab=0x2acb680f8128, pos=0x2acab00087d0) at sql_select.cc:11812
#8 0x00000000005ec3f6 in make_join_statistics (join=0x2acab0008710, tables_arg=0x2acb506f7af8, conds=0x2acab0008540,
    keyuse_array=0x2acab0009cd0) at sql_select.cc:2933
#9 0x00000000005ede07 in JOIN::optimize (this=0x2acab0008710) at sql_select.cc:1035
#10 0x00000000005f7101 in mysql_select (thd=0x2acaae7a2030, rref_pointer_array=0x2acb506f6c98, tables=0x2acb506f7af8,
    wild_num=0, fields=..., conds=0x2acab0008540, og_num=0, order=0x0, group=0x0, having=0x0, proc_param=0x0,
    select_options=2416724480, result=0x2acb506f9c80, unit=0x2acb506f66a0, select_lex=0x2acb506f6ac8) at sql_select.cc:2518
#11 0x00000000005f7b55 in handle_select (thd=0x2acaae7a2030, lex=0x2acb506f6600, result=0x2acb506f9c80,
    setup_tables_done_option=0) at sql_select.cc:280
#12 0x00000000005864b1 in execute_sqlcom_select (thd=0x2acaae7a2030, all_tables=0x2acb506f7af8) at sql_parse.cc:5181
#13 0x0000000000589721 in mysql_execute_command (thd=0x2acaae7a2030) at sql_parse.cc:2362
#14 0x0000000000603520 in Prepared_statement::execute (this=0x2acb506ebff0, expanded_query=<value optimized out>,
    open_cursor=false) at sql_prepare.cc:3879
#15 0x00000000006056f4 in Prepared_statement::execute_loop (this=0x2acb506ebff0, expanded_query=0x5a761180, open_cursor=false,
    packet=<value optimized out>, packet_end=<value optimized out>) at sql_prepare.cc:3554
#16 0x0000000000605a09 in mysqld_stmt_execute (thd=0x2acaae7a2030, packet_arg=0x2acb504ec021 "\001", packet_length=29)
    at sql_prepare.cc:2601
#17 0x000000000058fc61 in dispatch_command (command=COM_STMT_EXECUTE, thd=0x2acaae7a2030, packet=0x2acb504ec021 "\001",
    packet_length=6) at sql_parse.cc:1231
#18 0x0000000000590e30 in do_command (thd=0x2acaae7a2030) at sql_parse.cc:901
#19 0x0000000000582d17 in handle_one_connection (arg=0x2acaae7a2030) at sql_connect.cc:1738
#20 0x0000003d2760673d in start_thread () from /lib64/libpthread.so.0
#21 0x0000003d26ed3d1d in clone () from /lib64/libc.so.6

Vadim Tkachenko (vadim-tk) wrote :

It crashes for me with 100% if I kill mysqld and start after that.

I think we need to research more to understand what's wrong.

Changed in percona-server:
status: Fix Committed → Confirmed

I found it may not be related to buffer_pool_shm.
buffer_pool_shm is just a trigger to elicit the problem.

Always the recovery process seems to have problem.
Using "MALLOC_CHECK_" environment says error always at the environment.

I continue to analyze.

It seems specific problem only at the server.
And it seems to be caused by multi sessions' load.
(single thread load seems to safe)

I currently suspect
"Hardware bug" of the server
which like "breakable cache coherency" and cannot keep atomic operations.

The work area of malloc() seems to be broken by the multi-threads' load.

In the end,
Percona-Server and XtraDB must be innocent.

BTW,
if the server has the hardware bug about cache coherency between cores/HT,
The hangup or irregular concurrent access about mutex/lock should happen.

How about the other hangup problems of the server?

Changed in percona-server:
status: Confirmed → Invalid

I couldn't reach to prove the hardware bug of the server.
So, it may be problem of malloc() at the server.
(e.g. not thread safe)

But I don't understand why.

It might be problem of MALLOC_CHECK_

Changed in percona-server:
status: Invalid → Incomplete

I cannot reproduce now, with correct operation.
(And I executed "yum update glibc" also before the retesting)

The malloc() may return NULL pointer, when the free memory is not enough.
If you still meet the crash, I suspect you have not deleted the shm with another key.

The another type of crash may still exist.
I will fix it soon. It should be the last fix for this bug.

Changed in percona-server:
status: Incomplete → Confirmed
Changed in percona-server:
status: Confirmed → Fix Committed
Changed in percona-server:
milestone: none → 5.1.50-12.1
Changed in percona-server:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers