Percona Server with XtraDB

Valgrind: free in buf_page_get_gen (Invalid read in buf_flush_batch / buf_flush_list) + free in buf_page_get_gen (Invalid read in buf_flush_page_and_try_neighbors)

Reported by Roel Van de Paar on 2012-12-05
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Percona Server
Status tracked in 5.6
5.1
High
Laurynas Biveinis
5.5
High
Laurynas Biveinis
5.6
High
Laurynas Biveinis

Bug Description

121125 7:05:01 [Note] /data/ssd/qa/Percona-Server-5.5.28-rel29.1-351.Linux.x86_64/bin/mysqld: ready for connections.
Version: '5.5.28-rel29.1-log' socket: '/data/ssd/qa/94/current1_1/tmp/master.sock' port: 13000 Percona Server with XtraDB (GPL), Release rel29.1, Revision 351
==11657== Thread 16:
==11657== Invalid read of size 8
==11657== at 0x87E0C6: buf_flush_batch (buf0flu.c:1770)
==11657== by 0x87E3FF: buf_flush_list (buf0flu.c:2047)
==11657== by 0x823CC1: srv_master_thread (srv0srv.c:3692)
==11657== by 0x38E260677C: start_thread (pthread_create.c:301)
==11657== by 0x38E16D3C1C: clone (in /lib64/libc-2.5.so)
==11657== Address 0x152e01a8 is 56 bytes inside a block of size 152 free'd
==11657== at 0x4A072BA: free (vg_replace_malloc.c:446)
==11657== by 0x87446A: buf_page_get_gen (buf0buf.ic:886)
==11657== by 0x85ABD1: btr_free_externally_stored_field (btr0cur.c:4981)
==11657== by 0x863723: btr_cur_pessimistic_delete (btr0cur.c:5078)
==11657== by 0x926276: row_undo_ins (row0uins.c:117)
==11657== by 0x91EC13: row_undo_step (row0undo.c:320)
==11657== by 0x913049: que_run_threads (que0que.c:1257)
==11657== by 0x839CB8: trx_general_rollback_for_mysql (trx0roll.c:97)
==11657== by 0x8077F0: row_mysql_handle_errors (row0mysql.c:593)
==11657== by 0x80892B: row_insert_for_mysql (row0mysql.c:1264)
==11657== by 0x7EFA89: ha_innobase::write_row(unsigned char*) (ha_innodb.cc:6006)
==11657== by 0x69B7E1: handler::ha_write_row(unsigned char*) (handler.cc:5359)
==11657== by 0x5673D0: write_record(THD*, TABLE*, st_copy_info*) (sql_insert.cc:1788)
==11657== by 0x56E463: mysql_insert(THD*, TABLE_LIST*, List<Item>&, List<List<Item> >&, List<Item>&, List<Item>&, enum_duplicates, bool) (sql_insert.cc:956)
==11657== by 0x57EE0B: mysql_execute_command(THD*) (sql_parse.cc:3065)
==11657== by 0x583112: mysql_parse(THD*, char*, unsigned int, Parser_state*) (sql_parse.cc:5811)
==11657==
==11657== Invalid read of size 1
==11657== at 0x87D8FB: buf_flush_page_and_try_neighbors (buf0buf.ic:77)
==11657== by 0x87E008: buf_flush_batch (buf0flu.c:1738)
==11657== by 0x87E3FF: buf_flush_list (buf0flu.c:2047)
==11657== by 0x823CC1: srv_master_thread (srv0srv.c:3692)
==11657== by 0x38E260677C: start_thread (pthread_create.c:301)
==11657== by 0x38E16D3C1C: clone (in /lib64/libc-2.5.so)
==11657== Address 0x152e017b is 11 bytes inside a block of size 152 free'd
==11657== at 0x4A072BA: free (vg_replace_malloc.c:446)
==11657== by 0x87446A: buf_page_get_gen (buf0buf.ic:886)
==11657== by 0x85ABD1: btr_free_externally_stored_field (btr0cur.c:4981)
==11657== by 0x863723: btr_cur_pessimistic_delete (btr0cur.c:5078)
==11657== by 0x926276: row_undo_ins (row0uins.c:117)
==11657== by 0x91EC13: row_undo_step (row0undo.c:320)
==11657== by 0x913049: que_run_threads (que0que.c:1257)
==11657== by 0x839CB8: trx_general_rollback_for_mysql (trx0roll.c:97)
==11657== by 0x8077F0: row_mysql_handle_errors (row0mysql.c:593)
==11657== by 0x80892B: row_insert_for_mysql (row0mysql.c:1264)
==11657== by 0x7EFA89: ha_innobase::write_row(unsigned char*) (ha_innodb.cc:6006)
==11657== by 0x69B7E1: handler::ha_write_row(unsigned char*) (handler.cc:5359)
==11657== by 0x5673D0: write_record(THD*, TABLE*, st_copy_info*) (sql_insert.cc:1788)
==11657== by 0x56E463: mysql_insert(THD*, TABLE_LIST*, List<Item>&, List<List<Item> >&, List<Item>&, List<Item>&, enum_duplicates, bool) (sql_insert.cc:956)
==11657== by 0x57EE0B: mysql_execute_command(THD*) (sql_parse.cc:3065)
==11657== by 0x583112: mysql_parse(THD*, char*, unsigned int, Parser_state*) (sql_parse.cc:5811)
==11657==
==11657== Invalid read of size 1
[...]
[........ numerous occurrences of Invalid read of size 1 or 4 or 8 in buf_flush_page_and_try_neighbors as above ........]
[...]
==11657==
==11657== HEAP SUMMARY:
==11657== in use at exit: 1,560 bytes in 5 blocks
==11657== total heap usage: 3,162,238 allocs, 3,162,233 frees, 32,397,098,285 bytes allocated
==11657==
==11657== LEAK SUMMARY:
==11657== definitely lost: 0 bytes in 0 blocks
==11657== indirectly lost: 0 bytes in 0 blocks
==11657== possibly lost: 0 bytes in 0 blocks
==11657== still reachable: 0 bytes in 0 blocks
==11657== suppressed: 1,560 bytes in 5 blocks
==11657==
==11657== For counts of detected and suppressed errors, rerun with: -v
==11657== ERROR SUMMARY: 12 errors from 12 contexts (suppressed: 4 from 4)

Roel Van de Paar (roel11) wrote :

This bug is a split from bug 1083058

Roel Van de Paar (roel11) wrote :
Download full text (3.9 KiB)

In Debug + Valgrind, the sever asserts with:

130101 11:04:32 InnoDB: Assertion failure in thread 395134720 in file buf0flu.c line 482
InnoDB: Failing assertion: bpage->in_flush_list

Just after giving errors like these:

==19340== Invalid read of size 8
==19340== at 0x8E9F6B: buf_flush_page_and_try_neighbors (buf0flu.c:480)
==19340== by 0x8EC87E: buf_flush_batch (buf0flu.c:1740)
==19340== by 0x8ED41D: buf_flush_list (buf0flu.c:2047)
==19340== by 0x8734D6: srv_master_thread (srv0srv.c:3512)
==19340== by 0x3DA1807D13: start_thread (in /usr/lib64/libpthread-2.15.so)
==19340== by 0x3DA14F168C: clone (in /usr/lib64/libc-2.15.so)
==19340== Address 0x4c5da00 is 128 bytes inside a block of size 192 free'd
==19340== at 0x4A06666: free (vg_replace_malloc.c:446)
==19340== by 0x8A9543: ut_free (ut0mem.c:221)
==19340== by 0x8E1E78: buf_page_get_gen (buf0buf.ic:886)
==19340== by 0x8E386B: buf_page_create (buf0buf.c:3760)
==19340== by 0x940759: fsp_page_create (fsp0fsp.c:1546)
==19340== by 0x941899: fseg_alloc_free_page_low (fsp0fsp.c:2833)
==19340== by 0x9426E2: fseg_alloc_free_page_general (fsp0fsp.c:2907)
==19340== by 0x8B05FF: btr_page_alloc (btr0btr.c:993)
==19340== by 0x8C5918: btr_store_big_rec_extern_fields (btr0cur.c:4509)
==19340== by 0x9CA4B3: row_ins_index_entry_low (row0ins.c:2237)
==19340== by 0x9CDE87: row_ins_index_entry (row0ins.c:2297)
==19340== by 0x9CEBCD: row_ins_step (row0ins.c:2373)
==19340== by 0x84014F: row_insert_for_mysql (row0mysql.c:1255)
==19340== by 0x81CDDB: ha_innobase::write_row(unsigned char*) (ha_innodb.cc:6061)
==19340== by 0x6BA555: handler::ha_write_row(unsigned char*) (handler.cc:5395)
==19340== by 0x585B2E: write_record(THD*, TABLE*, st_copy_info*) (sql_insert.cc:1788)

Backtrace:

Thread 5 (LWP 19421):
+bt
#0 0x0000003da180c60c in pthread_kill () from /lib64/libpthread.so.0
#1 0x00000000007dbe18 in my_write_core (sig=6) at /ssd/BT-16724-xtradb-bmp-requests-5.5-BUILD-TREE/Percona-Server-5.5.28-rel29.1/mysys/stacktrace.c:433
#2 0x00000000006b1a68 in handle_fatal_signal (sig=6) at /ssd/BT-16724-xtradb-bmp-requests-5.5-BUILD-TREE/Percona-Server-5.5.28-rel29.1/sql/signal_handler.cc:249
#3 <signal handler called>
#4 0x0000003da1435935 in raise () from /lib64/libc.so.6
#5 0x0000003da14370e8 in abort () from /lib64/libc.so.6
#6 0x00000000008e9fab in buf_flush_ready_for_flush (flush_type=BUF_FLUSH_LIST, bpage=0x4c5d980) at /ssd/BT-16724-xtradb-bmp-requests-5.5-BUILD-TREE/Percona-Server-5.5.28-rel29.1/storage/innobase/buf/buf0flu.c:482
#7 buf_flush_page_and_try_neighbors (bpage=bpage@entry=0x4c5d980, flush_type=flush_type@entry=BUF_FLUSH_LIST, n_to_flush=n_to_flush@entry=22, count=count@entry=0x178d3668) at /ssd/BT-16724-xtradb-bmp-requests-5.5-BUILD-TREE/Percona-Server-5.5.28-rel29.1/storage/innobase/buf/buf0flu.c:1587
#8 0x00000000008ec87f in buf_flush_flush_list_batch (lsn_limit=277030465, min_n=22, buf_pool=0x5dc0c28) at /ssd/BT-16724-xtradb-bmp-requests-5.5-BUILD-TREE/Percona-Server-5.5.28-rel29.1/storage/innobase/buf/buf0flu.c:1740
#9 buf_flush_batch (buf_pool=buf_pool@entry=0x5dc0c28, flush_type=flush_type@entry=BUF_FLUSH_LIST, min_n=min_n@entry=22, l...

Read more...

Roel Van de Paar (roel11) wrote :

Server still asserts w/ --innodb_adaptive_flushing_method=native

Roel Van de Paar (roel11) wrote :

Server *does not* assert w/ --innodb_adaptive_flushing_method=0

Roel Van de Paar (roel11) wrote :

Was able to assert server w/ --innodb_adaptive_flushing_method=0 also now. This option may or may not affect reproducibility level.

Roel Van de Paar (roel11) wrote :

Discussed w/ Laurynas. This bug looks now unrelated to bug 1083058. Laurynas doing code analysis. Roel has testcase which semi-reliably reproduces the sporadic issue. Issue is single thread reproducible, but executing same sql in many threads helps w/ reproducibility.

Ovais Tariq (ovais-tariq) wrote :

Roel,

I see that you only tried different flushing methods by changing the parameter --innodb_adaptive_flushing_method to different values. It would be worthwhile to test with the adaptive_flushing itself disabled, that might mean a different code path all together. I would also note that the customer who was hitting the same bug, has not have the bug seen after disabling adaptive_flushing

Roel Van de Paar (roel11) wrote :

Correction on #6:
--innodb_adaptive_flushing_method=native is the same as ...-method=0 (And server asserts on both)
When testing with: --innodb_adaptive_flushing=0, the server no longer asserts.
So, bug 1083058 may still be related to this bug.

Roel -

Is 5.1 affected?

I have a theory based on instrumented runs.

We need two threads, one at buf_page_get_gen and another at
buf_flush_flush_list_batch, trying to work on the same dirty
compressed (i.e. state = BUF_BLOCK_ZIP_DIRTY) page.

1) buf_flush_flush_list_batch acquires flush_list_mutex, gets a
   page from the flush list, releases flush_list_mutex.
2) buf_page_get_gen attempts to read the same page, acquires
   zip_mutex, gets the dirty compressed page descriptor, allocates
   an uncompressed page, and replaces the descriptor with the page in
   the flush_list by a buf_flush_relocate_on_flush_list() call. That
   call acquires flush_list_mutex, sets in_flush_list = FALSE; but
   leaves oldest_modification alone, releases the flush_list_mutex.
   Caller then invalidates the descriptor memory, releases zip_mutex,
   frees its memory.
3) buf_flush_flush_list_batch attempts to read from the deallocated
   compressed page descriptor. Assuming that the memory was not
   overwritten since deallocation, the code reads oldest_modification
   != 0 and then trips on !in_flush_list.

Roel Van de Paar (roel11) wrote :

Laurynas, is this related to Bug 934377 - InnoDB: Failing assertion: (buf_pool->flush_list).count > 0 ?

It could be. The backtrace of that bug seems to be a bit removed from the underlying cause, so hard to say ATM.

Ovais Tariq (ovais-tariq) wrote :

Laurynas,

What about the case when compressed tablespace is not being used. I suppose dirty read from the flush_list is still done nevertheless, is that correct?

Roel Van de Paar (roel11) wrote :

5.1 Testing has revealed no assert, nor any related Valgrind issues.
Several rounds of patch>testing have resulted in a prototype patch being available which no longer asserts, no Valgrind's'

Ovais -

Yes, that's bug 1083058. For non-compressed tables the current bug does not trigger.

Roel Van de Paar (roel11) wrote :

Assert in buf_page_get_gen in bug 1100178

Roel Van de Paar (roel11) wrote :
Download full text (8.9 KiB)

Assert now also seen in 5.1

130117 9:24:09 InnoDB: Assertion failure in thread 386553600 in file buf/buf0flu.c line 407
InnoDB: Failing assertion: bpage->in_flush_list

Directly after a Valgrind warning

==7461== Thread 16:
==7461== Invalid read of size 1
==7461== at 0x97724D: buf_page_get_state (buf0buf.ic:176)
==7461== by 0x9775C8: buf_page_get_mutex (buf0buf.ic:319)
==7461== by 0x977638: buf_page_get_mutex_enter (buf0buf.ic:347)
==7461== by 0x984865: buf_flush_batch (buf0flu.c:1470)
==7461== by 0x9354CB: srv_master_thread (srv0srv.c:3108)
==7461== by 0x3DA1807D13: start_thread (in /usr/lib64/libpthread-2.15.so)
==7461== by 0x3DA14F168C: clone (in /usr/lib64/libc-2.15.so)
==7461== Address 0x19ff03f8 is 8 bytes inside a block of size 192 free'd
==7461== at 0x4A06666: free (vg_replace_malloc.c:446)
==7461== by 0x9577E8: ut_free (ut0mem.c:227)
==7461== by 0x9780AE: buf_page_free_descriptor (buf0buf.ic:798)
==7461== by 0x97C87F: buf_page_get_gen (buf0buf.c:2079)
==7461== by 0x97EA44: buf_page_create (buf0buf.c:2972)
==7461== by 0x9B898E: fsp_page_create (fsp0fsp.c:1608)
==7461== by 0x9BBD9E: fseg_alloc_free_page_low (fsp0fsp.c:2899)
==7461== by 0x9BBF18: fseg_alloc_free_page_general (fsp0fsp.c:2973)
==7461== by 0x95BF31: btr_page_alloc_low (btr0btr.c:970)
==7461== by 0x95BFA0: btr_page_alloc (btr0btr.c:1005)
==7461== by 0x96C505: btr_store_big_rec_extern_fields (btr0cur.c:4382)
==7461== by 0x911BA7: row_ins_index_entry_low (row0ins.c:2246)
==7461== by 0x911CCA: row_ins_index_entry (row0ins.c:2297)
==7461== by 0x911FF9: row_ins_index_entry_step (row0ins.c:2382)
==7461== by 0x9122C3: row_ins (row0ins.c:2514)
==7461== by 0x912594: row_ins_step (row0ins.c:2624)
==7461==
==7461== Invalid read of size 1
==7461== at 0x97724D: buf_page_get_state (buf0buf.ic:176)
==7461== by 0x9775C8: buf_page_get_mutex (buf0buf.ic:319)
==7461== by 0x981F8D: buf_flush_ready_for_flush (buf0flu.c:402)
==7461== by 0x984881: buf_flush_batch (buf0flu.c:1476)
==7461== by 0x9354CB: srv_master_thread (srv0srv.c:3108)
==7461== by 0x3DA1807D13: start_thread (in /usr/lib64/libpthread-2.15.so)
==7461== by 0x3DA14F168C: clone (in /usr/lib64/libc-2.15.so)
==7461== Address 0x19ff03f8 is 8 bytes inside a block of size 192 free'd
==7461== at 0x4A06666: free (vg_replace_malloc.c:446)
==7461== by 0x9577E8: ut_free (ut0mem.c:227)
==7461== by 0x9780AE: buf_page_free_descriptor (buf0buf.ic:798)
==7461== by 0x97C87F: buf_page_get_gen (buf0buf.c:2079)
==7461== by 0x97EA44: buf_page_create (buf0buf.c:2972)
==7461== by 0x9B898E: fsp_page_create (fsp0fsp.c:1608)
==7461== by 0x9BBD9E: fseg_alloc_free_page_low (fsp0fsp.c:2899)
==7461== by 0x9BBF18: fseg_alloc_free_page_general (fsp0fsp.c:2973)
==7461== by 0x95BF31: btr_page_alloc_low (btr0btr.c:970)
==7461== by 0x95BFA0: btr_page_alloc (btr0btr.c:1005)
==7461== by 0x96C505: btr_store_big_rec_extern_fields (btr0cur.c:4382)
==7461== by 0x911BA7: row_ins_index_entry_low (row0ins.c:2246)
==7461== by 0x911CCA: row_ins_index_entry (row0ins.c:2297)
==7461== by 0x911FF9: row_ins_index_entry_step (row0ins.c:2382)
==7461== ...

Read more...

Roel Van de Paar (roel11) wrote :
Roel Van de Paar (roel11) wrote :

Assert seen again in QA-16274-5.5 tree at revid Percona-Server-5.5.28-rel29.3-416-debug.Linux.x86_64, this time aborting in buf_flush_try_yield. Attaching full bt set + error log.

Roel Van de Paar (roel11) wrote :
Roel Van de Paar (roel11) wrote :

If we should split this issue into a new bug for clarity etc, please let me know and I'll be happy to arrange.

Roel -

The last error is indeed unrelated and should be split to a new bug.

Roel Van de Paar (roel11) wrote :

Split to bug 1110102 from comment 20 up to 24.

tags: added: xtradb
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers