percona-projects-qa

Valgrind warning/crash in heap_scan in mysql-55-eb-blobs

Bug #789244 reported by Philip Stoev on 2011-05-27

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	percona-projects-qa	Fix Released	Undecided	Alexey Kopytov	percona-projects-qa 5.5.13-eb "eb"

Bug Description

The attached test case produces (among others), the following 2 warnings:

==15968== Invalid read of size 1
==15968== at 0x8602215: heap_scan (hp_scan.c:78)
==15968== by 0x85FB587: ha_heap::rnd_next(unsigned char*) (ha_heap.cc:387)
==15968== by 0x8437799: rr_sequential(READ_RECORD*) (records.cc:455)
==15968== by 0x8291C04: mysql_update(THD*, TABLE_LIST*, List<Item>&, List<Item>&, Item*, unsigned int, st_order*, unsigned long long, enum_duplicates, bool, unsigned long long*, unsigned long long*) (sql_update.cc:644)
==15968== by 0x8202CC4: mysql_execute_command(THD*) (sql_parse.cc:2662)
==15968== by 0x820A979: mysql_parse(THD*, char*, unsigned int, Parser_state*) (sql_parse.cc:5503)
==15968== by 0x81FF049: dispatch_command(enum_server_command, THD*, char*, unsigned int) (sql_parse.cc:1034)
==15968== by 0x81FE537: do_command(THD*) (sql_parse.cc:771)
==15968== by 0x82CCF38: do_handle_one_connection(THD*) (sql_connect.cc:776)
==15968== by 0x82CCBFB: handle_one_connection (sql_connect.cc:724)
==15968== by 0x84D9D33: pfs_spawn_thread (pfs.cc:1015)
==15968== by 0x821918: start_thread (in /lib/libpthread-2.12.1.so)
==15968== by 0x76ACCD: clone (in /lib/libc-2.12.1.so)
==15968== Address 0xeb9760c is 4 bytes after a block of size 129,992 alloc'd
==15968== at 0x4005BDC: malloc (vg_replace_malloc.c:195)
==15968== by 0x84A52F8: my_malloc (my_malloc.c:38)
==15968== by 0x86033F3: hp_get_new_block (hp_block.c:79)
==15968== by 0x8601475: hp_allocate_one_chunk (hp_dspace.c:388)
==15968== by 0x8601197: hp_allocate_variable_chunkset (hp_dspace.c:240)
==15968== by 0x86012CC: hp_allocate_chunkset (hp_dspace.c:310)
==15968== by 0x8602985: heap_write (hp_write.c:56)
==15968== by 0x85FAE3B: ha_heap::write_row(unsigned char*) (ha_heap.cc:246)
==15968== by 0x8355ACC: handler::ha_write_row(unsigned char*) (handler.cc:4782)
==15968== by 0x81ED5FB: write_record(THD*, TABLE*, st_copy_info*) (sql_insert.cc:1734)
==15968== by 0x81EB83D: mysql_insert(THD*, TABLE_LIST*, List<Item>&, List<List<Item> >&, List<Item>&, List<Item>&, enum_duplicates, bool) (sql_insert.cc:928)
==15968== by 0x82031C7: mysql_execute_command(THD*) (sql_parse.cc:2787)
==15968== by 0x820A979: mysql_parse(THD*, char*, unsigned int, Parser_state*) (sql_parse.cc:5503)
==15968== by 0x81FF049: dispatch_command(enum_server_command, THD*, char*, unsigned int) (sql_parse.cc:1034)
==15968== by 0x81FE537: do_command(THD*) (sql_parse.cc:771)
==15968== by 0x82CCF38: do_handle_one_connection(THD*) (sql_connect.cc:776)

==15968== Conditional jump or move depends on uninitialised value(s)
==15968== at 0x860221A: heap_scan (hp_scan.c:78)
==15968== by 0x85FB587: ha_heap::rnd_next(unsigned char*) (ha_heap.cc:387)
==15968== by 0x8437799: rr_sequential(READ_RECORD*) (records.cc:455)
==15968== by 0x8291C04: mysql_update(THD*, TABLE_LIST*, List<Item>&, List<Item>&, Item*, unsigned int, st_order*, unsigned long long, enum_duplicates, bool, unsigned long long*, unsigned long long*) (sql_update.cc:644)
==15968== by 0x8202CC4: mysql_execute_command(THD*) (sql_parse.cc:2662)
==15968== by 0x820A979: mysql_parse(THD*, char*, unsigned int, Parser_state*) (sql_parse.cc:5503)
==15968== by 0x81FF049: dispatch_command(enum_server_command, THD*, char*, unsigned int) (sql_parse.cc:1034)
==15968== by 0x81FE537: do_command(THD*) (sql_parse.cc:771)
==15968== by 0x82CCF38: do_handle_one_connection(THD*) (sql_connect.cc:776)
==15968== by 0x82CCBFB: handle_one_connection (sql_connect.cc:724)
==15968== by 0x84D9D33: pfs_spawn_thread (pfs.cc:1015)
==15968== by 0x821918: start_thread (in /lib/libpthread-2.12.1.so)
==15968== by 0x76ACCD: clone (in /lib/libc-2.12.1.so)

Unfortunately a 6-hour effort to simplify only reduced the test case to 675 lines (out if initial 30K).

Revision history for this message

Philip Stoev (pstoev-askmonty) wrote on 2011-05-27:

bug789244.test Edit (218.1 KiB, text/plain)

Partially simplifed test case. Needs to be run with --max_heap_table_size=1G , --valgrind and --secure-file-priv set to a directory containing the lp:randgen repository.

Revision history for this message

Philip Stoev (pstoev-askmonty) wrote on 2011-05-27:

bzr version-info
revision-id: <email address hidden>
date: 2011-05-27 13:09:02 +0400
build-date: 2011-05-27 21:13:59 +0300
revno: 3475
branch-nick: mysql-55-eb-blobs

Philip Stoev (pstoev-askmonty) on 2011-05-27

summary:

- Valgrind warning in heap_scan in mysql-55-eb-blobs
+ Valgrind warning/crash in heap_scan in mysql-55-eb-blobs

Alexey Kopytov (akopytov) on 2011-05-28

Changed in percona-projects-qa:
milestone:	none → 5.5.13-eb

Revision history for this message

Alexey Kopytov (akopytov) wrote on 2011-05-29:

Could not repeat it using the attached test case on CentOS 5.6 with valgrind 3.5.0. Here's what I did:

- downloaded the attached bug789244.test to mysql-test/t.

- fixed the absolute path names to my randgen branch:

perl -pi -e 's|/home/philips/bzr/randgen-heap|/home/alexey.kopytov/src/launchpad/randgen|g' t/bug789244.test

- started the test like this:

./mtr --valgrind --mysqld=--secure-file-priv=/home/alexey.kopytov/src/launchpad/randgen --mysqld=--max_heap_table_size=1G bug789244

The output:

main.bug789244 [ fail ]
Test ended at 2011-05-28 22:23:16

CURRENT_TEST: main.bug789244
==21464== Memcheck, a memory error detector
==21464== Copyright (C) 2002-2009, and GNU GPL'd, by Julian Seward et al.
==21464== Using Valgrind-3.5.0 and LibVEX; rerun with -h for copyright info
==21464== Command: /mnt/data/home/alexey.kopytov/src/launchpad/mysql-55-eb-blobs/client//mysqltest --defaults-file=/mnt/data/home/alexey.kopytov/src/launchpad/mysql-55-eb-blobs/mysql-test/var/my.cnf --silent --tmpdir=/mnt/data/home/alexey.kopytov/src/launchpad/mysql-55-eb-blobs/mysql-test/var/tmp --character-sets-dir=/mnt/data/home/alexey.kopytov/src/launchpad/mysql-55-eb-blobs/sql/share/charsets --logdir=/mnt/data/home/alexey.kopytov/src/launchpad/mysql-55-eb-blobs/mysql-test/var/log --plugin_dir=/mnt/data/home/alexey.kopytov/src/launchpad/mysql-55-eb-blobs/plugin/auth --database=test --timer-file=/mnt/data/home/alexey.kopytov/src/launchpad/mysql-55-eb-blobs/mysql-test/var/log/timer --test-file=/mnt/data/home/alexey.kopytov/src/launchpad/mysql-55-eb-blobs/mysql-test/t/bug789244.test --tail-lines=20
==21464==
mysqltest: The test didn't produce any output
==21464==
==21464== HEAP SUMMARY:
==21464== in use at exit: 1,560 bytes in 5 blocks
==21464== total heap usage: 2,204 allocs, 2,199 frees, 1,247,814 bytes allocated
==21464==
==21464== LEAK SUMMARY:
==21464== definitely lost: 0 bytes in 0 blocks
==21464== indirectly lost: 0 bytes in 0 blocks
==21464== possibly lost: 0 bytes in 0 blocks
==21464== still reachable: 0 bytes in 0 blocks
==21464== suppressed: 1,560 bytes in 5 blocks
==21464==
==21464== For counts of detected and suppressed errors, rerun with: -v
==21464== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 4 from 4)

I tried both revision 3475 and the top-most one.

Going to ask someone to try repeating it on another machine on Monday.

Could not repeat it using the attached test case on CentOS 5.6 with valgrind 3.5.0. Here's what I did:

- downloaded the attached bug789244.test to mysql-test/t.

- fixed the absolute path names to my randgen branch:

perl -pi -e 's|/home/philips/bzr/randgen-heap|/home/alexey.kopytov/src/launchpad/randgen|g' t/bug789244.test

- started the test like this:

./mtr --valgrind --mysqld=--secure-file-priv=/home/alexey.kopytov/src/launchpad/randgen --mysqld=--max_heap_table_size=1G  bug789244

The output:

main.bug789244                           [ fail ]
        Test ended at 2011-05-28 22:23:16

CURRENT_TEST: main.bug789244
==21464== Memcheck, a memory error detector
==21464== Copyright (C) 2002-2009, and GNU GPL'd, by Julian Seward et al.
==21464== Using Valgrind-3.5.0 and LibVEX; rerun with -h for copyright info
==21464== Command: /mnt/data/home/alexey.kopytov/src/launchpad/mysql-55-eb-blobs/client//mysqltest --defaults-file=/mnt/data/home/alexey.kopytov/src/launchpad/mysql-55-eb-blobs/mysql-test/var/my.cnf --silent --tmpdir=/mnt/data/home/alexey.kopytov/src/launchpad/mysql-55-eb-blobs/mysql-test/var/tmp --character-sets-dir=/mnt/data/home/alexey.kopytov/src/launchpad/mysql-55-eb-blobs/sql/share/charsets --logdir=/mnt/data/home/alexey.kopytov/src/launchpad/mysql-55-eb-blobs/mysql-test/var/log --plugin_dir=/mnt/data/home/alexey.kopytov/src/launchpad/mysql-55-eb-blobs/plugin/auth --database=test --timer-file=/mnt/data/home/alexey.kopytov/src/launchpad/mysql-55-eb-blobs/mysql-test/var/log/timer --test-file=/mnt/data/home/alexey.kopytov/src/launchpad/mysql-55-eb-blobs/mysql-test/t/bug789244.test --tail-lines=20
==21464== 
mysqltest: The test didn't produce any output
==21464== 
==21464== HEAP SUMMARY:
==21464==     in use at exit: 1,560 bytes in 5 blocks
==21464==   total heap usage: 2,204 allocs, 2,199 frees, 1,247,814 bytes allocated
==21464== 
==21464== LEAK SUMMARY:
==21464==    definitely lost: 0 bytes in 0 blocks
==21464==    indirectly lost: 0 bytes in 0 blocks
==21464==      possibly lost: 0 bytes in 0 blocks
==21464==    still reachable: 0 bytes in 0 blocks
==21464==         suppressed: 1,560 bytes in 5 blocks
==21464== 
==21464== For counts of detected and suppressed errors, rerun with: -v
==21464== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 4 from 4)

I tried both revision 3475 and the top-most one.

Going to ask someone to try repeating it on another machine on Monday.

Revision history for this message

Philip Stoev (pstoev-askmonty) wrote on 2011-05-29:

Confirmed that bug is still prevent. Provided Alexey with login credentials for a machine where the bug is repeatable in at least 50% of valgrind runs.

Revision history for this message

Laurynas Biveinis (laurynas-biveinis) wrote on 2011-05-30:

For the record not repeatable on my local machine (Ubuntu 10.10, Valgrind 3.6.0)

Revision history for this message

Philip Stoev (pstoev-askmonty) wrote on 2011-05-30:

Repeatable for me on the following 2 machines:

fedora core 14 32 bit, gcc 4.4.5 20101112 (Red Hat 4.4.5-2), valgrind-3.5.0
,
fedora core 13 64 bit, ) 4.4.4 20100503 (Red Hat 4.4.4-2),valgrind-3.5.0

both when compiling with ./BUILD/compile-pentium-debug-no-ndb and with ./BUILD/compile-pentium-valgrind-no-ndb

the test may need to be run several times in order to display the warning and/or crash.

Alexey Kopytov (akopytov) on 2011-05-30

Changed in percona-projects-qa:
assignee:	nobody → Alexey Kopytov (akopytov)
status:	New → In Progress

Revision history for this message

Alexey Kopytov (akopytov) wrote on 2011-05-30:

This bug is similar to #783451 and in the sense that the fix for bug #783451 was incomplete, though it could handle this bug as well. The proposed solution, however, is quite different and covers both bug #783451 and this one.

heap_scan() tries to optimize searching for the next chunk by keeping track of the current block (i.e. by storing the maximum chunk number for the current block in info->next_block) and only incrementing the current chunk pointer by the chunk size, thus relying on the assumption that all chunks within a single block are stored contiguously.

As in bug #783451, the root cause is that with heap state changing during the scan operation (which may be the case for UPDATE on a HEAP table using the dynamic row format), the invariant of info->next_block always being the maximum available chunk number within the current block does not necessarily hold.

In this particular case, when scanning the last block, info->next_block was assigned the current maximum chunk number (share->recordspace.chunk_count). The following sequence of events then led to access to uninitialized memory:

- in the middle of scanning the last block, UPDATE had to allocate a number of new chunks *and* a new block, because there were not enough space for a part of the new chunks in the last block.

- when reaching the info->next_block threshold, heap_scan() updated it with the new share->recordspace.chunk_count and called hp_find_record() to update info->current_ptr. The important thing is that current_ptr returned by hp_find_record() was still pointing into the same block, because part of the newly allocated chunks are still in the same block. Also note that at this step info->next_block becomes greater than the maximum available chunk number in the current block.

- after processing the last chunk from the current block, heap_scan() should have switched to the newly allocated block, but instead it happily continued past the current block, because the current chunk number (pos) was still less than info->next_block.

The proposed solution is to remove the dubious optimization with keeping track of the current block and just incrementing the pointer when in the same block.

Instead, we can always call hp_find_record() (which is a wrapper over hp_find_block), that always provides accurate information, no matter if the heap state has been changed since the last call or not. It is sufficiently fast to not introduce any measurable performance impact (just 3 loops iterations in the worst case with a few arithmetic operations). This will also fix bug #783451.

- in the middle of scanning the last block, UPDATE had to allocate a number of new chunks *and* a new block, because there were not enough space for a part of the new chunks in the last block.

The proposed solution is to remove the dubious optimization with keeping track of the current block and just incrementing the pointer when in the same block.

Alexey Kopytov (akopytov) on 2011-05-30