Second valgrind warning /crash in hp_process_field_data_to_chunkset with an out-of-memory situation

Bug #790828 reported by Philip Stoev on 2011-05-31
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
percona-projects-qa
Low
Alexey Kopytov

Bug Description

When executing a RQG stress test under valgrind, memory consumption grew suddenly (most likely due to trying to insert too ma ny 2MB blobs in a table) and the following was produced in the server error log file:

110531 17:12:08 [ERROR] /home/philips/bzr/mysql-55-eb/sql/mysqld: Out of memory (Needed 129872 bytes)
==16380== Thread 19:
==16380== Invalid write of size 1
==16380== at 0x4007634: memcpy (mc_replace_strmem.c:497)
==16380== by 0x8617123: hp_process_field_data_to_chunkset (hp_record.c:173)
==16380== by 0x861733D: hp_process_record_data_to_chunkset (hp_record.c:276)
==16380== by 0x86173C4: hp_copy_record_data_to_chunkset (hp_record.c:306)
==16380== by 0x8618172: heap_update (hp_update.c:66)
==16380== by 0x860FEB8: ha_heap::update_row(unsigned char const*, unsigned char*) (ha_heap.cc:265)
==16380== by 0x835A24A: handler::ha_update_row(unsigned char const*, unsigned char*) (handler.cc:4806)
==16380== by 0x8293F8F: mysql_update(THD*, TABLE_LIST*, List<Item>&, List<Item>&, Item*, unsigned int, st_order*, unsigned long long, enum_duplicates, bool
, unsigned long long*, unsigned long long*) (sql_update.cc:713)
==16380== by 0x8204368: mysql_execute_command(THD*) (sql_parse.cc:2662)
==16380== by 0x820C025: mysql_parse(THD*, char*, unsigned int, Parser_state*) (sql_parse.cc:5503)
==16380== by 0x82006ED: dispatch_command(enum_server_command, THD*, char*, unsigned int) (sql_parse.cc:1034)
==16380== by 0x81FFBDB: do_command(THD*) (sql_parse.cc:771)
==16380== by 0x82D03B8: do_handle_one_connection(THD*) (sql_connect.cc:776)
==16380== by 0x82D007B: handle_one_connection (sql_connect.cc:724)
==16380== by 0x821918: start_thread (in /lib/libpthread-2.12.1.so)
==16380== by 0x76ACCD: clone (in /lib/libc-2.12.1.so)
==16380== Address 0x0 is not stack'd, malloc'd or (recently) free'd
==16380==

I interpret this to mean that a certain memory operation could not be completed, returned 0 and this 0 was subsequently used by the heap storage engine. A cursory code inspection showed that most of the return value of most memory management calls is checked, but not for all.

I can provide a test case for this bug, however a code inspection may be the best way to fix this situation.

The core and the binary are available if needed both locally and remotely -- compressed size is 2gb.

Philip Stoev (pstoev-askmonty) wrote :

mysql bzr version-info
revision-id: <email address hidden>
date: 2011-05-31 11:33:25 +0300
build-date: 2011-05-31 21:44:55 +0300
revno: 3483
branch-nick: mysql-55-eb

RQG bzr version-info
revision-id: <email address hidden>
date: 2011-05-31 14:18:45 +0200
build-date: 2011-05-31 21:45:08 +0300
revno: 809
branch-nick: randgen-heap

RQG command line:

perl runall.pl --queries=100000000 --validator=None --queries=100M --mysqld=--log-output=file --seed=time --mysqld=--max_heap_table_size=3Gb --threads=2 --grammar=conf/engines/heap/heap_ddl_multi.yy --basedir1=/home/philips/bzr/mysql-55-eb --valgrind --duration=21600

description: updated
description: updated
description: updated
Changed in percona-projects-qa:
milestone: none → 5.5.13-eb
Alexey Kopytov (akopytov) wrote :

Code inspection has not revealed any code paths that might lead to NULL pointer dereference. Manual tests with inserting BLOBs and emulating OOM in a debugger shows correct behavior: the "out of memory" error is returned to the client.

I'm now running the randgen test with the reported command line. The test has been running for ~2 hours so far with no errors.

Alexey Kopytov (akopytov) wrote :

Setting to low importance, since it doesn't look like a showstopper to me.

The bug seems to be valid, even though it's not yet clear what exactly leads to that state. There are lots of OOM bugs in the server, so a loaded server will likely fail under an OOM condition anyway, if not in HEAP then elsewhere.

The workaround is to set max_heap_table_size appropriately.

Changed in percona-projects-qa:
assignee: nobody → Alexey Kopytov (akopytov)
importance: Undecided → Low
Philip Stoev (pstoev-askmonty) wrote :

I do agree that there are other places in the server where OOM is not handled properly.

This particular test shows a failure when an UPDATE statement tries to update all records in a table to the largest blob from randgen's data directory. So, if there are any unguarded paths, they should be in update and not in insert.

Alexey Kopytov (akopytov) wrote :

The randgen test has completed in 7 hours with no errors.

Philip Stoev (pstoev-askmonty) wrote :

Ok, let me know if you ever want to work on this bug and I will provide a resource-constrained VM where the problem is reproducible in 2 hours or so. I will not be doing any further testing that causes OOM.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers