mysql process crashes after setting innodb_dict_size

Bug #758788 reported by Eugene Shelepov
78
This bug affects 11 people
Affects Status Importance Assigned to Milestone
Percona Server moved to https://jira.percona.com/projects/PS
Invalid
Undecided
Unassigned
5.1
Fix Released
High
George Ormond Lorch III
5.5
Fix Released
High
George Ormond Lorch III
5.6
Invalid
Undecided
Unassigned

Bug Description

Hello,

We're running Percona 5.1.54-rel12.5. Because of some memory usage issues I've tried to set innodb_dict_size limitation dynamically. In about twenty minutes after implementing it server crashed. The logs for the period contain:

Version: '5.1.54-rel12.5-log' socket: '/var/lib/mysql/mysql.sock' port: 3306 Percona Server (GPL), 12.5, Revision 188
110404 3:44:23 [Note] Slave SQL thread initialized, starting replication in log 'bin-log.000016' at position 26626, relay log './relay-log.000087' position: 5337
110404 3:44:23 [Note] Slave I/O thread: connected to master 'rep@1.2.3.4:3306',replication started in log 'bin-log.000016' at position 26626
110406 11:29:05 - mysqld got signal 11 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
We will try our best to scrape up some info that will hopefully help diagnose
the problem, but since we have already crashed, something is definitely wrong
and this may fail.

key_buffer_size=67108864
read_buffer_size=131072
max_used_connections=374
max_threads=10000
threads_connected=359
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 4014129 K
bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

thd: 0x2aafe4307550
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x4ee520f0 thread_stack 0x40000
/usr/sbin/mysqld(my_print_stacktrace+0x2e)[0x87333e]
/usr/sbin/mysqld(handle_segfault+0x322)[0x5b2332]
/lib64/libpthread.so.0[0x337140eb10]
/usr/sbin/mysqld[0x779b2b]
/usr/sbin/mysqld[0x779ecb]
/usr/sbin/mysqld[0x77ff1c]
/usr/sbin/mysqld[0x73344e]
/usr/sbin/mysqld(_ZN7handler12ha_write_rowEPh+0x53)[0x695ae3]
/usr/sbin/mysqld(_Z12write_recordP3THDP8st_tableP12st_copy_info+0x5d)[0x63286d]
/usr/sbin/mysqld(_Z12mysql_insertP3THDP10TABLE_LISTR4ListI4ItemERS3_IS5_ES6_S6_15enum_duplicatesb+0xb36)[0x6367e6]
/usr/sbin/mysqld(_Z21mysql_execute_commandP3THD+0x2538)[0x5c2ed8]
/usr/sbin/mysqld(_Z11mysql_parseP3THDPcjPPKc+0x549)[0x5c6f89]
/usr/sbin/mysqld(_Z16dispatch_command19enum_server_commandP3THDPcj+0xa31)[0x5c79c1]
/usr/sbin/mysqld(_Z10do_commandP3THD+0x120)[0x5c8650]
/usr/sbin/mysqld(handle_one_connection+0x7a2)[0x5ba562]
/lib64/libpthread.so.0[0x337140673d]
/lib64/libc.so.6(clone+0x6d)[0x33708d3f6d]
Trying to get some variables.
Some pointers may be invalid and cause the dump to abort...
thd->query at 0x20e62d90 = INSERT INTO `report_event` (`logged_at`, `event_type_id`, `object_id`, `subject_id`, `subtype`, `store_id`) VALUES ('2011-04-06 16:29:03', '1', '259', '780', '1', '1')
thd->thread_id=807150
thd->killed=NOT_KILLED
The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains
information that should help you find out what is causing the crash.
110406 11:29:06 mysqld_safe Number of processes running now: 0
110406 11:29:06 mysqld_safe mysqld restarted
110406 11:29:06 [Note] Flashcache bypass: disabled
110406 11:29:06 [Note] Flashcache setup error is : ioctl failed

110406 11:29:06 [Note] Plugin 'FEDERATED' is disabled.
InnoDB: The InnoDB memory heap is disabled
InnoDB: Mutexes and rw_locks use GCC atomic builtins
InnoDB: Compressed tables use zlib 1.2.3
110406 11:29:06 InnoDB: Initializing buffer pool, size = 9.8G
110406 11:29:07 InnoDB: Completed initialization of buffer pool
110406 11:29:07 InnoDB: highest supported file format is Barracuda.
InnoDB: Log scan progressed past the checkpoint lsn 40048057101
110406 11:29:07 InnoDB: Database was not shut down normally!
InnoDB: Starting crash recovery.

The crash was repeating again and again. After setting innodb_dict_size back to 0, situation stabilized.
The problem is reproducible.

Looks like a bug.

---
Regards,
Eugene Shelepov

Related branches

Revision history for this message
Stewart Smith (stewart) wrote :

Does look like a bug - to help us construct a test case, could you give an indication as to how many tables (and what kind of schemas they are) and a indication as to what value you set innodb_dict_size to?

visibility: public → private
visibility: private → public
Revision history for this message
Stewart Smith (stewart) wrote :

are you able to provide us with any more information to be able to reproduce this? How many tables do you have? what was innodb_dict_size set to? We'll need this information to help construct a test case.

Changed in percona-server:
status: New → Incomplete
Revision history for this message
Eugene Shelepov (eugene-shelepov) wrote :

Hello,

We have a lot of small databases, about 7 thousand with about 300 tables in each of them. Open tables are limited to 100,000.

When mysql crashed innodb_dict_size was set to 4Gb.

BUFFER POOL AND MEMORY
----------------------
Total memory allocated 37632000000; in additional pool allocated 0
Internal hash tables (constant factor + variable factor)
    Adaptive hash index 1326803568 (581048824 + 745754744)
    Page hash 36316328
    Dictionary cache 23604963972 (145264208 + 23459699764)
    File system 83536 (82672 + 864)
    Lock system 90822072 (90789784 + 32288)
    Recovery system 0 (0 + 0)
    Threads 408112 (406936 + 1176)
Dictionary memory allocated 23459699764
Buffer pool size 2239999
Buffer pool size, bytes 36700143616
Free buffers 0
Database pages 2194481
Old database pages 810052
Modified db pages 866
Pending reads 0
Pending writes: LRU 0, flush list 0, single page 0
Pages made young 20778188, not young 0
0.00 youngs/s, 0.00 non-youngs/s
Pages read 20983317, created 3975656, written 43405370
0.00 reads/s, 0.00 creates/s, 0.00 writes/s
Buffer pool hit rate 1000 / 1000, young-making rate 0 / 1000 not 0 / 1000
Pages read ahead 0.00/s, evicted without access 0.00/s
LRU len: 2194481, unzip_LRU len: 0
I/O sum[1143]:cur[0], unzip sum[0]:cur[0]

Please let me know if any extra details are needed. We can also try to reproduce an issue and collect needed data.

---
Regards,
Eugene Shelepov

Revision history for this message
Stewart Smith (stewart) wrote :

Thanks for the further info.. have you tried upgrading to the latest Percona Server at all? I haven't looked through the fixed bugs list, but this is the version we'll attempt to reproduce bug against.

Changed in percona-server:
status: Incomplete → New
Stewart Smith (stewart)
Changed in percona-server:
assignee: nobody → Yasufumi Kinoshita (yasufumi-kinoshita)
Revision history for this message
Yasufumi Kinoshita (yasufumi-kinoshita) wrote :

Currently, no clue for where should read source code.

At least, I need symbol-resolved stacktrace of the crash to continue.

Thank you.

Revision history for this message
Stewart Smith (stewart) wrote :

If you are using CentOS or RedHat with gdb 7.0.1 you may be hitting a GDB bug in resolving symbols. We recommend enabling core dumps and using GDB 7.2 (or later) to be able to get a backtrace that is usable for debugging.

Revision history for this message
Andrii Kryzhyk (andrii.kryzhyk) wrote :

Hi!
I tried to reproduce this issue on the latest version of Percona Server 5.1 and issue is reproducible.
I set the innodb_dict_size_limit option to 128M and run siege on frontend of application.
On separate console I watched the size of 'Dictionary memory allocated' variable and once it grow to value of 128M the mysqld process crashed.
I've got the backtrace from core file and attached it to this bug.
Pls look on it and reply if you will need any additional information.

Revision history for this message
Yasufumi Kinoshita (yasufumi-kinoshita) wrote :

It might be the potential bug of InnoDB Plugin to cause race condition.
I have reported to bugs.mysql.com

http://bugs.mysql.com/bug.php?id=62221

I will fix soon.

Changed in percona-server:
status: New → Confirmed
importance: Undecided → High
Changed in percona-server:
status: Confirmed → Fix Committed
Revision history for this message
Vadim Tkachenko (vadim-tk) wrote :

Valentine, can we make reproducible test case ?

tags: added: cr i17875
Revision history for this message
Eugene Shelepov (eugene-shelepov) wrote :

Hello,

We've rebuilt 5.1.58 with provided patch but the problem is still reproduced.
When we set innodb_dict_size_limit to 100M server still crashes.

Changed in percona-server:
status: Fix Committed → In Progress
Revision history for this message
Yasufumi Kinoshita (yasufumi-kinoshita) wrote :

I have rewritten the fix more nervously.

lp:~percona-dev/percona-server/5.1-bug758788-trial2

It might be too much nervous fix. But I don't have reproduce procedure yet.
Please test the branch and please inform if it is effective.
We will merge after the your verification.

Revision history for this message
Stewart Smith (stewart) wrote :

Using dbqp to run randgen from: lp:~stewart/dbqp/innodb_dict_size (and soon to be lp:dbqp trunk)

./dbqp.py --libeatmydata --default-server-type=mysql --basedir=/home/stewart/percona-server/5.5/Percona-Server --suite=randgen_basic innodbDictSizeLimit_test --gdb

we are able to near-instantly reproduce a crash that only happens with the innodb-dict-size option set.

Revision history for this message
Yasufumi Kinoshita (yasufumi-kinoshita) wrote :

Stewart,

You seem to test 5.5.

The fix candidate is lp:~percona-dev/percona-server/5.1-bug758788-trial2
based on 5.1

Do you mean you can reproduce also using the branch?

Revision history for this message
Stewart Smith (stewart) wrote :
Download full text (4.0 KiB)

Managed to reproduce a few more crashes:

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffecfa0700 (LWP 13305)]
0x00000000007b3b2f in rec_get_offsets_func (rec=0x7ffff4564470 "\200", index=0x15983e8, offsets=0x7fffecf9e770, n_fields=2, heap=0x7fffecf9edc0, file=0x9f3c6c "page/page0cur.c", line=378) at rem/rem0rec.c:546
546 if (dict_table_is_comp(index->table)) {

#0 0x00000000007b3b2f in rec_get_offsets_func (rec=0x7ffff4564470 "\200",
    index=0x15983e8, offsets=0x7fffecf9e770, n_fields=2, heap=0x7fffecf9edc0,
    file=0x9f3c6c "page/page0cur.c", line=378) at rem/rem0rec.c:546
#1 0x00000000007a08bf in page_cur_search_with_match (block=0x7ffff2690078,
    index=0x15983e8, tuple=0x13f1e18, mode=4,
    iup_matched_fields=0x7fffecf9f218, iup_matched_bytes=0x7fffecf9f220,
    ilow_matched_fields=0x7fffecf9f228, ilow_matched_bytes=0x7fffecf9f230,
    cursor=0x7fffecf9f2d8) at page/page0cur.c:376
#2 0x0000000000804597 in btr_cur_search_to_nth_level (index=0x15983e8,
    level=0, tuple=0x13f1e18, mode=4, latch_mode=2, cursor=0x7fffecf9f2d0,
    has_search_latch=0, file=0x9f9abe "row/row0row.c", line=771,
    mtr=0x7fffecf9f3c0) at btr/btr0cur.c:665
#3 0x00000000007c7e10 in btr_pcur_open_func (mtr=<optimized out>, line=771,
    file=0x9f9abe "row/row0row.c", cursor=0x7fffecf9f2d0,
    latch_mode=<optimized out>, mode=4, tuple=0x13f1e18, index=<optimized out>)
    at ./include/btr0pcur.ic:498
#4 row_search_index_entry (index=<optimized out>, entry=0x13f1e18,
    mode=<optimized out>, pcur=<optimized out>, mtr=<optimized out>)
    at row/row0row.c:771
#5 0x00000000007c5b24 in row_purge_remove_sec_if_poss_low (node=0x138fb88,
    index=0x15983e8, entry=0x13f1e18, mode=2) at row/row0purge.c:253
#6 0x00000000007c690b in row_purge_remove_sec_if_poss (entry=0x13f1e18,
    index=0x15983e8, node=0x138fb88) at row/row0purge.c:328

#7 row_purge_del_mark (node=0x138fb88) at row/row0purge.c:375
#8 row_purge (thr=0x138fa00, node=0x138fb88) at row/row0purge.c:648
#9 row_purge_step (thr=0x138fa00) at row/row0purge.c:688
#10 0x00000000007b167d in que_thr_step (thr=0x138fa00) at que/que0que.c:1260
#11 que_run_threads_low (thr=0x138fa00) at que/que0que.c:1320
#12 que_run_threads (thr=0x138fa00) at que/que0que.c:1357
#13 0x00000000007e2ffc in trx_purge () at trx/trx0purge.c:1201
#14 0x00000000007da109 in srv_purge_thread (arg=<optimized out>)
    at srv/srv0srv.c:3513
#15 0x00007ffff7bc6efc in start_thread (arg=0x7fffecfa0700)
    at pthread_create.c:304
#16 0x00007ffff6b1989d in clone ()
    at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#17 0x0000000000000000 in ?? ()

and...

#0 0x00007ffff6a6e3a5 in __GI_raise (sig=6)
    at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
#1 0x00007ffff6a71b0b in __GI_abort () at abort.c:92
#2 0x00000000007dd68f in rw_lock_free (lock=0x1433100) at sync/sync0rw.c:307
#3 0x000000000082e4d4 in dict_index_remove_from_cache (table=0x142c048,
    index=0x1433078) at dict/dict0dict.c:1825
#4 0x000000000082e5e0 in dict_table_remove_from_cache (table=0x142c048)
    at dict/dict0dict.c:1227
#5 0x000000000082e935 in dict_table_LRU_trim (self=0x7fffe8049478)
    at di...

Read more...

Revision history for this message
Stewart Smith (stewart) wrote :

those crashes were with the above branch btw.

Revision history for this message
Yasufumi Kinoshita (yasufumi-kinoshita) wrote :

The case seems the race condition between using index entry at dictionary and removing index entry.
And it might be caused by the situation that
the loading the index entry and the deleting index entry are done at the same time.
That is, the LRU list of the table entry of dictionary is very short.
If the LRU list has certain length enough, the crash should not cause, though it is optimistic implementation.
(Otherwise, we need to rewrite all InnoDB source code using index entry of dict to use new mutex.)

The test script set the innodb_dict_size_limit value to 200k. (200KB)
It might be crazily tiny value.

So, I suggest to change the minimum value of the innodb_dict_size_limit to 100M from 0.
100M should be enough to keep the safe certain length of the dictionary LRU list.

Revision history for this message
Stewart Smith (stewart) wrote :

Yes, 200k is a crazy value. Basically set that low to try and trigger any possible problems.

Revision history for this message
Stewart Smith (stewart) wrote :

as discussed on IRC... setting a higher dict_size_limit (e.g. 10MB) avoids the possible race by ensuring that the list is never too small. I ran tests against it and have so far been unable to reproduce a crash with the 5.1 patch. It would be great to see both a merge proposal for 5.1 and the same for 5.5

Revision history for this message
Valentine Gostev (longbow) wrote :

Stewart, is proposal for 5.5 ready? I see that only 5.1 is committed

Revision history for this message
Tim (tchadwick-b) wrote :
Download full text (3.6 KiB)

Hi All,

I am responsible for a set of MySQL instances that are playing the application-level shard game. We have well over 100K tables per instance that are partitioned, so we're really talking 600K+ tables with a dictionary cache over 6GB.

We've seen instances of the corruption in a handful of tables upon restart, and the errors point to the table definitions being corrupt. After seeing the innodb_dict_size_limit instrumentation available, and knowing that we were well past the usual size of a large dictionary I was excited to see if reducing the dictionary cache could eliminate this issue. However, before I attempted to implement the feature I noticed this bug and did considerable testing using the random query generator and probing harsh transaction and alter conditions around the 100M threshold suggested in this bug report. I was able to confirm instrumentation of the dictionary cache, and did not see crash. I also note that I was able to repeat the behavior Yasufumi reports using very small innodb_dict_size_limit values.

The production environment in which I next implemented the feature is a site mirror of the production "primary" (not a replication architecture - the sites operate independently), and so it is a production environment that can suffer downtime, but only briefly so as to maintain failover for the primary site.

The summary is that the functionality fell over.

Here are the initial / material details for 2 machines, denoted as 01 and 02:

# 01
mysql> SHOW ENGINE INNODB STATUS;
    Dictionary cache 6434650431 (212499088 + 6222151343)
Dictionary memory allocated 6222151343

#02
mysql> SHOW ENGINE INNODB STATUS;
    Dictionary cache 5662601749 (212499088 + 5450102661)
Dictionary memory allocated 5450102661

I began by setting the dictionary size limit to 5GB.

# 01 and 02
mysql> SET GLOBAL innodb_dict_size_limit=5368709120;

The cache on 01, illustrated the expected behavior and slowly (ever so slowly...) began to evict cache entries.

# 01
mysql> SHOW ENGINE INNODB STATUS;
    Dictionary cache 6432416970 (212499088 + 6219917882)
Dictionary memory allocated 6219917882
1 row in set (0.00 sec)

mysql> SHOW ENGINE INNODB STATUS;
    Dictionary cache 6431708659 (212499088 + 6219209571)
Dictionary memory allocated 6219209571
1 row in set (0.00 sec)

mysql> SHOW ENGINE INNODB STATUS;
    Dictionary cache 6429897331 (212499088 + 6217398243)
Dictionary memory allocated 6217398243
1 row in set (0.00 sec)

The cache on 02 is a different story. It did not change at all. Please note that the table_open_cache was 50k initially, but reduced to 20k to help the algorithm identify candidates for dictionary cache eviction. Of course while this was happening all activity on the db was suspended and no tables were in active use. Also, although 50k might seem low for this number of tables, the tables are in a single tablespace and not in a file-per-table mode. At this scale using multiple buffer pool instances was found to be most performant in that regard, but thoughts on that certainly appreciated.

While I was prodding 02 in order to determine what might be causing the innodb_dict_size_limit to not force dicti...

Read more...

tags: added: xtradb
Revision history for this message
George Ormond Lorch III (gl-az) wrote :

While looking into this issue, I managed to hit several crashes on both PS 5.1 and 5.5 trunks (5.1.73-14.12 revno 595 and 5.5.35-33.1 revno 607). The test was quite simple, fire up a server instance with the following options:
  innodb_file_format=barracuda
  innodb_file_per_table=1
  innodb_dict_size_limit=0

Start RQG with the percona grammar in a loop against the server instance. In each case, this can run for hours without problem.
Then set innodb_dict_size_limit to just about any value, I used 1000, 10000, 100000 and 500000. The actual value really didn't seem to have any impact on the failure as long as it was > 0. Within minutes, both the 5.1 and 5.5 servers would crash.

A very cursory analysis of all of the core files revealed that there seems to be something 'unprotected' as all the issues seemed to related to using an already freed record, releasing a lock too many times, etc...

The next few comments will contain info on the various crashes.

Revision history for this message
George Ormond Lorch III (gl-az) wrote :

PS 5.1 crash:
InnoDB: Assertion failure in thread 140323328083712 in file lock/lock0lock.c line 3799
InnoDB: Failing assertion: (table->locks).count > 0

Revision history for this message
George Ormond Lorch III (gl-az) wrote :

PS 5.1 crash:
mysqld got signal 11

Revision history for this message
George Ormond Lorch III (gl-az) wrote :

PS 5.1 crash:
mysqld got signal 6

Revision history for this message
George Ormond Lorch III (gl-az) wrote :

PS 5.5 crash:
InnoDB: Assertion failure in thread 140332332128000 in file sync0rw.c line 319
InnoDB: Failing assertion: lock->lock_word == X_LOCK_DECR

Revision history for this message
George Ormond Lorch III (gl-az) wrote :

PS 5.5 crash, different call stack, same location as above:
InnoDB: Assertion failure in thread 140376792467200 in file sync0rw.c line 319
InnoDB: Failing assertion: lock->lock_word == X_LOCK_DECR

Revision history for this message
George Ormond Lorch III (gl-az) wrote :
Download full text (7.6 KiB)

A little more info on one of the 5.1 assertions that I can reproduce nearly at will:
InnoDB: Assertion failure in thread 140323328083712 in file lock/lock0lock.c line 3799
InnoDB: Failing assertion: (table->locks).count > 0

What seems to be happening is that a FLUSH TABLES is closing an innodb dict table that still has locks on it. This is causing a dict table handle count to get decremented down to 0 yet a transaction still has an IX lock on the table.

Along comes the dictionary trimmer and sees the table with no open handles, so, it removes it from the cache. Later on, when the transaction is cleaning up, it tries to remove the (now invalid) table from its lock list and all kinds of interesting things can happen...

I hacked in an assertion in row_prebuilt_free just before it decrements the table handle count to assert if there is a matching trx to table lock in the table lock list and it fires with the following bt full:

#6 0x000000000092708b in row_prebuilt_free (prebuilt=0x7f67d8019608, dict_locked=0) at row/row0mysql.c:798
        lock = 0x7f67cc01d2f8
        i = 8
#7 0x00000000008b695c in ha_innobase::close (this=0x7f67d800edf0) at handler/ha_innodb.cc:4571
        _db_func_ = 0xbd9308 "closefrm"
        _db_file_ = 0xbd8e17 "table.cc"
        dbug_violation_helper = {_entered = true}
        thd = 0x2742b60
        _db_level_ = 10
        _db_framep_ = 0x7f67e80008c0
#8 0x00000000006cd55b in closefrm (table=0x7f67d8009ad0, free_share=true) at table.cc:1999
        _db_func_ = 0xbd597d "intern_close_table"
        _db_file_ = 0xbd5741 "sql_base.cc"
        dbug_violation_helper = {_entered = true}
        error = 0
        _db_level_ = 9
        _db_framep_ = 0x7f67e80009d0
#9 0x00000000006b3ac5 in intern_close_table (table=0x7f67d8009ad0) at sql_base.cc:789
        _db_func_ = 0xbd59b0 "free_cache_entry"
        _db_file_ = 0xbd5741 "sql_base.cc"
        dbug_violation_helper = {_entered = true}
        _db_level_ = 8
        _db_framep_ = 0x7f6828083328
#10 0x00000000006b3b7c in free_cache_entry (table=0x7f67d8009ad0) at sql_base.cc:811
        _db_func_ = 0xc72c57 "my_hash_delete"
        _db_file_ = 0xc72bd0 "hash.c"
        dbug_violation_helper = {_entered = true}
        __PRETTY_FUNCTION__ = "void free_cache_entry(TABLE*)"
        _db_level_ = 7
        _db_framep_ = 0x7f67f8e80c30
#11 0x0000000000a5fe4b in my_hash_delete (hash=0x10bc060, record=0x7f67d8009ad0 "\220\243") at hash.c:549
        blength = 16
        pos2 = 6
        pos_hashnr = 1386574718
        lastpos_hashnr = 1948990622
        idx = 4294967295
        empty_index = 9
        data = 0x1bd85f0
        lastpos = 0x1bd8690
        gpos = 0x1bd85f0
        pos = 0x1bd8650
        pos3 = 0x1bd8650
        empty = 0x1bd8680
        _db_func_ = 0xbd59f5 "close_cached_tables"
        _db_file_ = 0xbd5741 "sql_base.cc"
        _db_level_ = 6
        _db_framep_ = 0xa7ab9c
#12 0x00000000006b3de9 in close_cached_tables (thd=0x2742b60, tables=0x0, have_lock=false, wait_for_refresh=true, wait_for_placeholders=false) at sql_base.cc:876
        _db_func_ = 0xb82151 "mysql_execute_command"
        _db_file_ = 0xb81b40 "sql_parse.cc"
        dbug_violation_helper = {...

Read more...

Revision history for this message
George Ormond Lorch III (gl-az) wrote :

Just some more info, logging it here in case I forget or for anyone following. I can easily hit my assertion that I added above and removing that assertion I can 100% recreate an error situation by executing the following sql script:

SET GLOBAL innodb_dict_size_limit=1;
SET LOCAL AUTOCOMMIT=OFF;
BEGIN;
CREATE TABLE test.t1 (c1 INT) ENGINE=INNODB;
CREATE TABLE test.t2 (c1 INT) ENGINE=INNODB;
INSERT INTO test.t1(c1) VALUES(1);
FLUSH TABLES;
INSERT INTO test.t2(c1) VALUES(1);
COMMIT;

The 'FLUSH TABLES' command above allows the eviction of the dict_table for t1 in the dict cache even though there is a lock held against it and referenced from the transaction lock list. When the COMMIT comes along and tries to clean up the transaction lock list, the reference to t1 that was locked during the first insert is now completely invalid and can cause various assertions since it is now pointing to potentially reclaimed memory.

From digging, there seems to be a disconnect between what mysql thinks of as an 'unused table' (where it pulls the list of TABLES that it can/should flush) and what XtraDB believes are used. There seems to be no way to fail to 'close' the table at the handler interface within XtraDB so somehow I need to figure out why MySQL thinks this table is 'unused' when in fact it is most certainly being used. Stay tuned...

Revision history for this message
George Ormond Lorch III (gl-az) wrote :

OK, in sql_base.cc there is a list of TABLE* unused_tables. This is a list of tables that are currently not being used by an executing statement. Once the statement execution is complete, regardless of the overall transaction state, any tables used by that statement are placed back into this unused tables list. This list is what the FLUSH TABLES statement uses as a source of tables that it needs to flush. This translates directly into a handler::close call on the table even if the table is still included within an active transaction. So by the end of the INSERT...test.t1 statement above, mysql has determined/placed test.t1 into the unused list, thus thinking it is OK to flush that table. The FLUSH TABLES comes along then, walks the unused_tables list and closes all the tables within that list, causing XtraDB to close the dict_table within its cache, causing the dangling references within the transaction lock lists.

Revision history for this message
George Ormond Lorch III (gl-az) wrote :

After a little hint from Laurynas and some clarification on the relationship between unused_tables, dict_cache handle counts and lock lists and comparing to upstream 5.6 implementation which is similar but different I think I have a fix that clears most of these issues in 5.1 by checking to see if a dict_table has any locks or pinned indexes before allowing it to be evicted from the cache.

I am still chasing down one issue that is rather hard to hit compared to these and if I can't get it soon I will just post this fix and continue on trying to finish off this last crash.

Revision history for this message
Shahriyar Rzayev (rzayev-sehriyar) wrote :

Percona now uses JIRA for bug reports so this bug report is migrated to: https://jira.percona.com/browse/PS-6

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.