innodb main thread locks itself in &dict_operation_lock
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Percona Server moved to https://jira.percona.com/projects/PS |
New
|
Undecided
|
Unassigned |
Bug Description
Yesterday our master got a severe stall which ended in InnoDB committing suicide with an assertion:
InnoDB: Error: semaphore wait has lasted > 600 seconds
InnoDB: We intentionally crash the server, because it appears to be hung.
2015-09-26 20:54:46 7efda0ca3700 InnoDB: Assertion failure in thread 139627789432576 in file srv0srv.cc line 2128
The InnoDB engine status dump (attached to this bug) shows entire INNODB subsystem locked , with main thread as the blocker, which seem stuck in state "enforcing dict cache limit"
Looking at the error log, what puzzles me is the following (139627810412288 is the main thread):
--Thread 139627810412288 has waited at srv0srv.cc line 2596 for 241.00 seconds the semaphore:
X-lock (wait_ex) on RW-latch at 0x134c460 '&dict_
a writer (thread id 139627810412288) has reserved it in mode wait exclusive
number of readers 7, waiters flag 1, lock_word: fffffffffffffff9
Last time read locked in file row0ins.cc line 1803
Last time write locked in file /mnt/workspace/
From the above it seems that the blocker was blocked by itself. This seems quite odd!!
The thread already owns &dict_operation
Running Percona Server 5.6.25-73.1-log . my.cnf and error log attached.
Additional info which may help troubleshooting:
At the time the problem occurred there was an online schema change running and about to complete.
The table being altered was a very large (1 billion rows) partitioned one, it is likely that the OSC tool was about to execute or executing the atomical rename of tables, though this is just a speculation as the operation never completed, and therefore is not logged in binary logs or anywhere else.
Thanks
Rick
Today we retried the online schema change operation, after trimming down the partitioned table to just 3 days worth of data (50M rows approx). The server stalled again when the OSC was about to complete in the same exact way:
--Thread 140222643808000 has waited at dict0stats.cc line 2385 for 23.000 seconds the semaphore: operation_ lock' percona- server- 5.6-redhat- binary/ label_exp/ centos6- 64/rpmbuild/ BUILD/percona- server- 5.6.25- 73.1/storage/ innobase/ dict/dict0stats .cc line 2385
X-lock (wait_ex) on RW-latch at 0x134c460 '&dict_
a writer (thread id 140222643808000) has reserved it in mode wait exclusive
number of readers 34, waiters flag 1, lock_word: ffffffffffffffde
Last time read locked in file row0undo.cc line 298
Last time write locked in file /mnt/workspace/
We suspect that the problem is tied to a conflict between rows inserted in the current partitionof _new table by OSC process and the insert trigger. Everything is smooth until the end, when the inserted row ID from original table approaches the value of the current row ID (which is auto increment) inserted from trigger.
This time we were able to run a pt-pmp:
647 pthread_ cond_wait, os_cond_ wait(os0sync. cc:214) ,os_event_ wait_low( os0sync. cc:214) ,sync_array_ wait_event( sync0arr. cc:433) ,rw_lock_ s_lock_ spin(sync0rw. cc:550) ,rw_lock_ s_lock_ func(sync0rw. ic:426) ,pfs_rw_ lock_s_ lock_func( sync0rw. ic:426) ,row_mysql_ freeze_ data_dictionary _func(sync0rw. ic:426) ,row_undo( row0undo. cc:298) ,row_undo_ step(row0undo. cc:298) ,que_thr_ step(que0que. cc:1087) ,que_run_ threads_ low(que0que. cc:1087) ,que_run_ threads( que0que. cc:1087) ,trx_rollback_ to_savepoint_ low(trx0roll. cc:112) ,trx_rollback_ to_savepoint( trx0roll. cc:154) ,row_mysql_ handle_ errors( row0mysql. cc:651) ,row_lock_ table_autoinc_ for_mysql( row0mysql. cc:651) ,ha_innobase: :innobase_ lock_autoinc( ha_innodb. cc:7072) ,innobase_ set_max_ autoinc( ha_innodb. cc:7123) ,ha_innobase: :write_ row(ha_ innodb. cc:7123) ,handler: :ha_write_ row(handler. cc:7679) ,ha_partition: :write_ row(ha_ partition. cc:4010) ,handler: :ha_write_ row(handler. cc:7679) ,write_ record( sql_insert. cc:1680) ,mysql_ insert( sql_insert. cc:1085) ,mysql_ execute_ command( sql_parse. cc:3892) ,audit_ mysql_execute_ command, sp_instr_ stmt::exec_ core(sp_ instr.cc: 937),sp_ lex_instr: :reset_ lex_and_ exec_core( sp_instr. cc:391) ,sp_lex_ instr:: validate_ lex_and_ execute_ core(sp_ instr.cc: 642),sp_ instr_stmt: :execute( sp_instr. cc:825) ,sp_head: :execute( sp_head. cc:648) ,sp_head: :execute_ trigger( sp_head. cc:895) ,Table_ triggers_ list::process_ triggers( sql_trigger. cc:2184) ,write_ record( sql_insert. cc:1962) ,mysql_ insert( sql_insert. cc:1085) ,mysql_ execute_ command( sql_parse. cc:3892) ,audit_ mysql_execute_ command, mysql_parse( sql_parse. cc:6950) ,dispatch_ command( sql_parse. cc:1442) ,do_handle_ one_connection( sql_connect. cc:1532) ,handle_ one_connection( sql_connect. cc:1443) ,start_ thread( libpthread. so.0),clone( libc.so. 6) so.6),vio_ io_wait( viosocket. c:1136) ,vio_socket_ io_wait( viosocket. c:104), vio_read( viosocket. c:159), net_read_ raw_loop( net_serv. cc:669) ,net_read_ packet_ header( net_serv. cc:751) ,net_read_ packet( net_serv. cc:751) ,my_net_ read(net_ serv.cc: 751),do_ command( sql_parse. cc:987) ,do_handle_ one_connection( sql_connect. cc:1532) ,handle_ one_connection( sql_conn. ..
151 poll(libc.