MySQL abort due to over 600s semaphore waiting, triggered by the drop table SQL

Bug #1074255 reported by Hui Liu
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Percona Server moved to https://jira.percona.com/projects/PS
Incomplete
Undecided
Unassigned

Bug Description

It's the first time for the Percona-5.5(.18) caught this bug on product environment. Detail info could refer to the attached alert.log.

The basic scenario is:
Drop tables xxx_001/0064 sequentially, most of the tables are dropped successfully, but at last of the task, mysqld aborts, as several drop_table SQL wait for some semaphore over 600 secs.
The load data is started after the batch drop_table task, while as the drop tables are blocked, these load SQL could be seen in trx lists, but it has no relation with this issue.

Tags: xtradb
Revision history for this message
Hui Liu (hickey) wrote :
Revision history for this message
Hui Liu (hickey) wrote :

Here are the threads that semaphore shows:
1331538240 waits for row0purge.c:680 for 241s on &dict_operation_lock, owned by 1383319872
1321048384 waits for buf0flu.c:1335 for 1s on &block->lock, owned by 1383319872 (1321048384 is main thread)
1368525120 waits for fsp0fsp.c:3144 for 231s on &dict_sys->mutex
1354787136 waits for dict0dict.c:742 for 23s on &dict_sys->mutex

I think the key threads are: 1331538240 , 1383319872 and 1392122.
1331538240 triggered the problem and I am not sure it's the purge thread. It waits dict_operation_lock in row_mysql_freeze_data_dictionary over 600s.

1383319872 owns the dict_operation_lock, which lasted for ever, and seen from the last time write lock info, it was last owned in row_mysql_lock_data_dictionary of row_drop_table_for_mysql, which should be the DDL, but the DDL is 1392122.

Revision history for this message
Valerii Kravchuk (valerii-kravchuk) wrote :

Please, send your my.cnf file content. I wonder if the table you drop is partitioned or compressed InnoDB table.

See also http://bugs.mysql.com/bug.php?id=41158 for (probably) similar cases in the past (that should be solve in recent 5.5.x).

Changed in percona-server:
status: New → Incomplete
Revision history for this message
Hui Liu (hickey) wrote :

Here is the my.cnf used for product env.

Yes, these dropped tables are compressed tables, and no partition tables used.

PRIMARY KEY (`id`),
  KEY `ind_tip_appapireport_period_app` (`appid`,`servicename`,`timestamp`),
  KEY `ind_tip_appapireport_period_name` (`servicename`,`timestamp`),
  KEY `ind_timestamp` (`timestamp`)
ENGINE=InnoDB DEFAULT CHARSET=gbk ROW_FORMAT=COMPRESSED KEY_BLOCK_SIZE=4

Seems not the same issue with #41158, as these tables normal sized, and I suspect related with compressed tables? We have less compressed table experiences for product environment.

Any clue? Thanks~

Hui Liu (hickey)
Changed in percona-server:
status: Incomplete → New
Revision history for this message
yinfeng (yinfeng-zwx) wrote :

I had tried to reproduce this bug. I used a modified sysbench to create some compressed table and then drop them. I found another long semaphore waiting:

--Thread 139878966376192 has waited at buf0flu.c line 1454 for 303.00 seconds the semaphore:
S-lock on RW-latch at 0x44d4158 '&buf_pool->page_hash_latch'
a writer (thread id 139878934730496) has reserved it in mode exclusive
number of readers 0, waiters flag 1, lock_word: 0
Last time read locked in file buf0buf.c line 2482

the purge thread was waiting for the user thread that executing 'drop table' to release buf_pool->page_hash_latch

I guess if the buffer pool or compressed table is big enough, a real crash may happen(though the user thread is working normally to free compressed pages in page_hash )

But actually the server was active. While the aim of crashing server is to avoid real hang, I think a more friendly way is needed to check the real hang.

Besides , crashing the server is very rude and may probably lead to inconsistency of data. I think we can add some option to forbid this behavior (this should be very easy)

Revision history for this message
Raghavendra D Prabhu (raghavendra-prabhu) wrote :

@yinfeng, @hickey,

Can you try to reproduce this on the latest Percona Server? (5.5.28-29.1) Many drop table related performance regressions have been fixed (both on PS and upstream, upstream specifically after 5.5.23)

Revision history for this message
yinfeng (yinfeng-zwx) wrote :

hi, Raghavendra D Prabhu
thanks for your reply.

I had tested PS5.5.28 and it really improved the performance of 'drop table' ( from minutes to seconds ).

Using innodb_lazy_drop_table should be a workaround too because it didn't call the function buf_LRU_invalidate_tablespace which was called by "drop table" thread to do some time-consuming work like this:

#0 0x00000000008a8c2d in buf_buddy_free_low ()
#1 0x00000000008150f4 in buf_LRU_block_remove_hashed_page ()
#2 0x0000000000817def in buf_LRU_invalidate_tablespace ()
#3 0x0000000000847cab in fil_delete_tablespace ()
#4 0x00000000007a606d in row_drop_table_for_mysql ()
#5 0x00000000007918ae in ha_innobase::delete_table(char const*) ()
#6 0x0000000000694c0e in ha_delete_table(THD*, handlerton*, char const*, char const*, char const*, bool) ()
#7 0x0000000000602e55 in mysql_rm_table_no_locks(THD*, TABLE_LIST*, bool, bool, bool, bool) ()
#8 0x00000000006034ad in mysql_rm_table(THD*, TABLE_LIST*, char, char) ()
#9 0x00000000005abd29 in mysql_execute_command(THD*) ()
#10 0x00000000005afaa8 in mysql_parse(THD*, char*, unsigned int, Parser_state*) ()
#11 0x00000000005b12c7 in dispatch_command(enum_server_command, THD*, char*, unsigned int) ()
#12 0x00000000006355ff in do_handle_one_connection(THD*) ()
#13 0x0000000000635700 in handle_one_connection ()
#14 0x0000000000900da6 in pfs_spawn_thread ()
#15 0x0000003888c077e1 in start_thread () from /lib64/libpthread.so.0
#16 0x00000038888e68ed in clone () from /lib64/libc.so.6

I also got some backtrace even with innodb_lazy_drop_table =1.

#0 0x00000000008a8c2d in buf_buddy_free_low ()
#1 0x0000000000815234 in buf_LRU_block_remove_hashed_page ()
#2 0x0000000000815fb3 in buf_LRU_free_block ()
#3 0x0000000000817143 in buf_LRU_search_and_free_block ()
#4 0x00000000008187bc in buf_LRU_get_free_block ()
#5 0x000000000080bf0e in buf_page_get_gen ()
#6 0x00000000007fdc58 in btr_search_drop_page_hash_when_freed ()
#7 0x000000000084c489 in fseg_free_extent ()
#8 0x000000000084d284 in fseg_free_step ()
#9 0x00000000007ec4a9 in btr_free_but_not_root ()
#10 0x000000000081d50e in dict_drop_index_tree ()
#11 0x00000000007b6035 in row_upd_clust_step ()
#12 0x00000000007b69de in row_upd_step ()
#13 0x0000000000899b88 in que_run_threads ()
#14 0x000000000089a286 in que_eval_sql ()
#15 0x00000000007a5ce9 in row_drop_table_for_mysql ()
#16 0x00000000007918ae in ha_innobase::delete_table(char const*) ()

both backtrace called the same function : buf_LRU_block_remove_hashed_page ->buf_LRU_block_remove_hashed_page

So I am not sure if this bug is related to http://bugs.mysql.com/bug.php?id=59727 which is still open.

tags: added: xtradb
Changed in percona-server:
assignee: nobody → Valerii Kravchuk (valerii-kravchuk)
Revision history for this message
Valerii Kravchuk (valerii-kravchuk) wrote :

Do you still see this problem with recent PS 5.5.30-30.2?

Changed in percona-server:
status: New → Incomplete
Changed in percona-server:
assignee: Valerii Kravchuk (valerii-kravchuk) → nobody
Revision history for this message
Shahriyar Rzayev (rzayev-sehriyar) wrote :

Percona now uses JIRA for bug reports so this bug report is migrated to: https://jira.percona.com/browse/PS-2826

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.