Percona Server with XtraDB

Server crashing after upgrade (semaphore wait)

Reported by Ivor B. on 2012-06-01
38
This bug affects 7 people
Affects Status Importance Assigned to Milestone
Percona Server
Critical
Laurynas Biveinis
5.1
Undecided
Unassigned
5.5
Critical
Laurynas Biveinis

Bug Description

After upgrading to the latest version (Percona-Server-server-55-5.5.23-rel25.3.240.rhel5 64bit) the Percona server seems to crash at random intervals.

The logs show:

--Thread 1264978240 has waited at buf0flu.c line 1896 for 271.00 seconds the semaphore:
Mutex at 0xb1b7728 '&buf_pool->mutex', lock var 1
waiters flag 1
InnoDB: Warning: a long semaphore wait:
--Thread 1322592576 has waited at trx0trx.c line 441 for 271.00 seconds the semaphore:
Mutex at 0x9cbaa48 '&kernel_mutex', lock var 1
waiters flag 1
InnoDB: ###### Starts InnoDB Monitor for 30 secs to print diagnostic info:
InnoDB: Pending preads 0, pwrites 0
InnoDB: ###### Diagnostic info printed to the standard error stream

The only thing that helps to recover is kill and restart the mysqld process..

Ivor B. (ivor.b) wrote :

I've further narrow down the cause of the semaphore lock. It happens when removing a large amount of records from an innodb table and right after that giving 'OPTIMIZE TABLE tblname' .. The server will hang.. Tested and confirmed on both master and slave (both running percona).

PavelVD (pdobryakov) wrote :

We also encountered this problem when renaming, deleting large tables, updating a large of number of data (simultaneously in multiple threads)

Ivor B. (ivor.b) wrote :

It's getting worse for me now:

- I just created a new table (MyISAM, no records)
- Converted it to InnoDB
- Removed some columns (ALTER TABLE)

This caused the entire server to go dead.

2 days ago I already performed the following actions, but it seems they didn't help:

- Made MySQLdump for all data
- Cleared almost entire /var/lib/mysql dir (except for the mysql database itsself)
- Recreated database, imported data.

Attached are the last error message and my my.cnf files

Ivor B. (ivor.b) wrote :

Error log attached..

Ivor -

Do you experience the same issue if you start with an empty database and then create a new MyISAM table, convert it to InnoDB, do ALTER TABLE?

Ivor B. (ivor.b) wrote :

There are no problems when i repeat the actions on an entirely new database..

Stewart Smith (stewart) on 2012-06-14
Changed in percona-server:
importance: Undecided → High

Is there any other workload executing on the server in parallel? If yes, does it include Nagios or other monitoring queries that include INFORMATION_SCHEMA.TEMPORARY_TABLES or GLOBAL_TEMPORARY_TABLES queries? Could you test with the workload stopped to see if the issue disappears?

Ivor B. (ivor.b) wrote :

Hello,

Yes, the server normally has a medium workload. No monitoring software, but there are regular queries on the database requiring the use of temporary tables for sorting and such.

I just performed the same alter table operation without any other clients active, didn't run into any problems..

Ivor -

Thanks. I was referring not to the queries that use temp tables for their execution, but the queries that select from INFORMATION_SCHEMA.TEMPORARY_TABLES or INFORMATION_SCHEMA.GLOBAL_TEMPORARY_TABLES. Does your parallel workload include such queries?

For Percona Server 5.5.25, about to be released shortly, we have fixed one bug that is slightly similar to yours but requires the aforementioned queries to trigger (bug 951588).

In case you don't query these I_S tables, it'd be great if you could narrow down the parallel workload to some specific queries that cause ALTER TABLE hang. In that case complete stacktraces for all threads would be helpful too.

Ivor B. (ivor.b) wrote :

Hello Laurynas,

I'm pretty sure I don't have anything querying the IS tables..

I don't think bug 951588 is related, but I will see what the update brings.

It's hard to narrow it down to a specific query.. For example i was just removing some database (all with similar structure and data), 5 went without problems, removing number 6 crashed the server..

Can you point me somewhere I can learn to make the requested stacktrace?

Ivor -

Yes, I understand that in this case it might be quite hard to narrow down to a specific query. As for the stacktraces, the server should have gdb installed and then once mysqld hangs, issue
gdb -ex "set pagination 0" -ex "thread apply all bt" --batch -p $(pidof mysqld)
or if you have gdb 6.3 or older
(echo "set pagination 0"; echo "thread apply all bt"; echo "quit"; cat /dev/zero ) | gdb -p $(pidof mysqld)

Ivor B. (ivor.b) wrote :

Hi,

I managed to catch another lockup of the server and was able to do a stacktrace, you can find it enclosed

Alexey Kopytov (akopytov) wrote :
Download full text (9.8 KiB)

This appears to be a regression introduced upstream with the fix for http://bugs.mysql.com/64284.

The deadlock occurs due to lock order violation:

Background flushing is waiting on the buffer pool mutex while holding the page mutex:

Thread 1007 (Thread 0x47e5d940 (LWP 5098)):
#0 0x000000352cc0af59 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x00000000008bd241 in os_cond_wait (event=0xef56db0, reset_sig_count=939) at /usr/src/debug/Percona-Server-5.5.24-rel26.0/Percona-Server-5.5.24-rel26.0/storage/innobase/os/os0sync.c:207
#2 os_event_wait_low (event=0xef56db0, reset_sig_count=939) at /usr/src/debug/Percona-Server-5.5.24-rel26.0/Percona-Server-5.5.24-rel26.0/storage/innobase/os/os0sync.c:609
#3 0x00000000007f4a9f in sync_array_wait_event (arr=0xda078c0, index=9) at /usr/src/debug/Percona-Server-5.5.24-rel26.0/Percona-Server-5.5.24-rel26.0/storage/innobase/sync/sync0arr.c:458
#4 0x00000000007f5ccc in mutex_spin_wait (mutex=0xef56838, file_name=0xa5a6b0 "/home/jenkins/workspace/percona-server-5.5-rpms/label_exp/centos5-64/target/BUILD/Percona-Server-5.5.24-rel26.0/Percona-Server-5.5.24-rel26.0/storage/innobase/buf/buf0flu.c", line=1290) at /usr/src/debug/Percona-Server-5.5.24-rel26.0/Percona-Server-5.5.24-rel26.0/storage/innobase/sync/sync0sync.c:653
#5 0x0000000000844410 in mutex_enter_func (bpage=0x430a, flush_type=BUF_FLUSH_LIST, n_to_flush=7, count=0x47e5c8a8) at /usr/src/debug/Percona-Server-5.5.24-rel26.0/Percona-Server-5.5.24-rel26.0/storage/innobase/include/sync0sync.ic:222
#6 pfs_mutex_enter_func (bpage=0x430a, flush_type=BUF_FLUSH_LIST, n_to_flush=7, count=0x47e5c8a8) at /usr/src/debug/Percona-Server-5.5.24-rel26.0/Percona-Server-5.5.24-rel26.0/storage/innobase/include/sync0sync.ic:251
#7 buf_flush_page (bpage=0x430a, flush_type=BUF_FLUSH_LIST, n_to_flush=7, count=0x47e5c8a8) at /usr/src/debug/Percona-Server-5.5.24-rel26.0/Percona-Server-5.5.24-rel26.0/storage/innobase/buf/buf0flu.c:1290
#8 buf_flush_try_neighbors (bpage=0x430a, flush_type=BUF_FLUSH_LIST, n_to_flush=7, count=0x47e5c8a8) at /usr/src/debug/Percona-Server-5.5.24-rel26.0/Percona-Server-5.5.24-rel26.0/storage/innobase/buf/buf0flu.c:1520
#9 buf_flush_page_and_try_neighbors (bpage=0x430a, flush_type=BUF_FLUSH_LIST, n_to_flush=7, count=0x47e5c8a8) at /usr/src/debug/Percona-Server-5.5.24-rel26.0/Percona-Server-5.5.24-rel26.0/storage/innobase/buf/buf0flu.c:1607
#10 0x0000000000844d57 in buf_flush_flush_list_batch (buf_pool=0xef56838, flush_type=<value optimized out>, min_n=7, lsn_limit=644671213472) at /usr/src/debug/Percona-Server-5.5.24-rel26.0/Percona-Server-5.5.24-rel26.0/storage/innobase/buf/buf0flu.c:1740
#11 buf_flush_batch (buf_pool=0xef56838, flush_type=<value optimized out>, min_n=7, lsn_limit=644671213472) at /usr/src/debug/Percona-Server-5.5.24-rel26.0/Percona-Server-5.5.24-rel26.0/storage/innobase/buf/buf0flu.c:1830
#12 0x000000000084550d in buf_flush_list (min_n=7, lsn_limit=644671213472) at /usr/src/debug/Percona-Server-5.5.24-rel26.0/Percona-Server-5.5.24-rel26.0/storage/innobase/buf/buf0flu.c:2056
#13 0x00000000007ecbe4 in srv_master_thread (arg=<value optimized out>) at /usr/src/debug/Percona-Server-5.5.24-rel26.0/Percona-S...

Read more...

Alexey Kopytov (akopytov) wrote :

Ivor,

I'm going to report this bug upstream.

Can you check if using innodb_lazy_drop_table=1 fixes the problem? That should use a different code path from the one introduced with the upstream fix.

Ivor B. (ivor.b) wrote :

Thank you.. I've added that configuration option and will report back about my findings..

Hi,

I believe I may also have come across this bug under Percona-Server-5.5.24-rel26.0. Occurred when an ALTER TABLE... was issued in a small database, whilst the server was under heavy load of UPDATE and DELETE queries on compressed innodb tables on a large >150GB database. Server locked up, extract from log attached. SHOW PROCESSLIST indicated that the ALTER TABLE thread was stuck in 'rename result table'. SHOW ENGINE INNODB STATUS would hang.

I may be able to try to replicate with innodb_lazy_drop_table=1 this weekend. Haven't been able to replicate on a development installation, but the DB size is far smaller, and load is tiny in comparison. Unfortunately the development install is on a vm on a laptop - we don't have the ability to replicate the live configuration / workload for testing.

DT

Ivor B. (ivor.b) wrote :

For the last two weeks, after setting innodb_lazy_drop_table=1, I didn't have a single lockup anymore.. not even under high load..

Thank you!

Chehai Wu (wuchehai) wrote :

We ran into a similar issue recently. Our Percona Server 5.5.24-rel26.0 server with innodb_lazy_drop_table disabled just hung forever.

There was a single-thread deadlock in the error log:

InnoDB: Warning: a long semaphore wait:
--Thread 1258592576 has waited at buf0buf.c line 2529 for 241.00 seconds the semaphore:
S-lock on RW-latch at 0x10296528 '&buf_pool->page_hash_latch'
a writer (thread id 1258592576) has reserved it in mode exclusive
number of readers 0, waiters flag 1, lock_word: 0
Last time read locked in file buf0flu.c line 1481
Last time write locked in file /home/jenkins/workspace/percona-server-5.5-binaries/label_exp/centos5-64 /Percona-Server-5.5.23-rel25.3/storage/innobase/buf/buf0lru.c line 629

I ran pstack against the running Percona instance. Thread 1258592576 (0x4b049940) had the following stack trace:

Thread 23 (Thread 0x4b049940 (LWP 21020)):
#0 0x0000003af320aee9 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x00000000008f4527 in os_event_wait_low ()
#2 0x0000000000825352 in sync_array_wait_event ()
#3 0x0000000000825737 in rw_lock_s_lock_spin ()
#4 0x000000000086ca82 in buf_page_get_gen ()
#5 0x0000000000863983 in btr_search_drop_page_hash_when_freed ()
#6 0x000000000087bc3c in buf_LRU_flush_or_remove_pages ()
#7 0x00000000008b138b in fil_delete_tablespace ()
#8 0x00000000008b153d in fil_discard_tablespace ()
#9 0x0000000000808323 in row_truncate_table_for_mysql ()
#10 0x00000000007edc56 in ha_innobase::truncate() ()
#11 0x00000000007b971b in Truncate_statement::handler_truncate(THD*, TABLE_LIST*, bool) ()
#12 0x00000000007b9edd in Truncate_statement::truncate_table(THD*, TABLE_LIST*) ()
#13 0x00000000007b9f7e in Truncate_statement::execute(THD*) ()
#14 0x000000000057eab8 in mysql_execute_command(THD*) ()
#15 0x0000000000580d13 in mysql_parse(THD*, char*, unsigned int, Parser_state*) ()
#16 0x00000000007554db in Query_log_event::do_apply_event(Relay_log_info const*, char const*, unsigned int) ()
#17 0x0000000000518192 in apply_event_and_update_pos(Log_event*, THD*, Relay_log_info*) ()
#18 0x00000000005216b6 in exec_relay_log_event(THD*, Relay_log_info*) ()
#19 0x00000000005229c9 in handle_slave_sql ()
#20 0x0000003af320673d in start_thread () from /lib64/libpthread.so.0
#21 0x0000003af2ad44bd in clone () from /lib64/libc.so.6

After reading Percona Server source, I can see single-thread deadlock is possible if innodb_lazy_drop_table is disabled.

- buf_LRU_flush_or_remove_pages() calls buf_LRU_remove_all_pages() (the call stack of buf_LRU_remove_all_pages(), I believe, is optimized away).
- buf_LRU_remove_all_pages() executes rw_lock_x_lock(&buf_pool->page_hash_latch), which locks page_hash_latch in exclusive mode.
- buf_LRU_remove_all_pages() calls btr_search_drop_page_hash_when_freed().
- btr_search_drop_page_hash_when_freed() calls buf_page_get_gen().
- buf_page_get_gen() executes rw_lock_s_lock(&buf_pool->page_hash_latch), which locks page_hash_latch in shared mode.
- Deadlock!

If innodb_lazy_drop_table is enabled, the execution path is very different and it is unlikely to have a deadlock.

@Chehai Wu,

I have written about the deadlock in non-lazy drop table path here -- https://bugs.launchpad.net/percona-server/+bug/1026926

Also, as noted in the comment in other bug -- it is a regression introduced in Percona Server (due the latch ordering wrt buf_pool->page_hash_latch) , it is not present upstream.

Chehai Wu (wuchehai) wrote :

@Raghavendra D Prabhu

Thanks! Hopefully the bug can be fixed soon. I think the workaround is to enable innodb_lazy_drop_table

Yes, that is the 'workaround' since it doesn't use that code path.

>@Raghavendra D Prabhu
>
>Thanks! Hopefully the bug can be fixed soon. I think the workaround is
>to enable innodb_lazy_drop_table

Regards,
--
Raghavendra D Prabhu (TZ: GMT + 530)
Call: +91 96118 00062
mailto:<email address hidden>
Percona, Inc. - http://www.percona.com / Blog: http://www.mysqlperformanceblog.com/
Skype: percona.raghavendrap
GPG: 0xD72BE977

Alexey Kopytov (akopytov) wrote :

It's a critical regression introduced by incorrectly merging the upstream fix for http://bugs.mysql.com/bug.php?id=61188

Changed in percona-server:
importance: High → Critical
tags: added: regression
Stewart Smith (stewart) wrote :

Do we think this is a duplicate of https://bugs.launchpad.net/percona-server/+bug/1026926 ?

Stewart Smith (stewart) wrote :

Marking as a duplicate as it sounds really similar, and logs here indicate that it is likely the same issue. Please reopen if we discover it isn't.

Alexey Kopytov (akopytov) wrote :

This bug is not a duplicate of bug #1026926, as seen from the logs provided by the reporter and my analysis in https://bugs.launchpad.net/percona-server/+bug/1007268/comments/13

Alexey Kopytov (akopytov) wrote :

The deadlock here happens due to a wrong lock order of a block mutex and the buffer pool mutex, used by flushing and the new fast DROP table code. The problem in bug #1026926 appears to be around page_hash_latch.

This is not an upstream issue: the upstream buf_flush_page() already holds both mutexes, both acquired in buf_flush_try_neighbors(). And buf_flush_yield() already has the buffer pool mutex and acquires the block one. In both cases the order is the same.

After fixing bug 1038225, buf_flush_yield() uses the LRU list mutex instead of the buffer pool mutex, eliminating this deadlock. Thus I consider this to be duplicate of already-fixed bug 1038225.

This fix will appear in the next release 5.5.27-28.1, please let us know if you still experience any issues with it. Thanks!

Omega8.cc (g-admin-omega8-cc) wrote :

It just crashed with the same error while running latest Percona-Server-5.5.27-rel28.0, even with innodb_lazy_drop_table set to 1, so it is not a working workaround: https://gist.github.com/3486469

> It just crashed with the same error while running latest Percona-Server-5.5.27-rel28.0, even with innodb_lazy_drop_table set to > 1, so it is not a working workaround: https://gist.github.com/3486469

These symptoms match those of bug 1040735, a different bug whose cause we are trying to find.

Could you provide the full GDB stacktraces at the time of deadlock? You'd need to install Percona Server debug symbols and GDB if not already installed and then do
gdb -ex "set pagination 0" -ex "thread apply all bt" --batch -p $(pidof mysqld)

Please add this info to bug 1040735. Thanks!

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.