Percona Server with XtraDB

InnoDB: Failing assertion: mutex_own(&buf_pool->LRU_list_mutex) in file buf0lru.c line 859

Reported by Roel Van de Paar on 2013-01-24
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Percona Server
Status tracked in 5.6
5.1
High
Laurynas Biveinis
5.5
High
Laurynas Biveinis
5.6
Undecided
Unassigned

Bug Description

130124 7:34:12 InnoDB: Assertion failure in thread 139816180811520 in file buf0lru.c line 859
InnoDB: Failing assertion: mutex_own(&buf_pool->LRU_list_mutex)

Thread 1 (LWP 15772):
+bt
#0 0x0000003da180c60c in pthread_kill () from /lib64/libpthread.so.0
#1 0x00000000007de838 in my_write_core (sig=6) at /ssd/5.5/Percona-Server-5.5.28-rel29.3/mysys/stacktrace.c:433
#2 0x00000000006b2e14 in handle_fatal_signal (sig=6) at /ssd/5.5/Percona-Server-5.5.28-rel29.3/sql/signal_handler.cc:249
#3 <signal handler called>
#4 0x0000003da1435935 in raise () from /lib64/libc.so.6
#5 0x0000003da14370e8 in abort () from /lib64/libc.so.6
#6 0x00000000008f339a in buf_LRU_insert_zip_clean (bpage=bpage@entry=0x7f297043d380) at /ssd/5.5/Percona-Server-5.5.28-rel29.3/storage/innobase/buf/buf0lru.c:859
#7 0x00000000008eb0c9 in buf_flush_remove (bpage=bpage@entry=0x7f297043d380) at /ssd/5.5/Percona-Server-5.5.28-rel29.3/storage/innobase/buf/buf0flu.c:538
#8 0x00000000008ecb5e in buf_flush_ready_for_flush (flush_type=BUF_FLUSH_LIST, bpage=0x7f297043d380) at /ssd/5.5/Percona-Server-5.5.28-rel29.3/storage/innobase/buf/buf0flu.c:487
#9 buf_flush_page_and_try_neighbors (bpage=bpage@entry=0x7f297043d380, flush_type=flush_type@entry=BUF_FLUSH_LIST, n_to_flush=n_to_flush@entry=200, count=count
@entry=0x7f297dca7668) at /ssd/5.5/Percona-Server-5.5.28-rel29.3/storage/innobase/buf/buf0flu.c:1587
#10 0x00000000008ef41f in buf_flush_flush_list_batch (lsn_limit=18446744073709551615, min_n=200, buf_pool=0x2a2ee48) at /ssd/5.5/Percona-Server-5.5.28-rel29.3/storage/innobase/buf/buf0flu.c:1740
#11 buf_flush_batch (buf_pool=buf_pool@entry=0x2a2ee48, flush_type=flush_type@entry=BUF_FLUSH_LIST, min_n=min_n@entry=200, lsn_limit=lsn_limit@entry=18446744073
709551615) at /ssd/5.5/Percona-Server-5.5.28-rel29.3/storage/innobase/buf/buf0flu.c:1830
#12 0x00000000008effbe in buf_flush_list (min_n=<optimized out>, lsn_limit=18446744073709551615, lsn_limit@entry=1) at /ssd/5.5/Percona-Server-5.5.28-rel29.3/storage/innobase/buf/buf0flu.c:2047
#13 0x0000000000876587 in srv_master_thread (arg=<optimized out>) at /ssd/5.5/Percona-Server-5.5.28-rel29.3/storage/innobase/srv/srv0srv.c:3694
#14 0x0000003da1807d14 in start_thread () from /lib64/libpthread.so.0
#15 0x0000003da14f168d in clone () from /lib64/libc.so.6

Found in trunk @ Percona-Server-5.5.28-rel29.3-409-debug.Linux.x86_64

Roel Van de Paar (roel11) wrote :

A debug-only issue: buf_flush_remove needs to hold the LRU list mutex for the flush list flushes for BUF_BLOCK_ZIP_DIRTY pages. Due to mutex locking order, the way to fix this is to lock LRU in buf_flush_flush_list_batch #if UNIV_DEBUG || UNIV_BUF_DEBUG.

Likewise for 5.1 (with buf_flush_batch instead of buf_flush_flush_list_batch).

Roel Van de Paar (roel11) wrote :

(Bug is single threaded reproducible)

Great, the single-threadeness fits the analysis.

Roel Van de Paar (roel11) wrote :
Roel Van de Paar (roel11) wrote :

Ignore #7, unrelated to bug 1038225

tags: added: xtradb
George Ormond Lorch III (gl-az) wrote :

I hit this today while trying to reproduce 1086700 against PS 5.5 trunk revno 466

Laurynas, if I understood your suggestion, it doesn't work. I tried a few things and my final changes still caused assertions on startup for a fresh server:

=== modified file 'Percona-Server/storage/innobase/buf/buf0flu.c'
--- Percona-Server/storage/innobase/buf/buf0flu.c 2013-02-18 04:48:10 +0000
+++ Percona-Server/storage/innobase/buf/buf0flu.c 2013-03-13 22:36:14 +0000
@@ -1757,11 +1757,22 @@
   /* The list may change during the flushing and we cannot
   safely preserve within this function a pointer to a
   block in the list! */
- while (bpage != NULL
- && len > 0
- && !buf_flush_page_and_try_neighbors(
+ while (bpage != NULL && len > 0 ) {
+#if UNIV_DEBUG || UNIV_BUF_DEBUG
+ ibool have_LRU_list_mutex =
+ mutex_own(&buf_pool->LRU_list_mutex);
+
+ if (!have_LRU_list_mutex)
+ mutex_enter(&buf_pool->LRU_list_mutex);
+#endif
+ if (buf_flush_page_and_try_neighbors(
     bpage, BUF_FLUSH_LIST, min_n, &count)) {
-
+ break;
+ }
+#if UNIV_DEBUG || UNIV_BUF_DEBUG
+ if (!have_LRU_list_mutex)
+ mutex_exit(&buf_pool->LRU_list_mutex);
+#endif
    buf_flush_list_mutex_enter(buf_pool);

    /* If we are here that means that buf_pool->mutex

Initializing server instance: id[1] bindir[/data/bin/ps-5.5-PS-81-debug]
Creating database...done.
Starting server instance 1 .....ERROR: failed to start.
130313 15:36:36 [Note] Plugin 'FEDERATED' is disabled.
130313 15:36:36 InnoDB: !!!!!!!! UNIV_DEBUG switched on !!!!!!!!!
130313 15:36:36 InnoDB: The InnoDB memory heap is disabled
130313 15:36:36 InnoDB: Mutexes and rw_locks use GCC atomic builtins
130313 15:36:36 InnoDB: Compressed tables use zlib 1.2.3
130313 15:36:36 InnoDB: Using Linux native AIO
130313 15:36:36 InnoDB: Initializing buffer pool, size = 4.0G
130313 15:36:36 InnoDB: Completed initialization of buffer pool
InnoDB: The first specified data file ./ibdata1 did not exist:
InnoDB: a new database to be created!
130313 15:36:36 InnoDB: Setting file ./ibdata1 size to 10 MB
InnoDB: Database physically writes the file full: wait...
130313 15:36:37 InnoDB: Log file ./ib_logfile0 did not exist: new to be created
InnoDB: Setting log file ./ib_logfile0 size to 5 MB
InnoDB: Database physically writes the file full: wait...
130313 15:36:37 InnoDB: Log file ./ib_logfile1 did not exist: new to be created
InnoDB: Setting log file ./ib_logfile1 size to 5 MB
InnoDB: Database physically writes the file full: wait...
InnoDB: Doublewrite buffer not found: creating new
130313 15:36:37 InnoDB: Assertion failure in thread 140222067590944 in file buf0flu.c line 1139
InnoDB: Failing assertion: !mutex_own(&buf_pool->LRU_list_mutex)

George -

For one thing,
+ ibool have_LRU_list_mutex =
+ mutex_own(&buf_pool->LRU_list_mutex);
+
+ if (!have_LRU_list_mutex)
+ mutex_enter(&buf_pool->LRU_list_mutex);

is a race condition.

I worked on this bug couple of weeks ago, and it appears that, while the fix idea is simple, the code reorganization to achieve it is not. If this bug is blocking further work for a more critical bug, I'd disable the failing UNIV_DEBUG site and proceed.

I see three ways to fix this.

1) Acquire LRU list mutex in advance (i.e. before the flush list mutex) for debug builds where needed in order to satisfy the current code. But this comes close to unsplitting the buffer pool mutex split, meaning that the debug builds would become less useful in diagnosing any issues with the split.

2) In debug builds, buffer-fix the block, release the mutexes before the buf_LRU_insert_zip_clean() call in buf_flush_remove() and reacquire them again, this time with LRU mutex acquired too, undo this after the buf_LRU_insert_zip_clean() call.

3) Do not use zip_clean list in debug builds neither.

I am deciding between 2) and 3).

Option 2) becomes quite complicated because buf_flush_remove() may be called with different mutexes being held from different contexts (and even more so after the bug 1086680 fix). Option 3) starts looking more attractive. The zip_clean list maintenance does not seem to enforce anything vital.

The above reasoning re. option 2) applies to 5.1 more than to 5.5. In 5.5 it may be a feasible option.

In 5.5 all the current buf_flush_remove() callers hold the LRU list mutex. This bug only occurred when buf_flush_ready_for_flush() called buf_flush_remove(), and this has been fixed with the lazy drop table feature removal.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers