Percona Server moved to https://jira.percona.com/projects/PS

Small buffer pool size causes XtraDB buffer flush thread to spin at 100% cpu

Bug #1433432 reported by Jan Lindström on 2015-03-18

This bug affects 2 people

	Status	Importance	Assigned to	Milestone
Percona Server moved to https://jira.percona.com/projects/PS	Fix Released	Medium	Nickolay Ihalainen	Percona Server moved to https://jira.percona.com/projects/PS 5.6.28-76.1
5.1	Invalid	Undecided	Unassigned
5.5	Invalid	Undecided	Unassigned
5.6	Fix Released	Medium	Nickolay Ihalainen	Percona Server moved to https://jira.percona.com/projects/PS 5.6.28-76.1

Bug Description

Small innodb_buffer_pool_size causes mysqld to spin at 100% CPU.

Here is how I could repeat:

Start server with:

sql/mysqld --no-defaults --language=$(pwd)/sql/share/english--basedir=$(pwd) --datadir=$X --socket=$(pwd)/$X/mysql.sock --port=3310 --innodb-buffer-pool-size=5M

Create some dirty pages to trigger flush thread:

create table t1 (a int primary key auto_increment, b varchar (1000));
insert into t1 (b) values (repeat("a", 999));
insert into t1 (b) select b from t1;
insert into t1 (b) select b from t1;
insert into t1 (b) select b from t1;
insert into t1 (b) select b from t1;
insert into t1 (b) select b from t1;
insert into t1 (b) select b from t1;
insert into t1 (b) select b from t1;
insert into t1 (b) select b from t1;
insert into t1 (b) select b from t1;
insert into t1 (b) select b from t1;
insert into t1 (b) select b from t1;
insert into t1 (b) select b from t1;
insert into t1 (b) select b from t1;

This causes the mysqld process to appear at 100% CPU in `top`.

The problem seems to be related to heuristics with hard-coded constants that
assume a certain minimum pool size. In page_cleaner_adapt_lru_sleep_time(),
if the number of free pages is less than @@innodb_lru_scan_depth / 20 (which
works out as 51 by default), it will never sleep between trying to flush
pages. Probably the page flusher has another heuristics that doesn't try to
flush out more than some percentage of total pages or something.

Se with small buffer pool, we end up with two heuristics fighting one another,
and 100% cpu spinning results.

It could take a few more INSERTs to get there, so if it's not reproducible with SQL above, just keep inserting.

Tested with Percona Server 5.6.22-72.0.

See original description

Tags:

Revision history for this message

Jan Lindström (jplindst) wrote on 2015-03-18:

https://mariadb.atlassian.net/browse/MDEV-7758

description:

updated

Laurynas Biveinis (laurynas-biveinis) on 2015-03-18

tags:

added: xtradb

Revision history for this message

Laurynas Biveinis (laurynas-biveinis) wrote on 2015-03-18:

Confirmed 100% CPU with either page cleaner, LRU manager, or both threads.

We have a small buffer pool whose useful size is decreasing (until it hits the ER_LOCK_TABLE_FULL as noted in MDEV-7758 with InnoDB). Both LRU and flush list threads have heuristics to flush furiously if either checkpoint age close to log capacity or free lists empty. In this case both become true, and the same time the attempted flushing by threads won't result in any flushing. InnoDB is not affected because they sleep 1s if the last iteration did not flush anything.

I am not sure what's the best fix here. We can sleep if the last iteration did not flush, but this needs further consideration. We can also sleep if the buffer pool is approaching ER_LOCK_TABLE_FULL.

There are some immediately fixable things: 1) for active server n_flushed is being added to, and reset only by idle server flush, resulting in wrong counter values (typo += instead of =). 2) The page cleaner heuristics will return without any flushing if LSN did not advance since the last heuristic adjustment. This situation must be made incompatible with 0 ms cleaner sleep time.

Revision history for this message

Laurynas Biveinis (laurynas-biveinis) wrote on 2015-09-02:

The typo "1) for active server n_flushed is being added to, and reset only by idle server flush, resulting in wrong counter values (typo += instead of =)." has been split off to bug 1491435. As that does not fully fix the current bug, leaving it open.

Revision history for this message

Krunal Bauskar (krunal-bauskar) wrote on 2015-09-25:

I found a deadlock and here is analysis for the same.

T1 -> (Query thread) .. Tries to perform a write heavy workload that causes it to reach a point when free buffers are exhausted and it needs a free buffer to complete the action.

#0 0x00007f99e0d99743 in select () at ../sysdeps/unix/syscall-template.S:81
#1 0x0000000000c8dc8a in os_thread_sleep (tm=4096) at /opt/projects/codebase/5.6/storage/innobase/os/os0thread.cc:285
#2 0x0000000000e0a738 in buf_LRU_get_free_block (buf_pool=0x317bc18) at /opt/projects/codebase/5.6/storage/innobase/buf/buf0lru.cc:1403
#3 0x0000000000df2700 in buf_page_create (space=6, offset=7623, zip_size=0, mtr=0x7f99dc1054a0) at /opt/projects/codebase/5.6/storage/innobase/buf/buf0buf.cc:3961

As T1 is waiting for free block, srv_empty_free_list_algorithm algorithm kicks in and causes T1 to suspend.
T1 is now relying on buf_flush_lru_manager_thread/page-cleaner-thread (buf_flush_page_cleaner_thread) to make free buffer available.

----

Page-Cleaner thread schedules a flushing of flush list by scanning the flush list (buf_flush_page)
First dirty-page to flush is space=x, offset=0. This is page that has been made dirty by T1 as part of write-heavy workload.
Page-Cleaner now need to obtain s-lock on this page but fails to obtain it as T1 is already holding X-lock on this page as part of insert action.

DEADLOCK....

* Page-Cleaner thread can't proceed till T1 releases an X-lock so that it can complete flush.

* T1 can't release the lock till Page-Cleaner thread make the free buffer available. [BTW T1 is in middle of doing pessimistic insert].

* Well what about LRU Manager thread ?
It is unable to free pages beyond set hard limit of 256 pages to avoid thrashing so LRU manager thread though active has went in passive mode just hogging CPU)

/** If LRU list of a buf_pool is less than this size then LRU eviction
should not happen. This is because when we do LRU flushing we also put
the blocks on free list. If LRU list is very small then we can end up
in thrashing. */
#define BUF_LRU_MIN_LEN 256

---------------

I have used the same TC as quoted in bug just increased the number of inserts by few more (18 to be precise)

I found a deadlock and here is analysis for the same.

T1 -> (Query thread) .. Tries to perform a write heavy workload that causes it to reach a point when free buffers are exhausted and it needs a free buffer to complete the action.

#0  0x00007f99e0d99743 in select () at ../sysdeps/unix/syscall-template.S:81
#1  0x0000000000c8dc8a in os_thread_sleep (tm=4096) at /opt/projects/codebase/5.6/storage/innobase/os/os0thread.cc:285
#2  0x0000000000e0a738 in buf_LRU_get_free_block (buf_pool=0x317bc18) at /opt/projects/codebase/5.6/storage/innobase/buf/buf0lru.cc:1403
#3  0x0000000000df2700 in buf_page_create (space=6, offset=7623, zip_size=0, mtr=0x7f99dc1054a0) at /opt/projects/codebase/5.6/storage/innobase/buf/buf0buf.cc:3961

----

DEADLOCK....

* Page-Cleaner thread can't proceed till T1 releases an X-lock so that it can complete flush.

* T1 can't release the lock till Page-Cleaner thread make the free buffer available. [BTW T1 is in middle of doing pessimistic insert].

* Well what about LRU Manager thread ?
It is unable to free pages beyond set hard limit of 256 pages to avoid thrashing so LRU manager thread though active has went in passive mode just hogging CPU)

---------------

I have used the same TC as quoted in bug just increased the number of inserts by few more (18 to be precise)

Revision history for this message

Laurynas Biveinis (laurynas-biveinis) wrote on 2015-12-02:

https://github.com/percona/percona-server/pull/245. The deadlock in #4 has been moved to bug 1521905.

Laurynas Biveinis (laurynas-biveinis) on 2016-12-27

tags:

added: lru-flusher

Revision history for this message

Shahriyar Rzayev (rzayev-sehriyar) wrote on 2018-01-25:

Percona now uses JIRA for bug reports so this bug report is migrated to: https://jira.percona.com/browse/PS-1616

Report a bug

This report contains Public information

Everyone can see this information.

Duplicates of this bug

Bug #1445324

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.