Add hard timeouts to page cleaner flushes

Bug #1232101 reported by Laurynas Biveinis on 2013-09-27
This bug affects 1 person
Affects Status Importance Assigned to Milestone
MySQL Server
Percona Server moved to
Status tracked in 5.7
Fix Released
Laurynas Biveinis
Laurynas Biveinis

Bug Description

Copy of

[27 Sep 15:51] Laurynas Biveinis

Currently the page cleaner thread is designed as follows, if I understand correctly:
- in a loop:
- do LRU tail flushing, controlled by max LRU scan depth;
- do flush list flushing, controlled by I/O capacity;
- sleep the remaining time until 1s.

Now the proper tuning (correct me if I'm wrong) would strive to minimize the sleep time and utilize storage as closely as possible to capacity.

But the flushing time is not only determined by the I/O variable settings, but also by how quickly can cleaner can get access to the shared resources such as buffer pool mutexes too. This means that the page cleaner iteration time, even with properly tuned I/O settings, can exceed 1 second. Under very heavy loads (sysbench, I/O-bound, 512 threads on 32 core) one page cleaner iteration might take as long as minutes. This is bad due to several reasons: if LRU flushing is taking that long, the flush list flushing is not happening, the query threads go to sync preflush. if flush list flushing is taking that long, the LRU tail flushing is not happening, the query threads go to single page LRU flushes. Moreover the page cleaner and adaptive flushing heuristics are designed to be updated roughly every second, and probably will not get a very exact idea of what's going if called once in several minutes.

How to repeat:
Benchmarks, code analysis

Suggested fix:
A partial solution would be to add a hard timeout, i.e. limit LRU tail flush to 1 second, checked after each batch, and flush list flushing to 1 second too. (the exact value of the constant is debatable). It looks it's more important to re-check heuristics and to alternate periodically between the two than to allow one of them to complete fully.

This is not a complete solution for the case of multiple buffer pool instances, because with a timeout, some instances may receive no flushing at all. This will be reported separately.

Related branches

tags: added: innodb

Percona now uses JIRA for bug reports so this bug report is migrated to:

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.