I started looking at this again after lyarwood mentioned it in the nova meeting today.

Looking at the logs/mysql/error.txt of some successful grenade runs, there are a lot of messages like this regarding aborted connections:


2020-03-12T19:07:34.435762Z 4 [Note] Aborted connection 4 to db: 'keystone' user: 'root' host: 'localhost' (Got an error reading communication packets)


so that looks likely to be a red herring.

The mysql logs aren't indexed by our logstash, so looking at a few by hand, there seems to be this consistent pattern in the failed jobs, that is not present in the succeeded jobs:


2020-03-11T12:09:58.384142Z 0 [Note] InnoDB: page_cleaner: 1000ms intended loop took 18540ms. The settings might not be optimal. (flushed=200 and evicted=0, during the time.)

2020-03-11T11:40:53.524707Z 0 [Note] InnoDB: page_cleaner: 1000ms intended loop took 4382ms. The settings might not be optimal. (flushed=3 and evicted=0, during the time.)

2020-03-11T11:41:05.482158Z 0 [Note] InnoDB: page_cleaner: 1000ms intended loop took 8976ms. The settings might not be optimal. (flushed=44 and evicted=0, during the time.)

2020-03-11T11:41:41.406597Z 0 [Note] InnoDB: page_cleaner: 1000ms intended loop took 4915ms. The settings might not be optimal. (flushed=200 and evicted=0, during the time.)

2020-03-11T10:37:04.469735Z 0 [Note] InnoDB: page_cleaner: 1000ms intended loop took 5434ms. The settings might not be optimal. (flushed=5 and evicted=0, during the time.)


I googled about this and learned it's a periodic task in mysql server that flushes dirty pages every second [1]. From the stackoverflow answer, they say:

"Once per second, the page cleaner scans the buffer pool for dirty pages to flush from the buffer pool to disk. The warning you saw shows that it has lots of dirty pages to flush, and it takes over 4 seconds to flush a batch of them to disk, when it should complete that work in under 1 second. In other words, it's biting off more than it can chew."

They go on to say that this issue can be exacerbated if it's happening on a machine with slow disks as that would also cause the page cleaning to fall behind.

The person who asked the question solved their issue by setting innodb_lru_scan_depth=256 to make the page cleaner process smaller chunks at a time (default is 1024). The person who answered the question noted that this will only work if page cleaner can keep up with the average rate of creating new dirty pages. If it cannot, the flushing rate will be automatically increased once innodb_max_dirty_page_pct is exceeded and may result in page cleaner warnings all over again.

They say:

"Another solution would be to put MySQL on a server with faster disks. You need an I/O system that can handle the throughput demanded by your page flushing.

If you see this warning all the time under average traffic, you might be trying to do too many write queries on this MySQL server. It might be time to scale out, and split the writes over multiple MySQL instances, each with their own disk system."

This again seems to point back at slow nodes. 

I'm trying out a DNM devstack patch [2] to set innodb_lru_scan_depth=256 and keep rechecking the DNM nova change [3] to see if it has any effect on the failures.

[1] https://stackoverflow.com/questions/41134785/how-to-solve-mysql-warning-innodb-page-cleaner-1000ms-intended-loop-took-xxx
[2] https://review.opendev.org/712805
[3] https://review.opendev.org/701478