Developed a theory by code analysis. It's a race condition in accessing the page I/O fix state, resulting in two threads attempting to remove the same block from the flush list.
One thread is performing a LRU or flush list flush. Let's say it stops at buf_flush_page() for a certain page between ut_ad(buf_flush_ready_for_flush(bpage, flush_type)) and buf_page_set_io_fix(bpage, BUF_IO_WRITE) calls. This thread holds the buffer pool and block mutexes, does not hold the LRU nor flush list mutexes.
Another thread is performing a DROP or TRUNCATE for a tablespace and arrives at buf_flush_or_remove_page() for the same page. It holds the LRU and flush list mutexes, does not hold the buffer pool nor block mutexes. It checks the block state to be BUF_IO_NONE and proceeds to remove the block from the flush list.
Meanwhile the first thread proceeds with flushing the same block, resulting in flush I/O completion routine unable to remove the flushed block from the flush list.
The fix would be not to perform dirty I/O fix state reads in buf_flush_or_remove_pages(). It's a recent regression in merging upstream DROP TABLE performance improvements (5.5.27 or so).
Will review the rest of stacktraces in this and other bugs and will make a tentative fix for testing.
Developed a theory by code analysis. It's a race condition in accessing the page I/O fix state, resulting in two threads attempting to remove the same block from the flush list.
One thread is performing a LRU or flush list flush. Let's say it stops at buf_flush_page() for a certain page between ut_ad(buf_ flush_ready_ for_flush( bpage, flush_type)) and buf_page_ set_io_ fix(bpage, BUF_IO_WRITE) calls. This thread holds the buffer pool and block mutexes, does not hold the LRU nor flush list mutexes.
Another thread is performing a DROP or TRUNCATE for a tablespace and arrives at buf_flush_ or_remove_ page() for the same page. It holds the LRU and flush list mutexes, does not hold the buffer pool nor block mutexes. It checks the block state to be BUF_IO_NONE and proceeds to remove the block from the flush list.
Meanwhile the first thread proceeds with flushing the same block, resulting in flush I/O completion routine unable to remove the flushed block from the flush list.
The fix would be not to perform dirty I/O fix state reads in buf_flush_ or_remove_ pages() . It's a recent regression in merging upstream DROP TABLE performance improvements (5.5.27 or so).
Will review the rest of stacktraces in this and other bugs and will make a tentative fix for testing.