Comment 5 for bug 1709889

Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2017-08-15 09:26 EDT-------
Here is what I am observing, and what leads me to think that "cfq" may not (yet) be a good choice for the default io-sched.

The test exerciser, HTX (https://github.com/open-power/HTX - POWER arch only), causes stress on CFQ during certain cycles. I set the debug timeout threshold for completion of I/Os at 60 seconds (upon timeout, debugging is printed and then io_schedule() called after which more debugging is printed). It is known that I/O delays seem to vary continuously throughout the range, but using a timeout lower than 60 just produced too much output.

During certain cycles, where about 1 million I/Os per hour are being performed on each disk, we see timeouts being triggered. Essentially, the timeout happens because CFQ has not even submitted the I/O to SCSI yet. Earlier debugging showed that the once the I/O actually gets submitted to SCSI it completes promptly.

Without the patch 5be6b75610ce these I/Os could (sometimes) take an hour or more to get submitted to SCSI. With the patch, that delay time seems to max out at around 110 seconds, which is a great improvement however still indicates a problem.

I typically see about 400-500 I/Os trip the 60-second timeout during a given ~2 hour cycle (estimated 4 million I/Os total), so it is not a huge percentage. However, the I/Os affected seem to be related, possibly by process or thread, and so this could be detrimental to an application. Note, the number of I/Os taking between 30 and 60 seconds is not known, but is expected to be much higher. Even 30 seconds may be an undesirable number. It's not clear just how CFQ chooses the delay value and what overrides it.

On a run with the scheduler set to "deadline", I never see any I/Os trip the 60-second timeout.

I think this shows undesirable behavior in CFQ, possibly a bug, and that it should not be the default scheduler - especially for servers. Is there some evidence that shows CFQ to be better than deadline in general?