Apropos "So possibly the FileReader methods should rather be called readAsync (overlapped/AIO) and readUnbuffered (FILE_FLAG_NO_BUFFERING/O_DIRECT/F_NOCACHE). Is buffered reading even needed there? When would that work if the unbuffered read fails?":

An unbuffered read might depend on buffer alignment, e.g., in NetBSD:
"To meet the alignment requirements for direct I/O, file offset, the length of the I/O and the address of the buffer in memory must all be multiples of DEV_BSIZE (512 bytes)." A buffered read doesn't have that requirement, so might succeed where an unbuffered read fails. There'd still be some value in a buffered fallback, then.

A file reading framework for hashing or uploading might ideally:
- be asynchronous, e.g., OVERLAPPED on Windows, io_uring on Linux, and kqueue() on BSDs/macOS;
- and avoid pointlessly filling disk cache with file contents that will likely not be read again within a relevant timeframe, while still allowing the OS to pre-fetch/readahead cache for blocking calls:
  (1) FILE_FLAG_SEQUENTIAL_SCAN for non-OVERLAPPED, because readahead caching's valuable for blocking operations, and FILE_FLAG_NO_BUFFERING for OVERLAPPED on Windows, per https://docs.microsoft.com/en-us/windows/win32/api/fileapi/nf-fileapi-createfilew, which states "When FILE_FLAG_NO_BUFFERING is combined with FILE_FLAG_OVERLAPPED, the flags give maximum asynchronous performance, because the I/O does not rely on the synchronous operations of the memory manager. However, some I/O operations take more time, because data is not being held in the cache."
  (2a) POSIX_FADV_SEQUENTIAL potentially combined with POSIX_FADV_DONTNEED for already-read files/sections of file and O_DIRECT/F_NOCACHE when using io_uring/aio/kqueue for similar reasons as FILE_FLAG_NO_BUFFERING and FILE_FLAG_OVERLAPPED work together in Windows; or
  (2b) with a different speed/memory usage tradeoff than (2a), using FILE_FLAG_NO_BUFFERING/O_DIRECT/F_NOCACHE regardless of whether the reads are asynchronous. This is slightly slower in your benchmarks, likely being hurt by disabling readahead/pre-fetch.  Still, it's possible that FILE_FLAG_SEQUENTIAL_SCAN clobbers the OS disk caches when one pushes 20GB of hashed files through it.

If it's fast enough, and explicitly in async cases, FILE_FLAG_NO_BUFFERING/O_DIRECT/F_NOCACHE might probably on balance be better, but does create buffer alignment requirements on various platforms, which might require more testing, and/or a buffered fallback. As you suggest, I'd be curious too about how well FILE_FLAG_NO_BUFFERING/O_DIRECT/F_NOCACHE works with high-latency devices, such as HDDs and slower networks; fast networks with SSDs on the other end should have lower end-to-end latency than typical HDDs (~10ms). 

https://github.com/facebook/rocksdb/issues/1032#issuecomment-196874121 suggests that rocksdb does (2a), going through the file sequentially and then explicitly dropping cache behind itself, or did in 2016 at least. I couldn't verify from their current code that they still do that reading, though they definitely do it when writing files they're not going to read back: they write the file, then call with POSIX_FADV_DONTNEED afterwards.

Asynchronous file I/O is useful here partly because it allows for avoiding file caching with fewer other tradeoffs, especially in AirDC++, where one might have multiple hashing threads.

On Windows, DC++, at least, has long used:
size_t FileReader::readDirect(const string& file, const DataCallback& callback) {
	...

	auto tmp = ::CreateFile(tfile.c_str(), GENERIC_READ, FILE_SHARE_READ, nullptr, OPEN_EXISTING,
		FILE_FLAG_NO_BUFFERING | FILE_FLAG_OVERLAPPED | FILE_FLAG_POSIX_SEMANTICS, nullptr);

Capturing the asynchronous case already. As far as I understand, neither DC++ nor AirDC++ read files asynchronously on any non-Windows platform, to capture the better case (1). This is likely out of scope of near-term DC++ development, but since AirDC++ does support non-Windows platforms, might prove worthwhile.

So, it looks reasonable to have DC++ and AirDC++ on each supported platform prefer (1), and use one or both of (2a) and (2b), depending on the speed/fragility hit from (2a). If it's possible to align and size buffers reliably to match (2a) requirements, then that's probably more elegant, and if it also doesn't cause too much trouble with higher-latency I/O devices, it's probably overall more elegant.