FileReader is not thread safe on Linux

Bug #1909861 reported by maksis on 2021-01-02
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
DC++
Undecided
Unassigned

Bug Description

FileReader::readMapped currently modifies the global SIGBUS handler in order to catch read errors:

https://sourceforge.net/p/dcplusplus/code/ci/default/tree/dcpp/FileReader.cpp#l289

Since the function can be called concurrently from different threads (currently hashing/queue recheck/sfv check in DC++) and each of them sets and resets the SIGBUS handler, there's a high risk that the application will crash in case of read errors as they aren't being handler properly.

More information about the caveats: https://www.sublimetext.com/blog/articles/use-mmap-with-care

These issues are much more likely to happen with AirDC++ as it uses multiple threads for hashing. Read errors caused rather nasty crashes with corrupted stack traces for one user, but luckily he was able to catch the SIGBUS signal with gdb.

I didn't even spend time in trying to figure out how to make the mapped reads work properly, as based on my testing the basic FileReader::readCached function is noticeably faster:

readMapped: 671 files (21.70 GiB) in 9 directories have been hashed in 4 minutes 21 seconds (84.87 MiB/s)
readCached: 671 files (21.70 GiB) in 9 directories have been hashed in 3 minutes 58 seconds (93.08 MiB/s)

FileReader::readMapped is now disabled in AirDC++, as I can't see any benefits from using it. The included setjmp.h header is even causing issues when using clang for compiling on Linux: https://bugs.gentoo.org/731676

cologic (cologic) wrote :

It looks in DC++ on Windows, the only supported DC++ platform, FileReader::readMapped() already always fell through to FileReader::readCached():

  ret = readMapped(file, callback);

  if(ret == READ_FAILED) {
   dcdebug("Reading [full] %s\n", file.c_str());
   ret = readCached(file, callback);
  }

...

#ifdef _WIN32

...

size_t FileReader::readMapped(const string& file, const DataCallback& callback) {
 /** @todo mapped reads can fail on Windows by throwing an exception that may only be caught by
 SEH. MinGW doesn't have that, thus making this method of reading prone to unrecoverable
 failures. disabling this for now should be fine as DC++ always tries overlapped reads first
 (at the moment this file reader is only used in places where overlapped reads make the most
 sense).
 more info:
 <https://msdn.microsoft.com/en-us/library/aa366801(VS.85).aspx>
 <https://stackoverflow.com/q/7244645> */
#if 1
 return READ_FAILED;
#else
...
#endif
}

#else

...

static sigjmp_buf sb_env;

static void sigbus_handler(int signum, siginfo_t* info, void* context) {
 // Jump back to the readMapped which will return error. Apparently truncating
 // a file in Solaris sets si_code to BUS_OBJERR
 if (signum == SIGBUS && (info->si_code == BUS_ADRERR || info->si_code == BUS_OBJERR))
  siglongjmp(sb_env, 1);
}

size_t FileReader::readMapped(const string& filename, const DataCallback& callback) {
  ...
}

#endif

Had this also been the case in AirDC++ before you disabled FileReader::readMapped() entirely, or were you only seeing reports of it from non-Windows systems, such as the Gentoo bug you linked?

Either way, it seems like a safe, reasonable, risk-averse change, since DC++ only targets Windows platforms currently, so at best there's a bunch of dead code in one platform-dependent preprocessor branch and some and very likely subtly incorrect code in the other platform-dependent branch. I'm inclined to remove FileReader::readMapped(), exactly as you suggest.

maksis (maksis) wrote :

Yeah this crash only happened for people using the Linux version of AirDC++. FileReader::readMapped has always been disabled on Windows.

I'm not sure if mapped reading even makes sense in cases where the file is only being read sequentially through once (maybe things were different back in the days when the code was written?).

cologic (cologic) wrote :

This patch is relatively aggressive in removing FileReader::readCached() and all associated infrastructure entirely. I'm not yet certain that's the best, but in the absence of enough ongoing development to support something evidently flawed.

Since non-Windows doesn't support readDirect() to avoid polluting the disk cache, this ensures hashing will probably evict disk cache on Linux and other platforms. I'm not sure the benchmarks you've shown address that side effect. The concern isn't slowing down AirDC++ or DC++, it's slowing down everything else, as they have to repopulate disk cache.

To ameliorate this on Linux, one can use O_DIRECT (https://linux.die.net/man/2/open says "Since Linux 2.4.10", which seems sufficient). FreeBSD 4.x, released in 2000, introduced this flag, which means DragonFly BSD would have inherited it. https://man.netbsd.org/NetBSD-9.0/open.2 documents that NetBSD supports it (with an alignment restriction)

https://stackoverflow.com/questions/55281353/use-of-undeclared-identifier-o-direct suggests that this is trickier on macOS. Apparently there's an F_NOCACHE flag which accomplishes a similar goal.

https://man.openbsd.org/amd64/open shows OpenBSD not supporting O_DIRECT. It's not clear to me whether it supports any alternate mechanism.

So for all the typical POSIXy systems except OpenBSD, the safer approach looks like removing readMapped() and, in its place, using those per-OS direct I/O hints. The attached patch doesn't do the latter part, but only the former.

cologic (cologic) wrote :

As far as

"I'm not sure if mapped reading even makes sense in cases where the file is only being read sequentially through once (maybe things were different back in the days when the code was written?)."

One possible explanation is that all of those mechanisms appeared no earlier than 2000 or 2001, and would have taken a while to spread enough to rely on in the early years of DC++, while mmap() was available since the earliest widely used versions of Linux.

maksis (maksis) wrote :

AirDC++ actually has some kind of support for opening files O_DIRECT: https://github.com/airdcpp/airdcpp-windows/blob/b863d8626d95d0ee483572a5139f8f569b558c3f/airdcpp/airdcpp/File.cpp#L380-L394 (BUFFER_NONE isn't currently being used anywhere though when opening files)

FileReader::readCached currently uses the fadvise code path with POSIX_FADV_SEQUENTIAL.

I compiled AirDC++ with a modified FileReader that opens the file with BUFFER_NONE (O_DIRECT) and did a quick benchmark:

O_DIRECT: Hashing finished: 671 files (21.70 GiB) in 9 directories have been hashed in 4 minutes 10 seconds (88.62 MiB/s)
Normal caching (POSIX_FADV_SEQUENTIAL): Hashing finished: 671 files (21.70 GiB) in 9 directories have been hashed in 4 minutes 1 second (92.07 MiB/s)

Test builds for Linux: https://web-builds.airdcpp.net/develop/

So O_DIRECT is slower, but not by much. I still wonder whether the difference would be bigger with different disk setups (e.g. with network disks).

maksis (maksis) wrote :

FileReader::readDirect implementation with O_DIRECT/F_NOCACHE: https://github.com/airdcpp/airdcpp-windows/commit/eb1e075d5a69862c4f2255f9ea2f204bb70921ba

However, I don't think that the main point of the low-level FileReader::readDirect implementation for Windows is skipping of the OS buffers (couldn't you use the regular File class with FILE_FLAG_NO_BUFFERING instead?) but the use of OVERLAPPED (async I/O). The closest equivalent for that on Linux would probably be AIO (https://man7.org/linux/man-pages/man7/aio.7.html).

So possibly the FileReader methods should rather be called readAsync (overlapped/AIO) and readUnbuffered (FILE_FLAG_NO_BUFFERING/O_DIRECT/F_NOCACHE). Is buffered reading even needed there? When would that work if the unbuffered read fails?

cologic (cologic) wrote :

It appears that pre-uring AIO is being supplanted by io_uring (https://unixism.net/2020/04/io-uring-by-example-part-1-introduction/ and https://www.scylladb.com/2020/05/05/how-io_uring-and-ebpf-will-revolutionize-programming-in-linux/). AIO has had persistent critiques since it was introduced (e.g., https://lwn.net/Articles/724198/), and as best as I can tell appears to have been tolerated in the absence of anything better until Linux 5.1, which added io_uring.

Asynchronous Linux file I/O is in an unfortunate limbo at the moment, where the ubiquitously supported approach is being displaced quickly enough it's unclear whether it's worth targeting specifically anymore. The usual approach appears to be to use libuv or libevent to abstract over these interfaces, both across Linux kernel versions and operating systems more broadly.

For the BSDs (Open/Free/Net/Dragonfly) and macOS, there's kqueue() with EVFILT_READ, which has been around long enough it can simply be required as part of the condition of supporting those OSes. Except for macOS, these are probably quite niche even within the DC userbase, but since the kqueue() appears to operate similarly across all of them, a single approach should work.

Therefore, to the extent that readMapped() was meant partly to introduce asynchrony even when readDirect() didn't work, the in-principle-better approach would seem still be to drop readMapped() and implement readDirect() for non-Windows OSes using io_uring and/or kqueue(), potentially through existing libraries such as libevent or libuv if appropriate.

Strictly in terms of addressing this bug that you've reported for DC++, I probably wouldn't wait for such steps. Given that DC++ does not officially support any non-Windows platform, it seems fine to use, in your terminology, readAsync with a readUnbuffered fallback, without any readMapped().

cologic (cologic) wrote :
Download full text (4.3 KiB)

Apropos "So possibly the FileReader methods should rather be called readAsync (overlapped/AIO) and readUnbuffered (FILE_FLAG_NO_BUFFERING/O_DIRECT/F_NOCACHE). Is buffered reading even needed there? When would that work if the unbuffered read fails?":

An unbuffered read might depend on buffer alignment, e.g., in NetBSD:
"To meet the alignment requirements for direct I/O, file offset, the length of the I/O and the address of the buffer in memory must all be multiples of DEV_BSIZE (512 bytes)." A buffered read doesn't have that requirement, so might succeed where an unbuffered read fails. There'd still be some value in a buffered fallback, then.

A file reading framework for hashing or uploading might ideally:
- be asynchronous, e.g., OVERLAPPED on Windows, io_uring on Linux, and kqueue() on BSDs/macOS;
- and avoid pointlessly filling disk cache with file contents that will likely not be read again within a relevant timeframe, while still allowing the OS to pre-fetch/readahead cache for blocking calls:
  (1) FILE_FLAG_SEQUENTIAL_SCAN for non-OVERLAPPED, because readahead caching's valuable for blocking operations, and FILE_FLAG_NO_BUFFERING for OVERLAPPED on Windows, per https://docs.microsoft.com/en-us/windows/win32/api/fileapi/nf-fileapi-createfilew, which states "When FILE_FLAG_NO_BUFFERING is combined with FILE_FLAG_OVERLAPPED, the flags give maximum asynchronous performance, because the I/O does not rely on the synchronous operations of the memory manager. However, some I/O operations take more time, because data is not being held in the cache."
  (2a) POSIX_FADV_SEQUENTIAL potentially combined with POSIX_FADV_DONTNEED for already-read files/sections of file and O_DIRECT/F_NOCACHE when using io_uring/aio/kqueue for similar reasons as FILE_FLAG_NO_BUFFERING and FILE_FLAG_OVERLAPPED work together in Windows; or
  (2b) with a different speed/memory usage tradeoff than (2a), using FILE_FLAG_NO_BUFFERING/O_DIRECT/F_NOCACHE regardless of whether the reads are asynchronous. This is slightly slower in your benchmarks, likely being hurt by disabling readahead/pre-fetch. Still, it's possible that FILE_FLAG_SEQUENTIAL_SCAN clobbers the OS disk caches when one pushes 20GB of hashed files through it.

If it's fast enough, and explicitly in async cases, FILE_FLAG_NO_BUFFERING/O_DIRECT/F_NOCACHE might probably on balance be better, but does create buffer alignment requirements on various platforms, which might require more testing, and/or a buffered fallback. As you suggest, I'd be curious too about how well FILE_FLAG_NO_BUFFERING/O_DIRECT/F_NOCACHE works with high-latency devices, such as HDDs and slower networks; fast networks with SSDs on the other end should have lower end-to-end latency than typical HDDs (~10ms).

https://github.com/facebook/rocksdb/issues/1032#issuecomment-196874121 suggests that rocksdb does (2a), going through the file sequentially and then explicitly dropping cache behind itself, or did in 2016 at least. I couldn't verify from their current code that they still do that reading, though they definitely do it when writing files they're not going to read back: they write the file, then call with POSIX_FADV_DONTNEED afterwards.

As...

Read more...

maksis (maksis) wrote :

I ran some additional tests with different read buffer sizes on different platforms. Cached hashing speed on macOS is CPU-bound and the same most likely applies to Windows results as well (both disks are able to read 1000 MB/s+). 1024 KB buffer size looks quite good based on the tests.

macOS 11 (MacBook Pro, SSD)

256 KB:

Cached: 2 files (20.76 GiB) have been hashed in 59 seconds (357.97 MiB/s)
F_NOCACHE: 2 files (20.76 GiB) have been hashed in 1 minute 54 seconds (186.08 MiB/s)

1024 KB:

Cached: 2 files (20.76 GiB) have been hashed in 1 minute 0 second (354.01 MiB/s)
F_NOCACHE: 2 files (20.76 GiB) have been hashed in 1 minute 17 seconds (275.21 MiB/s)

4096 KB:

Cached: 2 files (20.76 GiB) have been hashed in 59 seconds (355.90 MiB/s)
F_NOCACHE: 2 files (20.76 GiB) have been hashed in 1 minute 6 seconds (320.85 MiB/s)

Ubuntu 20.04 (Celeron N3150, internal HDD)

256 KB:

Cached: 671 files (21.70 GiB) in 9 directories have been hashed in 3 minutes 58 seconds (93.08 MiB/s)
O_DIRECT: 671 files (21.70 GiB) in 9 directories have been hashed in 4 minutes 10 seconds (88.62 MiB/s)

1024 KB:

POSIX_FADV_DONTNEED: 671 files (21.70 GiB) in 9 directories have been hashed in 3 minutes 55 seconds (94.55 MiB/s)
O_DIRECT: 671 files (21.70 GiB) in 9 directories have been hashed in 4 minutes 1 second (91.95 MiB/s)

4096 KB:

POSIX_FADV_DONTNEED: Hashing finished: 671 files (21.70 GiB) in 9 directories have been hashed in 4 minutes 58 seconds (74.42 MiB/s)
O_DIRECT: 671 files (21.70 GiB) in 9 directories have been hashed in 4 minutes 55 seconds (75.24 MiB/s)

Windows 10 (AMD 2700X, SSD)

OVERLAPPED, 256 KB: 191 files (21,36 GiB) have been hashed in 39 seconds (551,41 MiB/s)
OVERLAPPED, 1024 KB: 191 files (21,36 GiB) have been hashed in 37 seconds (577,62 MiB/s)
OVERLAPPED, 4096 KB: 191 files (21,36 GiB) have been hashed in 37 seconds (581,33 MiB/s)

cologic (cologic) wrote :

https://sourceforge.net/p/dcplusplus/code/ci/a484cbace55f671548b370135ac9fa42d5ddfc02/ implements this. As discussed in this bug/issue, it's mostly a no-op for supported Win32-based DC++ platforms, but hopefully the buffer size increase helps hashing speeds in general in a way consistent with your benchmarks for OVERLAPPED. There's a plausible mechanism, certainly.

Do you have any idea why in your Ubuntu 20.04 setup, 4MB buffers performed notably worse than 1MB buffers? It didn't seem to matter, as regards that, whether you used POSIX_FADV_DONTNEED or O_DIRECT, so presumably it's more a function of the buffer size itself.

The 256 kB vs 1MB/4MB benchmarks for macOS 11 are fairly dramatic. The cached/F_NOCACHE distinction, at least, comes across quite clearly in them. With 1MB, F_NOCACHE is at least a reasonable thing to do, even if it's not optimal for AirDC++'s hashing speed itself.

The other use case that F_NOCACHE, O_DIRECT, POSIX_FADV_DONTNEED, etc might hinder is multiple simultaneous uploads of the same file, if AirDC++ uses the same FileReader member functions as for hashing. I'm not sure how likely/common/worth optimizing for that scenario is, but there's something to be said for just letting the OS find a decent compromise, rather than attempt to address all these edge cases in AirDC++ or DC++.

For example, https://www.realworldtech.com/forum/?threadid=197081&curpostid=197969 from Linus Torvalds, discussing POSIX_FADV_DONTNEED, notes that "Linux should already handle the case of "truly only touched once" case fairly well. Those pages should never end up on the active list etc, and should be cheap and easy to re-use. IOW, it's not typically one of the hard cases for the VM." It's possible that it just doesn't need the hints.

If one does want to use POSIX_FADV_DONTNEED, it works on Linux, FreeBSD, and NetBSD. It compiles, but is a no-op, on DragonFlyBSD: https://github.com/DragonFlyBSD/DragonFlyBSD/blob/5a4a4dbaed373c7df0dbbd0638ce487111b28c4e/sys/sys/fcntl.h#L314-L324. OpenBSD doesn't support it at all, that I can see: https://github.com/openbsd/src/search?q=POSIX_FADV_DONTNEED. The moral equivalent in practical usage that I've found on macOS is F_NOCACHE, despite note being semantically identical.

So for DC++ purposes, for this bug, given that DC++ code isn't set up to be easily tested with non-Win32 environments (which can include WINE), and there'd need to be some additional per-OS build testing to do anything fancier on the hinting side for POSIXy systems, just dropping FileReader::readMapped() seemed the most prudent.

Changed in dcplusplus:
status: New → Fix Committed
cologic (cologic) wrote :

https://sasha-f.medium.com/why-mmap-is-faster-than-system-calls-24718e75ab37 profiles mmap()-based file reading versus what looks like what DC++ calls readCached().

mmap() is significantly faster in those experiments, so while my response here was to just remove the associated code, given DC++'s target platforms and lack of testing on Linux, there's some potential argument for getting mmap(), with all the multithreaded signaling issues, to work in general. It doesn't compare O_DIRECT/FILE_FLAG_NO_BUFFERING, which should avoid copy_user_enhanced_fast_string, or its Windows equivalent, from consuming most of the file-reading time.

So that's another dimension one might measure: CPU usage for reading during hashing. It might not matter enough to decide things, since the TTH probably overwhelms the file read per se, but it's an mmap() use case.

maksis (maksis) wrote :

Looks like the benchmarks that you linked use 4-16 KB block sizes for reading, which is far from optimal when doing regular sequential file reads for larger files.

"Do you have any idea why in your Ubuntu 20.04 setup, 4MB buffers performed notably worse than 1MB buffers?"

I'm not really sure, but hardware on that system is quite low-end compared to the others. Celeron N3150 has a 2 MB L2 cache while the Core i5-1038NG7 on MacBook Pro has a 6 MB cache (and Ryzen 2700X has even larger caches). Can that make a significant difference when calculating TTHs (or with the file reads)?

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.