Comment 9 for bug 2062568

Revision history for this message
Jeff (jeff09) wrote :

The tricky part for me is that client was regularly changing, so I can't confidently say when did errors start appearing, it's just very suspicious that it needs high load (as new host and higher network bandwidth made the issue more frequent), and uploading to the server as just pure downloading doesn't seem to be a problem even if cached data is getting sent at full bandwidth for minutes.

Moved the server to 24.04, but I've also moved some I/O heavy tasks to it so there would be less need of uploading. Client was on 23.10 and I'm still holding back on upgrading for some more weeks.

Can't say a whole lot about the current situation as I'm not uploading much anymore to avoid the issue, but I actually ran into a hanging issue a few days ago, I just didn't have time to debug it, but the server didn't want to gracefully restart, so ended up hard rebooting.
I believe it was the first time since moving I/O heavy tasks, wanted to upload a few hundred GiB of data back to the server which was downloaded from there a while ago without problems. Otherwise light I/O doesn't seem to run into this problem, like the occasional backup to the server is fine, but that rarely saturates the network, and likely completely fits into the page cache almost every time.

A few hopefully helpful points for reproducing the problem:
- As mentioned multiple times, download alone seems to be unaffected, uploading is what should be stressed, and I suspect that either there's no need to download at the same time, or just casual filesystem browsing is a good enough load.
- A fast client with high bandwidth is key. Ran into this issue a couple times with an older host on 1 Gb/s, but a new fast host with 2.5 Gb/s made the issue appear significantly more frequently.
- Likely doesn't matter how the link gets saturated, but I either processed files cached on the server (mixed R/W), or uploaded cached files (fast SSD should be fine too), meaning that the bottleneck was always the network at least while the caches were large enough.
- Files were large, so there wasn't any stopping for fiddling with metadata as it would happen with small files, and the page cache was often exhausted. The target was a single HDD the majority of the time which often meant that writes started blocking (100-ish MiB/s HDD catching up with close to 250 MiB/s data), occasionally making the hosts freeze as the kernel's background I/O handling is still bad, we just pretend the issue is gone with SSDs being fast enough not to run into this. The page cache draining freezes may be good at exposing race conditions.

It may be more efficient to start looking for what's causing the "RPC: Could not send backchannel reply error: -110" log spam which might be related. The lockup may take significant time to catch while that kernel message showed up quite frequently.
Even now I have plenty of those lines without experiencing issues and not even uploading much, mostly just downloading large files.

Some extra info which may or may not matter:
- The server hardware is quite weak with an old 4 core Broadwell CPU, possibly helping to expose race condition problems
- All file systems are Btrfs with noatime,discard=async,compress-force=zstd , the later part surely adding more load
- LUKS is used everywhere, also adding some extra load
- There's a Btrfs (on LUKS) image mounted over NFS (with not a whole lot of usage though)