5% reservation for root is inappropriate for large disks/arrays

Bug #1340448 reported by James Troup
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
e2fsprogs (Ubuntu)
Confirmed
Undecided
Unassigned

Bug Description

mke2fs (and it's ext3 and ext4 analogs) still default to reserving 5%
of the filesystem for root. With the size of modern disks and arrays
this isn't a terribly sensible default, e.g. if I have a 10Tb array,
mke2fs will reserve 500Gb for root.

Obviously this is both tunable at FS creation time and fixable after
the fact but I still think we should try and improve the defaults.

Given the size of modern disks, I think it'd make sense to either a)
only reserve a smaller amount (e.g. 1%) or b) reserve n% if the
filesystem is << NNN GB and otherwise reserve NN GB.

Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in e2fsprogs (Ubuntu):
status: New → Confirmed
Revision history for this message
Theodore Ts'o (tytso) wrote :

If you try to use more than 95% of the storage, performance will generally suffer -- badly. Now, you may not care for certain use cases; if you are doing backups, you might not worry about that much about performance, and you might care a lot more about using the last few bytes of the disk. But changing this default is not something I plan to do upstream.

In addition for the root file system, you really do want to leave the default at 5% so that root can write to critical file systems. And since the vast majority of Ubuntu users are using a single root file system, that implies that the for the vast majority of file systems created by Ubuntu, the default is in fact appropriate.

Revision history for this message
James Troup (elmo) wrote : Re: [Bug 1340448] Re: 5% reservation for root is inappropriate for large disks/arrays

"Theodore Ts'o" <email address hidden> writes:

> If you try to use more than 95% of the storage, performance will
> generally suffer -- badly.

Sorry, but why is that? And do you mean read performance, write
performance or both? And is that a factor of the type of storage
(e.g. spinning disk vs. SSD)?

> In addition for the root file system, you really do want to leave the
> default at 5% so that root can write to critical file systems.

Sure, I didn't suggest eliminating the reservation, just reducing it.
Or is there some reason that root would need at least 5% (as opposed
to 1% or 2%) of disk available to write to that I'm missing?

> And since the vast majority of Ubuntu users are using a single root
> file system, that implies that the for the vast majority of file
> systems created by Ubuntu, the default is in fact appropriate.

While that may be true of Ubuntu on the client (desktop, laptop,
tablet, phone, etc.); I'm not sure it's true of server and cloud?

--
James

Revision history for this message
Theodore Ts'o (tytso) wrote :

As the file system gets more and more full, the free space gets more and more fragmented. This results in disastrous performance hits for files that are written when the file system gets full, and then when you later try to read them, you will suffer similar disastrous performance hits.

This is of course much more notable on HDD, but even on SSD's, a random write workload is always going to result in greater flash wear and lower performance than a sequential write workload --- particularly on the cheaper flash that you would expect to see in tablets and phones (i.e., eMMC flash) --- and even some of the crappier desktop SSD's.

Now, if you have an Intel or Samsung SSD, this might not matter as much (although you will still see a performance hit) --- but having mke2fs figure out whether you have a competently implemented flash translation layer, or a spectacularly crappy one (since on phones they calculate the BOM cost down to the hundredth or thousandth of a cent, and that extra 25 cents worth of better FTL license fees and controller memory is Just Too Damn High :-), is just too much complexity to put into mke2fs's program logic. It's better to have a simple, well defined default, and then if you know for sure that you have a system where you don't mind highly fragmented files, to adjust the root reserve.

As far as the server and the cloud, for the cloud, it won't really matter since guest OS's generally have relatively small amounts of space. And Ubuntu has stopped caring about the enterprise server market a long time ago. As far as the cloud host OS, there's a heck of a lot more tuning you need to do if you what a high-performing, price/performance competitive offering, such that adjusting the root reserve is the very least of your problems....

In any case, this is not something I intend to change for upstream, either in e2fsprogs or the Debian package. If the Ubuntu release engineers want to make a change, they are of course free to do so. But I wouldn't recomend it.

Revision history for this message
Dimitri John Ledkov (xnox) wrote :

This is not the only place where %-only based metric is not appropriate for all sizes, the other one being default swap size is a simple multiple of RAM still, despite extreme cases of RAM >> HDD (in e.g. high-mem VM cases) and very fast HDD (e.g. NVMe drives or RAID-arrays speed up with SSD cache layers).

Adding in that as well can eat up a lot of disk space, and at times even expensive one.

Whilst we all agree that performance does degrade at the extremely full disk-space, I don't buy that 51GB of free disk-space is degraded performance given an e.g. RAID1 1TB filesystem and typical file-sizes / extends. I'll poke cking, to check if he has performance degradation results for various file-systems w.r.t. % filled for various max sizes.

5% is sensible and appropriate for a wide distribution of filesystem sizes, but tiny disks may want more reserve and ditto large disks may want less. Thus the calculated reserve shouldn't be a linear 5%, but some non-linear distribution or scaled value.

One reasonable distribution algorithm for scaling disk-space limits (in the context of nagios warning/critical levels monitoring) that I have seen is implemented in checkmk project https://mathias-kettner.de/checkmk_filesystems.html#H1:The df magic number . However, the margins there increase too rapidly at the small disk-space sizes. Thus further tweaking and/or different distributions would be needed.

Theodore, would you be open to having a tune option to make reserve level adaptive rather than static %? (default would be static)

Revision history for this message
Theodore Ts'o (tytso) wrote :

I just don't think an adaptive algorithm is going to be worth the complexity. It will invariably get it wrong for some work load or another. At which point more people will start kvetching and opening bug reports.

For one thing, it matters a lot what the average size is of the files being stored on a file system. For example, if the average file size is 4k, then you can fill it to 100% without suffering any performance degradation. It will also depend on the distibution of the files and whether you just write once and be done, or whether files are constantly being deleted and inserted, potentially with different sizes.

It also matters whether the user cares about performance or capacity. Disks have essentially not changed the average seek time, while doubling in capacity every 18 months or so (with this trend only only slowing down in the past year or two). So if you consider the average seek per GB of capacity, it's been dropping with every single disk generation. As a result, there are probably many applications which are now no longer constrained by disk capacity, but by seek capacity (i.e., by the number of spindles). I've even seen examples of shops that have deliberately not gotten the latest 6TB or 8TB disks, because they are spindle constrained, and so they are better off buying 2T or 4T disks.

This last is why I really don't think it's worth while to worry about that last 51GB if you have a 1T disk. Adding complexity to the calculation just makes it harder for system administrators to understand what is going on, and with the seeks per GB dropping like a stone over the past few decades, for many shops performance is far more important than the value of that 51GB (which in dollar values, given today's costs in disk, is approximately $2.50 USD. In terms of minutes of a system administrators time, or a software engineers time, you've no doubt wasted far more than that in discussing it. :-)

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.