Comment 7 for bug 1959971

Revision history for this message
Michael Mikowski (kfocus) wrote (last edit ): Re: increase /boot partition size

Design Failure Mode Effects Analysis

System: Boot

Potential Failure Mode: /boot partition overfills

Effects of Failure: To return the system to a usable state, the user must have advanced knowledge and an available system recovery disk. The procedure involves disk mounting, chroot, package management, and deep file system knowledge. This is outside the range of most target users.

Severity: 10 - system is completely unusable until recovery, and recovery is time consuming and tedious for advanced users, or impossible for less skilled users.

Causes of Failure:
1. Multiple kernel updates between reboots will overfill the standard 705M /boot partition with over 3 kernels.
2. User installing generic and lowlatency hwe kernels. They may also have transient kernels when switching to another kernel like oem.

Preventative Activities:
1. There does not appear to be any tool to guide kernel selection for users or ensure the latest and penultimate versions are reserved for the kernels.
2. There does not appear to be any testing which prevents installation of a kernel that would over-fill the disk.
3. unattended-upgrades tries to clean images that fill up the /boot disk, but it does not consider disk space, and even when it work (which is a whole other issue), the /boot disk is required to hold 4 images at times which is not feasible with the current 705M size.

Occurrence: 5 (even users with a singe Kernel flavor encounter this)

Detection Rating: 5 - Preventative Activities are unlikely to be effective at all times

Risk Priority Number: Severity * Occurrence * Detection = 250

Take Away Points:

* The DFMEA indicates this is a severe issue that should be considered critical.
* When an overfull /boot disk occurs, the effect is catastrophic to the the average user, and for many is simple unrecoverable. This can and does drive users to abandon the OS when this occurs.
* Existing controls to prevent occurrence are inadequate (automated upgrades still allows disk to over fill when using a single kernel flavor, and does not consider disk space) or completely missing (users are not guided on the issue of kernel management). The popularity of forum posts over the years about this issue illustrates this is a substantial problem.
* This issue goes beyond /boot partition size, but the increasing it to handle all possible transient states is required for a complete solution.
* Disk space is cheap these days. On consumer desktop solutions, 2.0GB is a small price to pay to avoid a catastrophic failure which is otherwise unrecoverable for many users.