Issue in Extended Disk Data retrieval (biosdisk: int 13h/service 48h)
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
grub2 (Ubuntu) |
Won't Fix
|
Medium
|
Guilherme G. Piccoli |
Bug Description
We have an user reporting the following issue:
After an upgrade, grub couldn't boot any kernel. The system is not running in UEFI mode, so "grub-pc" is the package used - also it is a HW RAID5 setup (Dell machine). The bootloader itself was able to get loaded, including all its base modules (hence the bootloader could read/write from disk) - also, grub packages were up-to-date and seemed properly installed. The following kernels were present/installed there: 4.4.0-148, 4.4.0-189, 4.4.0-190, 4.4.0-193, 4.4.0-194 .
Attempting to boot the most recent version (-194), we got the following grub error: "error: attempt read-write outside of disk `hd0`" - even dropping to the grub shell and manually trying to load the file vmlinuz-
After booting from a virtual ISO (Ubuntu installer), we managed to "update-grub", "update-initramfs" and "grub-install", not forgetting to "sync" after all these commands. We previously duplicated all initrds, saving them as initrd.
(A) We apt-get removed kernels -189 and -190 (and their initrd backups)
(B) We moved all the remaining vmlinuz/initrd pairs (and their backups) to "/"
(C) We *copied* all of them back to /boot, with the goal of duplicating the files in the filesystem
We double-checked the md5 hashes of all the vmlinuz/initrd pairs and they matched, so the *same files* are present in "/" and "/boot". We also checked vmlinuz-
So the (very odd and interesting) problem is: grub can read some files and others it cannot read, even we knowing that *all the duplicate files are the same* and have proved integrity (i.e., the filesystem and the storage controller/disks seems to be healthy). Why? Very similar problems were reports in [0] and [1] with no really good/definitive answer.
HYPOTHESIS:
I think this has to do with the fact that grub *cannot* read some sectors of the underlying disks, but not due to disk corruption, but due to logical sector accounting/math. Since it's a hardware RAID, I understand that from Linux perspective, it is "seen" as a single device. And even from grub perspective, it's a single disk (called 'hd0' in grub terminology). But maybe grub is doing some low-level queries to gather physical device information on the underlying disks, and when it calculates the sector math, it notices the "section" to be read is outside of the "available" area of the device, giving us this error. Some mentions of "BIOS restrictions" in [0] or [1] could be also considered, the BIOS or even Grub could be unable to deal with files outside some "range" in the disk, like for security reasons - although I doubt that, I'm more keen to the first theory.
In both theories, it ends-up being a restriction in loading a file *depending* on its logical position in the disk. If that is true, it's a very awkward limitation. The following data was suggested to be collected by user, to understand the topology of the disk and the logical position (LBA) of the files:
debugfs -R "stat /boot/vmlinuz-
hdparm --fibmap /boot/vmlinuz-
debugfs -R "stat /vmlinuz-
hdparm --fibmap /vmlinuz-
[0] https:/
[1] https:/
After checking the fibmap files from the user, could verify that the LBAs for the non-working files are very large compared to the ones that are working - it's a data point reinforcing the theory that GRUB is miscalculating something for files after some LBA offset.
Managed to reproduced the issue in-house using a Dell machine with such HW RAID5 setup. The idea to reproduce is basically have a legacy-BIOS booting mode, and duplicate the vmlinuz/initrd pair until it almost fills the HDD - then, use fibmap to determine the ones living in the largest LBAs, and keep them, a few of them should be enough. Keep a valid vmlinuz/initrd and grub config files/modules in a first small "/boot" partition so the machine can always boot, and duplicate the kernel files in a huge "/" partition after "/boot" in the disk.