GRUB error on RHEL 8.5 with custom partitioning

Bug #1970257 reported by Nate Sias
16
This bug affects 2 people
Affects Status Importance Assigned to Milestone
MAAS
Invalid
Low
Unassigned
curtin
Invalid
Undecided
Unassigned

Bug Description

I've got an issue where a client is deploying a RHEL 8.5 instance with custom partitioning. After provisioning, the instance won't boot.

We ran through the exercise of manually booting from GRUB, which worked.

Followup investigation reveals the error; GRUB calls ($root)/boot/vmlinuz|initrd..., however rather than containing the /boot folder, $root is the partition which would mount on /boot.

Layout is defined in MAAS > Machine > storage, and is structured like:
sdb-part1 511.7MB fat32 /boot/efi
sdb-part2 1.99GB xfs /boot
vg00-lvhome .....
...
....
vg00-root 14.99GB xfs /
...

Image was created with default packer-maas settings, using the latest commit (fcc4d5c):
cd packer-maas/rhel8;
sudo -Es CHECKPOINT_DISABLE=1 PACKER_LOG=1 packer build -on-error=abort -force -var 'rhel8_iso_path=/home/maasuser/iso/rhel8.iso' rhel8.json

Issue has been independently reproduced twice outside the client's environment.
I've attached install logs from my local reproducer (less complicated than the client's environment), but don't think they reveal anything.

Best regards,
~ Nate

Tags: sts
Revision history for this message
Nate Sias (nate-sias) wrote :
Bill Wear (billwear)
Changed in maas:
status: New → Triaged
importance: Undecided → Low
Revision history for this message
Björn Tillenius (bjornt) wrote :

I added a curtin task, since I suspect that the problem is there. Maybe some curtin dev could take a look at the logs?

FWIW, I can reproduce this with ext4 as well. So the issue seems to show when you put /boot on a separate partition.

Changed in maas:
milestone: none → next
Revision history for this message
Michael Hudson-Doyle (mwhudson) wrote :

Hm, can you paste the content of the generated grub.cfg? I don't know anything at all about how that is generated for a rhel install but I know where in the code to start looking...

Revision history for this message
Nate Sias (nate-sias) wrote :
Download full text (7.5 KiB)

Hey Michael,

This is /boot/efi/EFI/redhat/grub.cfg
```
#
# DO NOT EDIT THIS FILE
#
# It is automatically generated by grub2-mkconfig using templates
# from /etc/grub.d and settings from /etc/default/grub
#

### BEGIN /etc/grub.d/00_header ###
set pager=1

if [ -f ${config_directory}/grubenv ]; then
  load_env -f ${config_directory}/grubenv
elif [ -s $prefix/grubenv ]; then
  load_env
fi
if [ "${next_entry}" ] ; then
   set default="${next_entry}"
   set next_entry=
   save_env next_entry
   set boot_once=true
else
   set default="${saved_entry}"
fi

if [ x"${feature_menuentry_id}" = xy ]; then
  menuentry_id_option="--id"
else
  menuentry_id_option=""
fi

export menuentry_id_option

if [ "${prev_saved_entry}" ]; then
  set saved_entry="${prev_saved_entry}"
  save_env saved_entry
  set prev_saved_entry=
  save_env prev_saved_entry
  set boot_once=true
fi

function savedefault {
  if [ -z "${boot_once}" ]; then
    saved_entry="${chosen}"
    save_env saved_entry
  fi
}

function load_video {
  if [ x$feature_all_video_module = xy ]; then
    insmod all_video
  else
    insmod efi_gop
    insmod efi_uga
    insmod ieee1275_fb
    insmod vbe
    insmod vga
    insmod video_bochs
    insmod video_cirrus
  fi
}

terminal_input console
terminal_output console
if [ x$feature_timeout_style = xy ] ; then
  set timeout_style=menu
  set timeout=1
# Fallback normal timeout code in case the timeout_style feature is
# unavailable.
else
  set timeout=1
fi
### END /etc/grub.d/00_header ###

### BEGIN /etc/grub.d/00_tuned ###
set tuned_params=""
set tuned_initrd=""
### END /etc/grub.d/00_tuned ###

### BEGIN /etc/grub.d/01_users ###
if [ -f ${prefix}/user.cfg ]; then
  source ${prefix}/user.cfg
  if [ -n "${GRUB2_PASSWORD}" ]; then
    set superusers="root"
    export superusers
    password_pbkdf2 root ${GRUB2_PASSWORD}
  fi
fi
### END /etc/grub.d/01_users ###

### BEGIN /etc/grub.d/08_fallback_counting ###
insmod increment
# Check if boot_counter exists and boot_success=0 to activate this behaviour.
if [ -n "${boot_counter}" -a "${boot_success}" = "0" ]; then
  # if countdown has ended, choose to boot rollback deployment,
  # i.e. default=1 on OSTree-based systems.
  if [ "${boot_counter}" = "0" -o "${boot_counter}" = "-1" ]; then
    set default=1
    set boot_counter=-1
  # otherwise decrement boot_counter
  else
    decrement boot_counter
  fi
  save_env boot_counter
fi
### END /etc/grub.d/08_fallback_counting ###

### BEGIN /etc/grub.d/10_linux ###
insmod part_gpt
insmod xfs
set root='hd1,gpt2'
if [ x$feature_platform_search_hint = xy ]; then
  search --no-floppy --fs-uuid --set=root --hint-bios=hd1,gpt2 --hint-efi=hd1,gpt2 --hint-baremetal=ahci1,gpt2 976db22f-c024-4b90-aa86-ed26916784b5
else
  search --no-floppy --fs-uuid --set=root 976db22f-c024-4b90-aa86-ed26916784b5
fi
insmod part_gpt
insmod fat
set boot='hd1,gpt1'
if [ x$feature_platform_search_hint = xy ]; then
  search --no-floppy --fs-uuid --set=boot --hint-bios=hd1,gpt1 --hint-efi=hd1,gpt1 --hint-baremetal=ahci1,gpt1 659B-7E29
else
  search --no-floppy --fs-uuid --set=boot 659B-7E29
fi

# This section was generated by a script. Do not modify the generated file - all changes
# will be lost t...

Read more...

Revision history for this message
Alan Baghumian (alanbach) wrote :

I just spent some time on this and was able to reproduce.

The environment:

1) UEFI enabled KVM based VM, commissioned on MAAS 3.1

2) Used a custom storage commissioning script (attached). This produces a disk layout as the following:

GPT Partitioned
/dev/vda1 /boot/efi vfat (500M)
/dev/vda2 /boot ext4 (2G)
/dev/vda3 / ext4 (20G)

3) Used packer-maas to generate a custom RHEL 8.5 image (ISO was downloaded today)

4) Commissioned the machine using custom storage layout and performed a deployment

5) Final grub.cfg was generated with the following lines:

linux ($root)/boot/vmlinuz...
initrd ($root)/boot/initramfs...

entries, that caused the boot to fail. Changing the grub.cfg lines to the following fixes the issue (Remove /boot/):

linux ($root)/vmlinuz...
initrd ($root)/initramfs...

6) Released the machine and deployed using default MAAS 3.1 CentOS 7 image, everything worked out fine.

7) Released the machine and deployed using default MAAS 3.1 CentOS 8 image, experienced the exact same issue.

This seems to be affecting RHEL/CentOS 8 family.

Please let me know if you'd like me to add more testing scenarios and will report back the results.

Revision history for this message
Alan Baghumian (alanbach) wrote :

Out of mere curiosity I picked the same machine and storage configuration and deployed:

- Ubuntu 18.04 (Bionic)
- Ubuntu 20.04 (Focal)
- Ubuntu 22.04 (Jammy)

Using default images and all deployments were successful and completed/booted normally.

Revision history for this message
Jerzy Husakowski (jhusakowski) wrote :

We'll have a look at this again when Curtin analysis is done, at this point it's not yet clear where the issue is.

Changed in maas:
status: Triaged → Incomplete
Changed in maas:
milestone: next → 3.2.0
Revision history for this message
Ioanna Alifieraki (joalif) wrote :

Most likely this bug is because of Centos/RHEL 8 uses blscfg module to read
the entries for each installed kernel from /boot/loader/entries
(more info https://fedoraproject.org/wiki/Changes/BootLoaderSpecByDefault).

On a Centos8 vm deployed by maas , after applying the workaround and booting
the machine :

# ls /boot/loader/entries/
994a7dc087ef40f9bd1f4c6e228f294f-0-rescue.conf 994a7dc087ef40f9bd1f4c6e228f294f-4.18.0-305.25.1.el8_4.x86_64.conf
# ls /boot/loader/entries/
994a7dc087ef40f9bd1f4c6e228f294f-0-rescue.conf 994a7dc087ef40f9bd1f4c6e228f294f-4.18.0-305.25.1.el8_4.x86_64.conf
[root@maas-test-2 entries]# cat /boot/loader/entries/994a7dc087ef40f9bd1f4c6e228f294f-4.18.0-305.25.1.el8_4.x86_64.conf
title CentOS Linux (4.18.0-305.25.1.el8_4.x86_64) 8
version 4.18.0-305.25.1.el8_4.x86_64
linux /boot/vmlinuz-4.18.0-305.25.1.el8_4.x86_64
initrd /boot/initramfs-4.18.0-305.25.1.el8_4.x86_64.img $tuned_initrd
options $kernelopts $tuned_params
id centos-20211103103800-4.18.0-305.25.1.el8_4.x86_64
grub_users $grub_users
grub_arg --unrestricted
grub_class kernel

So the problem is that in this file, linux and initrd paths are prefixed with '/boot'.
Now the question is where this file comes from?
These files are either copied from /lib/modules/$kernelver/bls.conf or generated
if this file doesn't exist. (this information comes from commit e96a64edb5320
"grub-switch-to-blscfg: Only fix boot prefix for non-generated BLS files" in
grub2-2.02-99.el8_4.1 source package).

I'm not sure yet, how this file is created, if curtin is aware of blscfg and if
not if it should be made aware of, but I think we should look into this direction.

tags: added: sts
Revision history for this message
Ioanna Alifieraki (joalif) wrote :

Looking further into it, the conf files in /boot/loader/entries/ come from the image used for
deployment.
I've confirmed it, by unpacking the image, editing the conf file to remove the "/boot", repacking the image and deploying again, the deployment was successful.
I guess these files are created while maas-packer creates the image.

I'm trying to come up with a workaround for curtin.
For example, at the point where it configures the grub, to add an if-statement and in case boot
is in separate partition, to trigger the creation of the those config files again.

Changed in maas:
status: Incomplete → New
status: New → Incomplete
Revision history for this message
Alan Baghumian (alanbach) wrote :

This bug was re-brought to my attention and I spent a little bit of testing time on it. I re-imported a packer-maas built RHEL 8 image, and performed a deployment using the custom partitioning script previously attached above. This issue is still happening with MAAS 3.3.

After the deployment, the machine fails to boot from GRUB due to the extra /boot/ prefix on the kernel and initrd paths. After manually removing /boot/ and pressing Ctrl+X to boot GRUB I was able to finally SSH to the machine as before.

Looking into the state of the machine I do confirm Jo's findings. GRUB is using /boot (UUID=75adced6-e475-4ef1-8530-de75be3f2707) as the root, hence the additional /boot/ in the loader files is throwing out the boot process:

[root@generic-4 boot]# grep -r 75adced6-e475-4ef1-8530-de75be *
efi/EFI/redhat/grub.cfg: search --no-floppy --fs-uuid --set=root --hint='hd0,gpt2' 75adced6-e475-4ef1-8530-de75be3f2707
efi/EFI/redhat/grub.cfg: search --no-floppy --fs-uuid --set=root 75adced6-e475-4ef1-8530-de75be3f2707
grub2/grub.cfg: search --no-floppy --fs-uuid --set=root --hint='hd0,gpt2' 75adced6-e475-4ef1-8530-de75be3f2707
grub2/grub.cfg: search --no-floppy --fs-uuid --set=root 75adced6-e475-4ef1-8530-de75be3f2707

[root@generic-4 boot]# blkid
/dev/vda1: SEC_TYPE="msdos" UUID="7D45-6317" BLOCK_SIZE="512" TYPE="vfat" PARTUUID="9b53237b-e6fe-47b1-be1d-428e3372b53d"
/dev/vda2: UUID="75adced6-e475-4ef1-8530-de75be3f2707" BLOCK_SIZE="512" TYPE="xfs" PARTUUID="cf3052a9-c4d3-4518-ab62-8aa64682f65c"
/dev/vda3: UUID="4b5a8ff1-c332-4afa-93d2-13a5d0e58088" BLOCK_SIZE="512" TYPE="xfs" PARTUUID="a6b2ae3c-c29f-4a37-a210-f1a51a08df23"

[root@generic-4 boot]# cat /etc/fstab
# /etc/fstab: static file system information.
#
# Use 'blkid' to print the universally unique identifier for a
# device; this may be used with UUID= as a more robust way to name devices
# that works even if disks are added and removed. See fstab(5).
#
# <file system> <mount point> <type> <options> <dump> <pass>
# / was on /dev/vda3 during curtin installation
/dev/disk/by-uuid/4b5a8ff1-c332-4afa-93d2-13a5d0e58088 / xfs defaults 0 1
# /boot/ was on /dev/vda2 during curtin installation
/dev/disk/by-uuid/75adced6-e475-4ef1-8530-de75be3f2707 /boot/ xfs defaults 0 1
# /boot/efi was on /dev/vda1 during curtin installation
/dev/disk/by-uuid/7D45-6317 /boot/efi vfat defaults 0 1

Best,
Alan

Revision history for this message
Alan Baghumian (alanbach) wrote :

Seems like the easiest workaround was to:

[root@generic-4 ~]# sed -i s/"GRUB_ENABLE_BLSCFG=.*"/"GRUB_ENABLE_BLSCFG=false"/g /etc/default/grub

[root@generic-4 ~]# grub2-mkconfig -o /boot/efi/EFI/redhat/grub.cfg

Revision history for this message
Alan Baghumian (alanbach) wrote :

Did a quick change on packer-maas Kickstart files, turning BLS off:

alan@veloci:~/Canonical/git/packer-maas$ git diff
diff --git a/rhel8/http/rhel8.ks.in b/rhel8/http/rhel8.ks.in
index ff8d6bf..d05bd3f 100644
--- a/rhel8/http/rhel8.ks.in
+++ b/rhel8/http/rhel8.ks.in
@@ -39,6 +39,7 @@ rm -f /etc/sysconfig/network-scripts/ifcfg-[^lo]*
 sed -i 's/^GRUB_TERMINAL=.*/GRUB_TERMINAL_OUTPUT="console"/g' /etc/default/grub
 sed -i '/GRUB_SERIAL_COMMAND="serial"/d' /etc/default/grub
 sed -ri 's/(GRUB_CMDLINE_LINUX=".*)\s+console=ttyS0(.*")/\1\2/' /etc/default/grub
+sed -i s/"GRUB_ENABLE_BLSCFG=.*"/"GRUB_ENABLE_BLSCFG=false"/g /etc/default/grub

 dnf clean all
 %end
diff --git a/rhel9/http/rhel9.ks.in b/rhel9/http/rhel9.ks.in
index ff8d6bf..d05bd3f 100644
--- a/rhel9/http/rhel9.ks.in
+++ b/rhel9/http/rhel9.ks.in
@@ -39,6 +39,7 @@ rm -f /etc/sysconfig/network-scripts/ifcfg-[^lo]*
 sed -i 's/^GRUB_TERMINAL=.*/GRUB_TERMINAL_OUTPUT="console"/g' /etc/default/grub
 sed -i '/GRUB_SERIAL_COMMAND="serial"/d' /etc/default/grub
 sed -ri 's/(GRUB_CMDLINE_LINUX=".*)\s+console=ttyS0(.*")/\1\2/' /etc/default/grub
+sed -i s/"GRUB_ENABLE_BLSCFG=.*"/"GRUB_ENABLE_BLSCFG=false"/g /etc/default/grub

 dnf clean all
 %end

Then downloaded the latest RHEL 9.2 and 8.8 DVD ISOs and created new images for MAAS:

$ maas homelab boot-resources create name='custom/rhel-9.2-20230912-no-bls' title='RHEL 9.2 20230912 No BLS' architecture='amd64/generic' filetype='tgz' base_image='rhel/9' content@=rhel-9.2-20230912-no-bls.tar.gz

$ maas homelab boot-resources create name='custom/rhel-8.8-20230912-no-bls' title='RHEL 8.8 20230912 No BLS' architecture='amd64/generic' filetype='tgz' base_image='rhel/8' content@=rhel-8.8-20230912-no-bls.tar.gz

After that performed two deployments (9.2 / 8.8) using the "custom" disk layout and then two extra deployments (9.2 / 8.8) with the default Flat disk layout and all tests worked w/o an issue.

Best,
Alan

Revision history for this message
Heather Lemon (hypothetical-lemon) wrote :

This is a problem in packer-maas, will make a new PR there. Thanks!

Changed in curtin:
status: New → Invalid
Changed in maas:
status: Incomplete → Invalid
Revision history for this message
Heather Lemon (hypothetical-lemon) wrote :
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.