[CS9/master] unable to provision OC node: timeout on second boot

Bug #1958230 reported by Cédric Jeanneret
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
Critical
Unassigned

Bug Description

Note: this is a new bug, discovered after https://bugs.launchpad.net/tripleo/+bug/1957169 was solved.

Hello there,

When trying to provision the OC node using the overcloud-full.qcow2 image, it fails on to properly reboot on the OS.

It hangs on a dracut loop, as if it was waiting for a device that doesn't exist (the UUID listed in dracut log doesn't exist on the host - it was checked of course).
This UUID might be a left-over of some image build process. Since it's always the same, it's probably related to the CentOS base image itself, the one used to build the OC image:

[ 8.045500] localhost systemd[1]: Reached target Basic System.
[ 137.069543] localhost dracut-initqueue[598]: Warning: dracut-initqueue: timeout, still waiting for following initqueue hooks:
[ 137.072352] localhost dracut-initqueue[598]: Warning: /lib/dracut/hooks/initqueue/finished/devexists-\x2fdev\x2fdisk\x2fby-uuid\x2f4a50018c-f6ea-4bd7-95dc-badc49878c01.sh: "if ! grep -q After=remote-fs-pre.target /run/systemd/generator/systemd-cryptsetup@*.service 2>/dev/null; then
[ 137.072352] localhost dracut-initqueue[598]: [ -e "/dev/disk/by-uuid/4a50018c-f6ea-4bd7-95dc-badc49878c01" ]
[ 137.072352] localhost dracut-initqueue[598]: fi"
[ 137.082038] localhost dracut-initqueue[598]: Warning: dracut-initqueue: starting timeout scripts

I'll add a log file right after this LP is created.

Cheers,

C.

Revision history for this message
Cédric Jeanneret (cjeanner) wrote :

Log generated upon boot failure, its origin is /run/initramfs/rdsosreport.txt - I could extract it from the VM.

Revision history for this message
Harald Jensås (harald-jensas) wrote :

Right, the UUID matches the one of the base image used for the build:

$ virt-filesystems -a CentOS-Stream-GenericCloud-9-20220117.1.x86_64.qcow2 --all --long --uuid -h
Name Type VFS Label MBR Size Parent UUID
/dev/sda1 filesystem xfs - - 7.8G - 4a50018c-f6ea-4bd7-95dc-badc49878c01
/dev/sda1 partition - - 83 7.8G /dev/sda -
/dev/sda device - - - 10G - -

That UUID matches what is seen on the console:

[ 137.072352] localhost dracut-initqueue[598]: Warning: /lib/dracut/hooks/initqueue/finished/devexists-\x2fdev\x2fdisk\x2fby-uuid\x2f4a50018c-f6ea-4bd7-95dc-badc49878c01.sh: "if ! grep -q After=remote-fs-pre.target /run/systemd/generator/systemd-cryptsetup@*.service 2>/dev/null; then
[ 137.072352] localhost dracut-initqueue[598]: [ -e "/dev/disk/by-uuid/4a50018c-f6ea-4bd7-95dc-badc49878c01" ]

Revision history for this message
Cédric Jeanneret (cjeanner) wrote (last edit ):

OK, that UUID is pointed in the boot options for the kernel, in grub:

loader/entries/3096152462f947feae105af3ba3df966-5.14.0-43.el9.x86_64.conf:options root=UUID=4a50018c-f6ea-4bd7-95dc-badc49878c01 ro console=ttyS0,115200n8 no_timer_check net.ifnames=0 crashkernel=auto

Nothing in /etc/grub* is showing this UUID though - so it's more than probably a lack of update of the "menu".

This is especially true if we run a "grep -r UUID /etc/grub*" from within the image disk:
grub2.cfg: set kernelopts="root=UUID=81161d9c-88cf-4fc4-bf99-691bdcbcd3dc ro console=ttyS0,115200n8 no_timer_check crashkernel=auto "
grub2-efi.cfg: set kernelopts="root=UUID=81161d9c-88cf-4fc4-bf99-691bdcbcd3dc ro console=ttyS0,115200n8 no_timer_check crashkernel=auto "

Here, we can see the actual UUID it should use in the loader:

[virtuser@builder2 ~]$ virt-filesystems -a workload/lab2-oc0-controller-0.qcow2 --all --long --uuid -h
Name Type VFS Label MBR Size Parent UUID
/dev/sda1 filesystem vfat efi-part - 200M - 2180-0667
/dev/sda2 filesystem iso9660 config-2 - 378K - 2022-01-18-10-10-26-00
/dev/sda3 filesystem xfs img-rootfs - 2.6G - 81161d9c-88cf-4fc4-bf99-691bdcbcd3dc
/dev/sda1 partition - - - 200M /dev/sda -
/dev/sda2 partition - - - 1.0M /dev/sda -
/dev/sda3 partition - - - 99G /dev/sda -
/dev/sda device - - - 100G - -

Revision history for this message
Cédric Jeanneret (cjeanner) wrote :

Yep, so something is preventing dracut to do its job properly, or, maybe, doesn't call dracut the right way:

cat /boot/loader/entries/3096152462f947feae105af3ba3df966-5.14.0-43.el9.x86_64.
title CentOS Stream (5.14.0-43.el9.x86_64) 9
version 5.14.0-43.el9.x86_64
linux /boot/vmlinuz-5.14.0-43.el9.x86_64
initrd /boot/initramfs-5.14.0-43.el9.x86_64.img
options root=UUID=4a50018c-f6ea-4bd7-95dc-badc49878c01 ro console=ttyS0,115200n8 no_timer_check net.ifnames=0 crashkernel=auto
grub_users $grub_users
grub_arg --unrestricted
grub_class centos

Here, we have the mention of that old UUID. How are we supposed to re-generate this file ? I see mentions of "dracut --force" in DIB, but when looking on the Net, I stumbled on this one:

*before* anything, run `dracut --force --no-hostonly`
*after* everything, run `dracut --force`

Are we missing a step somewhere?

At least, according to the overcloud-full.log, the latter is called properly:
2022-01-17 23:21:18.529 | dib-run-parts 50-dracut-regenerate completed

Mentions of dracut are from:
diskimage_builder/elements/dracut-regenerate/finalise.d/50-dracut-regenerate

Maybe we should get a new elements/dracut-regenerate/pre-install.d in order to get that "--no-hostonly" call ?

Revision history for this message
Cédric Jeanneret (cjeanner) wrote :

that file (boot/loader/entries/*) is used by the UEFI. Pretty sure we're missing something related to uefi, again - or it's linked to https://review.opendev.org/c/openstack/diskimage-builder/+/824659 in some weird way.

tags: added: alert
Revision history for this message
Steve Baker (steve-stevebaker) wrote :

I've proposed this as a fix, it does the same thing for BLS entries as we do for the old /etc/default/grub root device setting:

https://review.opendev.org/c/openstack/diskimage-builder/+/825700

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-common (master)
Changed in tripleo:
status: Triaged → In Progress
Revision history for this message
Cédric Jeanneret (cjeanner) wrote :

Note: while the overcloud-full.qcow2 is still failing (even built with the latest patches), the overcloud-hardened-uefi-full.qcow2 is working as expected.

Revision history for this message
Cédric Jeanneret (cjeanner) wrote :

UEFI part is solved. There are now some other questions about coverage for BIOS/UEFI, but it's mostly unrelated to the initial issue.

Closing this one then.

Cheers,

C.

Changed in tripleo:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-common (master)

Reviewed: https://review.opendev.org/c/openstack/tripleo-common/+/826205
Committed: https://opendev.org/openstack/tripleo-common/commit/29324699829354575e58ea2bac37165943acd6f1
Submitter: "Zuul (22348)"
Branch: master

commit 29324699829354575e58ea2bac37165943acd6f1
Author: Steve Baker <email address hidden>
Date: Tue Jan 25 13:12:25 2022 +1300

    Switch from grub2 to bootloader element for overcloud-full

    A lot of effort has gone into the bootloader element doing the right
    thing in all cases, managing the transition to BLS, and handling
    legacy boot with UEFI boot.

    overcloud-full and other partition images have not received the
    benefit of this work because they use the much simpler grub2 element.
    Now that overcloud-hardened-uefi-full is the default, overcloud-full
    may become harder to support over time. Switching to the bootloader
    element will reduce the support matrix for the boot handling of this
    image.

    This changes fixes an issue where the kernel args root option is
    incorrect, because the bootloader element does a full kernel options
    regeneration.

    Depends-On: https://review.opendev.org/c/openstack/diskimage-builder/+/826976
    Closes-Bug: #1958230
    Change-Id: I9a0936c415485166962194104c7c54ee520e9516

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-common (stable/wallaby)

Fix proposed to branch: stable/wallaby
Review: https://review.opendev.org/c/openstack/tripleo-common/+/838804

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-common (stable/wallaby)

Reviewed: https://review.opendev.org/c/openstack/tripleo-common/+/838804
Committed: https://opendev.org/openstack/tripleo-common/commit/2bd026b72f46887282840d98d740a04ffb52c1b3
Submitter: "Zuul (22348)"
Branch: stable/wallaby

commit 2bd026b72f46887282840d98d740a04ffb52c1b3
Author: Steve Baker <email address hidden>
Date: Tue Jan 25 13:12:25 2022 +1300

    Switch from grub2 to bootloader element for overcloud-full

    A lot of effort has gone into the bootloader element doing the right
    thing in all cases, managing the transition to BLS, and handling
    legacy boot with UEFI boot.

    overcloud-full and other partition images have not received the
    benefit of this work because they use the much simpler grub2 element.
    Now that overcloud-hardened-uefi-full is the default, overcloud-full
    may become harder to support over time. Switching to the bootloader
    element will reduce the support matrix for the boot handling of this
    image.

    This changes fixes an issue where the kernel args root option is
    incorrect, because the bootloader element does a full kernel options
    regeneration.

    Depends-On: https://review.opendev.org/c/openstack/diskimage-builder/+/826976
    Closes-Bug: #1958230
    Change-Id: I9a0936c415485166962194104c7c54ee520e9516
    (cherry picked from commit 29324699829354575e58ea2bac37165943acd6f1)

tags: added: in-stable-wallaby
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tripleo-common 17.0.0

This issue was fixed in the openstack/tripleo-common 17.0.0 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.