c9 master overcloud nodes not booting during node provision

Bug #1962783 reported by Rafael Castillo
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
Critical
Unassigned
Changed in tripleo:
milestone: none → yoga-3
Revision history for this message
Rafael Castillo (rafaelcastillo) wrote :
summary: - overcloud nodes not booting in quickstart minimal libvirt deploy
+ overcloud nodes not booting during node provision
description: updated
description: updated
Revision history for this message
Ronelle Landy (rlandy) wrote :
summary: - overcloud nodes not booting during node provision
+ c9 master overcloud nodes not booting during node provision
Revision history for this message
Ronelle Landy (rlandy) wrote :
Revision history for this message
Cédric Jeanneret (cjeanner) wrote :
Download full text (3.3 KiB)

Some additions:
- deploying master as of today
- using OC image from tripleo-ci-testing

node provisioning fails on reboot: we're dropped in the grub rescue. VM console shows:
Booting from Hard Disk.....
error: ../../grub-core/kern/disk.c:236:disk `lvmid/Tt0GRF-hNDS-Qj7Z-P8Jz-p5v2-fV7G-iXrSej/vyvFWp-s01c-042E-cPIy-Kwud-ApFc-fhuFOO' not found.
Entering rescue mode...

The disk layout is as follow:
[CentOS-9 - stack@undercloud overcloud_imgs]$ virt-filesystems -a overcloud-hardened-uefi-full.qcow2 -l --all --uuid
Name Type VFS Label MBR Size Parent UUID
/dev/sda1 filesystem vfat MKFS_ESP - 16726016 - 7EAB-FF48
/dev/sda2 filesystem unknown - - 8388608 - -
/dev/vg/lv_audit filesystem xfs fs_audit - 170557440 - 75c5142c-f2ac-4db2-9767-676044f958f0
/dev/vg/lv_home filesystem xfs fs_home - 233472000 - 02d347dc-5399-48df-9ea0-538ab5d341a4
/dev/vg/lv_log filesystem xfs fs_log - 233472000 - ed11a542-d187-4324-b47e-2ef153721d52
/dev/vg/lv_root filesystem xfs img-rootfs - 4125097984 - 13e1e4cc-f461-43c9-ad35-5ed179da1070
/dev/vg/lv_srv filesystem xfs fs_srv - 53116928 - 7d50d59f-c5c6-413b-8cee-0e6043104593
/dev/vg/lv_tmp filesystem xfs fs_tmp - 233472000 - f7d86cba-247b-4307-b2fc-7b33ef5373fb
/dev/vg/lv_var filesystem xfs fs_var - 891166720 - 1d943236-00f9-4a78-838a-1e1f60f31a9e
/dev/vg/lv_audit lv - - - 176160768 /dev/vg y3jG40-uTqe-dOdR-Fwqm-a1ds-GpdH-E1yA4i
/dev/vg/lv_home lv - - - 239075328 /dev/vg 4LzWkW-chp5-D4X0-Gg3j-fpKx-opA2-2XXagY
/dev/vg/lv_log lv - - - 239075328 /dev/vg CH9n90-TTnX-GvPR-T7t0-qP32-saJf-Epg53e
/dev/vg/lv_root lv - - - 4135583744 /dev/vg vyvFWp-s01c-042E-cPIy-Kwud-ApFc-fhuFOO
/dev/vg/lv_srv lv - - - 58720256 /dev/vg 3Zv4Sy-pTUV-xgvW-qVmz-81rE-t9iG-P9pdEi
/dev/vg/lv_tmp lv - - - 239075328 /dev/vg C34Oew-oeCH-fOBH-KHWC-PMeA-pUeo-25mpgq
/dev/vg/lv_var lv - - - 897581056 /dev/vg yncRs3-O7HB-xGdC-4XYy-ym7T-YZWC-NeRYjc
/dev/vg vg - - - 5997854720 /dev/sda3 0mTWcGjx3EDE58JynSvHs98KrzyVYj8k
/dev/sda3 pv - - - 5997854720 - i0QG7HgV8HO2ihd1U7X0uao6wRv9cIQ0
/dev/sda1 partition - - - 16777216 /dev/sda -
/dev/sda2 partition - - - 8388608 /dev/sda -
/dev/sda3 partition - - - 5999951872 /dev/sda -
/dev/sda device - - - 6442450944 - -

The entries in /boot/loader/entries/*.conf looks correct:
title CentOS Stream (5.14.0-70.el9.x86_64) 9
version 5.14.0-70.el9.x86_64
linux /boot/vmlinuz-5.14.0-70.el9.x86_64
initrd /boot/initramfs-5.14.0-70.el9.x86_64.img
options root=LABEL=img-rootfs ro console=ttyS0,115200n8 no_timer_check crashkernel=auto console=tty0 console=ttyS0,115200 no_timer_check nofb nomodeset vga=normal console=tty0 cons...

Read more...

Revision history for this message
Steve Baker (steve-stevebaker) wrote (last edit ):

(deleting this comment, it looks like it is an overcloud-hardened-uefi-full issue)

Revision history for this message
Steve Baker (steve-stevebaker) wrote :

jenkins-tripleo-quickstart-promote-master-centos9-current-tripleo-delorean-minimal-33 appears to be overcloud-full with uefi boot. I'd suggest switching this to bios boot until these changes land which will ensure grub config is regenerated and correct for UEFI boot in overcloud-full:
- https://review.opendev.org/c/openstack/diskimage-builder/+/826976, and the change which switches - https://review.opendev.org/c/openstack/tripleo-common/+/826205

featureset001, featureset035, and Cedric's lab are all deploying overcloud-hardened-uefi-full, so for now we should assume that is unrelated to the minimal-33 failure. I've successfully downloaded and deployed [1] on local hardware, so I'm not sure yet what the issue is.

[1] https://images.rdoproject.org/centos9/master/rdo_trunk/tripleo-ci-testing/overcloud-hardened-uefi-full.qcow2

Revision history for this message
Steve Baker (steve-stevebaker) wrote :

Just looking through the jobs, master featureset035 is recently green, or red but node provision succeeds:
https://review.rdoproject.org/zuul/builds?job_name=periodic-tripleo-ci-centos-9-ovb-3ctlr_1comp-featureset035-master
https://review.rdoproject.org/zuul/build/07b71ab34aba4965ae9790cdac5027d7

Recent master featureset001 has one green, one red (provision failed) and one red (provision succeeds):
https://review.rdoproject.org/zuul/builds?job_name=periodic-tripleo-ci-centos-9-ovb-3ctlr_1comp-featureset001-master
I don't think the image is the source of the red (provision failed). "Operation was aborted due to conductor take over"
https://logserver.rdoproject.org/openstack-periodic-integration-main/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-9-ovb-3ctlr_1comp-featureset001-master/7a0d9e4/logs/undercloud/home/zuul/overcloud_node_provision.log.txt.gz

From this can we conclude that [1] is correct and is capable of booting in CI?

I'm not sure what is happening with Cedric's environment in comment #4, that is a different error message than the other issues.

[1] https://images.rdoproject.org/centos9/master/rdo_trunk/tripleo-ci-testing/overcloud-hardened-uefi-full.qcow2

Revision history for this message
Cédric Jeanneret (cjeanner) wrote :

Hello Steve,

I gave a new try today - same issue. Dropped in grub rescue, and that weird lvm thing. tripleo-lab fetches the OC image here[1]:
https://images.rdoproject.org/centos{{ os_version }}/{{tripleo_repos_branch}}/rdo_trunk/{{ oc_img_line | default('current-tripleo') }}/

in my current case, the final URI is:
https://images.rdoproject.org/centos9/master/rdo_trunk/tripleo-ci-testing/
sooo... not sure why it's failing on my env, really.
If you tell me your UTC schedule, I can manage to be available for a live session on the lab during your morning? It shouldn't be THAT late for me.

Cheers,

C.

[1] https://github.com/cjeanner/tripleo-lab/blob/master/roles/overcloud/tasks/centos-overcloud.yaml

Revision history for this message
Steve Baker (steve-stevebaker) wrote :

We did a debug session on Cedric's lab, and the issue was the libvirt VMs not being in UEFI mode.

The jobs are still intermittent green which suggests the image is not a cause of failures, so I think we can close this bug now?

Revision history for this message
Ronelle Landy (rlandy) wrote :

Chatted with Steve and Rafael. We are closing this out - we can open another bug for specific issues

Changed in tripleo:
status: Triaged → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.