StarlingX

Main controller install fails - Layered build

Bug #1863340 reported by Cristopher Lemus on 2020-02-14

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	StarlingX	Invalid	High	Scott Little

Bug Description

Brief Description
-----------------
Install of main controller fails during the initial boot. "Warning: Could not boot." message is logged into the console.

Severity
--------
Critical: System won't boot.

Steps to Reproduce
------------------
Using latest layered build, after selecting the boot options (selected all-in-one and serial console), the error appears, system won't boot.

Expected Behavior
------------------
Main controller installation completes, system boot, able to continue with setup steps.

Actual Behavior
----------------
Main controller does not boot.

Reproducibility
---------------
100%

System Configuration
--------------------
Simplex - Baremetal

Branch/Pull Time/Commit
-----------------------
Layered build taken from: http://mirror.starlingx.cengn.ca/mirror/starlingx/master/centos/flock/20200213T223559Z/

Last Pass
---------
This is the first try with the layered build, issue is not happening on master branch.

Timestamp/Logs
--------------
Added error messages found on console output: http://paste.openstack.org/show/789593/

Full console log attached "iso_setup_installation.txt"

Test Activity
-------------
Sanity (for layered build).

See original description

Tags:

Revision history for this message

Cristopher Lemus (cjlemusc) wrote on 2020-02-14:

Serial Console output Edit (164.0 KiB, text/plain)

description:

updated

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2020-02-14:

stx.4.0 / high priority - serious issue w/ layered build

tags:	added: stx.4.0 stx.build
Changed in starlingx:
importance:	Undecided → High
status:	New → Triaged
assignee:	nobody → Scott Little (slittle1)

Revision history for this message

Cristopher Lemus (cjlemusc) wrote on 2020-02-14:

Previously I stated that I used the latest image available, which is:

However, I think that I used latest -1, I'm pulling latest image available and update the bug.

Revision history for this message

Cristopher Lemus (cjlemusc) wrote on 2020-02-16:

I tried this again, with build: 20200213T223559Z, which is the latest one available. Faced the exact same error as with previous ISO, on different hardware (a whole complete different server, with different disks).

[ 205.774489] dracut-initqueue[2043]: Warning: dracut-initqueue timeout - starting timeout scripts
[ 205.774758] dracut-initqueue[2043]: Warning: Could not boot.
[ OK ] Reached target System Initialization.
[ OK ] Listening on Open-iSCSI iscsiuio Socket.
[ OK ] Reached target Sockets.
         Starting Device-Mapper Multipath Device Controller...
[ OK ] Started Device-Mapper Multipath Device Controller.
         Starting Open-iSCSI...
[ OK ] Started Show Plymouth Boot Screen.
[ OK ] Reached target Paths.
[ OK ] Started Forward Password Requests to Plymouth Directory Watch.
[ OK ] Reached target Basic System.
[ OK ] Started Open-iSCSI.
         Starting dracut initqueue hook...
[ 17.735444] dracut-initqueue[2043]: RTNETLINK answers: File exists
[ 143.984130] dracut-initqueue[2043]: Warning: dracut-initqueue timeout - starting timeout scripts
[ 144.511867] dracut-initqueue[2043]: Warning: dracut-initqueue timeout - starting timeout scripts

Revision history for this message

Don Penney (dpenney) wrote on 2020-02-18:

This looks like it's failing to find the LiveOS/squashfs.img file. Assuming you're still using a custom network boot setup, please verify that this file is accessible via the path you've specified to the boot cmdline.

Revision history for this message

Scott Little (slittle1) wrote on 2020-02-19:

I've installed the load on both virtual and hardware environments. Installed from CDROM and over PXE. So far I'm not reproducing your results.

Can you provide details on your installation method?

Scott

Revision history for this message

Cristopher Lemus (cjlemusc) wrote on 2020-02-19:

Hi Scott, Don,

Just to confirm, we are still using our custom network boot to do the initial install of controller-0. That's part of our automation.

In general, the automation to do the install consist of:

- Mount the ISO and expose it on HTTP server locally on /var/www/html/, clients are able to see everything on it.
- Use TFTP service to send a uefi/shim.efi file that redirects the boot to the http location listed above. The TFTP service is exposed on: /var/lib/tftpboot/
- I think that the efi file is constructed to include the http route and some boot options to select between AIO, standard, and also, serial and graphical console. Will check for further details and update ASAP.

I think that this method was implemented before 2.0, it is also still working for the master branch, results were sent for today's build, Feb/19, where this worked properly.

I can also confirm that using a USB and plugging it on the server works just fine. No issues with the install, is just our automation that is broken.

Has something changed for the layered build vs master build? Some EFI options? Paths? Boot options?

I will check for further details, specially on the efi file, but if you have any suggestions, I'll appreciate them a lot.

Revision history for this message

Cristopher Lemus (cjlemusc) wrote on 2020-02-19:

These are the boot options that the automation sets to boot:

linuxefi uefi/images/vmlinuz inst.ks=http://192.168.200.3/stx/bootimage/smalls\
ystem_ks.cfg boot_device=sda rootfs_device=sda biosdevname=0 usbcore.autosuspe\
nd=-1 console=ttyS0,115200 inst.text serial inst.stage2=http://192.168.200.3/s\
tx/bootimage inst.gpt security_profile=standard user_namespace.enable=1 inst.r\
epo=http://192.168.200.3/stx/bootimage
initrdefi uefi/images/initrd.img

Revision history for this message

Don Penney (dpenney) wrote on 2020-02-19:

As mentioned previously, you should look into using the pxeboot_*.cfg kickstarts for your network boot, but that would not be related to this issue.

I would verify that stx/bootimage/LiveOS/squashfs.img exists and that you can do a:
wget http://192.168.200.3/stx/bootimage/LiveOS/squashfs.img

Check the httpd logs on 192.168.200.3 to see if you see the request, or an error.

Probably also good to verify the checksums of the squashfs.img, initrd.img, and vmlinuz files.

I'm assuming there's no changes in your DHCP config.

You can also check your console log to verify that your boot interface is coming up. Maybe you can post a console log for a working session, up to the point where anaconda starts, for comparison.

Revision history for this message

Cristopher Lemus (cjlemusc) wrote on 2020-02-21:

#10

Hi Don,

I tried using the latest layered build: http://mirror.starlingx.cengn.ca/mirror/starlingx/master/centos/flock/20200220T172409Z/outputs/iso/

With that one, I was able to use the exact same installation method that we are using without facing this issue. I assume that probably a corrupt file was the issue, probably during the download or the mount.

We'll use this one to do the sanity.

Thanks for the reminder about pxeboot_*.cfg, we'll check and adapt our automation.

Revision history for this message

Cristopher Lemus (cjlemusc) wrote on 2020-02-21:

#11

I have tried a handful of installs using latest build http://mirror.starlingx.cengn.ca/mirror/starlingx/master/centos/flock/20200220T172409Z/

This issue is no longer appearing. I think it's safe to close it.

Scott Little (slittle1) on 2020-05-07