No boot possible - Cannot open root device

Bug #1950792 reported by Justin Lamp
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
MAAS
Expired
Medium
Unassigned
linux (Ubuntu)
Expired
Undecided
Unassigned

Bug Description

MaaS cannot boot on my X11DPi-NT and will therefore fail to commission. It will instantly fail after grub.

It will fail to boot saying that is not able to open root device. I even tried supplying the squashfs via python on a different port as the url is somehow cut off a bit, but to no avail. The python webserver doesn't even see any connection try.

( custom kernel params: root=squash:http://10.0.33.253:8080/squashfs python webserver: python3 -m http.server 8080)

Steps to reproduce:

1. Setup maas ;)
2. Try to commision server
3. See the kernel and initrd download fly by
4. Boom kernel panic

Hardware:
Supermicro X11DPi-NT with UEFI http Networkboot enabled
NIC: Intel X722

Revision history for this message
Justin Lamp (modzilla) wrote :
Revision history for this message
Justin Lamp (modzilla) wrote :
Revision history for this message
Justin Lamp (modzilla) wrote :

Oh I forgot to mention that I am running the latest 3.1~beta5

Revision history for this message
Alberto Donato (ack) wrote :

can you please attach regiond.log and rackd.log from around the time when the server has been powered on?

Changed in maas:
status: New → Incomplete
Revision history for this message
Justin Lamp (modzilla) wrote :

Sorry it took so long, was quite busy, but here we go!

Revision history for this message
Alberto Donato (ack) wrote :

Hi, could you please confirm that boot works normally with other machines (different hardware) and it's only this specific model that is having issues?

Changed in maas:
status: Incomplete → New
status: New → Incomplete
Revision history for this message
Justin Lamp (modzilla) wrote :

Yes I have successfully booted a KVM! :)

Revision history for this message
Alberto Donato (ack) wrote :

I wonder if this could be a network issue during the bootloader phase.

Could you please make sure the bmc firmware is up to date?

Changed in maas:
status: Incomplete → New
status: New → Incomplete
Revision history for this message
Justin Lamp (modzilla) wrote :

I installed the latest firmware (BMC and BIOS) and the issue still persists. I even started a KVM on that host and it did just boot into the os.

Revision history for this message
Alberto Donato (ack) wrote :

From what I see, this could be indeed an issue with missing driver for the specific network card in the initrd.

Changed in maas:
status: Incomplete → Triaged
importance: Undecided → Medium
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1950792

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Revision history for this message
Stefan Fleischmann (sfleischmann) wrote :

I am having similar problems since updating from 3.0 to the latest release 3.2.6

We mainly use legacy boot (instead of UEFI) and the boot order of all our nodes is set to:
1. Network/PXE
2. harddisk

A deployed node would boot PXE and then immediately MAAS tells it (not sure how exactly) to boot from the harddrive instead. This doesn't seem to work anymore with 3.2.6. Instead of booting the node stops with a kernel panic. Error messages I can see on the screen are (nothing in the SOL at this early stage):

```
VFS: Cannot open root device "squash:http://172.17.100.3:5248/images/ubuntu/amd64/ga-20.04/fo" or unknown-block(0,0): error -6
Please append a correct "root=" boot option; here are the available partitions:
Kernel panic - not syncing: VFS: Unable to mount root fs on unknown-block(0,0)
```

If I set the harddisk as first boot device in the BIOS the node boots okay. Commission of a node works, and new deployments work up to the point where the node would boot into the newly installed OS, then above error prevents it from succeeding. Then I need to manually change the boot order to allow deployment to succeed.

Any idea what has changed here from 3.0 to 3.2?

What is the recommended boot order for legacy boot now? I suppose nowadays MAAS requests PXE boot via IPMI whenever needed, so hard disk as first boot option should work. But I remember this was not always the case when we started using it.

Revision history for this message
Stefan Fleischmann (sfleischmann) wrote :

Happy to provide more info/logs/whatever if you can tell me where to look ;-)

Revision history for this message
Boris Lukashev (rageltman) wrote (last edit ):

Still a problem on _random_ supermicro hardware: i have 15 identical nodes(AS -1123US-TR4 on H11DSU-iN motherboards), set kernel parameters for `console=ttyS0 console=tty0` and now _one_ node will not commission or therefore provision with this error. Maas (snap) at 3.2.7-12037-g.c688dd446 or 3.3.2-13177-g.a73a6e2bd

On a related note: the maas UI doesn't update the parameters passed when saved, user has to refresh the page or it keeps showing up as the last set of parameters at page-load.

After setting the parameters back to blank, this one node still cannot boot from MAAS - whatever broke, broke persistently.

Changing which image is sent (focal or jammy) does nothing, same with selecting various kernels. Once a user touches the kernel command line parameters, even if they subsequently remove them, it will break at least some nodes' ability to boot - this seems like a high severity bug which has been open for 2y.

My case is occurring in pure UEFI mode (legacy boot disabled), after a failed deployment of Ceph via Juju already installed Ubuntu on the machine. Prior to the installation of Ubuntu, the host did boot correctly with kernel parameters set.

Running `maas admin maas get-config name=kernel_opts value=" console=ttyS0 console=tty0 "`, installing an OS, and then trying to boot the node from PXE via maas again causes the breakage. Subsequently clearing `kernel_opts` to read as
```
$ maas admin maas get-config name=kernel_opts
Success.
Machine-readable output follows:
""
```
does not fix whatever the initial setting broke. Looks like there is a side-effect that cannot be undone.

Revision history for this message
Boris Lukashev (rageltman) wrote :

Figured out what made that one node different from the rest: its UEFI network boot stack had both PXE and HTTP enabled, while the others were PXE only.
After i added the kernel boot parameters, MAAS started to truncate the HTTP URI by a few characters (and now keeps doing that even if the kernel commandline options are cleared) which doesnt imapct the PXE clients but does break the HTTP ones (this node worked fine before the params were added).

For anyone stuck in this situation, the workaround seems to be to disable HTTP-based image acquisition in the system's BIOS configuration menu - PXE only on the IP stack desired will avoid the oddly truncated HTTP URI which occurs after setting kernel boot parameters.

Revision history for this message
Bartosz Woronicz (mastier1) wrote :

I also encountered something similar, some computes randomly are unable to boot
I may check disabling this HTTP boot

$ maas root maas get-config name=kernel_opts
Success.
Machine-readable output follows:
"console=tty0 console=ttyS0,115200n8"

Revision history for this message
Jerzy Husakowski (jhusakowski) wrote :

We suspect that on the affected systems the length of the kernel options cmdline exceeds the allowed maximum. Would it be possible to share the kernel options MAAS sends for machines that fail to deploy this way? The kernel options could be retrieved using IPMI console.

Changed in maas:
status: Triaged → Incomplete
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for linux (Ubuntu) because there has been no activity for 60 days.]

Changed in linux (Ubuntu):
status: Incomplete → Expired
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for MAAS because there has been no activity for 60 days.]

Changed in maas:
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.