Setup issue: Kernel panic - not syncing: VFS: Unable to mount root fs on unknown-block(0,0)

Bug #1938559 reported by Alexandru Dimofte
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Critical
Jiping Ma

Bug Description

Brief Description
-----------------
Setup step fails on all baremetal machines trying to install stx.

Severity
--------
<Critical: System/Feature is not usable due to the defect>

Steps to Reproduce
------------------
Try to install stx: 20210729T224105Z

Expected Behavior
------------------
Setup should work fine.

Actual Behavior
----------------
Stx install fails during setup.

Reproducibility
---------------
100% reproducible

System Configuration
--------------------
<One node system, Two node system, Multi-node system, Dedicated storage>

Branch/Pull Time/Commit
-----------------------
master 20210729T224105Z

Last Pass
---------
20210728T020149Z

Timestamp/Logs
--------------
15:20:38 [2021-07-30T12:20:38.869Z] ==============================================================================
15:20:38 [2021-07-30T12:20:38.869Z] Setup :: Install and configure StarlingX ISO.
15:20:38 [2021-07-30T12:20:38.869Z] ==============================================================================
15:20:45 [2021-07-30T12:20:45.417Z] Check ISO Basic Mounting BareMetal :: Test basic ISO structure and... | PASS |
15:20:45 [2021-07-30T12:20:45.418Z] ------------------------------------------------------------------------------
15:42:37 [2021-07-30T12:42:37.441Z] Install ISO BareMetal :: Installation of controller node and defin... | FAIL |
15:42:37 [2021-07-30T12:42:37.441Z] TIMEOUT: Timeout exceeded.
15:42:37 [2021-07-30T12:42:37.441Z] <pexpect.pty_spawn.spawn object at 0x7ffb1ec52250>
15:42:37 [2021-07-30T12:42:37.441Z] command: /usr/bin/ipmitool
15:42:37 [2021-07-30T12:42:37.441Z] args: ['/usr/bin/ipmitool', '-I', 'lanplus', '-H', '192.168.100.72', '-U', 'starlingx', '-P', 'Passw0rd', 'sol', 'activate']
15:42:37 [2021-07-30T12:42:37.441Z] buffer (last 100 chars): b'8299] ---[ end Kernel panic - not syncing: VFS: Unable to mount root fs on unknown-block(0,0) ]---\r\n'
15:42:37 [2021-07-30T12:42:37.441Z] before (last 100 chars): b'8299] ---[ end Kernel panic - not syncing: VFS: Unable to mount root fs on unknown-block(0,0) ]---\r\n'
15:42:37 [2021-07-30T12:42:37.441Z] after: <class 'pexpect.exceptions.TIMEOUT'>
15:42:37 [2021-07-30T12:42:37.441Z] match: None
15:42:37 [2021-07-30T12:42:37.441Z] match_index: None
15:42:37 [2021-07-30T12:42:37.441Z] exitstatus: None
15:42:37 [2021-07-30T12:42:37.441Z] flag_eof: False
15:42:37 [2021-07-30T12:42:37.441Z] pid: 36799
15:42:37 [2021-07-30T12:42:37.441Z] child_fd: 15
15:42:37 [2021-07-30T12:42:37.441Z] closed: False
15:42:37 [2021-07-30T12:42:37.441Z] timeout: 1200
15:42:37 [2021-07-30T12:42:37.441Z] delimiter: <class 'pexpect.exceptions.EOF'>
15:42:37 [2021-07-30T12:42:37.441Z] logfile: <_io.BufferedWriter name='/localdisk/starlingx/s2_duplex/test/automated-robot-suite/Results/20210730142058_Setup/iso_setup_installation.txt'>
15:42:37 [2021-07-30T12:42:37.441Z] logfile_read: None
15:42:37 [2021-07-30T12:42:37.441Z] logfile_send: None
15:42:37 [2021-07-30T12:42:37.441Z] maxread: 2000
15:42:37 [2021-07-30T12:42:37.441Z] ignorecase: False
15:42:37 [2021-07-30T12:42:37.441Z] searchwindowsize: None
15:42:37 [2021-07-30T12:42:37.441Z] delaybeforesend: 0.05
15:42:37 [2021-07-30T12:42:37.441Z] delayafterclose: 0.1
15:42:37 [2021-07-30T12:42:37.441Z] delayafterterminate: 0.1
15:42:37 [2021-07-30T12:42:37.441Z] searcher: searcher_re:
15:42:37 [2021-07-30T12:42:37.441Z] 0: re.compile(b'Welcome to')
15:42:37 [2021-07-30T12:42:37.441Z] ------------------------------------------------------------------------------
15:47:43 [2021-07-30T12:47:43.927Z] Ansible Bootstrap Configuration :: Configure controller with local... | FAIL |
15:47:43 [2021-07-30T12:47:43.927Z] Keyword 'Connect to Controller Node' failed after retrying for 5 minutes. The last error was: NoValidConnectionsError: [Errno None] Unable to connect to port 22 on 192.168.201.52

Test Activity
-------------
Sanity

Workaround
----------
-

Revision history for this message
Alexandru Dimofte (adimofte) wrote :

[ 17.189672] md: ... autorun DONE.^M
[ 17.193024] /dev/root: Can't open blockdev^M
[ 17.196220] tsc: Refined TSC clocksource calibration: 2294.609 MHz^M
[ 17.197109] VFS: Cannot open root device "(null)" or unknown-block(0,0): error -6^M
[ 17.203318] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x211350aecf9, max_idle_ns: 440795286341 ns^M
[ 17.210771] Please append a correct "root=" boot option; here are the available partitions:^M
[ 17.229149] Kernel panic - not syncing: VFS: Unable to mount root fs on unknown-block(0,0)^M
[ 17.229169] clocksource: Switched to clocksource tsc^M
[ 17.237405] CPU: 79 PID: 1 Comm: swapper/0 Tainted: G I 5.10.30-200.3.tis.el7.x86_64 #1^M
[ 17.237406] Hardware name: Intel Corporation S2600WFT/S2600WFT, BIOS SE5C620.86B.02.01.0011.032620200659 03/26/2020^M
[ 17.237407] Call Trace:^M
[ 17.264546] dump_stack+0x57/0x6a^M
[ 17.267870] panic+0x102/0x2d2^M
[ 17.270929] mount_block_root+0x29d/0x325^M
[ 17.274937] prepare_namespace+0x135/0x164^M
[ 17.275720] hub 1-2:1.0: USB hub found^M
[ 17.279041] ? rest_init+0xc4/0xc4^M
[ 17.279045] kernel_init+0xa/0x10c^M
[ 17.282989] hub 1-2:1.0: 5 ports detected^M
[ 17.286208] ret_from_fork+0x1f/0x30^M
[ 17.289611] Kernel Offset: 0x1d000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)^M
[ 17.313699] ---[ end Kernel panic - not syncing: VFS: Unable to mount root fs on unknown-block(0,0) ]---^M

Revision history for this message
Alexandru Dimofte (adimofte) wrote :

sys_stxval@kasstxj:/localdisk/starlingx/s1_simplex/test/automated-robot-suite/Results/20210805152634_Setup$ cat iso_setup_installation.txt | grep failed
[ 8.764826] pci 0000:3b:00.0: BAR 6: failed to assign [mem size 0x00100000 pref]
[ 8.778885] pci 0000:3b:00.0: BAR 14: failed to assign [mem size 0x00100000]
[ 8.792597] pci 0000:3c:03.0: BAR 14: failed to assign [mem size 0x00100000]
[ 8.813857] pci 0000:3d:00.1: BAR 6: failed to assign [mem size 0x00080000 pref]
[ [ 14.510132] Initramfs unpacking failed: XZ-compressed data is corrupt

Revision history for this message
Austin Sun (sunausti) wrote :

This is use pxe to install controller-0, if use iso , it is ok. need check the different between pxe and iso install

Revision history for this message
Austin Sun (sunausti) wrote :

one suspect item from log
error: timeout reading `uefi/images/initrd.img'.

Revision history for this message
Austin Sun (sunausti) wrote :

If not use pxe to install controller, but use iso install directly , the controller-0 is installed successfully.

Revision history for this message
Jiping Ma (jma11) wrote :

Hi, Austin

Could you share the steps that use pxe to install controller?

Revision history for this message
Jiping Ma (jma11) wrote :

it is strange why it use "uefi/images/initrd.img".

it is "/initrd.img" in grub.cfg file in pxe-network-installer.

# Standard Controller menu
submenu 'UEFI Standard Controller Configuration' --id=standard {
  menuentry 'Serial Console' --id=serial {
    linuxefi /vmlinuz inst.ks=hd:LABEL=oe_iso_boot:/ks.cfg boot_device=sda rootfs_device=sda biosdevname=0 usbcore.autosuspend=-1 console=ttyS0,115200 inst.text serial inst.stage2=hd:LABEL=oe_iso_boot inst.gpt security_profile=standard user_namespace.enable=1
    initrdefi /initrd.img
  }
  menuentry 'Graphical Console' --id=graphical {
    linuxefi /vmlinuz inst.ks=hd:LABEL=oe_iso_boot:/ks.cfg boot_device=sda rootfs_device=sda biosdevname=0 usbcore.autosuspend=-1 console=tty0 serial inst.text inst.stage2=hd:LABEL=oe_iso_boot inst.gpt security_profile=standard user_namespace.enable=1
    initrdefi /initrd.img
  }
}

# Create a space in the menu
menuentry ' '{
     echo " "
}

Revision history for this message
Jiping Ma (jma11) wrote :

There missed initrd=initrd.img in command line. please check it and try add it and test.

Revision history for this message
Austin Sun (sunausti) wrote :

Hi Jiping:
    because they are using uefi setup , so need to use uefi initrd.img.

Revision history for this message
Austin Sun (sunausti) wrote :

i tried to reproduce this in virtual env.
qemu-img create -f qcow2 ../root.img 10G
if use “qemu-system-x86_64 -kernel vmlinuz -initrd initrd.img -hda ../root.img -append "root=/dev/hda"”, I got similar issue.
but if use command
qemu-system-x86_64 -m 2048 -kernel vmlinuz -initrd initrd.img -hda ../root.img -append "root=/dev/hda".
So I think maybe somehow the memory is not recognized in kernel.
and Yes you are correct Jiping.

Kernel command line: BOOT_IMAGE=uefi/images/vmlinuz inst.ks=http://192.168.200.3/stx/bootimage/smallsystem_ks.cfg boot_device=sda rootfs_device=sda biosdevname=0 usbcore.autosuspend=-1 console=ttyS0,115200 inst.text serial inst.stage2=http://192.168.200.3/stx/bootimage inst.gpt security_profile=standard user_namespace.enable=1 inst.repo=http://192.168.200.3/stx/bootimage

The initrd=xxx/initrd.img is missing.

Revision history for this message
Austin Sun (sunausti) wrote :

Hi Jiping:
the grub cfg
timeout=5
menuentry 'duplex'{
linuxefi uefi/images/vmlinuz inst.ks=http://192.168.200.3/stx/bootimage/smallsystem_ks.cfg boot_device=sda rootfs_device=sda biosdevname=0 usbcore.autosuspend=-1 console=ttyS0,115200 inst.text serial inst.stage2=http://192.168.200.3/stx/bootimage inst.gpt security_profile=standard user_namespace.enable=1 inst.repo=http://192.168.200.3/stx/bootimage
initrdefi uefi/images/initrd.img

initrd.img is in initrdefi.

Revision history for this message
M. Vefa Bicakci (vbicakci) wrote :

Hello Austin,

Last week Jiping asked about the UEFI+PXE bootstrap set-up. Do you have any information regarding this? Without being able to locally reproduce this issue, it will be very difficult for us to resolve it. For what it is worth, I have been using PXE bootstrap with a UEFI-based system without issues, but due to another assignment, I did not verify the specific ISO image used in your tests.

I think that we need to focus on the error message mentioning a time-out reading the initramfs image, in the log file you have posted to this bug:

  timeout reading `uefi/images/initrd.img'.

The kernel also reports a corrupted initramfs image while decompressing it, so something is wrong with the PXE bootstrap sequence:

  [ 14.510132] Initramfs unpacking failed: XZ-compressed data is corrupt

How large is the initramfs file in question? (uefi/images/initrd.img) If it is replaced with a smaller placeholder file, does the *file transfer* work without a read time-out during PXE bootstrap? (This is intended for debugging, of course; the system would definitely not boot up with a placeholder file.)

Regarding the qemu command lines you have mentioned, qemu defaults to 128 MiB of virtual machine memory. It is likely that the kernel and the (uncompressed) initramfs do not fit inside the virtual machine's 128 MiB-large system RAM, which would explain the similar error message you have encountered.

Revision history for this message
Austin Sun (sunausti) wrote :

Hi, Vefa:
    Yes. I think we should focusing on reading "uefi/images/initrd.img" , The initrd.img which is unzip from the built iso. the info is
-rw-r--r-- 1 18876 751 70758984 Aug 18 00:02 /var/lib/tftpboot/uefi/images/initrd.img

This initrd.img can be uncompress in pxe server, no finding suspect point.

Revision history for this message
Austin Sun (sunausti) wrote :

Hi Vefa and Ma:
    Verification team did below test and some finding.
Case 1: stx vmlinuz + stx initrd.img + stx shim.efi + stx grubx64.efi ----- FAIL
Case 2: stx vmlinuz + stx initrd.img + stx shim.efi + Centos8 grubx64.efi ----- PASS

for grub2 rpm , stx is still using grub2-efi-x64-2.02-0.86, centos8 is using grub2-tools-2.02-99.el8. I found there is one fix between this https://src.fedoraproject.org/rpms/grub2/c/f7e054f3d693af43a903faf33b70673f7aca4508?branch=f34 , might be related with this one.

so i think either we should upgrade grub2 package , or pick-up this patch.

Revision history for this message
M. Vefa Bicakci (vbicakci) wrote :

Hello Austin,

Good find. This issue is most likely encountered occurred because the size of the initrd.img file is larger with the v5.10 kernel. Using StarlingX ISO files that I built for another task:

# 5.10 kernel's initrd
$ iso-info -f -i bootimage-5.10.iso | grep 'initrd\.img' | head -n1
 72868840 /initrd.img

# 3.10 kernel's initrd
$ iso-info -f -i bootimage-3.10.iso | grep 'initrd\.img' | head -n1
 61240096 /initrd.img

I was suspecting that the size of the file is a factor, but I did not have proof due to not being able to reproduce the issue locally. Also note that I mentioned replacing initrd.img with a smaller test file in my previous comment in this bug report. That would have helped with discovering the bug as well.

In any case, I agree with you that the patch you identified should fix this issue.

Revision history for this message
Austin Sun (sunausti) wrote :

Hi Vefa:

But as you may know, we have upgraded kernel to 4.18 last year, the initrd.img size is larger than 67107840 Bytes. but no issue found in kernel 4.18 . but only kernel 5.10 has this issue.

sunausti@sunausti-deploy:/local/sunausti/images$ ls -la 20210726T043006Z/tmp/initrd.img
-rw-r--r-- 1 18876 751 70146056 7月 26 15:28 20210726T043006Z/tmp/initrd.img
sunausti@sunausti-deploy:/local/sunausti/images$ ls -la 20210810T043006Z/tmp/initrd.img
-rw-r--r-- 1 18876 751 70754524 8月 10 15:46 20210810T043006Z/tmp/initrd.img

but from test result , we should upgrade grub2 version to 2.02-99.
Thanks

Revision history for this message
M. Vefa Bicakci (vbicakci) wrote :

Hello Austin,

I did not realize that the size of the 4.18 kernel's initramfs image is very close to the size of the 5.10 kernel's initramfs image.

If upgrading grub2 resolves this issue, and if this issue is not related to the initramfs image size, then I cannot think of how the 5.10 kernel would cause this issue. My understanding is that the kernel is not involved at all when the PXE/TFTP transfer is taking place.

Finally, I should note that I was only trying to help Jiping, and I do not have more time to work on this issue. Please work with Jiping.

Revision history for this message
Jiping Ma (jma11) wrote (last edit ):

Sorry for the delay reply. I did not receive the related email. any way I am sorry for the delay.

Hi, Austin

The size of initrd.img is 70146056 when you met the issue?
Our test team will try to reproduce it. could you share the initrd.img that can reproduce the issue?
I am not sure if we can reproduce the issue, or we apply the patch that you provide, could your test team help to test?

Thanks

Revision history for this message
Ghada Khalil (gkhalil) wrote :

screening: marking as critical / stx.6.0 since this issue is causing red sanity on the stx master branch.

Changed in starlingx:
importance: Undecided → Critical
tags: added: stx.6.0 stx.distro.other
Changed in starlingx:
assignee: nobody → Jiping Ma (jma11)
Revision history for this message
Jiping Ma (jma11) wrote :

Our test team already helped to verify https://src.fedoraproject.org/rpms/grub2/c/f7e054f3d693af43a903faf33b70673f7aca4508?branch=f34 that can fix the issue. I will submitted the code for the review soon.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to integ (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/integ/+/806829

Changed in starlingx:
status: New → In Progress
Revision history for this message
Jiping Ma (jma11) wrote :

 https://review.opendev.org/c/starlingx/integ/+/806829
had submitted the code for the review.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to integ (master)

Reviewed: https://review.opendev.org/c/starlingx/integ/+/806829
Committed: https://opendev.org/starlingx/integ/commit/572cd24c4971c713f5fe06df55a8d78c3553c4ed
Submitter: "Zuul (22348)"
Branch: master

commit 572cd24c4971c713f5fe06df55a8d78c3553c4ed
Author: Jiping Ma <email address hidden>
Date: Tue Aug 31 20:07:04 2021 -0700

    tftp: roll over block counter to prevent timeouts with data packets

    The block number is a 16-bit counter which only allows to fetch
    files no bigger than 65535 * blksize. To avoid this limit, the
    counter is rolled over. This behavior isn't defined in RFC 1350
    but is handled by many TFTP servers and it's what GRUB was doing
    before implicitly due an overflow.

    Closes-Bug: #1938559

    Signed-off-by: Jiping Ma <email address hidden>
    Change-Id: I4a0c04cf15b87f6f40c7a979d20fb3c57657cefd

Changed in starlingx:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers