Boot test failed with Linaro Openembedded minimal armv8 pre-built image

Bug #1282076 reported by Soumya Basak
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Linaro Linux Baseline
Confirmed
Critical
Andrey Konovalov
Linaro OpenEmbedded
Confirmed
Critical
Riku Voipio

Bug Description

Hi,
the issue observed with Linaro Openembedded minimal armv8 prebuild image:

http://snapshots.linaro.org/openembedded/pre-built/vexpress64/latest

with the foundation model with image:

http://snapshots.linaro.org/openembedded/pre-built/vexpress64/latest/vexpress64-openembedded_minimal-armv8-gcc-4.8_20140218-609.img.gz

manually boot test image is failed.

the same issue reproduced with Daily builds on Lava.

https://validation.linaro.org/scheduler/job/111355

Please look into the issue for QA weekly and daily test.

refer to the snapshot boot log for details.

Revision history for this message
Soumya Basak (soumya-basak) wrote :
description: updated
Fathi Boudra (fboudra)
Changed in linaro-oe:
status: New → Confirmed
importance: Undecided → Critical
milestone: none → 14.02
Changed in linaro-oe:
assignee: nobody → Riku Voipio (riku-voipio)
Revision history for this message
Riku Voipio (riku-voipio) wrote :

This has been tricky to reproduce, only got the bug show a few times. Got the bug in lava anyways, so will move there.

Revision history for this message
Riku Voipio (riku-voipio) wrote :

Just adding inside the rootfs /etc/default/rcS the variable VERBOSE=yes to find out where exactly it hangs, and this is no longer reproducible in LAVA either:

https://validation.linaro.org/dashboard/streams/public/team/linaro/pre-built-vexpress64/bundles/85c3c4032d8812f5708f44cf9f02d4d9e839c5d9/

Revision history for this message
Fathi Boudra (fboudra) wrote :

Ryan is reporting that it isn't booting for him:
"Populating dev cache" times out on LAVA, but it *eventually* progresses to the next stage "random: nonblocking pool is initialized", but that has hung for about 10 mins so far.

the terminal is responsive to keypresses (enter) but it's not booted to the cmdline yet.

I tried to boot that kernel with the 14.01 minimal image, it dies: "Kernel panic - not syncing: No working init found".
I tried to use the 14.01 kernel with this snapshot's minimal disk image, "Populating dev cache" is taking a very long time again and hung there.

conclusion: the rootfs is buggered

Revision history for this message
Soumya Basak (soumya-basak) wrote :

boot test with OE 14.02 release builds:
http://snapshots.linaro.org/openembedded/pre-built/vexpress64/611/

boot test is PASS for minimal, lamp, leg builds.

Revision history for this message
Riku Voipio (riku-voipio) wrote :

I don't the bug is found, but since people are no longer hitting it, drop severity

Changed in linaro-oe:
importance: Critical → Medium
Fathi Boudra (fboudra)
Changed in linaro-oe:
milestone: 14.02 → 14.04
Revision history for this message
Fathi Boudra (fboudra) wrote :

the build boots fine but still fails in lava. Milosz, have you seen that before?

https://validation.linaro.org/scheduler/job/119418 (submitted by CI)
vs
https://validation.linaro.org/scheduler/job/119478 (submitted by me)

Revision history for this message
Milosz Wasilewski (mwasilew) wrote :

I also had some strange boot failures even with the tests submitted manually: https://validation.linaro.org/scheduler/job/117286. Than it booted just fine when running locally. This is not exactly the same problem but still this is worrying that the LAVA results are not reproducible always. I'll point LAVA team here.

Revision history for this message
Milosz Wasilewski (mwasilew) wrote :

It seems it's model 05 that makes the difference:
https://validation.linaro.org/scheduler/job/119674

Revision history for this message
Fathi Boudra (fboudra) wrote :

It seems we still have the same unknown issue has before:
https://validation.linaro.org/scheduler/job/128385/log_file

Changed in linaro-oe:
milestone: 14.04 → 14.05
importance: Medium → Critical
Revision history for this message
koen (koenkooi) wrote :

it does boot using the fvp model: ~/linaro/Foundation_v8pkg/models/Linux64_GCC-4.1/Foundation_v8 --block-device=/home/koen/linaro/test/vexpress64-openembedded_lamp-armv8-gcc-4.8_20140522-651.img --data=/media/1/fvp_bl1.bin@0x0
             --data=/media/1/uefi_fvp-base.bin@0x8000000 --gicv3 --no-secure-memory

Image from https://validation.linaro.org/scheduler/job/128386/log_file

Revision history for this message
Milosz Wasilewski (mwasilew) wrote :

@koen
you seem to be using Foundation model. The job you pointed doesn't use uefi_fvp-base.bin any more (it's included in fvp_fip.bin now). I'm getting same hang in init locally with GICv2 DTB. With GICv3 DTB kernel crashes. Here is my commandline:

/home/milosz/FVP_Base_AEMv8A-AEMv8A/models/Linux64_GCC-4.1/FVP_Base_AEMv8A-AEMv8A -C bp.virtioblockdevice.image_path=$PWD/vexpress64-openembedded_lamp-armv8-gcc-4.8_20140522-651.img -C bp.secure_memory=0 -C bp.smsc_91c111.mac_address="72:43:BB:97:73:BA" -C pctl.startup=0.0.0.0 -C bp.pl011_uart0.untimed_fifos=1 -C bp.flashloader0.fname=$PWD/fvp_fip.bin -C bp.secureflashloader.fname=$PWD/bl1.bin -C cluster1.NUM_CORES=4 -C bp.smsc_91c111.enabled=true -C cache_state_modelled=0 -C cluster0.NUM_CORES=4

bl1.bin points to fvp_bl1.bin from the image.

Revision history for this message
koen (koenkooi) wrote :

I know it's different model, I didn't have access to the base model till 2 hours ago :) My main goal was to see if the problem is with the filesystem itself (e.g. missing /bin/sh or something) or something else.

Revision history for this message
Fathi Boudra (fboudra) wrote :

fwiw, LSK boots fine on rtsm aemv8a using the same rootfs. adding linux-linaro. Once again, it's a release blocker.

Changed in linaro-linux-baseline:
importance: Undecided → Critical
assignee: nobody → Andrey Konovalov (andrey-konovalov)
Revision history for this message
Ryan Harkin (ryanharkin) wrote :

Having spoken to various people about the CPU idle problem and whether or not the problems we are having are related, I decided to do some boot testing of our various releases.

Here are my notes of recent releases vs bootability:

- with networking enabled
- by 14.05, I mean the 1st release image/build with CONFIG_CPU_IDLE=y
    http://releases.linaro.org/14.05/openembedded/aarch64/vexpress64-openembedded_minimal-armv8-gcc-4.8_20140525-654.img.gz

14.01: boots fine every time
14.02: hung on 1st boot, 2nd was fine
14.03: booted fine 1st time
14.04: booted fine 1st time
14.05: hung on 1st, 2nd, 3rd boot (as far as I can tell, it always hangs)

Using TF, kernel and DTB from 14.01 and the rootfs from 14.05:
Boot #1: hung
Boot #2: hung
Boot #3: booted
Boot #4: booted
Boot #5: booted

Using TF, kernel and DTB from 14.05 and the rootfs from 14.01:
Boot #1: booted
Boot #2: booted
Boot #3: booted
Boot #4: booted

So, after all those tests, I decided to re-run pure 14.05 again:
Boot #1: hung
Boot #2: hung
Boot #3: hung
Boot #4: hung

note: not all hangs happen in the same place, but there are some common hang points for each release, eg after getting DHCP lease.
note: by hang, the model appears to still be alive and responding to keypressed, but whatever is running gets blocked somehow and doesn't continue. I didn't wait very long for it to continue, usually about 1 minute. The model responds to pings.

Changed in linaro-linux-baseline:
status: New → Confirmed
Revision history for this message
Andrey Konovalov (andrey-konovalov) wrote :

Looks like there is a workaround in the kernel configuration.

Adding my.conf with:
CONFIG_PREEMPT=y
# CONFIG_DEBUG_PREEMPT is not set
makes the kernel to boot on FVP Base model.

I've used (with the ll-20140526.0 tree, so CONFIG_CPU_IDLE was enabled):
ARCH=arm64 scripts/kconfig/merge_config.sh linaro/configs/linaro-base.conf linaro/configs/linaro-base64.conf linaro/configs/distribution.conf linaro/configs/kvm-guest.conf linaro/configs/kvm-host.conf linaro/configs/vexpress64.conf my.conf

Adding this my.conf results in the following change to .config:
-----8<-----
-INLINE_READ_UNLOCK y
-INLINE_READ_UNLOCK_IRQ y
-INLINE_SPIN_UNLOCK_IRQ y
-INLINE_WRITE_UNLOCK y
-INLINE_WRITE_UNLOCK_IRQ y
-TREE_RCU y
 PREEMPT n -> y
 PREEMPT_NONE y -> n
 PREEMPT_RCU n -> y
+DEBUG_PREEMPT n
+PREEMPT_COUNT y
+PREEMPT_TRACER n
+PROVE_RCU_DELAY n
+RCU_BOOST n
+RCU_CPU_STALL_VERBOSE y
+TREE_PREEMPT_RCU y
+UNINLINE_SPIN_UNLOCK y
-----8<-----

Revision history for this message
Andrey Konovalov (andrey-konovalov) wrote :

Forgot to mention that CONFIG_DEBUG_PREEMPT=y (the default if CONFIG_PREEMPT=y) also works. Just wanted to check if this an "adding more debug info "fixes" the issue" thing.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.