Comment 9 for bug 1946149

Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :

Today I wanted to try and instrument the boot process a bit,
since we have no serial console in the nitro metal instances.

I was looking for pstore_blk (hoping we could panic_on_warn
or panic_on_oops), but it's only available in 5.8+ it seems.)

So I decided to start with grub, and keep a progress variable
in grubenv, and use grub-reboot to boot 4.15.0-1113-aws _once_
(as it's expected to fail), then (force) stop and start again,
and check grubenv in 5.4.0-*-aws (which works.)

Interestingly, in one of such attempts 4.15.0-1113-aws WORKED.

In another attempt, I could see the progress variable for the
4.15 _and_ 5.14 kernels, so it seems that grub booted 4.15
but it didn't make it to the fully booted system. (i.e., grub
seems to be working correctly.)

In the other attempts I noticed that once we try to boot 4.15,
the system seems to become weird and not react quickly even
to the 'Force stop' method (after you try 'Stop' that doesn't
work.)

...

So, since 4.15 worked/booted once, and the systems seem weird,
and Ian just posted that he had a different result/questioned
previous result (ie, it might well be a _different_ result),
I wonder if somehow this particular instance type is acting up.

Given that 4.15 worked/booted ~20 times under kexec, it's not
unreasonable to consider there might be something going on in
normal boot.

I think we should probably engage AWS Support to try and ask
for a console log using an internally available method (seen
it elsewhere iirc), and also to clarify differences in boot
disk among instace types r5.metal (fail), r5d.metal (works),
and r5d.24xlarge (works) -- they all have EBS/nvme as '/'.