Ampere AltraMax sometimes hangs after "EFI stub: Exiting boot services..."

Bug #1990294 reported by dann frazier
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
grub2 (Ubuntu)
Invalid
Undecided
Unassigned
linux (Ubuntu)
Invalid
Undecided
Taihsiang Ho

Bug Description

Kernel trace happens when bringing up CPU with focal HWE kernel 5.15.0-43-generic

Steps to reproduce: enable earlycon and run reboot loop
Expected result: system boots and ready to use
Actual result: kerenl trace happens when bringing up CPU. See the call trace below[1].

[1] (copy from comment #7)

[ 3.321372] arch_timer: Enabling local workaround for ARM erratum 1418040
[ 3.321394] CPU245: Booted secondary processor 0x0100310100 [0x413fd0c1]
[ 8.536939] CPU246: failed to come online
[ 8.550338] Detected PIPT I-cache on CPU246
[ 8.555817] CPU246: failed in unknown state : 0x0
[ 8.555987] GICv3: CPU246: found redistributor 100370000 region 1:0x0000500100f00000
[ 8.562727] ------------[ cut here ]------------
[ 8.562809] GICv3: CPU246: using allocated LPI pending table @0x0000080002680000
[ 8.569364] Dying CPU not properly vacated!
[ 16.668834] WARNING: CPU: 0 PID: 1 at kernel/sched/core.c:9215 sched_cpu_dying+0x1d0/0x2a0
[ 16.681496] Modules linked in:
[ 16.684597] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.15.0-43-generic #46~20.04.1-Ubuntu
[ 16.693010] pstate: 604000c9 (nZCv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[ 16.700095] pc : sched_cpu_dying+0x1d0/0x2a0
[ 16.704432] lr : sched_cpu_dying+0x1d0/0x2a0
[ 16.708770] sp : ffff80000881bbe0
[ 16.712132] x29: ffff80000881bbe0 x28: ffffcca930eada80 x27: 0000000000000000
[ 16.719390] x26: ffff7395107ce000 x25: 00000000000000f6 x24: ffff403e40ed6728
[ 16.726651] x23: ffffcca930eb2618 x22: 0000000000000000 x21: 00000000000000f6
[ 16.733907] x20: ffffcca930719100 x19: ffff403e40ee7100 x18: 0000000000000014
[ 16.741165] x17: 3836323030303830 x16: 3030303078304020 x15: 656c62617420676e
[ 16.748425] x14: 69646e6570204950 x13: 3030303038363230 x12: 3030383030303030
[ 16.755683] x11: 78304020656c6261 x10: 7420676e69646e65 x9 : ffffcca92e7e39a0
[ 16.762940] x8 : 6c6c6120676e6973 x7 : 205d393038323635 x6 : c0000000fffeffff
[ 16.770198] x5 : 000000000017ffe8 x4 : 0000000000000000 x3 : ffff80000881b8c8
[ 16.777457] x2 : 0000000000000000 x1 : 0000000000000000 x0 : 0000000000000000
[ 16.784717] Call trace:
[ 16.787197] sched_cpu_dying+0x1d0/0x2a0
[ 16.791182] cpuhp_invoke_callback+0x14c/0x5a8
[ 16.795699] cpuhp_invoke_callback_range+0x5c/0xb8
[ 16.800568] _cpu_up+0x234/0x360
[ 16.803844] cpu_up+0xb8/0x110
[ 16.806945] bringup_nonboot_cpus+0x7c/0xb0
[ 16.811195] smp_init+0x3c/0x98
[ 16.814385] kernel_init_freeable+0x1a4/0x39c
[ 16.818813] kernel_init+0x2c/0x158
[ 16.822359] ret_from_fork+0x10/0x20
[ 16.825992] ---[ end trace bb9ca924bc23dd10 ]---
[ 16.830685] CPU246 enqueued tasks (0 total):
[ 16.835356] ------------[ cut here ]------------
[ 16.835431] GICv3: CPU246: found redistributor 100370000 region 1:0x0000500100f00000
[ 16.840045] Dying CPU not properly vacated!
[ 16.847925] WARNING: CPU: 0 PID: 1 at kernel/sched/core.c:9215 sched_cpu_dying+0x1d0/0x2a

== Original Description ==

When kernel test rebooted onto the 5.15.0-43-generic HWE kernel, no output appeared on the console after the EFI stub:

Checkpoint 92
Checkpoint 92
Checkpoint 92
Checkpoint 92
Checkpoint 92
Checkpoint A0
Checkpoint 92
Checkpoint 92
Checkpoint 92
Checkpoint 92
Checkpoint 92
Checkpoint 92
Checkpoint 92
Checkpoint AD
EFI stub: Booting Linux Kernel...
EFI stub: ERROR: FIRMWARE BUG: kernel image not aligned on 64k boundary
EFI stub: Using DTB from configuration table
EFI stub: Exiting boot services...

Revision history for this message
dann frazier (dannf) wrote (last edit ):

I put our system in a reboot loop to see if I could reproduce and, after 16 boots of 5.15.0-43-generic, it hung again. Seems we have a real issue here.

As a next step, I suggest installing the same OS (focal) and downgrading to the same kernel (5.15.0-43) on the same system (papat) and adding "earlycon" to the kernel command line, then restarting a reboot loop. If the system still hangs w/ earlycon, we might get more information out of the kernel.

= Quick way to set up a reboot loop =
sudo apt-add-repository -y ppa:dannf/dannf && sudo apt install -y dannf
dannf-setup-reboot-loop.sh
sudo reboot

To stop the reboot loop, interrupt GRUB and add "stop" to the kernel command line.

Revision history for this message
Ubuntu Foundations Team Bug Bot (crichton) wrote :

Thank you for taking the time to report this bug and helping to make Ubuntu better. It seems that your bug report is not filed about a specific source package though, rather it is just filed against Ubuntu in general. It is important that bug reports be filed about source packages so that people interested in the package can find the bugs about it. You can find some hints about determining what package your bug might be about at https://wiki.ubuntu.com/Bugs/FindRightPackage. You might also ask for help in the #ubuntu-bugs irc channel on Libera.chat.

To change the source package that this bug is filed about visit https://bugs.launchpad.net/ubuntu/+bug/1990294/+editstatus and add the package name in the text box next to the word Package.

[This is an automated message. I apologize if it reached you inappropriately; please just reply to this message indicating so.]

tags: added: bot-comment
dann frazier (dannf)
no longer affects: ubuntu
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1990294

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Revision history for this message
Ike Panhc (ikepanhc) wrote :

Tried to reproduce this issue on Ampere Altra SoC (not AltraMax) with focal grub and 5.15.0-43-generic kernel but can not reproduce. It reboot 222 times successfully.

I will try to reproduce on an altramax machine.

Revision history for this message
dann frazier (dannf) wrote :

Whether or not we can reproduce on other machines, I think we still need to try to diagnose what is going on with papat. I think there's a series of steps we can do to help diagnose that - I'd suggest starting w/ the "earlycon" test.

Olivier Gayot (ogayot)
tags: added: foundations-triage-discuss
Revision history for this message
Hon Ming Hui (hm-hui) wrote :

Here is the log that reproduce the hang.

Revision history for this message
dann frazier (dannf) wrote :

Thanks! OK, that shows that we've successfully booted the kernel, but it is having an issue bringing up a CPU. I've pasted the relevant snippet below. It isn't obvious to me if this is a kernel issue or some other platform issue. As a next step, I'd suggest attempting to reproduce w/ the latest mainline kernel (continuing w/ earlycon enabled). If the problem goes away, perhaps there's a mainline fix we can identify what "fixed" it. If the problem persists, we probably need to notify the vendor.

[ 3.321372] arch_timer: Enabling local workaround for ARM erratum 1418040
[ 3.321394] CPU245: Booted secondary processor 0x0100310100 [0x413fd0c1]
[ 8.536939] CPU246: failed to come online
[ 8.550338] Detected PIPT I-cache on CPU246
[ 8.555817] CPU246: failed in unknown state : 0x0
[ 8.555987] GICv3: CPU246: found redistributor 100370000 region 1:0x0000500100f00000
[ 8.562727] ------------[ cut here ]------------
[ 8.562809] GICv3: CPU246: using allocated LPI pending table @0x0000080002680000
[ 8.569364] Dying CPU not properly vacated!
[ 16.668834] WARNING: CPU: 0 PID: 1 at kernel/sched/core.c:9215 sched_cpu_dying+0x1d0/0x2a0
[ 16.681496] Modules linked in:
[ 16.684597] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.15.0-43-generic #46~20.04.1-Ubuntu
[ 16.693010] pstate: 604000c9 (nZCv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[ 16.700095] pc : sched_cpu_dying+0x1d0/0x2a0
[ 16.704432] lr : sched_cpu_dying+0x1d0/0x2a0
[ 16.708770] sp : ffff80000881bbe0
[ 16.712132] x29: ffff80000881bbe0 x28: ffffcca930eada80 x27: 0000000000000000
[ 16.719390] x26: ffff7395107ce000 x25: 00000000000000f6 x24: ffff403e40ed6728
[ 16.726651] x23: ffffcca930eb2618 x22: 0000000000000000 x21: 00000000000000f6
[ 16.733907] x20: ffffcca930719100 x19: ffff403e40ee7100 x18: 0000000000000014
[ 16.741165] x17: 3836323030303830 x16: 3030303078304020 x15: 656c62617420676e
[ 16.748425] x14: 69646e6570204950 x13: 3030303038363230 x12: 3030383030303030
[ 16.755683] x11: 78304020656c6261 x10: 7420676e69646e65 x9 : ffffcca92e7e39a0
[ 16.762940] x8 : 6c6c6120676e6973 x7 : 205d393038323635 x6 : c0000000fffeffff
[ 16.770198] x5 : 000000000017ffe8 x4 : 0000000000000000 x3 : ffff80000881b8c8
[ 16.777457] x2 : 0000000000000000 x1 : 0000000000000000 x0 : 0000000000000000
[ 16.784717] Call trace:
[ 16.787197] sched_cpu_dying+0x1d0/0x2a0
[ 16.791182] cpuhp_invoke_callback+0x14c/0x5a8
[ 16.795699] cpuhp_invoke_callback_range+0x5c/0xb8
[ 16.800568] _cpu_up+0x234/0x360
[ 16.803844] cpu_up+0xb8/0x110
[ 16.806945] bringup_nonboot_cpus+0x7c/0xb0
[ 16.811195] smp_init+0x3c/0x98
[ 16.814385] kernel_init_freeable+0x1a4/0x39c
[ 16.818813] kernel_init+0x2c/0x158
[ 16.822359] ret_from_fork+0x10/0x20
[ 16.825992] ---[ end trace bb9ca924bc23dd10 ]---
[ 16.830685] CPU246 enqueued tasks (0 total):
[ 16.835356] ------------[ cut here ]------------
[ 16.835431] GICv3: CPU246: found redistributor 100370000 region 1:0x0000500100f00000
[ 16.840045] Dying CPU not properly vacated!
[ 16.847925] WARNING: CPU: 0 PID: 1 at kernel/sched/core.c:9215 sched_cpu_dying+0x1d0/0x2a

Revision history for this message
Taihsiang Ho (tai271828) wrote :

I will process mainline kernel testing.

Changed in linux (Ubuntu):
status: Incomplete → In Progress
assignee: nobody → Taihsiang Ho (taihsiangho)
Revision history for this message
Taihsiang Ho (tai271828) wrote :

I rebooted the system with mainline kernel v6.0.9, and I can similar kernel panic at 302th reboot. See the log attachment papat-221211-reboot_mainline-6.0.9-060009-generic_kernel-panic.log for the kernel panic message.

Revision history for this message
Taihsiang Ho (tai271828) wrote :
Download full text (3.1 KiB)

Kernel panic message snippet from the log attachment papat-221211-reboot_mainline-6.0.9-060009-generic_kernel-panic.log:

[ 835.504178] ------------[ cut here ]------------
[ 835.504183] kernel BUG at kernel/irq_work.c:235!
[ 835.504184] Internal error: Oops - BUG: 0 [#1] SMP
[ 835.504186] Modules linked in:
[ 835.504187] CPU: 0 PID: 1 Comm: swapper/0 Tainted: G W L 6.0.9-060009-generic #202211161102
[ 835.504189] pstate: 80400009 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[ 835.504190] pc : irq_work_run_list+0x98/0xa0
[ 835.504194] lr : irq_work_run+0x2c/0x60
[ 835.504196] sp : ffff80000883bc60
[ 835.504197] x29: ffff80000883bc60 x28: ffff403e40bca798 x27: 0000000000000000
[ 835.504200] x26: ffff78e12694f000 x25: 0000000000000000 x24: 0000000000000000
[ 835.504202] x23: ffffc75d180f48d4 x22: 00000000000002e9 x21: ffffc75d1a27b798
[ 835.504205] x20: 0000000000000095 x19: ffffc75d1a287ff0 x18: ffff80000881b050
[ 835.504207] x17: 0000000000000000 x16: 0000000000000000 x15: 6765722030303030
[ 835.504209] x14: 0000000000000001 x13: 3a296c61746f7420 x12: 302820736b736174
[ 835.504212] x11: 0000000000000000 x10: 0000000000000000 x9 : ffffc75d180f48f4
[ 835.504214] x8 : 0000000000000000 x7 : 0000000000000000 x6 : 0000000000000000
[ 835.504217] x5 : 0000000000000000 x4 : 0000000000000000 x3 : 0000000000000000
[ 835.504219] x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffff083e7f10eff8
[ 835.504221] Call trace:
[ 835.504222] irq_work_run_list+0x98/0xa0
[ 835.504224] smpcfd_dying_cpu+0x24/0x40
[ 835.504227] cpuhp_invoke_callback+0x230/0x610
[ 835.504229] cpuhp_invoke_callback_range+0xe4/0x11c
[ 835.504231] _cpu_up+0x1f4/0x294
[ 835.504233] cpu_up+0xd0/0x13c
[ 835.504234] bringup_nonboot_cpus+0x98/0xdc
[ 835.504236] smp_init+0x40/0xb8
[ 835.504237] kernel_init_freeable+0x144/0x1c8
[ 835.504240] kernel_init+0x3c/0x180
[ 835.504241] ret_from_fork+0x10/0x20
[ 835.504244] Code: d2800002 d2800010 d2800011 d65f03c0 (d4210000)
[ 835.504249] ---[ end trace 0000000000000000 ]---
[ 835.504250] note: swapper/0[1] exited with preempt_count 1
[ 835.504435] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b
[ 835.504437] SMP: stopping secondary CPUs
[ 835.506210] x6 : 0000000000000000
[ 835.506213] x5 : 0000000000000000 x4 : 0000000000000000 x3 : 0000000000000000
[ 835.506216] x2 : 0000000000000000 x1 : 0000000000000000 x0 : 00000000000000e0
[ 835.506218] Call trace:
[ 835.506219] arch_cpu_idle+0x1c/0x60
[ 835.506221] default_idle_call+0x78/0x1f0
[ 835.506223] cpuidle_idle_call+0x198/0x200
[ 835.506225] do_idle+0xb8/0x120
[ 835.506227] cpu_startup_entry+0x30/0x40
[ 835.506229] secondary_start_kernel+0xf4/0x14c
[ 835.506231] __secondary_switched+0xb4/0xb8
[ 835.514617] Call trace:
[ 835.518072] arch_cpu_idle+0x1c/0x60
[ 835.525418] default_idle_call+0x78/0x1f0
[ 835.528518] cpuidle_idle_call+0x198/0x200
[ 835.538432] do_idle+0xb8/0x120
[ 835.545516] cpu_startup_entry+0x30/0x40
[ 835.549499] secondary_start_kernel+0xf4/0x14c
[ 835.553486] __secondary_switched+0xb4/0xb8
[ 836.948113] ---[ end Kernel panic - not syncing: Attempted to kill init! exi...

Read more...

Revision history for this message
Taihsiang Ho (tai271828) wrote :

Similar to comment #6. Here is the log of another Mt. Jade with AltraMax reproducing this issue with Ubuntu kernel.

[ 8.308100] ------------[ cut here ]------------^M
[ 8.308185] GICv3: CPU248: using allocated LPI pending table @0x00000800026a0000^M
[ 8.314742] Dying CPU not properly vacated!^M
[ 16.480898] WARNING: CPU: 0 PID: 1 at kernel/sched/core.c:9215 sched_cpu_dying+0x1d0/0x2a0^M
[ 16.493557] Modules linked in:^M
[ 16.496658] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.15.0-43-generic #46~20.04.1-Ubuntu^M
[ 16.505065] pstate: 604000c9 (nZCv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)^M
[ 16.512146] pc : sched_cpu_dying+0x1d0/0x2a0^M
[ 16.516483] lr : sched_cpu_dying+0x1d0/0x2a0^M
[ 16.520822] sp : ffff80000881bbe0^M
[ 16.524187] x29: ffff80000881bbe0 x28: ffffc747c462da80 x27: 0000000000000000^M
[ 16.531443] x26: ffff78f67d08a000 x25: 00000000000000f8 x24: ffff403e40f12728^M
[ 16.538703] x23: ffffc747c4632618 x22: 0000000000000000 x21: 00000000000000f8^M
[ 16.545961] x20: ffffc747c3e99100 x19: ffff403e40f23100 x18: 0000000000000014^M
[ 16.553222] x17: 6136323030303830 x16: 3030303078304020 x15: 656c62617420676e^M
[ 16.560480] x14: 69646e6570204950 x13: 3030303061363230 x12: 3030383030303030^M
[ 16.567738] x11: 78304020656c6261 x10: 7420676e69646e65 x9 : ffffc747c1f639a0^M
[ 16.574995] x8 : 6c6c6120676e6973 x7 : 205d353831383033 x6 : c0000000fffeffff^M
[ 16.582255] x5 : 000000000017ffe8 x4 : 0000000000000000 x3 : ffff80000881b8c8^M
[ 16.589512] x2 : 0000000000000000 x1 : 0000000000000000 x0 : 0000000000000000^M
[ 16.596772] Call trace:^M
[ 16.599252] sched_cpu_dying+0x1d0/0x2a0^M
[ 16.603238] cpuhp_invoke_callback+0x14c/0x5a8^M
[ 16.607757] cpuhp_invoke_callback_range+0x5c/0xb8^M
[ 16.612627] _cpu_up+0x234/0x360^M
[ 16.615904] cpu_up+0xb8/0x110^M
[ 16.619003] bringup_nonboot_cpus+0x7c/0xb0^M
[ 16.623254] smp_init+0x3c/0x98^M
[ 16.626444] kernel_init_freeable+0x1a4/0x39c^M
[ 16.630871] kernel_init+0x2c/0x158^M
[ 16.634415] ret_from_fork+0x10/0x20^M
[ 16.638046] ---[ end trace b4903e61a7ade0b4 ]---^M
[ 16.642739] CPU248 enqueued tasks (0 total):^M

Taihsiang Ho (tai271828)
Changed in grub2 (Ubuntu):
status: New → Invalid
description: updated
Revision history for this message
Taihsiang Ho (tai271828) wrote :

I updated the bug description to make it clear. Besides, I marked grub as invalid because at this moment it looks like a kernel issue according to the log and the reproducer.

Revision history for this message
Taihsiang Ho (tai271828) wrote :

I upgraded our Mt. Jade AltraMax with the latest firmware from Ampere, and then I can't reproduce this issue with the latest Ubuntu Focal, HWE kernel, and the latest upgraded firmware in the previous step[1] after rebooting a Mt. Jade wih AltraMax more than 900 times. Setting the bug as Invalid on linux.

[1] More details of version number
HWE kernel 5.15.0-58-generic
Firmware Revision 2.17.104000
SCP Revision 2.10.20221028
UEFI Revision 2.10.20221028

Changed in linux (Ubuntu):
status: In Progress → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.