After triggering kdump, the board hangs up during startup with SuperMicro_E300-9A-16CN8TP_7 board

Bug #1999646 reported by Peng Zhang
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Low
Peng Zhang

Bug Description

Brief Description
-----------------
Fail to pass regression test after triggering kdump.

Severity
--------
Provide the severity of the defect.
Major

Steps to Reproduce
------------------

systemctl enable kdump-tools.service

systemctl start kdump-tools.service

echo 1 >/proc/sysrq-trigger

echo 'c' > /proc/sysrq-trigger

print logs

[ 18.482648] VFIO - User Level meta-driver version: 0.3[ 18.952121] ACPI BIOS Error (bug): Could not resolve symbol [\_PR.CPU0._PCT], AE_NOT_FOUND (20200925/psargs-330)
[ 19.054102] ACPI Error: Aborting method _PR.CPUD._PCT due to previous error (AE_NOT_FOUND) (20200925/psparse-529)
[ 19.111597] ACPI Error: AE_NOT_FOUND, Evaluating _PCT (20200925/processor_perflib-203)
[ 19.155138] ACPI BIOS Error (bug): Could not resolve symbol [\_PR.CPU0._PCT], AE_NOT_FOUND (20200925/psargs-330)
[ 19.241955] ACPI Error: Aborting method _PR.CPUD._PCT due to previous error (AE_NOT_FOUND) (20200925/psparse-529)
[ 19.318081] ACPI Error: AE_NOT_FOUND, Evaluating _PCT (20200925/processor_perflib-203)
[ 19.362139] ACPI BIOS Error (bug): Could not resolve symbol [\_PR.CPU0._PCT], AE_NOT_FOUND (20200925/psargs-330)
[ 19.435703] ACPI Error: Aborting method _PR.CPUD._PCT due to previous error (AE_NOT_FOUND) (20200925/psparse-529)
[ 19.520510] ACPI Error: AE_NOT_FOUND, Evaluating _PCT (20200925/processor_perflib-203)
[ 19.594132] ACPI BIOS Error (bug): Could not resolve symbol [\_PR.CPU0._PCT], AE_NOT_FOUND (20200925/psargs-330)
[ 19.648704] QAT: AE0 is inactive!!
[ 19.704611] QAT: failed to get device out of reset
[ 19.730592] ACPI Error: Aborting method _PR.CPUD._PCT due to previous error (AE_NOT_FOUND) (20200925/psparse-529)
[ 19.780597] c3xxx 0000:01:00.0: qat_hal_clr_reset error
[ 19.836611] c3xxx 0000:01:00.0: Failed to init the AEs
[ 19.866616] ACPI Error: AE_NOT_FOUND, Evaluating _PCT (20200925/processor_perflib-203)
[ 19.901595] c3xxx 0000:01:00.0: Failed to initialise Acceleration Engine
[ 19.940759] ACPI BIOS Error (bug): Could not resolve symbol [\_PR.CPU0._PCT], AE_NOT_FOUND (20200925/psargs-330)
[ 20.007538] ACPI Error: Aborting method _PR.CPUD._PCT due to previous error (AE_NOT_FOUND) (20200925/psparse-529)
[ 20.087888] ACPI Error: AE_NOT_FOUND, Evaluating _PCT (20200925/processor_perflib-203)
[ 20.163881] ACPI BIOS Error (bug): Could not resolve symbol [\_PR.CPU0._PCT], AE_NOT_FOUND (20200925/psargs-330)
[ 20.280608] ACPI Error: Aborting method _PR.CPUD._PCT due to previous error (AE_NOT_FOUND) (20200925/psparse-529)
[ 20.480607] ACPI Error: AE_NOT_FOUND, Evaluating _PCT (20200925/processor_perflib-203)
[ 20.663116] ACPI BIOS Error (bug): Could not resolve symbol [\_PR.CPU0._PCT], AE_NOT_FOUND (20200925/psargs-330)
[ 20.779631] ACPI Error: Aborting method _PR.CPUD._PCT due to previous error (AE_NOT_FOUND) (20200925/psparse-529)
[ 20.939283] ACPI Error: AE_NOT_FOUND, Evaluating _PCT (20200925/processor_perflib-203)
[ 21.081144] ACPI BIOS Error (bug): Could not resolve symbol [\_PR.CPU0._PCT], AE_NOT_FOUND (20200925/psargs-330)
[ 21.279007] ACPI Error: Aborting method _PR.CPUD._PCT due to previous error (AE_NOT_FOUND) (20200925/psparse-529)
[ 21.382258] ACPI Error: AE_NOT_FOUND, Evaluating _PCT (20200925/processor_perflib-203)
[ 21.436147] ACPI BIOS Error (bug): Could not resolve symbol [\_PR.CPU0._PCT], AE_NOT_FOUND (20200925/psargs-330)
[ 21.561718] ACPI Error: Aborting method _PR.CPUD._PCT due to previous error (AE_NOT_FOUND) (20200925/psparse-529)
[ 21.587704] ACPI Error: AE_NOT_FOUND, Evaluating _PCT (20200925/processor_perflib-203)
[ 21.610171] ACPI BIOS Error (bug): Could not resolve symbol [\_PR.CPU0._PCT], AE_NOT_FOUND (20200925/psargs-330)
[ 21.636609] ACPI Error: Aborting method _PR.CPUD._PCT due to previous error (AE_NOT_FOUND) (20200925/psparse-529)
[ 21.663210] ACPI Error: AE_NOT_FOUND, Evaluating _PCT (20200925/processor_perflib-203)
[ 21.685083] ACPI BIOS Error (bug): Could not resolve symbol [\_PR.CPU0._PCT], AE_NOT_FOUND (20200925/psargs-330)
[ 21.711660] ACPI Error: Aborting method _PR.CPUD._PCT due to previous error (AE_NOT_FOUND) (20200925/psparse-529)
[ 21.738243] ACPI Error: AE_NOT_FOUND, Evaluating _PCT (20200925/processor_perflib-203)
[ 21.760099] ACPI BIOS Error (bug): Could not resolve symbol [\_PR.CPU0._PCT], AE_NOT_FOUND (20200925/psargs-330)
[ 21.786628] ACPI Error: Aborting method _PR.CPUD._PCT due to previous error (AE_NOT_FOUND) (20200925/psparse-529)
[ 21.813209] ACPI Error: AE_NOT_FOUND, Evaluating _PCT (20200925/processor_perflib-203)
[ 21.835079] ACPI BIOS Error (bug): Could not resolve symbol [\_PR.CPU0._PCT], AE_NOT_FOUND (20200925/psargs-330)
[ 21.862321] ACPI Error: Aborting method _PR.CPUD._PCT due to previous error (AE_NOT_FOUND) (20200925/psparse-529)
[ 21.890105] ACPI Error: AE_NOT_FOUND, Evaluating _PCT (20200925/processor_perflib-203)
[ 21.912995] ACPI BIOS Error (bug): Could not resolve symbol [\_PR.CPU0._PCT], AE_NOT_FOUND (20200925/psargs-330)
[ 21.940680] ACPI Error: Aborting method _PR.CPUD._PCT due to previous error (AE_NOT_FOUND) (20200925/psparse-529)
[ 21.968435] ACPI Error: AE_NOT_FOUND, Evaluating _PCT (20200925/processor_perflib-203)
[ 21.991016] ACPI BIOS Error (bug): Could not resolve symbol [\_PR.CPU0._PCT], AE_NOT_FOUND (20200925/psargs-330)
[ 22.019114] ACPI Error: Aborting method _PR.CPUD._PCT due to previous error (AE_NOT_FOUND) (20200925/psparse-529)
[ 22.047945] ACPI Error: AE_NOT_FOUND, Evaluating _PCT (20200925/processor_perflib-203)
[ 22.071048] ACPI BIOS Error (bug): Could not resolve symbol [\_PR.CPU0._PCT], AE_NOT_FOUND (20200925/psargs-330)
[ 22.099798] ACPI Error: Aborting method _PR.CPUD._PCT due to previous error (AE_NOT_FOUND) (20200925/psparse-529)
[ 22.128628] ACPI Error: AE_NOT_FOUND, Evaluating _PCT (20200925/processor_perflib-203)
[ 22.152006] ACPI BIOS Error (bug): Could not resolve symbol [\_PR.CPU0._PCT], AE_NOT_FOUND (20200925/psargs-330)
[ 22.180754] ACPI Error: Aborting method _PR.CPUD._PCT due to previous error (AE_NOT_FOUND) (20200925/psparse-529)
[ 22.209566] ACPI Error: AE_NOT_FOUND, Evaluating _PCT (20200925/processor_perflib-203)

telnet> q
Connection closed.
sjiao@pek-lpgtest20:test$

Expected Behavior
------------------
pass the testcase

Actual Behavior
----------------
After triggering kdump, the board hangs up during startup with SuperMicro_E300-9A-16CN8TP_7 board

Reproducibility
---------------
<Reproducible>

System Configuration
--------------------
oot@controller-0:/var/rootdirs/opt/wr-test/testcases/stx/livepatch# cat /proc/cmdline
net.naming-scheme=vSTX7_0 BOOT_IMAGE=/1/vmlinuz-5.10.0-6-amd64 rw rootwait ostree_boot=LABEL=otaboot ostree_root=/dev/mapper/cgts-vg-rootlv rd.lvm.lv=cgts/root-lv ostree_var=/dev/mapper/cgtsvg-var-lv ostree=/ostree/1 console=ttyS0,115200 console=tty1 iommu=pt hugepagesz=2M hugepages=0 default_hugepagesz=2M rcu_nocbs=2-15 kthread_cpus=0,1 irqaffinity=2-15 nmi_watchdog=panic,1 softlockup_panic=1 intel_iommu=on selinux=0 enforcing=0 softdog.soft_panic=1 systemd.unified_cgroup_hierarchy=0 user_namespace.enable=1 biosdevname=0 crashkernel=2048M apparmor=0 security=apparmor

root@controller-0:/var/rootdirs/opt/wr-test/testcases/stx/livepatch# cat /proc/version
Linux version 5.10.0-6-amd64 (root@stx-stx-pkgbuilder-6d969f9849-s7wwv) (gcc-10 (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2) #1 SMP PREEMPT StarlingX Debian 5.10.152-1.stx.25 (2022-11-27)

wrcp build Parameter:
---------------------
root@controller-0:/var/rootdirs/opt/wr-test/testcases/stx/livepatch# cat /etc/build.info
SW_VERSION="22.12"
BUILD_TARGET="Unknown"
BUILD_TYPE="Informal"
BUILD_ID="n/a"

JOB="n/a"
BUILD_BY="pyan"
BUILD_NUMBER="n/a"
BUILD_HOST="stx-stx-builder-5bf7b4486-kdb8k"
BUILD_DATE="2022-11-27 15:55:43 +0000"

BUILD_DIR="/localdisk/loadbuild/pyan/stx"
WRS_SRC_DIR="/localdisk/designer/pyan/stx/cgcs-root"
WRS_GIT_BRANCH="HEAD"
CGCS_SRC_DIR="/localdisk/designer/pyan/stx/cgcs-root/stx"
CGCS_GIT_BRANCH="HEAD"

wrcp platform Parameter:
---------------------
root@controller-0:/var/rootdirs/opt/wr-test/testcases/stx/livepatch# cat /etc/platform/platform.conf
nodetype=controller
subfunction=controller,worker
system_type=All-in-one
http_port=8080
management_interface=eno1
INSTALL_UUID=9348118d-1d20-40c2-b39c-08cbf4642eca

Branch/Pull Time/Commit
-----------------------
Branch and the time when code was pulled or git commit or cengn load info

Last Pass
---------
before kernel upgrade

Timestamp/Logs
--------------
Attach the logs for debugging (use attachments in Launchpad or for large collect files use: https://files.starlingx.kube.cengn.ca/)
Provide a snippet of logs here and the timestamp when issue was seen.
Please indicate the unique identifier in the logs to highlight the problem

Test Activity
-------------
Testing

Workaround
----------
Describe workaround if available

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to integ (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/integ/+/867738

Changed in starlingx:
status: New → In Progress
Ghada Khalil (gkhalil)
Changed in starlingx:
importance: Undecided → Low
tags: added: stx.distro.other
Changed in starlingx:
assignee: nobody → Peng Zhang (pzhang2)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to integ (master)
Download full text (3.8 KiB)

Reviewed: https://review.opendev.org/c/starlingx/integ/+/867738
Committed: https://opendev.org/starlingx/integ/commit/470193ffc9fcee9ca3eb53090cc5001f5f27980c
Submitter: "Zuul (22348)"
Branch: master

commit 470193ffc9fcee9ca3eb53090cc5001f5f27980c
Author: Peng Zhang <email address hidden>
Date: Sat Dec 17 08:38:58 2022 +0800

    kdump-tools: disable AER to fix kdump hung issue

    This issue is detected after kernel updated from 5.10.112 version to
    5.10.152 version. Bad commit is d83d886e69bd (PCI/ERR: Recover from
    RCEC AER errors) which comes from linux-yocto 5.10 stable tree. It
    will lead to board hang up after triggering kdump.

    This issue can be reproduced on board whose name is Supermicro
    A2SDi-16C-TP8F, bios version is 1.4 and build date is 01/29/2021.

    We don't need pci AER functionality enabled in the kdump kernel, and it
    causes some boards to hang in certain situations as kernel AER error log
    shows. So we just disable it.

    KERNEL AER ERROR LOG:
    [ 7.409296] pcieport 0000:00:05.0: AER: Multiple Corrected error
    received: 0000:00:05.0
    [ 7.417311] BUG: kernel NULL pointer dereference, address:
    0000000000000028
    [ 7.418296] #PF: supervisor read access in kernel mode
    [ 7.418296] #PF: error_code(0x0000) - not-present page
    [ 7.418296] PGD 0 P4D 0
    [ 7.418296] Oops: 0000 [#1] PREEMPT SMP NOPTI
    [ 7.418296] CPU: 0 PID: 93 Comm: irq/25-aerdrv Not tainted
    5.10.0-6-amd64 #1 Debian 5.10.152-1.stx.25
    [ 7.418296] Hardware name: Supermicro
    SYS-E300-9A-16CN8TP/A2SDi-16C-TP8F, BIOS 1.4 01/29/2021
    [ 7.418296] RIP: 0010:pci_walk_bus+0x25/0x90
    [ 7.418296] Code: 00 00 00 00 00 0f 1f 44 00 00 41 56 41 55 49 89 fd
    48 c7 c7 20 37 9a 99 41 54 49 89 f4 55 48 89 d5 53 4c 89 eb e8 2b 5a 56
    00 <49> 8b 7d 28 eb 1f 48 8b 47 18 48 85 c0 74 31 4c 8b 70 28 48 89 c3
    [ 7.418296] RSP: 0000:ffffa60040173dc8 EFLAGS: 00010282
    [ 7.418296] RAX: ffff8b553fded001 RBX: 0000000000000000 RCX:
    0000000000000000
    [ 7.418296] RDX: ffff8b553fded000 RSI: ffffffff9833c6e0 RDI:
    ffffffff999a3720
    [ 7.418296] RBP: ffffa60040173e10 R08: 0000000000000002 R09:
    ffffa60040173d74
    [ 7.418296] R10: 0000000000000001 R11: 0000000000000000 R12:
    ffffffff9833c6e0
    [ 7.418296] R13: 0000000000000000 R14: 0000000000000028 R15:
    ffff8b555e206328
    [ 7.418296] FS: 0000000000000000(0000) GS:ffff8b55bec00000(0000)
    knlGS:0000000000000000
    [ 7.418296] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [ 7.418296] CR2: 0000000000000028 CR3: 000000087d80a000 CR4:
    00000000003506f0
    [ 7.418296] Call Trace:
    [ 7.418296] find_source_device+0x34/0x5a
    [ 7.418296] aer_isr.cold+0x89/0x9e
    [ 7.418296] ? __set_cpus_allowed_ptr+0xb6/0x220
    [ 7.418296] ? disable_irq_nosync+0x10/0x10
    [ 7.418296] irq_thread_fn+0x20/0x60
    [ 7.418296] irq_thread+0x104/0x1b0
    [ 7.418296] ? irq_finalize_oneshot.part.0+0xe0/0xe0
    [ 7.418296] ? irq_thread_check_affinity+0xa0/0xa0
    [ 7.418296] kthread+0x133/0x150
    [ 7.418296] ? __kthread_bi...

Read more...

Changed in starlingx:
status: In Progress → Fix Released
Ghada Khalil (gkhalil)
tags: added: stx.8.0
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.