Comment 2 for bug 1999646

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to integ (master)

Reviewed: https://review.opendev.org/c/starlingx/integ/+/867738
Committed: https://opendev.org/starlingx/integ/commit/470193ffc9fcee9ca3eb53090cc5001f5f27980c
Submitter: "Zuul (22348)"
Branch: master

commit 470193ffc9fcee9ca3eb53090cc5001f5f27980c
Author: Peng Zhang <email address hidden>
Date: Sat Dec 17 08:38:58 2022 +0800

    kdump-tools: disable AER to fix kdump hung issue

    This issue is detected after kernel updated from 5.10.112 version to
    5.10.152 version. Bad commit is d83d886e69bd (PCI/ERR: Recover from
    RCEC AER errors) which comes from linux-yocto 5.10 stable tree. It
    will lead to board hang up after triggering kdump.

    This issue can be reproduced on board whose name is Supermicro
    A2SDi-16C-TP8F, bios version is 1.4 and build date is 01/29/2021.

    We don't need pci AER functionality enabled in the kdump kernel, and it
    causes some boards to hang in certain situations as kernel AER error log
    shows. So we just disable it.

    KERNEL AER ERROR LOG:
    [ 7.409296] pcieport 0000:00:05.0: AER: Multiple Corrected error
    received: 0000:00:05.0
    [ 7.417311] BUG: kernel NULL pointer dereference, address:
    0000000000000028
    [ 7.418296] #PF: supervisor read access in kernel mode
    [ 7.418296] #PF: error_code(0x0000) - not-present page
    [ 7.418296] PGD 0 P4D 0
    [ 7.418296] Oops: 0000 [#1] PREEMPT SMP NOPTI
    [ 7.418296] CPU: 0 PID: 93 Comm: irq/25-aerdrv Not tainted
    5.10.0-6-amd64 #1 Debian 5.10.152-1.stx.25
    [ 7.418296] Hardware name: Supermicro
    SYS-E300-9A-16CN8TP/A2SDi-16C-TP8F, BIOS 1.4 01/29/2021
    [ 7.418296] RIP: 0010:pci_walk_bus+0x25/0x90
    [ 7.418296] Code: 00 00 00 00 00 0f 1f 44 00 00 41 56 41 55 49 89 fd
    48 c7 c7 20 37 9a 99 41 54 49 89 f4 55 48 89 d5 53 4c 89 eb e8 2b 5a 56
    00 <49> 8b 7d 28 eb 1f 48 8b 47 18 48 85 c0 74 31 4c 8b 70 28 48 89 c3
    [ 7.418296] RSP: 0000:ffffa60040173dc8 EFLAGS: 00010282
    [ 7.418296] RAX: ffff8b553fded001 RBX: 0000000000000000 RCX:
    0000000000000000
    [ 7.418296] RDX: ffff8b553fded000 RSI: ffffffff9833c6e0 RDI:
    ffffffff999a3720
    [ 7.418296] RBP: ffffa60040173e10 R08: 0000000000000002 R09:
    ffffa60040173d74
    [ 7.418296] R10: 0000000000000001 R11: 0000000000000000 R12:
    ffffffff9833c6e0
    [ 7.418296] R13: 0000000000000000 R14: 0000000000000028 R15:
    ffff8b555e206328
    [ 7.418296] FS: 0000000000000000(0000) GS:ffff8b55bec00000(0000)
    knlGS:0000000000000000
    [ 7.418296] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [ 7.418296] CR2: 0000000000000028 CR3: 000000087d80a000 CR4:
    00000000003506f0
    [ 7.418296] Call Trace:
    [ 7.418296] find_source_device+0x34/0x5a
    [ 7.418296] aer_isr.cold+0x89/0x9e
    [ 7.418296] ? __set_cpus_allowed_ptr+0xb6/0x220
    [ 7.418296] ? disable_irq_nosync+0x10/0x10
    [ 7.418296] irq_thread_fn+0x20/0x60
    [ 7.418296] irq_thread+0x104/0x1b0
    [ 7.418296] ? irq_finalize_oneshot.part.0+0xe0/0xe0
    [ 7.418296] ? irq_thread_check_affinity+0xa0/0xa0
    [ 7.418296] kthread+0x133/0x150
    [ 7.418296] ? __kthread_bind_mask+0x60/0x60
    [ 7.418296] ret_from_fork+0x22/0x30
    [ 7.418296] Modules linked in:
    [ 7.418296] CR2: 0000000000000028

    TEST PLAN:
    PASS: build-pkgs -c -p kdump-tools
    PASS: build-pkgs -c -p kdump-tools-rt
    PASS: boot
    PASS: on troublesome and non-troublesome platform
          systemctl enable kdump-tools.service
          systemctl start kdump-tools.service
          echo 1 >/proc/sysrq-trigger
          echo 'c' > /proc/sysrq-trigger
          vmcore has been created successfully
          system boots back up automatically

    Closes-Bug: 1999646

    Change-Id: I9ffc6e96d4b7fbd0b29a806d4d96dfc8e89dc4c6
    Signed-off-by: Peng Zhang <email address hidden>