Cavium ThunderX CN88XX crashes on boot

Bug #1857074 reported by Sean Feole on 2019-12-20
54
This bug affects 7 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Undecided
Unassigned
Bionic
Undecided
Unassigned

Bug Description

Series: Bionic
Kernel: 4.15.0-74.84 linux-generic
Steps to reproduce: Install 4.15.0-74.84 Kernel and boot the system.

The following crash was observed while testing the proposed kernel for the 2019.12.02 SRU Cycle.
This kernel was built to include fixes for the following bugs:

  * [Regression] Bionic kernel 4.15.0-71.80 can not boot on ThunderX
    (LP: #1853326)
    - Revert "arm64: Use firmware to detect CPUs that are not affected by
      Spectre-v2"
    - Revert "arm64: Get rid of __smccc_workaround_1_hvc_*"

  * [Regression] Bionic kernel 4.15.0-71.80 can not boot on ThunderX2 and
    Kunpeng920 (LP: #1852723)
    - SAUCE: arm64: capabilities: Move setup_boot_cpu_capabilities() call to
      correct place

The following crash appears to be a NEW bug. not related to the prior bugs listed above.
This bug DOES NOT APPEAR to be related to LP#1857073.

This is another NEW BUG.

Hostname: Starmie

Probable Cause is unknown at this point and still under investigation.

[ OK ] Found device WDC_WD5003ABYZ-011FA0 efi.
         Mounting /boot/efi...
[ OK ] Mounted /boot/efi.
[ OK ] Reached target Local File Systems.
         Starting AppArmor initialization...
         Starting Tell Plymouth To Write Out Runtime Data...
         Starting ebtables ruleset management...
[ 20.942427] kernel BUG at /build/linux-pWET3k/linux-4.15.0/fs/buffer.c:1240!
[ 20.951416] Internal error: Oops - BUG: 0 [#1] SMP
[ 20.958153] Modules linked in: nls_iso8859_1 thunderx_edac thunderx_zip cavium_rng_vf shpchp cavium_rng gpio_keys uio_pdrv_genirq ipmi_ssif uio ipmi_devintf ipmi_msghandler sch_fq_codel ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables x_tables autofs4 btrfs zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear nicvf nicpf ast i2c_algo_bit ttm drm_kms_helper syscopyarea sysfillrect aes_ce_blk sysimgblt fb_sys_fops aes_ce_cipher crc32_ce drm crct10dif_ce ghash_ce sha2_ce sha256_arm64 sha1_ce ahci thunder_bgx libahci thunder_xcv i2c_thunderx mdio_thunder thunderx_mmc mdio_cavium aes_neon_bs aes_neon_blk crypto_simd cryptd aes_arm64
[ 21.044326] Process systemd (pid: 1, stack limit = 0x000000005af6f18b)
[ 21.053858] CPU: 1 PID: 1 Comm: systemd Not tainted 4.15.0-74-generic #84-Ubuntu
[ 21.063931] Hardware name: Cavium ThunderX CRB/To be filled by O.E.M., BIOS 5.11 12/12/2012
[ 21.074790] pstate: 20400085 (nzCv daIf +PAN -UAO)
[ 21.082096] pc : __find_get_block+0x2e8/0x398
[ 21.088917] lr : __getblk_gfp+0x3c/0x2a8
[ 21.095379] sp : ffff0000099ab7e0
[ 21.101062] x29: ffff0000099ab7e0 x28: 0000000000000000
[ 21.108699] x27: 0000000000000000 x26: 0000000000000000
[ 21.116265] x25: 0000000000000001 x24: 0000000000000000
[ 21.123788] x23: 0000000000000008 x22: ffff801f26116c80
[ 21.131302] x21: ffff801f26116c80 x20: 000000000000245c
[ 21.138808] x19: 0000000000001000 x18: 0000ffffa59c3a70
[ 21.146300] x17: 0000000000000000 x16: 0000000000000000
[ 21.153730] x15: 0000000000000020 x14: 0000000000000012
[ 21.161083] x13: 2f7374696e752f64 x12: 0101010101010101
[ 21.168397] x11: 7f7f7f7f7f7f7f7f x10: ffff00000972d000
[ 21.175689] x9 : 0000000000000000 x8 : ffff801f7ba7e3c0
[ 21.183042] x7 : ffff801f7ba7e3e0 x6 : 0000000000000000
[ 21.190667] x5 : 0000000000000004 x4 : 0000000000000020
[ 21.197955] x3 : 0000000000000008 x2 : 0000000000001000
[ 21.205680] x1 : 000000000000245c x0 : 0000000000000080
[ 21.212918] Call trace:
[ 21.217257] __find_get_block+0x2e8/0x398
[ 21.223160] __getblk_gfp+0x3c/0x2a8
[ 21.228644] ext4_getblk+0xcc/0x1b0
[ 21.233991] ext4_bread_batch+0x78/0x1c8
[ 21.239726] ext4_find_entry+0x2d4/0x598
[ 21.245416] ext4_lookup+0xac/0x278
[ 21.250612] lookup_slow+0xac/0x190
[ 21.255736] walk_component+0x228/0x340
[ 21.261151] link_path_walk+0x2f4/0x568
[ 21.266499] path_parentat+0x44/0x88
[ 21.271521] filename_parentat+0xa0/0x170
[ 21.276924] filename_create+0x60/0x168
[ 21.282082] SyS_symlinkat+0x80/0x128
[ 21.287013] el0_svc_naked+0x30/0x34
[ 21.291835] Code: 17ffffe7 a90363b7 a9046bb9 f9002bbb (d4210000)
[ 21.299191] ---[ end trace b07cecc329f07f48 ]---
[ 21.347488] systemd: 35 output lines suppressed due to ratelimiting
[ 21.355094] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b
[ 21.355094]
[ 21.366666] SMP: stopping secondary CPUs
[ 21.371817] Kernel Offset: disabled
[ 21.376517] CPU features: 0x00901108
[ 21.381310] Memory Limit: none
[ 21.385617] ---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b
[ 21.385617]

CVE References

Sean Feole (sfeole) wrote :

Full console output for starmie

description: updated
description: updated
dann frazier (dannf) wrote :

Our testing found the same issue on 2 other CN88XX systems (seidel & anuchin), I'll attach the logs here. Interestingly, while both systems hit the oops when booting the bionic 4.15.0-74.84, neither system had a problem with the xenial hwe 4.15.0-74.84.

dann frazier (dannf) wrote :
dann frazier (dannf) wrote :
dann frazier (dannf) wrote :
dann frazier (dannf) wrote :
dann frazier (dannf) on 2020-01-07
Changed in linux (Ubuntu Bionic):
status: New → Confirmed
dann frazier (dannf) wrote :

fyi, I was able to reproduce this w/ upstream 4.14.164, built w/ an Ubuntu-based config.

dann frazier (dannf) wrote :
dann frazier (dannf) wrote :
Download full text (4.5 KiB)

v4.14.151 upstream fails as well - but with a different symptom (see below). v4.14.150 seems fine, so I'll try and bisect between the two. Of course, that's really just a shot in the dark, as we know this issue is finicky..

[ 34.896151] Unable to handle kernel paging request at virtual address 748214240b9fa200
[ 34.908263] Mem abort info:
[ 34.915217] Exception class = IABT (current EL), IL = 32 bits
[ 34.925365] SET = 0, FnV = 0
[ 34.932606] EA = 0, S1PTW = 0
[ 34.939877] [748214240b9fa200] address between user and kernel address ranges
[ 34.951247] Internal error: Oops: 86000004 [#1] SMP
[ 34.960370] Modules linked in: nls_iso8859_1 sch_fq_codel thunderx_edac thunderx_zip cavium_rng_vf cavium_rng gpio_keys shpchp uio_pdrv_genirq ipmi_ssif ipmi_devintf uio ipmi_msghandler ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables x_tables autofs4 btrfs zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear nicvf nicpf ast i2c_algo_bit ttm drm_kms_helper aes_ce_blk syscopyarea sysfillrect aes_ce_cipher sysimgblt thunder_bgx crc32_ce fb_sys_fops crct10dif_ce ghash_ce drm sha2_ce sha256_arm64 sha1_ce ahci libahci i2c_thunderx thunder_xcv i2c_smbus mdio_thunder thunderx_mmc mdio_cavium aes_neon_bs aes_neon_blk crypto_simd cryptd aes_arm64
[ 35.060004] Process apparmor_parser (pid: 1188, stack limit = 0xffff00001b948000)
[ 35.071987] CPU: 16 PID: 1188 Comm: apparmor_parser Not tainted 4.14.151 #1
[ 35.083436] Hardware name: Cavium ThunderX CRB/To be filled by O.E.M., BIOS 5.11 12/12/2012
[ 35.096272] task: ffff801f68160f00 task.stack: ffff00001b948000
[ 35.106632] PC is at 0x748214240b9fa200
[ 35.114857] LR is at 0x748214240b9fa200
[ 35.122977] pc : [<748214240b9fa200>] lr : [<748214240b9fa200>] pstate: 20400145
[ 35.134654] sp : ffff00001b94bca0
[ 35.142170] x29: ffff801f713f2020 x28: 0000000000000040
[ 35.151657] x27: 0000000000000000 x26: ffff00001b94be00
[ 35.161073] x25: ffff801f713f2020 x24: ffff00001b94bdd8
[ 35.170410] x23: ffff801f713f1ea8 x22: ffff801f71bd8000
[ 35.179678] x21: 0000000000000000 x20: ffff0000082352bc
[ 35.185726] audit: type=1400 audit(1578951898.572:9): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/lib/snapd/snap-confine" pid=1185 comm="apparmor_parser"
[ 35.188933] x19: ffff00001b94bcb0 x18: 0000ffff8916aa70
[ 35.188938] x17: 0000000000000000 x16: 0000000000000000
[ 35.188941] x15: 0000000000000000 x14: 0000000000000000
[ 35.213254] audit: type=1400 audit(1578951898.572:10): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/lib/snapd/snap-confine//mount-namespace-capture-helper" pid=1185 comm="apparmor_parser"
[ 35.222441] x13: 0000000000000000 x12: 0000000000000000
[ 35.222445] x11: 0000000000000000 x10: 0000000000000001
[ 35.222448] x9 : 0000000000000228 x8 : 0000000000000e4a
[ 35.281468] audit: type=1400 audit(1578951898.668:11): apparmor="STATUS" operation="profile_load" profile="unconfined" name="lxc-container-default" pid=1199 comm="apparmor_parser"
[ 35.2854...

Read more...

Download full text (5.2 KiB)

On Mon, 13 Jan 2020 23:51:13 -0000
dann frazier <email address hidden> wrote:

> v4.14.151 upstream fails as well - but with a different symptom (see
> below). v4.14.150 seems fine, so I'll try and bisect between the two. Of
> course, that's really just a shot in the dark, as we know this issue is
> finicky..

The cause of that is:
cce360b54ce6 arm64: capabilities: Filter the entries based on a given mask

and the fixup for the above is:
d4af3c4b81f4 arm64: cpufeature: Enable Qualcomm Falkor/Kryo errata 1003

We don't have the second commit but it also doesn't apply since things
in cap handling changed between the two and we don't have those changes. It
seems cap handling is broken and we're seeing the problem that's mentioned in
d4af3c4b81f4.

> [ 34.896151] Unable to handle kernel paging request at virtual address 748214240b9fa200
> [ 34.908263] Mem abort info:
> [ 34.915217] Exception class = IABT (current EL), IL = 32 bits
> [ 34.925365] SET = 0, FnV = 0
> [ 34.932606] EA = 0, S1PTW = 0
> [ 34.939877] [748214240b9fa200] address between user and kernel address ranges
> [ 34.951247] Internal error: Oops: 86000004 [#1] SMP
> [ 34.960370] Modules linked in: nls_iso8859_1 sch_fq_codel thunderx_edac thunderx_zip cavium_rng_vf cavium_rng gpio_keys shpchp uio_pdrv_genirq ipmi_ssif ipmi_devintf uio ipmi_msghandler ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables x_tables autofs4 btrfs zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear nicvf nicpf ast i2c_algo_bit ttm drm_kms_helper aes_ce_blk syscopyarea sysfillrect aes_ce_cipher sysimgblt thunder_bgx crc32_ce fb_sys_fops crct10dif_ce ghash_ce drm sha2_ce sha256_arm64 sha1_ce ahci libahci i2c_thunderx thunder_xcv i2c_smbus mdio_thunder thunderx_mmc mdio_cavium aes_neon_bs aes_neon_blk crypto_simd cryptd aes_arm64
> [ 35.060004] Process apparmor_parser (pid: 1188, stack limit = 0xffff00001b948000)
> [ 35.071987] CPU: 16 PID: 1188 Comm: apparmor_parser Not tainted 4.14.151 #1
> [ 35.083436] Hardware name: Cavium ThunderX CRB/To be filled by O.E.M., BIOS 5.11 12/12/2012
> [ 35.096272] task: ffff801f68160f00 task.stack: ffff00001b948000
> [ 35.106632] PC is at 0x748214240b9fa200
> [ 35.114857] LR is at 0x748214240b9fa200
> [ 35.122977] pc : [<748214240b9fa200>] lr : [<748214240b9fa200>] pstate: 20400145
> [ 35.134654] sp : ffff00001b94bca0
> [ 35.142170] x29: ffff801f713f2020 x28: 0000000000000040
> [ 35.151657] x27: 0000000000000000 x26: ffff00001b94be00
> [ 35.161073] x25: ffff801f713f2020 x24: ffff00001b94bdd8
> [ 35.170410] x23: ffff801f713f1ea8 x22: ffff801f71bd8000
> [ 35.179678] x21: 0000000000000000 x20: ffff0000082352bc
> [ 35.185726] audit: type=1400 audit(1578951898.572:9): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/lib/snapd/snap-confine" pid=1185 comm="apparmor_parser"
> [ 35.188933] x19: ffff00001b94bcb0 x18: 0000ffff8916aa70
> [ 35.188938] x17: 0000000000000000 x16: 0000000000000000
> [ 35.188941] x15: 0000000000000000 x14: 0000000000000000
> [ 3...

Read more...

Hrmm. Never mind the Qualcomm errata, this is a Cavium box :-( But I get consistent failures so I do believe that commit cce360b54ce6 causes issues although the erratas are indeed enabled:

[ 0.000000] CPU features: enabling workaround for Cavium erratum 27456
[ 0.000000] CPU features: enabling workaround for Cavium erratum 30115

Juerg Haefliger (juergh) wrote :

We certainly want this:

commit 71c751f2a43fa03fae3cf5f0067ed3001a397013
Author: Mark Rutland <email address hidden>
Date: Mon Apr 23 11:41:33 2018 +0100

    arm64: add sentinel to kpti_safe_list

    We're missing a sentinel entry in kpti_safe_list. Thus is_midr_in_range_list()
    can walk past the end of kpti_safe_list. Depending on the contents of memory,
    this could erroneously match a CPU's MIDR, cause a data abort, or other bad
    outcomes.

    Add the sentinel entry to avoid this.

    Fixes: be5b299830c63ed7 ("arm64: capabilities: Add support for checks based on a list of MIDRs")
    Signed-off-by: Mark Rutland <email address hidden>
    Reported-by: Jan Kiszka <email address hidden>
    Tested-by: Jan Kiszka <email address hidden>
    Reviewed-by: Suzuki K Poulose <email address hidden>
    Cc: Catalin Marinas <email address hidden>
    Cc: Suzuki K Poulose <email address hidden>
    Cc: Will Deacon <email address hidden>
    Signed-off-by: Will Deacon <email address hidden>

diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c
index 536d572e5596..9d1b06d67c53 100644
--- a/arch/arm64/kernel/cpufeature.c
+++ b/arch/arm64/kernel/cpufeature.c
@@ -868,6 +868,7 @@ static bool unmap_kernel_at_el0(const struct arm64_cpu_capabilities *entry,
        static const struct midr_range kpti_safe_list[] = {
                MIDR_ALL_VERSIONS(MIDR_CAVIUM_THUNDERX2),
                MIDR_ALL_VERSIONS(MIDR_BRCM_VULCAN),
+ { /* sentinel */ }
        };
        char const *str = "command line option";

Download full text (3.4 KiB)

On Tue, Jan 14, 2020 at 8:35 AM Juerg Haefliger
<email address hidden> wrote:
>
> We certainly want this:
>
> commit 71c751f2a43fa03fae3cf5f0067ed3001a397013
> Author: Mark Rutland <email address hidden>
> Date: Mon Apr 23 11:41:33 2018 +0100
>
> arm64: add sentinel to kpti_safe_list

Agreed, nice catch. Unfortunately, even with that, I'm still seeing
the crash (see below). Attached is the backport of the above I used on
top of 4.15.0-74.84, in case you find I did it wrong.

bestovius login: [ 95.760101] kernel BUG at fs/buffer.c:1240!
[ 95.764335] Internal error: Oops - BUG: 0 [#1] SMP
[ 95.769171] Modules linked in: nls_iso8859_1 thunderx_edac
thunderx_zip cavium_rng_vf shpchp cavium_rng gpio_keys uio_pdrv_genirq
ipmi_ssif uio ipmi_devintf ipmi_msghandler sch_fq_codel ib_iser
rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi
scsi_transport_iscsi ip_tables x_tables autofs4 btrfs zstd_compress
raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor
async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear nicvf
nicpf ast i2c_algo_bit ttm drm_kms_helper syscopyarea sysfillrect
aes_ce_blk sysimgblt thunder_bgx fb_sys_fops aes_ce_cipher crc32_ce
drm crct10dif_ce ghash_ce sha2_ce sha256_arm64 sha1_ce ahci libahci
thunder_xcv i2c_thunderx mdio_thunder thunderx_mmc mdio_cavium
aes_neon_bs aes_neon_blk crypto_simd cryptd aes_arm64
[ 95.837496] Process curl (pid: 1848, stack limit = 0x000000004a994c79)
[ 95.844086] CPU: 1 PID: 1848 Comm: curl Not tainted 4.15.18 #1
[ 95.849972] Hardware name: Cavium ThunderX CRB/To be filled by
O.E.M., BIOS 5.11 12/12/2012
[ 95.858403] pstate: 20400085 (nzCv daIf +PAN -UAO)
[ 95.863243] pc : __find_get_block+0x2e8/0x398
[ 95.867640] lr : __getblk_gfp+0x3c/0x2a8
[ 95.871596] sp : ffff00001fe23970
[ 95.874939] x29: ffff00001fe23970 x28: ffff801ff29d8400
[ 95.880302] x27: 000000000000000b x26: 0000000000000010
[ 95.885665] x25: 0000000003880020 x24: 0000000000000001
[ 95.891027] x23: 0000000000000008 x22: ffff801f25098000
[ 95.896390] x21: ffff801f25098000 x20: 0000000003880020
[ 95.901752] x19: 0000000000001000 x18: 0000fffff5966764
[ 95.907115] x17: 0000000000000000 x16: 0000000000000000
[ 95.912477] x15: 000032d49d85d7a0 x14: 002e941340578278
[ 95.917840] x13: 0000fffff5965d60 x12: 0000000000000000
[ 95.923202] x11: ffff000009578c08 x10: 0000000000000040
[ 95.928565] x9 : ffff801f69456ca8 x8 : ffff801f2526ac00
[ 95.933927] x7 : ffff0000083bb9a8 x6 : 0000000000000010
[ 95.939290] x5 : 000000000000001c x4 : 0000000000000100
[ 95.944653] x3 : 0000000000000008 x2 : 0000000000001000
[ 95.950016] x1 : 0000000003880020 x0 : 0000000000000080
[ 95.955378] Call trace:
[ 95.957846] __find_get_block+0x2e8/0x398
[ 95.961892] __getblk_gfp+0x3c/0x2a8
[ 95.965502] __ext4_get_inode_loc+0xe4/0x418
[ 95.969810] ext4_reserve_inode_write+0x5c/0xf8
[ 95.974381] ext4_mark_inode_dirty+0x5c/0x228
[ 95.978777] ext4_da_write_end+0x2c4/0x2e8
[ 95.982912] generic_perform_write+0x108/0x1b0
[ 95.987397] __generic_file_write_iter+0x14c/0x1c0
[ 95.992232] ext4_file_write_iter+0x120/0x3d0
[ 95.996629] new_sync_write+0xd8/0x138
[ 96.00...

Read more...

tags: added: patch

Hi,
Not sure this is useful (since it might be obvious), but adding `nopti` to kernel parameters works around the issue, indicating this is indeed related to kpti.

Juerg Haefliger (juergh) wrote :

Yeah I've noticed that as well but at the time thought that nopti just changes the timing somehow and masks the problem. But I'm not so sure anymore so thanks for mentioning it!

dann frazier (dannf) wrote :

I built a kernel with the proposed patches[*] and ran a reboot/kernel compile test on 4 systems. The tests survived 46 total iterations (~12/system) before I interrupted. Two systems failed with "Synchronous External Abort: synchronous parity or ECC error" errors.

I've reverted the systems back to 4.15.0-70 - the kernel before the cpufeature/errata patches that caused this - to see if these SEA errors are a regression.

[*] https://lists.ubuntu.com/archives/kernel-team/2020-January/106909.html

Download full text (4.8 KiB)

On Thu, 16 Jan 2020 02:14:16 -0000
dann frazier <email address hidden> wrote:

> I built a kernel with the proposed patches[*] and ran a reboot/kernel
> compile test on 4 systems. The tests survived 46 total iterations
> (~12/system) before I interrupted. Two systems failed with "Synchronous
> External Abort: synchronous parity or ECC error" errors.
>
> I've reverted the systems back to 4.15.0-70 - the kernel before the
> cpufeature/errata patches that caused this - to see if these SEA errors
> are a regression.
>
> [*] https://lists.ubuntu.com/archives/kernel-
> team/2020-January/106909.html
>

I've ran 75 iterations of reboot/compile-kernel and encountered 3 gcc
segmentation faults. Unfortunately, my test didn't capture the dmesg log but
it's likely that these are due to the ECC problems we're (still?) seeing.

There was also another issue during one of the reboots which is probably
unrelated and due to a flaky BMC:

[ 33.896320] ipmi_ssif 0-0012: IPMI message handler: device id demangle
failed: -22 [ 33.896354] ipmi_ssif 0-0012: Unable to get the device id: -5
[ 33.987825] ipmi_ssif 0-0012: Found new BMC (man_id: 0x000000, prod_id:
0xaabb, dev_id: 0x20) [ 33.987858] Unable to handle kernel read from
unreadable memory at virtual address 00000018 [ 33.999300] Mem abort info:
[ 34.005475] ESR = 0x96000004
[ 34.011454] Exception class = DABT (current EL), IL = 32 bits
[ 34.020168] SET = 0, FnV = 0
[ 34.025893] EA = 0, S1PTW = 0
[ 34.031617] Data abort info:
[ 34.037060] ISV = 0, ISS = 0x00000004
[ 34.043448] CM = 0, WnR = 0
[ 34.048949] user pgtable: 4k pages, 48-bit VAs, pgd = 000000002799ee91
[ 34.058063] [0000000000000018] *pgd=0000000000000000
[ 34.065624] Internal error: Oops: 96000004 [#1] SMP
[ 34.073090] Modules linked in: nls_iso8859_1 sch_fq_codel thunderx_zip
thunderx_edac ib_iser cavium_rng_vf rdma_cm ipmi_ssif(+) ipmi_devintf shpchp
cavium_rng iw_cm ipmi_msghandler ib_cm gpio_keys uio_pdrv_genirq uio ib_core
iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables x_tables autofs4
btrfs zstd_compress raid10 raid456 libcrc32c async_raid6_recov async_memcpy
async_pq async_xor async_tx xor raid6_pq raid1 raid0 multipath linear ast
i2c_algo_bit drm_kms_helper nicvf syscopyarea sysfillrect sysimgblt fb_sys_fops
ttm nicpf drm aes_ce_blk aes_ce_cipher crc32_ce crct10dif_ce ghash_ce sha2_ce
sha256_arm64 sha1_ce ahci thunder_bgx libahci i2c_thunderx thunder_xcv
mdio_thunder thunderx_mmc mdio_cavium aes_neon_bs aes_neon_blk crypto_simd
cryptd aes_arm64 [ 34.161807] Process kworker/64:1 (pid: 651, stack limit =
0x00000000b0697881) [ 34.172016] CPU: 64 PID: 651 Comm: kworker/64:1 Not
tainted 4.15.18+ #40 [ 34.181723] Hardware name: Cavium ThunderX CRB/To be
filled by O.E.M., BIOS 5.11 12/12/2012 [ 34.193113] Workqueue: events
redo_bmc_reg [ipmi_msghandler] [ 34.201840] pstate: 80400005 (Nzcv daif +PAN
-UAO) [ 34.209589] pc : smi_send.isra.4+0x80/0x158 [ipmi_msghandler] [
34.218275] lr : smi_send.isra.4+0x150/0x158 [ipmi_msghandler] [ 34.227046] sp
: ffff0000128c3b10 [ 34.233209] x29: ffff0000128c3b10 x28: 0000000000000020 [
  34.241305] x27: 0000000000000002 x26: 00000...

Read more...

dann frazier (dannf) wrote :

I've now seen an occurrence of the the SEA/ECC issue on a system w/ the 4.15.0-70 kernel, so I think we can safely assume this is not a regression related to this bug.

@alexandru-avadanii Thanks for mentioning 'nopti' which made me realize that there was something fishy about KPTI that I haven't paid much attention before.

Juerg Haefliger (juergh) wrote :

The root cause of this problem is that the order in which errata and cpu features are evaluated and enabled is reversed. On ThunderX boxes that have erratum 27456 enable, KPTI needs to be turned off to prevent I-cache clobbering. But due to the reversed order, the callback to enable KPTI is registered before the kernel determines that KPTI should be off. And when the CPUs are onlined the callback is executed, resulting in *both* KPTI and the erratum being enabled which results in all sorts of weird and inconsistent issues.

On Wed, Jan 15, 2020 at 11:28 PM Juerg Haefliger
<email address hidden> wrote:
>
> On Thu, 16 Jan 2020 02:14:16 -0000
> dann frazier <email address hidden> wrote:
>
> > I built a kernel with the proposed patches[*] and ran a reboot/kernel
> > compile test on 4 systems. The tests survived 46 total iterations
> > (~12/system) before I interrupted. Two systems failed with "Synchronous
> > External Abort: synchronous parity or ECC error" errors.
> >
> > I've reverted the systems back to 4.15.0-70 - the kernel before the
> > cpufeature/errata patches that caused this - to see if these SEA errors
> > are a regression.
> >
> > [*] https://lists.ubuntu.com/archives/kernel-
> > team/2020-January/106909.html
> >
>
> I've ran 75 iterations of reboot/compile-kernel and encountered 3 gcc
> segmentation faults. Unfortunately, my test didn't capture the dmesg log but
> it's likely that these are due to the ECC problems we're (still?) seeing.

I've seen those on every machine so far when ran long enough. Since I
believe we've clearly demonstrated that this is an unrelated failure,
I've split it out into bug 1860013 - let's track it there.

> There was also another issue during one of the reboots which is probably
> unrelated and due to a flaky BMC:

Let's track that in bug 1857073. Even if it is a flaky BMC, the IPMI
driver should handle the failure gracefully.
Did you see this on host 'wright' as well?

 -dann

dann frazier (dannf) on 2020-01-16
summary: - Cavium ThunderX CN88XX Panic : Unknown reason
+ Cavium ThunderX CN88XX crashes on boot

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-bionic' to 'verification-done-bionic'. If the problem still exists, change the tag 'verification-needed-bionic' to 'verification-failed-bionic'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-bionic
dann frazier (dannf) wrote :

The 4.15.0-76 kernel has survived several iterations on 2 nodes, so marking verified.

tags: added: verification-done-bionic
removed: verification-needed-bionic
David Galloway (djgalloway) wrote :

I'm having trouble updating the kernel on one of my Cavium boxes due to this bug. Setting `nopti` in kernel parameters didn't stop the segfaults and anytime I try to run `apt update` after enabling the -proposed repo, apt hangs due to a segfault and I can't run it again unless I reboot. apt still won't exit cleanly though. Hoping for a tip on upgrading the kernel.

dann frazier (dannf) wrote :

@djgalloway: If you haven't removed your previously working older kernel, you should be able to boot into it from the GRUB menu. Another suggestion would be to boot into rescue mode from an Ubuntu 18.04.3 ISO.

Juerg Haefliger (juergh) wrote :

@djgalloway: Sorry, on arm64 this is 'kpti=0' instead of 'nopti' (which is x86 and ppc). Of course they had to name it differently :-/.

Launchpad Janitor (janitor) wrote :

This bug was fixed in the package linux - 4.15.0-76.86

---------------
linux (4.15.0-76.86) bionic; urgency=medium

  * bionic/linux: 4.15.0-76.86 -proposed tracker (LP: #1860123)

  * Integrate Intel SGX driver into linux-azure (LP: #1844245)
    - [Packaging] Add systemd service to load intel_sgx

  * [Regression] Bionic kernel 4.15.0-71.80 can not boot on ThunderX
    (LP: #1853326) // Bionic kernel panic on Cavium ThunderX CN88XX
    (LP: #1853485) // Cavium ThunderX CN88XX crashes on boot (LP: #1857074)
    - arm64: Check for errata before evaluating cpu features
    - arm64: add sentinel to kpti_safe_list

linux (4.15.0-75.85) bionic; urgency=medium

  * bionic/linux: 4.15.0-75.85 -proposed tracker (LP: #1859705)

  * use-after-free in i915_ppgtt_close (LP: #1859522) // CVE-2020-7053
    - SAUCE: drm/i915: Fix use-after-free when destroying GEM context

  * CVE-2019-14615
    - drm/i915/gen9: Clear residual context state on context switch

  * PAN is broken for execute-only user mappings on ARMv8 (LP: #1858815)
    - arm64: Revert support for execute-only user mappings

  * [Regression] usb usb2-port2: Cannot enable. Maybe the USB cable is bad?
    (LP: #1856608)
    - SAUCE: Revert "usb: handle warm-reset port requests on hub resume"

  * Miscellaneous Ubuntu changes
    - update dkms package versions

 -- Marcelo Henrique Cerri <email address hidden> Fri, 17 Jan 2020 10:59:22 -0300

Changed in linux (Ubuntu Bionic):
status: Confirmed → Fix Released
dann frazier (dannf) on 2020-01-29
Changed in linux (Ubuntu):
status: Confirmed → Fix Released
status: Fix Released → Invalid
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers