Kernel panic on ThunderX

Bug #1749685 reported by Paolo Pisati
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Invalid
Undecided
Unassigned

Bug Description

While doing testing on lundmark, i observed (from time to time) panics on 4.13.0-32.35~16.04.1-generic - i got this one while deploying the board:

Booting under MAAS direction... [ grub.cfg-40:8d:5c:ba 606B 100% 1.56KiB/s ]
EFI stub: Booting Linux Kernel... [ boot-initrd 46.78MiB 100% 6.57MiB/s ]
EFI stub: EFI_RNG_PROTOCOL unavailable, no randomness supplied
EFI stub: Using DTB from configuration table
EFI stub: Exiting boot services and installing virtual address map...
[ 0.000000] Booting Linux on physical CPU 0x0
[ 0.000000] random: get_random_bytes called from start_kernel+0x50/0x460 with crng_init=0
[ 0.000000] Linux version 4.13.0-32-generic (buildd@bos01-arm64-018) (gcc version 5.4.0 20160609 (Ubuntu/Linaro 5.4.0-6ubuntu1~16.04.5)) #35~16.04.1-
Ubuntu SMP Thu Jan 25 10:10:26 UTC 2018 (Ubuntu 4.13.0-32.35~16.04.1-generic 4.13.13)
[ 0.000000] Boot CPU: AArch64 Processor [431f0a11]
[ 0.000000] Machine model: cavium,thunder-88xx
[ 0.000000] efi: Getting EFI parameters from FDT:
[ 0.000000] efi: EFI v2.40 by American Megatrends
[ 0.000000] efi: ESRT=0x1ffce5ac18 SMBIOS 3.0=0x1ffce5a918 ACPI 2.0=0x1ffeb46000
[ 0.000000] esrt: Reserving ESRT space from 0x0000001ffce5ac18 to 0x0000001ffce5ac50.
[ 0.000000] NUMA: NODE_DATA [mem 0x1fff0c4d00-0x1fff0c7fff]
[ 0.000000] Zone ranges:
[ 0.000000] DMA [mem 0x0000000000500000-0x00000000ffffffff]
[ 0.000000] Normal [mem 0x0000000100000000-0x0000001fff0fffff]
[ 0.000000] Movable zone start for each node
[ 0.000000] Early memory node ranges
...
[ 0.000000] Kernel command line: BOOT_IMAGE=ubuntu/arm64/hwe-16.04/xenial/daily/boot-kernel nomodeset root=squash:http://10.229.32.21:5248/images/ubu
ntu/arm64/hwe-16.04/xenial/daily/squashfs ro ip=::::lundmark:BOOTIF ip6=off overlayroot=tmpfs overlayroot_cfgdisk=disabled cc:{datasource_list: [MAAS]}e
nd_cc cloud-config-url=http://10.229.32.21:5240/MAAS/metadata/latest/by-id/ttctk4/?op=get_preseed apparmor=0 log_host=10.229.32.21 log_port=514 BOOTIF=0
1-40:8d:5c:ba:cd:d4
...
[ 9.058541] Synchronous External Abort: synchronous parity or ECC error (0x86000018) at 0x0000ffff9658fc9c
[ 9.058545] Internal error: : 86000018 [#1] SMP
[ 9.058548] Modules linked in: ast(+) i2c_algo_bit ttm drm_kms_helper syscopyarea sysfillrect aes_ce_blk sysimgblt aes_ce_cipher fb_sys_fops crc32_ce
 crct10dif_ce drm ghash_ce sha2_ce sha1_ce ahci libahci thunder_bgx(+) i2c_thunderx(+) thunder_xcv i2c_smbus ipmi_ssif mdio_thunder thunderx_mmc mdio_ca
vium ipmi_devintf ipmi_msghandler aes_neon_bs aes_neon_blk crypto_simd cryptd
[ 9.058588] CPU: 7 PID: 0 Comm: swapper/7 Not tainted 4.13.0-32-generic #35~16.04.1-Ubuntu
[ 9.058589] Hardware name: Cavium ThunderX CRB/To be filled by O.E.M., BIOS 5.11 12/12/2012
[ 9.058591] task: ffff801f700c6900 task.stack: ffff801f700cc000
[ 9.058600] PC is at __remove_hrtimer+0x48/0xa8
[ 9.058602] LR is at __remove_hrtimer+0x5c/0xa8
[ 9.058604] pc : [<ffff000008153d88>] lr : [<ffff000008153d9c>] pstate: 004001c5
[ 9.058606] sp : ffff801f79787e60
[ 9.058607] x29: ffff801f79787e60 x28: ffff801f700c6900
[ 9.058611] x27: 000000021bc7a0ca x26: ffff000008fcd000
[ 9.058614] x25: 0000000000000001 x24: ffff000008fcd000
[ 9.058617] x23: ffff0000093b9658 x22: ffff801f7978f598
[ 9.058620] x21: ffff801f7978f5c0 x20: ffff801f7978f580
[ 9.058624] x19: ffff801f7978fa00 x18: 0000ffffc8e8cb78
[ 9.058627] x17: 000000000000668a x16: 0000000000000000
[ 9.058630] x15: 0000ffff968adcc0 x14: 343030302c333030
[ 9.058633] x13: 302c323030302c31 x12: 3030302c30303030
[ 9.058636] x11: 0000aaaac635fa10 x10: 0000000000000b00
[ 9.058639] x9 : 0000000000000040 x8 : ffff801f780026f0
[ 9.058643] x7 : 0000000000000000 x6 : ffff801f7978fa00
[ 9.058646] x5 : 0000000000000000 x4 : ffff801f7978fb58
[ 9.058649] x3 : ffff801f7978fb58 x2 : ffff801f7978fb58
[ 9.058652] x1 : ffff801f7978f5d0 x0 : 0000000000000001
[ 9.058656] Process swapper/7 (pid: 0, stack limit = 0xffff801f700cc000)
[ 9.058658] Stack: (0xffff801f79787e60 to 0xffff801f700d0000)
[ 9.058660] Call trace:
[ 9.058662] Exception stack(0xffff801f79787c70 to 0xffff801f79787da0)
[ 9.058665] 7c60: ffff801f7978fa00 0001000000000000
[ 9.058668] 7c80: 000000000242d000 ffff000008153d88 00000000004001c5 ffff801f79787d38
[ 9.058671] 7ca0: ffff801f79787cd0 ffff0000081070a4 ffff801f79787cd0 ffff0000081070b4
[ 9.058674] 7cc0: ffff801f6f1b9e00 ffff0000093b8c08 ffff801f79787d50 ffff000008107318
[ 9.058677] 7ce0: ffff801f6f1b9e00 ffff801f79791c10 ffff801f79795020 0000000000000000
[ 9.058680] 7d00: 0000000000000100 0000000000000007 ffff000009555658 ffff0000093b8000
[ 9.058682] 7d20: ffff801f79787d50 0000000000040d00 0000000000000001 ffff801f7978f5d0
[ 9.058685] 7d40: ffff801f7978fb58 ffff801f7978fb58 ffff801f7978fb58 0000000000000000
[ 9.058688] 7d60: ffff801f7978fa00 0000000000000000 ffff801f780026f0 0000000000000040
[ 9.058691] 7d80: 0000000000000b00 0000aaaac635fa10 3030302c30303030 302c323030302c31
[ 9.058695] [<ffff000008153d88>] __remove_hrtimer+0x48/0xa8
[ 9.058697] [<ffff000008153f74>] __hrtimer_run_queues+0xbc/0x2a8
[ 9.058700] [<ffff000008154b08>] hrtimer_interrupt+0xa8/0x228
[ 9.058707] [<ffff0000088cde24>] arch_timer_handler_phys+0x3c/0x50
[ 9.058711] [<ffff00000813dbc4>] handle_percpu_devid_irq+0x8c/0x230
[ 9.058714] [<ffff000008137914>] generic_handle_irq+0x34/0x50
[ 9.058716] [<ffff000008138018>] __handle_domain_irq+0x68/0xc0
[ 9.058719] [<ffff0000080816ec>] gic_handle_irq+0xcc/0x188
[ 9.058721] Exception stack(0xffff801f700cfe00 to 0xffff801f700cff30)
[ 9.058724] fe00: ffff000008fcd000 0000000000000000 0000000000000000 ffff000008fd6000
[ 9.058727] fe20: 0000801f707b7000 ffff801f700cff20 0000801f707b7000 ffff0000093b8698
[ 9.058729] fe40: 0000000000000000 ffff801f700cfe90 0000000000000b00 0000aaaac635fa10
[ 9.058732] fe60: 3030302c30303030 302c323030302c31 343030302c333030 0000ffff968adcc0
[ 9.058735] fe80: 0000000000000000 0000000000006688 0000ffffc8e8cb78 ffff000008fcd000
[ 9.058738] fea0: ffff0000093b9658 ffff0000093b9000 ffff000008fdd348 0000000000000000
[ 9.058741] fec0: 0000000000000000 ffff801f700c6900 0000000000000000 0000000000000000
[ 9.058743] fee0: 0000000000000000 ffff801f700cff30 ffff0000080859bc ffff801f700cff30
[ 9.058746] ff00: ffff0000080859c0 0000000000400145 ffff801f700cff20 ffff0000081489d8
[ 9.058748] ff20: ffffffffffffffff ffff000008148a54
[ 9.058751] [<ffff00000808315c>] el1_irq+0xdc/0x180
[ 9.058754] [<ffff0000080859c0>] arch_cpu_idle+0x30/0x168
[ 9.058760] [<ffff000008122bc4>] do_idle+0x114/0x1e0
[ 9.058763] [<ffff000008122e64>] cpu_startup_entry+0x2c/0x30
[ 9.058767] [<ffff000008092308>] secondary_start_kernel+0x108/0x118
[ 9.058770] [<00000000018781c4>] 0x18781c4
[ 9.058773] Code: 370000c0 a94153f3 a9425bf5 f9401bf7 (a8c47bfd)
[ 9.058803] ---[ end trace 963acec48f21d263 ]---
[ 9.058805] Kernel panic - not syncing: Fatal exception in interrupt
[ 9.058824] SMP: stopping secondary CPUs
[ 9.059063] Kernel Offset: disabled
[ 9.059066] CPU features: 0x101108
[ 9.059067] Memory Limit: none
[ 9.656325] ---[ end Kernel panic - not syncing: Fatal exception in interrupt

Tags: artful
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1749685

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
tags: added: artful
Revision history for this message
Paolo Pisati (p-pisati) wrote :
Download full text (6.6 KiB)

Apparently lundmark has ECC memory problems:

lundmark login: [13906.806163] Synchronous External Abort: synchronous parity or ECC error (0x96000018) at 0x0000ffffa1a36000
[13906.819864] Internal error: : 96000018 [#2] SMP
[13906.826338] Modules linked in: nls_iso8859_1 i2c_thunderx thunderx_edac i2c_smbus thunderx_zip gpio_keys shpchp cavium_rng_vf cavium_rng uio_pdrv_gen
irq ipmi_ssif uio ipmi_devintf ipmi_msghandler ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi autofs4 btrfs ra
id10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear nicvf nicpf ast i2c_algo_bit
 ttm drm_kms_helper syscopyarea sysfillrect sysimgblt aes_ce_blk fb_sys_fops aes_ce_cipher crc32_ce drm crct10dif_ce ghash_ce sha2_ce sha1_ce ahci libah
ci thunder_bgx thunder_xcv mdio_thunder thunderx_mmc mdio_cavium aes_neon_bs aes_neon_blk crypto_simd cryptd
[13906.907111] CPU: 0 PID: 43189 Comm: cc1 Tainted: G D 4.13.0-33-generic #36~kpti414backport
[13906.921356] Hardware name: Cavium ThunderX CRB/To be filled by O.E.M., BIOS 5.11 12/12/2012
[13906.932232] task: ffff801ebfb65a00 task.stack: ffff000017938000
[13906.940682] PC is at __arch_copy_to_user+0x140/0x248
[13906.948211] LR is at cp_new_stat+0x188/0x198
[13906.954972] pc : [<ffff000008a82040>] lr : [<ffff0000082d10d8>] pstate: 80000145
[13906.964884] sp : ffff00001793bd40
[13906.970651] x29: ffff00001793bd40 x28: ffff801ebfb65a00
[13906.978453] x27: ffff000008ab2000 x26: 0000000000000050
[13906.986192] x25: 0000000000000124 x24: 0000000000000015
[13906.993908] x23: 0000000000000000 x22: 000000001793bd98
[13907.001571] x21: ffff801ebfb65a00 x20: ffff0000093d8000
[13907.009221] x19: ffff00001793be30 x18: 0000000000000000
[13907.016833] x17: 0000ffffa2987748 x16: ffff0000082d17a0
[13907.024426] x15: ffffffffffffffff x14: ff00000000000000
[13907.032072] x13: ffffffffffffffff x12: 0000000000000002
[13907.039698] x11: 0000000000000004 x10: 0000000000000000
[13907.047360] x9 : 000003e8000003e8 x8 : 00000001000081b4
[13907.054885] x7 : 00000000019a0839 x6 : 000000001793bdb0
[13907.062466] x5 : 000000001793be18 x4 : 0000000000000008
[13907.070147] x3 : 0000000000000802 x2 : fffffffffffffff8
[13907.077750] x1 : ffff00001793bda0 x0 : 000000001793bd98
[13907.085277] Process cc1 (pid: 43189, stack limit = 0xffff000017938000)
[13907.094022] Stack: (0xffff00001793bd40 to 0xffff00001793c000) [22/7865]
[13907.102004] bd40: ffff00001793be00 ffff0000082d17f8 ffff0000093d8000 0000000000000004
[13907.112119] bd60: 000000001793bd98 0000ffffa298775c ffff801ce388f710 0000000000000802
[13907.122311] bd80: 00000000019a0839 00000001000081b4 000003e8000003e8 0000000000000000
[13907.132302] bda0: 0000000000000000 000000000000075c 0000000000001000 0000000000000008
[13907.142201] bdc0: 000000005a857e32 000000001eb68da6 000000005a857a9a 000000002a0b1064
[13907.152053] bde0: 000000005a857a9a 000000002a0b1064 0000000000000000 0000000000040d00
[13907.161858] be00: ffff00001793bff0 ffff000008083c00 ffffffffffffff2c 0000801f7375f000
[13907.17160...

Read more...

Revision history for this message
dann frazier (dannf) wrote :

Since this appears to be a hardware issue, I'll mark Invalid. fyi, we're working with the vendor to figure out next steps for lundmark.

Changed in linux (Ubuntu):
status: Incomplete → Invalid
Revision history for this message
dann frazier (dannf) wrote :

I've tried a couple memory testers:
 - Userspace 'memtester', which passed overnight
 - Kernel's 'memtest' cmdline arg, which also passed earlyboot. Running stress-ng afterwards still reported errors.

Attached is the console log from the kernel 'memtest' run. Note that I saw 3 ECCs here, the last 2 do appear to be in userspace addresses:

$ grep 'ECC error' console.log
lundmark login: [13515.008197] Synchronous External Abort: synchronous parity or ECC error (0x86000018) at 0x0000aaabd45d5fff
[13516.695278] Synchronous External Abort: synchronous parity or ECC error (0x86000018) at 0x0000ffffa12d2e90
[13525.127804] Synchronous External Abort: synchronous parity or ECC error (0x86000018) at 0x0000ffff80007fc8

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Bug attachments

Remote bug watches

Bug watches keep track of this bug in other bug trackers.