oops in set_next_entity / ipmi_msghandler

Bug #1754053 reported by dann frazier
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Expired
High
Unassigned
Artful
Won't Fix
High
Unassigned
Bionic
Expired
High
Unassigned

Bug Description

Seen on a Cavium CRB1S a couple of times while running the com.canonical.certification::disk/disk_stress_ng_sda testcase from the canonical-certification-server test suite:

[ 1823.116031] Unable to handle kernel read from unreadable memory at virtual address 00000038
[ 1823.124479] user pgtable: 4k pages, 48-bit VAs, pgd = ffff801f4bf00000
[ 1823.131068] [0000000000000038] *pgd=0000000000000000
[ 1823.136080] Internal error: Oops: 96000004 [#1] SMP
[ 1823.141002] Modules linked in: nls_iso8859_1 i2c_thunderx thunderx_zip thunderx_edac i2c_smbus shpchp cavium_rng_vf cavi
um_rng gpio_keys ipmi_ssif ipmi_devintf ipmi_msghandler uio_pdrv_genirq uio ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp l
ibiscsi_tcp libiscsi scsi_transport_iscsi autofs4 btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor as
ync_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear nicvf nicpf ast i2c_algo_bit ttm drm_kms_helper syscopyarea sysf
illrect aes_ce_blk sysimgblt aes_ce_cipher fb_sys_fops crc32_ce crct10dif_ce drm ghash_ce sha2_ce sha1_ce ahci libahci thun
der_bgx thunder_xcv mdio_thunder thunderx_mmc mdio_cavium aes_neon_bs aes_neon_blk crypto_simd cryptd
[ 1823.204136] CPU: 30 PID: 0 Comm: swapper/30 Not tainted 4.13.0-36-generic #40~16.04.1-Ubuntu
[ 1823.212655] Hardware name: Cavium ThunderX CRB/To be filled by O.E.M., BIOS 5.11 12/12/2012
[ 1823.221085] task: ffff801f7320bc00 task.stack: ffff801f73210000
[ 1823.227067] PC is at set_next_entity+0x28/0x5e8
[ 1823.231640] LR is at pick_next_task_fair+0xa0/0x580
[ 1823.236561] pc : [<ffff00000810e310>] lr : [<ffff000008118f28>] pstate: 604001c5
[ 1823.244026] sp : ffff801f73213dc0
[ 1823.247368] x29: ffff801f73213dc0 x28: ffff801f7ca19800
[ 1823.252731] x27: ffff000008a76aac x26: ffff801f7ca19868
[ 1823.258093] x25: ffff801f7320c2c0 x24: ffff0000093c8000
[ 1823.263455] x23: 0000000000000000 x22: ffff000008fe1000
[ 1823.268817] x21: ffff801f7ca19868 x20: ffff801f7ca19868
[ 1823.274178] x19: 0000000000000000 x18: 0000ffffca622758
[ 1823.279541] x17: 00000000002711a2 x16: 0000000000000000
[ 1823.284903] x15: 00003dc0f68eb41e x14: 0001318c81308142
[ 1823.290266] x13: 00000003e8000000 x12: 00000000000000b4
[ 1823.295628] x11: 0000000000000000 x10: 00000000000000b4
[ 1823.300990] x9 : ffff000008aa5c98 x8 : ffff801f7320c760
[ 1823.306353] x7 : 0000000000000020 x6 : 003541e45d3fef78
[ 1823.311715] x5 : 0000000000000018 x4 : ffff000008118e88
[ 1823.317077] x3 : ffff000008aa5a70 x2 : ffff00000810cb18
[ 1823.324394] x1 : 0000000000000000 x0 : ffff000008118f28
[ 1823.331704] Process swapper/30 (pid: 0, stack limit = 0xffff801f73210000)
[ 1823.340536] Stack: (0xffff801f73213dc0 to 0xffff801f73214000)
[ 1823.348292] 3dc0: ffff801f73213e10 ffff000008118f28 0000000000000000 ffff801f7ca19868
[ 1823.358135] 3de0: ffff801f73213f28 ffff000008fe1000 0000000000000000 ffff0000093c8000
[ 1823.367943] 3e00: ffff801f7320c2c0 ffff801f7ca19868 ffff801f73213eb0 ffff000008a75f98
[ 1823.377737] 3e20: ffff801f7ca19800 ffff801f7320bc00 ffff801f73213f28 ffff000008fe1000
[ 1823.387523] 3e40: 0000000000000000 ffff0000093c8000 ffff801f7320c2c0 ffff801f7ca19800
[ 1823.397271] 3e60: ffff000008a76aac ffff0000093c8000 ffff801f7ca19868 ffff00000814a4e4
[ 1823.406965] 3e80: ffff801f73213eb0 ffff000008aa5bb0 ffff801f7320bc00 ffff801f73213f28
[ 1823.416614] 3ea0: ffff801f7ca19800 0000000000040d00 ffff801f73213f40 ffff000008a76aac
[ 1823.426211] 3ec0: ffff801f7320bc00 ffff0000093c9658 ffff0000093c9000 ffff000008ff1348
[ 1823.435758] 3ee0: 0000000000000000 0000000000000000 ffff801f7320bc00 0000000000000000
[ 1823.445266] 3f00: 0000000000000000 0000000000000000 ffff801f73213f60 ffff000008aa5a70
[ 1823.454714] 3f20: ffff000008fe1000 ffff0000093c9658 ffff000000000000 0000000000040d00
[ 1823.464120] 3f40: ffff801f73213f60 ffff000008123238 ffff000008fe1000 0000000000040d00
[ 1823.473470] 3f60: ffff801f73213fb0 ffff000008123534 0000000000000084 000000000000001e
[ 1823.482770] 3f80: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[ 1823.492031] 3fa0: 0000000000000000 ffff000008123524 ffff801f73213fd0 ffff000008092308
[ 1823.501281] 3fc0: 000000000000001e ffffffffffffffff 0000000000000000 00000000018831c4
[ 1823.510542] 3fe0: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[ 1823.519768] Call trace:
[ 1823.523564] Exception stack(0xffff801f73213bd0 to 0xffff801f73213d00)
[ 1823.531424] 3bc0: 0000000000000000 0001000000000000
[ 1823.540734] 3be0: 000000000243d000 ffff00000810e310 00000000604001c5 0000000000000001
[ 1823.550059] 3c00: 0000000000000000 0000000000000000 ffff801f73dbe880 ffff801f73dbd900
[ 1823.559408] 3c20: 0000000000000033 000000000000c000 0000000000000001 0000000000000017
[ 1823.568763] 3c40: 0000000000000017 0000000000000017 0000000000000017 0000000000000400
[ 1823.578098] 3c60: 0000000000000097 0000000000000001 0000000000000001 0000000000000000
[ 1823.587422] 3c80: 0000000000000000 0000000000040d00 ffff000008118f28 0000000000000000
[ 1823.596740] 3ca0: ffff00000810cb18 ffff000008aa5a70 ffff000008118e88 0000000000000018
[ 1823.606057] 3cc0: 003541e45d3fef78 0000000000000020 ffff801f7320c760 ffff000008aa5c98
[ 1823.615418] 3ce0: 00000000000000b4 0000000000000000 00000000000000b4 00000003e8000000
[ 1823.624791] [<ffff00000810e310>] set_next_entity+0x28/0x5e8
[ 1823.631915] [<ffff000008118f28>] pick_next_task_fair+0xa0/0x580
[ 1823.639406] [<ffff000008a75f98>] __schedule+0x130/0x8b0
[ 1823.646207] [<ffff000008a76aac>] schedule_idle+0x2c/0x48
[ 1823.653104] [<ffff000008123238>] do_idle+0xb8/0x1e0
[ 1823.659536] [<ffff000008123534>] cpu_startup_entry+0x2c/0x30
[ 1823.666729] [<ffff000008092308>] secondary_start_kernel+0x108/0x118
[ 1823.674507] [<00000000018831c4>] 0x18831c4
[ 1823.680110] Code: aa0003f5 aa1e03e0 aa0103f3 d503201f (b9403a60)
[ 1823.687766] ---[ end trace 331ab1a448238eaa ]---
[ 1823.693924] Kernel panic - not syncing: Attempted to kill the idle task!
[ 1823.702219] SMP: stopping secondary CPUs
[ 1824.796049] SMP: failed to stop secondary CPUs 10,30
[ 1824.802592] Unable to handle kernel read from unreadable memory at virtual address 00000000
[ 1824.812593] user pgtable: 4k pages, 48-bit VAs, pgd = ffff801f4bf00000
[ 1824.820785] [0000000000000000] *pgd=0000000000000000
[ 1824.827405] Internal error: Oops: 86000004 [#2] SMP
[ 1824.833951] Modules linked in: nls_iso8859_1 i2c_thunderx thunderx_zip thunderx_edac i2c_smbus shpchp cavium_rng_vf cavi
um_rng gpio_keys ipmi_ssif ipmi_devintf ipmi_msghandler uio_pdrv_genirq uio ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp l
ibiscsi_tcp libiscsi scsi_transport_iscsi autofs4 btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor as
ync_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear nicvf nicpf ast i2c_algo_bit ttm drm_kms_helper syscopyarea sysf
illrect aes_ce_blk sysimgblt aes_ce_cipher fb_sys_fops crc32_ce crct10dif_ce drm ghash_ce sha2_ce sha1_ce ahci libahci thun
der_bgx thunder_xcv mdio_thunder thunderx_mmc mdio_cavium aes_neon_bs aes_neon_blk crypto_simd cryptd
[ 1824.911861] CPU: 30 PID: 0 Comm: swapper/30 Tainted: G D 4.13.0-36-generic #40~16.04.1-Ubuntu
[ 1824.925653] Hardware name: Cavium ThunderX CRB/To be filled by O.E.M., BIOS 5.11 12/12/2012
[ 1824.936146] task: ffff801f7320bc00 task.stack: ffff801f73210000
[ 1824.944207] PC is at 0x0
[ 1824.948879] LR is at panic_event+0x84/0x110 [ipmi_msghandler]
[ 1824.956803] pc : [<0000000000000000>] lr : [<ffff0000019de93c>] pstate: 204001c5
[ 1824.966439] sp : ffff801f732138b0
[ 1824.971992] x29: ffff801f732138b0 x28: ffff801f7320bc00
[ 1824.979561] x27: ffff000008a76aac x26: ffff801f7ca19868
[ 1824.987103] x25: ffff801f7320c2c0 x24: ffff801f7320bc00
[ 1824.994623] x23: 0000000000000000 x22: ffff00000955cbd8
[ 1825.002107] x21: 0000000000000001 x20: ffff0000019e5080
[ 1825.009575] x19: ffff801f7140c000 x18: ffff0000093c8c08
[ 1825.017034] x17: 0000000000000001 x16: 0000000000000007
[ 1825.024493] x15: ffff00008956e757 x14: 0000000000000001
[ 1825.031943] x13: 0000000000000020 x12: ffff801f7269e520
[ 1825.039396] x11: 00000000ffffffff x10: 0000000000000001
[ 1825.046838] x9 : ffff000014bcdba0 x8 : ffff000008b21a08
[ 1825.054265] x7 : 0000000000000001 x6 : 000000000000050d
[ 1825.061675] x5 : ffff0000019e5160 x4 : 0000000000000000
[ 1825.069054] x3 : ffff0000019de8b8 x2 : 0000000000000000
[ 1825.076389] x1 : 0000000000000001 x0 : ffff801f27eed000
[ 1825.083697] Process swapper/30 (pid: 0, stack limit = 0xffff801f73210000)
[ 1825.092518] Stack: (0xffff801f732138b0 to 0xffff801f73214000)
[ 1825.100269] 38a0: ffff801f732138e0 ffff0000080fca1c
[ 1825.110125] 38c0: ffff000009402760 00000000fffffffe 0000000000000000 0000000000000100
[ 1825.119984] 38e0: ffff801f73213920 ffff0000080fcb7c ffff00000955cfd8 0000000000000000
[ 1825.129845] 3900: ffff00000955cbd8 ffff000008d33800 00000000000001c0 ffff000008d33800
[ 1825.139672] 3920: ffff801f73213950 ffff0000080d2e38 ffff00000955c000 ffff00000955cbd8
[ 1825.149449] 3940: 0000000000000000 0000000000040d00 ffff801f73213a30 ffff0000080d7ffc
[ 1825.159185] 3960: 000000000000000b ffff801f7320bc00 0000000000000001 ffff000008d2b0e8
[ 1825.168875] 3980: 00000000000001c0 ffff801f7320bc00 ffff801f7320c2c0 ffff801f7ca19868
[ 1825.178511] 39a0: ffff801f73213a30 ffff801f73213a30 ffff801f732139f0 00000000ffffffc8
[ 1825.188101] 39c0: ffff801f7320bc00 ffff801f73213a30 ffff801f73213a30 ffff801f732139f0
[ 1825.197633] 39e0: 00000000ffffffc8 0000000000040d00 ffff801f7320bc00 0000000000000000
[ 1825.207130] 3a00: ffff801f7320bc00 00000000ffffffff 0000000000000000 0000000000000001
[ 1825.216594] 3a20: 000000000000050a 0000000000000001 ffff801f73213ab0 ffff00000808c130
[ 1825.226037] 3a40: ffff00000954f000 ffff801f73213c90 0000000000000001 ffff000008d2b0e8
[ 1825.235452] 3a60: 00000000000001c0 ffff801f7320bc00 ffff801f7320c2c0 ffff801f7ca19868
[ 1825.244836] 3a80: ffff000008a76aac ffff801f7320bc00 ffff801f73213ab0 ffff00000808c084
[ 1825.254205] 3aa0: ffff00000954f000 0000000000040d00 ffff801f73213af0 ffff00000809f48c
[ 1825.263562] 3ac0: 0000000096000004 0000000000000021 ffff801f73213c90 0000000000000038
[ 1825.272888] 3ae0: 0000000000000000 0000000000000025 ffff801f73213b20 ffff000008a7cf70
[ 1825.282180] 3b00: ffff801f73213c90 ffff801f7320bc00 0000000096000004 0000000000000038
[ 1825.291503] 3b20: ffff801f73213b90 ffff000008a7d16c 0000000000000038 0000000096000004
[ 1825.300829] 3b40: ffff801f73213c90 0000000000000038 ffff801f73213c90 0000000000000025
[ 1825.310168] 3b60: ffff801f7320c2c0 ffff801f7ca19868 ffff000008a76aac ffff801f7320bc00
[ 1825.319526] 3b80: 00000000000001c0 ffff0000093c9000 ffff801f73213bc0 ffff000008081244
[ 1825.328915] 3ba0: ffff000008a97218 ffff0000093c8000 0000000096000004 ffff0000081352e4
[ 1825.338313] 3bc0: ffff801f73213dc0 ffff000008082f38 0000000000000000 0001000000000000
[ 1825.347717] 3be0: 000000000243d000 ffff00000810e310 00000000604001c5 0000000000000001
[ 1825.357136] 3c00: 0000000000000000 0000000000000000 ffff801f73dbe880 ffff801f73dbd900
[ 1825.366570] 3c20: 0000000000000033 000000000000c000 0000000000000001 0000000000000017
[ 1825.376013] 3c40: 0000000000000017 0000000000000017 0000000000000017 0000000000000400
[ 1825.385442] 3c60: 0000000000000097 0000000000000001 0000000000000001 0000000000000000
[ 1825.394855] 3c80: 0000000000000000 0000000000040d00 ffff000008118f28 0000000000000000
[ 1825.404272] 3ca0: ffff00000810cb18 ffff000008aa5a70 ffff000008118e88 0000000000000018
[ 1825.413707] 3cc0: 003541e45d3fef78 0000000000000020 ffff801f7320c760 ffff000008aa5c98
[ 1825.423164] 3ce0: 00000000000000b4 0000000000000000 00000000000000b4 00000003e8000000
[ 1825.432609] 3d00: 0001318c81308142 00003dc0f68eb41e 0000000000000000 00000000002711a2
[ 1825.442035] 3d20: 0000ffffca622758 0000000000000000 ffff801f7ca19868 ffff801f7ca19868
[ 1825.451452] 3d40: ffff000008fe1000 0000000000000000 ffff0000093c8000 ffff801f7320c2c0
[ 1825.460847] 3d60: ffff801f7ca19868 ffff000008a76aac ffff801f7ca19800 ffff801f73213dc0
[ 1825.470244] 3d80: ffff000008118f28 ffff801f73213dc0 ffff00000810e310 00000000604001c5
[ 1825.479645] 3da0: ffff801f7ca1d280 053d35967b5b3cc0 ffffffffffffffff ffff00000816545c
[ 1825.489059] 3dc0: ffff801f73213e10 ffff000008118f28 0000000000000000 ffff801f7ca19868
[ 1825.498477] 3de0: ffff801f73213f28 ffff000008fe1000 0000000000000000 ffff0000093c8000
[ 1825.507882] 3e00: ffff801f7320c2c0 ffff801f7ca19868 ffff801f73213eb0 ffff000008a75f98
[ 1825.517299] 3e20: ffff801f7ca19800 ffff801f7320bc00 ffff801f73213f28 ffff000008fe1000
[ 1825.526702] 3e40: 0000000000000000 ffff0000093c8000 ffff801f7320c2c0 ffff801f7ca19800
[ 1825.536122] 3e60: ffff000008a76aac ffff0000093c8000 ffff801f7ca19868 ffff00000814a4e4
[ 1825.545523] 3e80: ffff801f73213eb0 ffff000008aa5bb0 ffff801f7320bc00 ffff801f73213f28
[ 1825.554920] 3ea0: ffff801f7ca19800 0000000000040d00 ffff801f73213f40 ffff000008a76aac
[ 1825.564302] 3ec0: ffff801f7320bc00 ffff0000093c9658 ffff0000093c9000 ffff000008ff1348
[ 1825.573671] 3ee0: 0000000000000000 0000000000000000 ffff801f7320bc00 0000000000000000
[ 1825.583046] 3f00: 0000000000000000 0000000000000000 ffff801f73213f60 ffff000008aa5a70
[ 1825.592410] 3f20: ffff000008fe1000 ffff0000093c9658 ffff000000000000 0000000000040d00
[ 1825.601787] 3f40: ffff801f73213f60 ffff000008123238 ffff000008fe1000 0000000000040d00
[ 1825.611157] 3f60: ffff801f73213fb0 ffff000008123534 0000000000000084 000000000000001e
[ 1825.620526] 3f80: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[ 1825.629870] 3fa0: 0000000000000000 ffff000008123524 ffff801f73213fd0 ffff000008092308
[ 1825.639228] 3fc0: 000000000000001e ffffffffffffffff 0000000000000000 00000000018831c4
[ 1825.648591] 3fe0: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[ 1825.657933] Call trace:
[ 1825.661858] Exception stack(0xffff801f732136c0 to 0xffff801f732137f0)
[ 1825.669837] 36c0: ffff801f7140c000 0001000000000000 000000000243d000 0000000000000000
[ 1825.679249] 36e0: 00000000204001c5 ffff000008136c1c ffff801f732137e0 ffff000008d2d0d0
[ 1825.688680] 3700: ffff0000093c9658 ffff0000093c8000 00000000000001c0 ffff801f7320bc00
[ 1825.698110] 3720: ffff801f7320c2c0 ffff801f7ca19868 ffff000008a76aac ffff801f7320bc00
[ 1825.707540] 3740: ffff801f732137e0 ffff000008136c1c ffff801f73213840 0000000000000000
[ 1825.716996] 3760: ffff801f732138f0 ffff801f732138f0 ffff801f732138b0 0000000000040d00
[ 1825.726461] 3780: ffff801f27eed000 0000000000000001 0000000000000000 ffff0000019de8b8
[ 1825.735955] 37a0: 0000000000000000 ffff0000019e5160 000000000000050d 0000000000000001
[ 1825.745461] 37c0: ffff000008b21a08 ffff000014bcdba0 0000000000000001 00000000ffffffff
[ 1825.754945] 37e0: ffff801f7269e520 0000000000000020
[ 1825.761460] [< (null)>] (null)
[ 1825.767801] [<ffff0000080fca1c>] notifier_call_chain+0x5c/0xa0
[ 1825.775268] [<ffff0000080fcb7c>] atomic_notifier_call_chain+0x3c/0x50
[ 1825.783352] [<ffff0000080d2e38>] panic+0x15c/0x29c
[ 1825.789770] [<ffff0000080d7ffc>] do_exit+0x834/0xa38
[ 1825.796341] [<ffff00000808c130>] die+0x1b0/0x1c8
[ 1825.802536] [<ffff00000809f48c>] __do_kernel_fault+0xbc/0x130
[ 1825.809856] [<ffff000008a7cf70>] do_page_fault+0x250/0x3e0
[ 1825.816920] [<ffff000008a7d16c>] do_translation_fault+0x6c/0x7c
[ 1825.824410] [<ffff000008081244>] do_mem_abort+0x6c/0xe0
[ 1825.831203] Exception stack(0xffff801f73213bd0 to 0xffff801f73213d00)
[ 1825.839239] 3bc0: 0000000000000000 0001000000000000
[ 1825.848713] 3be0: 000000000243d000 ffff00000810e310 00000000604001c5 0000000000000001
[ 1825.858191] 3c00: 0000000000000000 0000000000000000 ffff801f73dbe880 ffff801f73dbd900
[ 1825.867681] 3c20: 0000000000000033 000000000000c000 0000000000000001 0000000000000017
[ 1825.877170] 3c40: 0000000000000017 0000000000000017 0000000000000017 0000000000000400
[ 1825.886628] 3c60: 0000000000000097 0000000000000001 0000000000000001 0000000000000000
[ 1825.896079] 3c80: 0000000000000000 0000000000040d00 ffff000008118f28 0000000000000000
[ 1825.905509] 3ca0: ffff00000810cb18 ffff000008aa5a70 ffff000008118e88 0000000000000018
[ 1825.914942] 3cc0: 003541e45d3fef78 0000000000000020 ffff801f7320c760 ffff000008aa5c98
[ 1825.924403] 3ce0: 00000000000000b4 0000000000000000 00000000000000b4 00000003e8000000
[ 1825.933872] [<ffff000008082f38>] el1_da+0x24/0xa0
[ 1825.940210] [<ffff000008118f28>] pick_next_task_fair+0xa0/0x580
[ 1825.947770] [<ffff000008a75f98>] __schedule+0x130/0x8b0
[ 1825.954632] [<ffff000008a76aac>] schedule_idle+0x2c/0x48
[ 1825.961544] [<ffff000008123238>] do_idle+0xb8/0x1e0
[ 1825.967994] [<ffff000008123534>] cpu_startup_entry+0x2c/0x30
[ 1825.975196] [<ffff000008092308>] secondary_start_kernel+0x108/0x118
[ 1825.982996] [<00000000018831c4>] 0x18831c4
[ 1825.988599] Code: bad PC value
[ 1825.993141] ---[ end trace 331ab1a448238eab ]---
[ 1825.999255] Kernel panic - not syncing: Attempted to kill the idle task!
[ 1826.007524] SMP: stopping secondary CPUs
[ 1827.101309] SMP: failed to stop secondary CPUs 10,30
[ 1827.107810] Kernel Offset: disabled
[ 1827.112822] CPU features: 0x101108
[ 1827.117715] Memory Limit: none
[ 1827.122259] ---[ end Kernel panic - not syncing: Attempted to kill the idle task!

dann frazier (dannf)
description: updated
Changed in linux (Ubuntu Artful):
status: New → Confirmed
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1754053

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
tags: added: artful
Changed in linux (Ubuntu Artful):
importance: Undecided → High
Changed in linux (Ubuntu Bionic):
importance: Undecided → High
tags: added: kernel-da-key
dann frazier (dannf)
description: updated
Revision history for this message
dann frazier (dannf) wrote :

These observations were on a system in our lab called "seuss". My suggestions for next steps would be:

1) Determine how reliable this failure is.
 a) Deploy xenial/hwe on seuss, and downgrade the kernel to the version in the crash log (4.13.0-36-generic #40~16.04.1-Ubuntu).
 b) Run the com.canonical.certification::disk/disk_stress_ng_sda test ~10 times, and see how frequently we hit this failure.

2) Test another CRB1S system just like in 1). Does it also hit this issue?

3) Test w/ the latest upstream kernel (https://wiki.ubuntu.com/Kernel/MainlineBuilds). Is it still reproducible?

4) If found to be reliably failing in #1, but never fails with the latest mainline kernel in #3, it maybe a candidate for bisection. I'd suggest bisecting with upstream git, first verifying that v4.13 fails and master does not (to rule out Ubuntu-specific patches). Remember this will be backwards from a typical bisect - "good" here means it fails, "bad" means it does not fail. Therefore, the first "bad" commit would be the one that fixes it.

Revision history for this message
Manoj Iyer (manjo) wrote :

-- linux-hwe --
Reproduced with linux-hwe on xenial running stress-ng sda. We either see soft lockups or we see the kernel oopsing on random fs/jbd2/ code, which I believe is a side effect of lockups.

Revision history for this message
Manoj Iyer (manjo) wrote :

-- linux-hwe-edge --
Reproduce the issue with linux-hwe-edge running stress-ng sda. Again we see softlockups causing system to get unstable.

Revision history for this message
Manoj Iyer (manjo) wrote :

-- linux mainline --
The current mainline kernel (4.16 04-01-2018) from http://kernel.ubuntu.com/~kernel-ppa/mainline/ was installed and tested using stress-ng sda. Again softlockups causing system to be unstable.

Revision history for this message
Manoj Iyer (manjo) wrote :

Based on all the testing with different kernels I suspect this might be a HW/FW issue, and we might want to seek guidance from Cavium on how to effectively address this issue on Seuss. Please note that the issue is not reproduced on a similar CRB system, and occurs *only on* Seuss. Where there any older kernel versions on which these tests passed on Seuss? If so we can do a reverse bisect to find the offending patch.

Revision history for this message
Manoj Iyer (manjo) wrote :

-- pre-req for apt-add-repository --
$ sudo apt-get install -y python-software-properties

-- Install cert --
$ sudo apt-add-repository -y ppa:hardware-certification/public
$ sudo apt-get update
$ sudo apt-get install -y canonical-certification-server

Edit this file for iperf3:
        /etc/xdg/canonical-certification.conf

Towards the end of this file add the iperf3 server ip address

-- To recreate this issue please run --
$ sudo /usr/lib/plainbox-provider-checkbox/bin/disk_stress_ng sda --really-run

Revision history for this message
Manoj Iyer (manjo) wrote :

Collected more data points on this issue.

1. Tried offline CPUs. We found that the crash typically was on CPU:40, so offlined CPU40 and repeated the test. The test seemed to make progress but panicked on a different CPU. I tried to offline several more CPU, but the crash seems to move on to other CPUs.

2. Changed the scheduler from CFQ to NOOP. This made no difference either, crash was seen on CPU:44 and offline CPU44 yielded the same results.

Panics seem to happen either in the scheduler or in ext4 code (note that we are running stress on SDA). According to Cavium eng this could be a due to a bad L2 cache or memory. Tailing /var/log/syslog and /var/log/kernlog while the tests were running I did see messages like this:

Jun 12 14:57:55 seuss ipmievd: Voltage sensor CPU_VTT_DDR02 Upper Non-critical going high Asserted (Reading 0.77 > Threshold 0.77 Volts)
Jun 12 14:57:56 seuss ipmievd: Voltage sensor CPU_VTT_DDR02 Upper Non-critical going high Deasserted (Reading 0.76 > Threshold 0.77 Volts)
Jun 12 14:57:57 seuss ipmievd: Voltage sensor CPU_VTT_DDR13 Upper Non-critical going high Deasserted (Reading 0.76 > Threshold 0.77 Volts)
Jun 12 14:57:58 seuss ipmievd: Voltage sensor CPU_VTT_DDR13 Upper Non-critical going high Asserted (Reading 0.77 > Threshold 0.77 Volts)

We have other CRB1S that function as expected and the stress-ng tests do no cause any panics. I am tempted consider this issue to be a hardware issue with this particular CRB1S

Changed in linux (Ubuntu Bionic):
status: Incomplete → Won't Fix
Changed in linux (Ubuntu Artful):
status: Confirmed → Won't Fix
Changed in linux (Ubuntu):
status: Incomplete → Won't Fix
Manoj Iyer (manjo)
Changed in linux (Ubuntu):
status: Won't Fix → Triaged
Changed in linux (Ubuntu Bionic):
status: Won't Fix → Triaged
Revision history for this message
Manoj Iyer (manjo) wrote :

If CONFIG_ARM64_SW_TTBR0_PAN is disabled in the kernel we dont see these random crashes anymore. I ran /usr/lib/plainbox-provider-checkbox/bin/disk_stress_ng and /usr/lib/plainbox-provider-checkbox/bin/memory_stress_ng tests on CRB (seuss) the last couple of days and did not see any crash. However with disk_stress_ng I did see the following message on the console, which is a different issue than the one tracked here in this bug.

[ 4163.900212] EXT4-fs error (device sda2) in ext4_free_inode:344: Corrupt filesystem

Next step would be to root cause why disabling CONFIG_ARM64_SW_TTBR0_PAN fixes the RCU stalls and random crashes of stress-ng.

Revision history for this message
Manoj Iyer (manjo) wrote :
Revision history for this message
Manoj Iyer (manjo) wrote :

processor : 0
BogoMIPS : 200.00
Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 cpuid
CPU implementer : 0x43
CPU architecture: 8
CPU variant : 0x1
CPU part : 0x0a1
CPU revision : 1

Changed in linux (Ubuntu):
status: Triaged → Incomplete
Changed in linux (Ubuntu Bionic):
status: Triaged → Incomplete
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for linux (Ubuntu Bionic) because there has been no activity for 60 days.]

Changed in linux (Ubuntu Bionic):
status: Incomplete → Expired
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for linux (Ubuntu) because there has been no activity for 60 days.]

Changed in linux (Ubuntu):
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.