ubuntu 16.04.2: crashed at deactivate_slab+0x18c/0x640 when testing dlpar

Bug #1658968 reported by bugproxy
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
The Ubuntu-power-systems project
Opinion
Undecided
Unassigned
linux (Ubuntu)
Incomplete
Undecided
Taco Screen team

Bug Description

Problem Description
===============================
When testing cpu, memory and slot DLPAR on roselp4, the system crashed.

---uname output---
Linux roselp4 4.8.0-34-generic #36~16.04.1-Ubuntu SMP Wed Dec 21 18:53:20 UTC 2016 ppc64le ppc64le ppc64le GNU/Linux

Machine Type = lpar

Stack trace output:
 [ 3289.065350] Unable to handle kernel paging request for data at address 0xc0000404565d6a00
[ 3289.065375] Faulting instruction address: 0xc0000000002e6eec
[ 3289.065379] Oops: Kernel access of bad area, sig: 11 [#1]
[ 3289.065382] SMP NR_CPUS=2048 NUMA pSeries
[ 3289.065386] Modules linked in: rpadlpar_io rpaphp dccp_diag dccp tcp_diag udp_diag inet_diag unix_diag af_packet_diag netlink_diag rpcsec_gss_krb5 auth_rpcgss nfsv4 nfs lockd grace fscache rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) configfs ib_ipoib(OE) ib_cm(OE) ib_uverbs(OE) ib_umad(OE) mlx5_ib(OE) mlx5_core(OE) mlx4_ib(OE) mlx4_en(OE) ib_sa(OE) ib_mad(OE) ib_core(OE) ib_addr(OE) ib_netlink(OE) mlx4_core(OE) mlx_compat(OE) binfmt_misc pseries_rng vmx_crypto sunrpc knem(OE) autofs4 dm_round_robin btrfs xor raid6_pq lpfc crc32c_vpmsum ipr scsi_transport_fc devlink be2net scsi_dh_emc scsi_dh_rdac scsi_dh_alua dm_multipath [last unloaded: mlx4_core]
[ 3289.065424] CPU: 82 PID: 40197 Comm: drmgr Tainted: G OE 4.8.0-34-generic #36~16.04.1-Ubuntu
[ 3289.065427] task: c00000045081ce00 task.stack: c00000044d414000
[ 3289.065430] NIP: c0000000002e6eec LR: c0000000002e7718 CTR: c0000000002e7630
[ 3289.065433] REGS: c00000044d417470 TRAP: 0300 Tainted: G OE (4.8.0-34-generic)
[ 3289.065435] MSR: 800000010280b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE,TM[E]> CR: 24082822 XER: 20000000
[ 3289.065446] CFAR: c000000000008750 DAR: c0000404565d6a00 DSISR: 40000000 SOFTE: 0
               GPR00: c0000000002e7718 c00000044d4176f0 c0000000014a6600 c00000047e01f480
               GPR04: 0000000000000010 0000000082000075 0000000000000075 0000000000000001
               GPR08: 0000000002000000 0000000000000000 0000000082000075 0000000000000009
               GPR12: 0000000084002828 c000000007b4e200 0000000000000000 0000000000000000
               GPR16: 0000000000000000 0000000000000000 0000000000000000 c000000000d7a800
               GPR20: 0000000010000050 c000000000fd4e6c c0000003e7933840 c0000000014daae0
               GPR24: c00000000138dc48 0000000000000000 0000000000000001 c00000047e00fe80
               GPR28: c0000404565d6a00 c00000047e01f480 c0000004565de700 f000000001159740
[ 3289.065486] NIP [c0000000002e6eec] deactivate_slab+0x18c/0x640
[ 3289.065489] LR [c0000000002e7718] slab_cpuup_callback+0xe8/0x170
[ 3289.065491] Call Trace:
[ 3289.065493] [c00000044d4176f0] [c0000000002e715c] deactivate_slab+0x3fc/0x640 (unreliable)
[ 3289.065498] [c00000044d417810] [c0000000002e7718] slab_cpuup_callback+0xe8/0x170
[ 3289.065502] [c00000044d417880] [c0000000000f98c8] notifier_call_chain+0x98/0x110
[ 3289.065506] [c00000044d4178d0] [c0000000000ca564] __cpu_notify+0x54/0xa0
[ 3289.065509] [c00000044d4178f0] [c0000000000ca77c] cpu_notify_nofail+0x2c/0x40
[ 3289.065512] [c00000044d417910] [c0000000000ca7e4] notify_dead+0x54/0x170
[ 3289.065515] [c00000044d4179b0] [c0000000000c98c4] cpuhp_invoke_callback+0x84/0x250
[ 3289.065519] [c00000044d417a10] [c0000000000c9bfc] cpuhp_down_callbacks+0x8c/0x110
[ 3289.065523] [c00000044d417a60] [c00000000024e328] _cpu_down+0x168/0x2b0
[ 3289.065526] [c00000044d417ac0] [c0000000000cc068] do_cpu_down+0x68/0xb0
[ 3289.065530] [c00000044d417b00] [c000000000738448] cpu_subsys_offline+0x28/0x40
[ 3289.065534] [c00000044d417b20] [c00000000072f9e4] device_offline+0x104/0x140
[ 3289.065538] [c00000044d417b60] [c00000000009a7bc] dlpar_cpu_remove+0x24c/0x350
[ 3289.065542] [c00000044d417c40] [c00000000009aa50] dlpar_cpu_release+0x70/0xe0
[ 3289.065545] [c00000044d417c90] [c000000000021a04] arch_cpu_release+0x44/0x80
[ 3289.065548] [c00000044d417cb0] [c000000000738c8c] cpu_release_store+0x4c/0x80
[ 3289.065552] [c00000044d417ce0] [c00000000072b7b0] dev_attr_store+0x40/0x70
[ 3289.065555] [c00000044d417d00] [c0000000003e1e1c] sysfs_kf_write+0x6c/0xa0
[ 3289.065559] [c00000044d417d20] [c0000000003e0cdc] kernfs_fop_write+0x17c/0x250
[ 3289.065563] [c00000044d417d70] [c000000000322b20] __vfs_write+0x40/0x80
[ 3289.065566] [c00000044d417d90] [c000000000323ec4] vfs_write+0xd4/0x270
[ 3289.065571] [c00000044d417de0] [c000000000325acc] SyS_write+0x6c/0x110
[ 3289.065575] [c00000044d417e30] [c000000000009584] system_call+0x38/0xec
[ 3289.065577] Instruction dump:
[ 3289.065579] b0df0018 60420000 815f0018 55490bfe 5529f83e 7d294378 913f0018 7c2004ac
[ 3289.065585] e93f0000 792907a4 f93f0000 e93d0022 <7d5c482a> 2faa0000 419e0064 7f86e378
[ 3289.065596] ---[ end trace 7f6da25673d4d05e ]---

Oops output:
 Oops: Kernel access of bad area, sig: 11 [#1]

== Comment: #12 - Ping Tian Han <email address hidden> - 2017-01-17 20:57:37 ==
Looks like this bug can be reproduced without the CadetE card. I think the problem occurs on the BabyBlueTip card:

0292:60:00.0 Ethernet controller: Mellanox Technologies MT27520 Family [ConnectX-3 Pro]
        Subsystem: IBM MT27520 Family [ConnectX-3 Pro]
        Kernel driver in use: mlx4_core
        Kernel modules: mlx4_core

== Comment: #14 - Carol L. Soto <email address hidden> - 2017-01-19 10:39:38 ==
I can not see in the report the stack trace this bugzilla is complaining.
but in the report I saw the known issue that when u did this dlpar with memory and cpu and mellanox cards the card hits eeh. I think that was kernel issue.

You can try to recreate what this bugzilla complains with dlpar of the IO card but the test that you are running will hit the known issue I explained.

== Comment: #15 - Ping Tian Han <email address hidden> - 2017-01-19 19:12:47 ==
(In reply to comment #14)
> I can not see in the report the stack trace this bugzilla is complaining.
> but in the report I saw the known issue that when u did this dlpar with
> memory and cpu and mellanox cards the card hits eeh. I think that was kernel
> issue.
>
> You can try to recreate what this bugzilla complains with dlpar of the IO
> card but the test that you are running will hit the known issue I explained.

Thanks. Looks like this is a mellanox card issue.

Mirroring the bug to Canonical for their awareness.

Revision history for this message
bugproxy (bugproxy) wrote : dmesg captured by kdump

Default Comment by Bridge

tags: added: architecture-ppc64le bugnameltc-150353 severity-high targetmilestone-inin---
Revision history for this message
bugproxy (bugproxy) wrote : sosreport of roselp4

Default Comment by Bridge

Revision history for this message
bugproxy (bugproxy) wrote : /proc/meminfo and /proc/cpuinfo

Default Comment by Bridge

Revision history for this message
bugproxy (bugproxy) wrote : dmesg from roselp4

Default Comment by Bridge

Changed in ubuntu:
assignee: nobody → Taco Screen team (taco-screen-team)
affects: ubuntu → linux (Ubuntu)
Revision history for this message
Andrew Cloke (andrew-cloke) wrote :

As this is "for our awareness", marking as incomplete.

Changed in linux (Ubuntu):
status: New → Incomplete
Manoj Iyer (manjo)
Changed in ubuntu-power-systems:
status: New → Opinion
Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2017-11-20 04:01 EDT-------
*** This bug has been marked as a duplicate of bug 142747 ***

Revision history for this message
Frank Heimes (fheimes) wrote :

(IBM Bugzilla duplicate)

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.