[LTCTest][Opal][FW860] Oops: Kernel access of bad area, sig: 11 [#1] during frozen PE EEH error injection.

Bug #1683699 reported by bugproxy
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Triaged
High
Canonical Kernel Team

Bug Description

== Comment: #0 - Pridhiviraj Paidipeddi <email address hidden> - 2016-08-13 08:28:54 ==
---Problem Description---
Install P8 PowerNV 8284-22A Hardware with latest FW860 firmware having build SV860_028, and install a ubuntu 16.10 on top of it. During EEH FrozenPE error injection, observed a "Oops: Kernel access of bad area, sig: 11 [#1]"

Contact Information = <email address hidden>

---uname output---
Linux lep8b 4.4.0-34-generic #53-Ubuntu SMP Wed Jul 27 16:04:07 UTC 2016 ppc64le ppc64le ppc64le GNU/Linux

Machine Type = PowerNV 8284-22A

---System Hang---
 system is hung and need to do a Hard Power OFF/ON to bring the system up again.

---Debugger---
A debugger is not configured

---Steps to Reproduce---
 1. Install a FW860 SV860_028 level of firmware on a P8 PowerNV 8284-22A Hardware.
2. Install a ubuntu 16.10 on top of it.
3. Inject below frozenPE EEH Error.
echo 0:0:4:0:0 > /sys/kernel/debug/powerpc/PCI0004/err_injct && lspci -ns 0004:00:00.0; echo $?
4. Immediately we can observe a kernel Oops.

*Additional Instructions for <email address hidden>:
-Post a private note with access information to the machine that the bug is occuring on.

Call Traces:
root@lep8b:~# echo 0:0:4:0:0 > /sys/kernel/debug/powerpc/PCI0004/err_injct && lspci -ns 0004:00:00.0; echo $?
[ 271.110859] EEH: Frozen PE#0 on PHB#4 detected
[ 271.110967] EEH: PE location: N/A, PHB location: N/A
0004:00:00.0 0604: 1014:03dc
0
root@lep8b:~# [ 277.108098] Unable to handle kernel paging request for data at address 0x00000010
[ 277.108183] Faulting instruction address: 0xc000000000083c7c
[ 277.108198] Oops: Kernel access of bad area, sig: 11 [#1]
[ 277.108253] SMP NR_CPUS=2048 NUMA PowerNV
[ 277.108310] Modules linked in: xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp bridge stp llc kvm_hv kvm_pr kvm ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter ip_tables x_tables leds_powernv ibmpowernv powernv_rng ipmi_powernv uio_pdrv_genirq ipmi_msghandler uio ib_iser rdma_cm iw_cm ib_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi autofs4 btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear ses enclosure be2net lpfc vxlan ip6_udp_tunnel udp_tunnel scsi_transport_fc ipr
[ 277.109391] CPU: 9 PID: 973 Comm: eehd Not tainted 4.4.0-34-generic #53-Ubuntu
[ 277.109467] task: c000000feb3c2a20 ti: c000000feb408000 task.ti: c000000feb408000
[ 277.109542] NIP: c000000000083c7c LR: c000000000083c78 CTR: c000000000083c20
[ 277.109617] REGS: c000000feb40b760 TRAP: 0300 Not tainted (4.4.0-34-generic)
[ 277.109691] MSR: 9000000100009033 <SF,HV,EE,ME,IR,DR,RI,LE> CR: 28008822 XER: 00000000
[ 277.109880] CFAR: c000000000008468 DAR: 0000000000000010 DSISR: 40000000 SOFTE: 1
GPR00: c000000000083c78 c000000feb40b9e0 c0000000015b5d00 0000000000000000
GPR04: 0000000000000001 c000000feb40bac0 c000002d74b54220 0000000000000f9f
GPR08: 0000000000000000 0000000000000000 0000000000000000 0000000000000026
GPR12: c000000000083c20 c000000007b45580 c0000000000e63d8 c000002d74c40100
GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
GPR20: 0000000000000000 0000000000000000 0000000000000000 c000000000d42468
GPR24: c000000000d42440 0000000000000100 c000000000036460 0000000000000000
GPR28: c00000000161a3f0 0000000000000001 c000002ffff81000 c0000000fe440000
[ 277.110878] NIP [c000000000083c7c] pnv_eeh_reset+0x5c/0x170
[ 277.110931] LR [c000000000083c78] pnv_eeh_reset+0x58/0x170
[ 277.110981] Call Trace:
[ 277.111009] [c000000feb40b9e0] [c000000000083c78] pnv_eeh_reset+0x58/0x170 (unreliable)
[ 277.111098] [c000000feb40ba60] [c000000000038250] eeh_reset_pe+0xb0/0x1c0
[ 277.111175] [c000000feb40bb00] [c000000000af472c] eeh_reset_device+0xd8/0x228
[ 277.111255] [c000000feb40bba0] [c00000000003c4c0] eeh_handle_normal_event+0x390/0x440
[ 277.111429] [c000000feb40bc20] [c00000000003c964] eeh_handle_event+0x184/0x370
[ 277.111601] [c000000feb40bcd0] [c00000000003cd28] eeh_event_handler+0x1d8/0x1e0
[ 277.111772] [c000000feb40bd80] [c0000000000e64e0] kthread+0x110/0x130
[ 277.111910] [c000000feb40be30] [c000000000009538] ret_from_kernel_thread+0x5c/0xa4
[ 277.112068] Instruction dump:
[ 277.112143] 60000000 813f0000 ebdf0010 792affe3 408200d4 e95e0250 812a000c 2f890002
[ 277.112385] 419e0054 7fe3fb78 4bfb7065 60000000 <e9230010> 2fa90000 419e00dc e9290010
[ 277.112629] ---[ end trace a6aa80c26ba676f6 ]---
[ 277.116859]
[ 277.116910] Sending IPI to other CPUs
[ 277.118085] IPI complete
[ 277.120271] kexec: waiting for cpu 0 (physical 32) to enter OPAL
 -> smp_release_cpus()
spinning_secondaries = 191
 <- smp_release_cpus()
 <- setup_system()
[ 0.397633] Kernel panic - not syncing: Out of memory and no killable processes...
[ 0.397633]
[ 0.397769] CPU: 4 PID: 1 Comm: swapper/1 Not tainted 4.4.0-34-generic #53-Ubuntu
[ 0.397843] Call Trace:
[ 0.397870] [c00000000c583190] [c000000008af983c] dump_stack+0xb0/0xf0 (unreliable)
[ 0.397959] [c00000000c5831d0] [c000000008af5a70] panic+0x100/0x2c0
[ 0.398035] [c00000000c583260] [c000000008231e04] out_of_memory+0x5e4/0x5f0
[ 0.398114] [c00000000c583310] [c00000000823a434] __alloc_pages_nodemask+0xc54/0xc90
[ 0.398204] [c00000000c583500] [c0000000082a0a6c] alloc_page_interleave+0x6c/0xe0
[ 0.398292] [c00000000c583550] [c0000000082a1558] alloc_pages_current+0x138/0x1a0
[ 0.398381] [c00000000c5835a0] [c00000000822cdcc] __page_cache_alloc+0x11c/0x160
[ 0.398470] [c00000000c5835e0] [c00000000822cf84] pagecache_get_page+0x174/0x2a0
[ 0.398558] [c00000000c583650] [c00000000822d4b4] grab_cache_page_write_begin+0x54/0x80
[ 0.398646] [c00000000c583690] [c00000000831d484] simple_write_begin+0x54/0x180
[ 0.398735] [c00000000c5836e0] [c00000000822ca64] generic_perform_write+0x104/0x280
[ 0.398823] [c00000000c583780] [c00000000822ed08] __generic_file_write_iter+0x208/0x250
[ 0.398912] [c00000000c5837e0] [c00000000822ee40] generic_file_write_iter+0xf0/0x280
[ 0.399000] [c00000000c583830] [c0000000082e1844] new_sync_write+0xc4/0x120
[ 0.399076] [c00000000c5838d0] [c0000000082e2640] vfs_write+0xc0/0x230
[ 0.399152] [c00000000c583920] [c0000000082e367c] SyS_write+0x6c/0x110
[ 0.399229] [c00000000c583970] [c000000008ea700c] xwrite+0x4c/0xb4
[ 0.399305] [c00000000c5839b0] [c000000008ea7164] do_copy+0xf0/0x170
[ 0.399381] [c00000000c5839e0] [c000000008ea6774] write_buffer+0x5c/0x88
[ 0.399458] [c00000000c583a10] [c000000008ea67fc] flush_buffer+0x5c/0xf0
[ 0.399534] [c00000000c583a60] [c000000008eea034] __gunzip+0x378/0x470
[ 0.399610] [c00000000c583ae0] [c000000008ea75ac] unpack_to_rootfs+0x1f8/0x34c
[ 0.399699] [c00000000c583ba0] [c000000008ea7910] populate_rootfs+0x94/0x164
[ 0.399775] [c00000000c583c20] [c00000000800b49c] do_one_initcall+0x12c/0x2a0
[ 0.399852] [c00000000c583cf0] [c000000008ea4204] kernel_init_freeable+0x28c/0x37c
[ 0.399940] [c00000000c583dc0] [c00000000800be0c] kernel_init+0x2c/0x160
[ 0.400016] [c00000000c583e30] [c000000008009538] ret_from_kernel_thread+0x5c/0xa4
[ 0.418756] ---[ end Kernel panic - not syncing: Out of memory and no killable processes...
[ 0.418756]

oot@lep8b:~# uname -a
Linux lep8b 4.4.0-34-generic #53-Ubuntu SMP Wed Jul 27 16:04:07 UTC 2016 ppc64le ppc64le ppc64le GNU/Linux
root@lep8b:~# cat /etc/os-release
NAME="Ubuntu"
VERSION="16.10 (Yakkety Yak)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 16.10"
VERSION_ID="16.10"
HOME_URL="http://www.ubuntu.com/"
SUPPORT_URL="http://help.ubuntu.com/"
BUG_REPORT_URL="http://bugs.launchpad.net/ubuntu/"
UBUNTU_CODENAME=yakkety
root@lep8b:~# update_flash -d
Current firwmare version :
  P side : FW860.00 (SV860_026)
  T side : FW860.00 (SV860_028)
  Boot side : FW860.00 (SV860_028)
root@lep8b:~# cat /sys/firmware/opal/msglog | grep -i skiboot
[45182541432,5] SkiBoot skiboot-5.3.0-rc2 starting...
root@lep8b:~#
root@lep8b:~# lspci
0000:00:00.0 PCI bridge: IBM Device 03dc
0000:01:00.0 RAID bus controller: IBM Obsidian-E PCI-E SCSI controller (rev 01)
0001:00:00.0 PCI bridge: IBM Device 03dc
0001:01:00.0 PCI bridge: PLX Technology, Inc. PEX 8732 32-lane, 8-Port PCI Express Gen 3 (8.0 GT/s) Switch (rev ca)
0001:02:01.0 PCI bridge: PLX Technology, Inc. PEX 8732 32-lane, 8-Port PCI Express Gen 3 (8.0 GT/s) Switch (rev ca)
0001:02:08.0 PCI bridge: PLX Technology, Inc. PEX 8732 32-lane, 8-Port PCI Express Gen 3 (8.0 GT/s) Switch (rev ca)
0001:02:09.0 PCI bridge: PLX Technology, Inc. PEX 8732 32-lane, 8-Port PCI Express Gen 3 (8.0 GT/s) Switch (rev ca)
0001:03:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5719 Gigabit Ethernet PCIe (rev 01)
0001:03:00.1 Ethernet controller: Broadcom Corporation NetXtreme BCM5719 Gigabit Ethernet PCIe (rev 01)
0001:03:00.2 Ethernet controller: Broadcom Corporation NetXtreme BCM5719 Gigabit Ethernet PCIe (rev 01)
0001:03:00.3 Ethernet controller: Broadcom Corporation NetXtreme BCM5719 Gigabit Ethernet PCIe (rev 01)
0001:04:00.0 RAID bus controller: IBM PCI-E IPR SAS Adapter (ASIC) (rev 01)
0002:00:00.0 PCI bridge: IBM Device 03dc
0002:01:00.0 Fibre Channel: Emulex Corporation Lancer-X: LightPulse Fibre Channel Host Adapter (rev 10)
0002:01:00.1 Fibre Channel: Emulex Corporation Lancer-X: LightPulse Fibre Channel Host Adapter (rev 10)
0003:00:00.0 PCI bridge: IBM Device 03dc
0003:01:00.0 PCI bridge: PLX Technology, Inc. Device 8748 (rev ca)
0003:02:01.0 PCI bridge: PLX Technology, Inc. Device 8748 (rev ca)
0003:02:08.0 PCI bridge: PLX Technology, Inc. Device 8748 (rev ca)
0003:02:09.0 PCI bridge: PLX Technology, Inc. Device 8748 (rev ca)
0003:02:10.0 PCI bridge: PLX Technology, Inc. Device 8748 (rev ca)
0003:02:11.0 PCI bridge: PLX Technology, Inc. Device 8748 (rev ca)
0003:03:00.0 USB controller: Texas Instruments TUSB73x0 SuperSpeed USB 3.0 xHCI Host Controller (rev 02)
0003:04:00.0 RAID bus controller: IBM PCI-E IPR SAS Adapter (ASIC) (rev 01)
0003:05:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5719 Gigabit Ethernet PCIe (rev 01)
0003:05:00.1 Ethernet controller: Broadcom Corporation NetXtreme BCM5719 Gigabit Ethernet PCIe (rev 01)
0003:05:00.2 Ethernet controller: Broadcom Corporation NetXtreme BCM5719 Gigabit Ethernet PCIe (rev 01)
0003:05:00.3 Ethernet controller: Broadcom Corporation NetXtreme BCM5719 Gigabit Ethernet PCIe (rev 01)
0003:0b:00.0 Fibre Channel: Emulex Corporation Saturn-X: LightPulse Fibre Channel Host Adapter (rev 03)
0003:0b:00.1 Fibre Channel: Emulex Corporation Saturn-X: LightPulse Fibre Channel Host Adapter (rev 03)
0004:00:00.0 PCI bridge: IBM Device 03dc
0005:00:00.0 PCI bridge: IBM Device 03dc
0005:01:00.0 Ethernet controller: Emulex Corporation OneConnect NIC (Lancer) (rev 10)
0005:01:00.1 Ethernet controller: Emulex Corporation OneConnect NIC (Lancer) (rev 10)
0005:01:00.2 Ethernet controller: Emulex Corporation OneConnect NIC (Lancer) (rev 10)
0005:01:00.3 Ethernet controller: Emulex Corporation OneConnect NIC (Lancer) (rev 10)
0005:01:00.4 Fibre Channel: Emulex Corporation OneConnect FCoE Initiator (Lancer) (rev 10)
0005:01:00.5 Fibre Channel: Emulex Corporation OneConnect FCoE Initiator (Lancer) (rev 10)
0006:00:00.0 PCI bridge: IBM Device 03dc
0006:01:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5719 Gigabit Ethernet PCIe (rev 01)
0006:01:00.1 Ethernet controller: Broadcom Corporation NetXtreme BCM5719 Gigabit Ethernet PCIe (rev 01)
0006:01:00.2 Ethernet controller: Broadcom Corporation NetXtreme BCM5719 Gigabit Ethernet PCIe (rev 01)
0006:01:00.3 Ethernet controller: Broadcom Corporation NetXtreme BCM5719 Gigabit Ethernet PCIe (rev 01)

== Comment: #1 - Milton D. Miller II <email address hidden> - 2016-09-09 19:05:32 ==
From the opcode the dereferencing 0x10 from a NULL pointer
and the DAR was 0x10 so the pointer was NULL.

disassembly of the printed opcodes shows an out of module call was
made and the result used as a base, the loaded value compared for
NULL, then the loaded value again loaded as a base with the same
16 byte offset.

Looking at upstream, eeh_pe_bus_get can return NULL,
and in pnv_eeh_reset both the returned bus and the bus->parent are
checked for pci_is_root_bus which checks the word at offset 16 for NULL.
The parent field is immediately after a list head and lines up.

Without looking at the full function disassembly, it would appear that
pnv_eeh_reset needs to consider the action if the bus returned from
pnv_eeh_reset is NULL before checking if the bus or it parent is a root bus.

== Comment: #2 - Russell Currey <email address hidden> - 2016-09-11 21:46:21 ==
Thanks for the details Milton, you're right. I'll write a patch to fix this in EEH and make sure all eeh_pe_bus_get calls check for failure.

== Comment: #3 - Russell Currey <email address hidden> - 2016-09-12 00:19:27 ==

== Comment: #4 - Russell Currey <email address hidden> - 2016-09-12 00:20:25 ==
Attached a patch that should stop the oops, can you test?

Note that not being able to find a bus is still an issue that we need to find the cause of.

== Comment: #5 - Milton D. Miller II <email address hidden> - 2016-09-12 12:36:18 ==
Originator: There is a second problem that the kdump process failed because it ran out of memory.

Please open a second defect to investigate that (unless you are aware of instructions setting up kdump that were not followed).

You should be able to recreate that via echo c > /proc/sysrq-trigger and look for the message:

[ 0.397633] Kernel panic - not syncing: Out of memory and no killable processes...

[note: it appears to have failed unpacking the initrd early in the dump process on your machine. This may be related to the partition definition such as memory size and distribution policy]

== Comment: #6 - Pridhiviraj Paidipeddi <email address hidden> - 2017-04-11 07:15:03 ==
@mamatha
Please create a ubuntu mirror request for this, the patches are merged in upstream.
https://patchwork.ozlabs.org/patch/668552/

Please backport the patches to respective 16.04.2/ 16.10 kernels.

Revision history for this message
bugproxy (bugproxy) wrote : eeh_pe_bus_get null check patches

Default Comment by Bridge

tags: added: architecture-ppc64le bugnameltc-144961 severity-high targetmilestone-inin1610
Changed in ubuntu:
assignee: nobody → Taco Screen team (taco-screen-team)
affects: ubuntu → kerneloops (Ubuntu)
affects: kerneloops (Ubuntu) → linux (Ubuntu)
tags: added: kernel-da-key
Changed in linux (Ubuntu):
importance: Undecided → High
Changed in linux (Ubuntu):
assignee: Taco Screen team (taco-screen-team) → Canonical Kernel Team (canonical-kernel-team)
status: New → Triaged
Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2018-08-01 06:20 EDT-------
Ubuntu 16.10 is supported only for 9 months upto July 2017. So closing the issue.

Brad Figg (brad-figg)
tags: added: cscc
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.