[Ubuntu 1804][boston][ixgbe] EEH causes kernel BUG at /build/linux-jWa1Fv/linux-4.15.0/drivers/pci/msi.c:352 (i2S)

Bug #1776389 reported by bugproxy on 2018-06-12
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
The Ubuntu-power-systems project
High
Canonical Kernel Team
linux (Ubuntu)
High
Canonical Kernel Team
Bionic
High
Canonical Kernel Team

Bug Description

== Comment: #0 - ABDUL HALEEM <> - 2018-02-16 08:26:15 ==
Problem:
------------
Injecting error multiple times causes kernel crash.

echo 0x0:1:4:0x6000008000000:0xfff80000 > /sys/kernel/debug/powerpc/PCI0000/err_injct

EEH: PHB#0 failure detected, location: N/A
EEH: PHB#0-PE#0 has failed 6 times in the
last hour and has been permanently disabled.
EEH: Unable to recover from failure from PHB#0-PE#0.
Please try reseating or replacing it
ixgbe 0000:01:00.1: Adapter removed
kernel BUG at /build/linux-jWa1Fv/linux-4.15.0/drivers/pci/msi.c:352!
Oops: Exception in kernel mode, sig: 5 [#1]
LE SMP NR_CPUS=2048 NUMA PowerNV
Modules linked in: rpcsec_gss_krb5 nfsv4 nfs fscache joydev input_leds mac_hid idt_89hpesx ofpart ipmi_powernv cmdlinepart ipmi_devintf ipmi_msghandler at24 powernv_flash mtd opal_prd ibmpowernv uio_pdrv_genirq vmx_crypto uio sch_fq_codel nfsd auth_rpcgss nfs_acl lockd grace sunrpc ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables x_tables autofs4 btrfs zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear ses enclosure scsi_transport_sas qla2xxx ast hid_generic ttm drm_kms_helper ixgbe syscopyarea usbhid igb sysfillrect sysimgblt nvme_fc fb_sys_fops hid nvme_fabrics crct10dif_vpmsum crc32c_vpmsum drm i40e scsi_transport_fc aacraid i2c_algo_bit mdio
CPU: 28 PID: 972 Comm: eehd Not tainted 4.15.0-10-generic #11-Ubuntu
NIP: c00000000077f080 LR: c00000000077f070 CTR: c0000000000aac30
REGS: c000000ff1deb5a0 TRAP: 0700 Not tainted (4.15.0-10-generic)
MSR: 9000000000029033 <SF,HV,EE,ME,IR,DR,RI,LE> CR: 24002822 XER: 20040000
CFAR: c00000000018bddc SOFTE: 1
GPR00: c00000000077f070 c000000ff1deb820 c0000000016ea600 c000000fbb5fac00
GPR04: 00000000000002c5 0000000000000000 0000000000000000 0000000000000000
GPR08: c000000fbb5fac00 0000000000000001 c000000fec617a00 c000000fdfd86488
GPR12: 0000000000000040 c000000007a33400 c000000000138be8 c000000ff90ec1c0
GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
GPR20: 0000000000000000 0000000000000000 0000000000000000 c000000000f48d10
GPR24: c000000000f48ce8 c000200e4fcf4000 c000000fc6900b18 c000200e4fcf4000
GPR28: c000200e4fcf4288 c008000010624480 0000000000000000 c000000fbb633ea0
NIP [c00000000077f080] free_msi_irqs+0xa0/0x260
LR [c00000000077f070] free_msi_irqs+0x90/0x260
Call Trace:
[c000000ff1deb820] [c00000000077f070] free_msi_irqs+0x90/0x260 (unreliable)
[c000000ff1deb880] [c00000000077fa68] pci_disable_msix+0x128/0x170
[c000000ff1deb8c0] [c00800001060b5c8] ixgbe_reset_interrupt_capability+0x90/0xd0 [ixgbe]
[c000000ff1deb8f0] [c0080000105d52f4] ixgbe_remove+0xec/0x240 [ixgbe]
[c000000ff1deb990] [c0000000007670ec] pci_device_remove+0x6c/0x110
[c000000ff1deb9d0] [c00000000085d194] device_release_driver_internal+0x224/0x310
[c000000ff1deba20] [c00000000075b398] pci_stop_bus_device+0x98/0xe0
[c000000ff1deba60] [c00000000075b588] pci_stop_and_remove_bus_device+0x28/0x40
[c000000ff1deba90] [c00000000005e1d0] pci_hp_remove_devices+0x90/0x130
[c000000ff1debb20] [c00000000005e184] pci_hp_remove_devices+0x44/0x130
[c000000ff1debbb0] [c00000000003ec04] eeh_handle_normal_event+0x134/0x580
[c000000ff1debc60] [c00000000003f160] eeh_handle_event+0x30/0x338
[c000000ff1debd10] [c00000000003f830] eeh_event_handler+0x140/0x200
[c000000ff1debdc0] [c000000000138d88] kthread+0x1a8/0x1b0
[c000000ff1debe30] [c00000000000b528] ret_from_kernel_thread+0x5c/0xb4
Instruction dump:
419effe0 3bc00000 4800000c 60420000 807f0010 7c7e1a14 78630020 4ba0cd3d
60000000 e9430158 312affff 7d295110 <0b090000> 813f0014 395e0001 7d5e07b4
---[ end trace 23c446a470e60864 ]---
ixgbe 0000:01:00.0: Adapter removed

Sending IPI to other CPUs
OPAL: Switch to big-endian OS
OPAL: Switch to little-endian OS
PHB#0000[0:0]: eeh_freeze_clear on fenced PHB

---uname output---
Linux ltciofvtr-bostonlc1 4.15.0-10-generic #11-Ubuntu SMP Tue Feb 13 18:21:52 UTC 2018 ppc64le ppc64le ppc64le GNU/Linux

Machine Type = Boston-LC

0000:00:00.0 PCI bridge [0604]: IBM Device [1014:04c1]
0000:01:00.0 Ethernet controller [0200]: Intel Corporation 82599ES 10-Gigabit SFI/SFP+ Network Connection [8086:10fb] (rev 01)
0000:01:00.1 Ethernet controller [0200]: Intel Corporation 82599ES 10-Gigabit SFI/SFP+ Network Connection [8086:10fb] (rev 01)

# ethtool -i enp1s0f0
driver: ixgbe
version: 5.1.0-k
firmware-version: 0x800006da
expansion-rom-version:
bus-info: 0000:01:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: yes

 Userspace tool common name: EEH

== Comment: #6 - Mauro Rodrigues <> - 2018-03-19 11:54:03 ==
Even though, probably it will not be accepted as is, I'll send a solution upstream.

The long story short: we add ixgbe_free_irq right before the ixgbe_clear_interrupt_scheme in ixgbe_remove
That created a side effect, this is hotplug remove and with the patch applied, with the usual removal path (for instance from unbind in sysfs) that removes the interruption twice.
To avoid that I'll send a patch that integrates the free_irq in the clear interruption schema code path.

== Comment: #8 - Mauro Rodrigues <> - 2018-04-18 12:23:34 ==
waiting for upstream feedback at:
http://patchwork.ozlabs.org/patch/900279/

which reads "ixgbe: Fix free irq process when removing device due to PCI Errors"

== Comment: #9 - Mauro Rodrigues <> - 2018-05-03 11:56:49 ==
The v3 of the patch is going through intel's queue for further testing
http://patchwork.ozlabs.org/patch/907695/
which reads: "ixgbe/ixgbevf: Free IRQ when PCI error recovery removes the device"

== Comment: #11 - Mauro Rodrigues <> - 2018-06-11 10:06:35 ==
 this got merged to Torvald's tree last week and I didn't notice before.
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/drivers/net/ethernet/intel/ixgbe?id=b212d815e77c72be921979119c715166cc8987b1

which reads:
"ixgbe/ixgbevf: Free IRQ when PCI error recovery removes the device"

I'll submit to canonical ML today.

Default Comment by Bridge

tags: added: architecture-ppc64le bugnameltc-164762 severity-high targetmilestone-inin1804
Changed in ubuntu:
assignee: nobody → Ubuntu on IBM Power Systems Bug Triage (ubuntu-power-triage)
affects: ubuntu → linux (Ubuntu)
Changed in ubuntu-power-systems:
status: New → Triaged
importance: Undecided → High
assignee: nobody → Canonical Kernel Team (canonical-kernel-team)
tags: added: triage-g

------- Comment From <email address hidden> 2018-06-12 08:59 EDT-------
I've just sent the fix for the kernel mail list review: https://lists.ubuntu.com/archives/kernel-team/2018-June/093281.html

Changed in linux (Ubuntu):
status: New → Triaged
importance: Undecided → High
Manoj Iyer (manjo) on 2018-06-18
Changed in linux (Ubuntu):
assignee: Ubuntu on IBM Power Systems Bug Triage (ubuntu-power-triage) → Canonical Kernel Team (canonical-kernel-team)
Changed in linux (Ubuntu):
status: Triaged → Fix Committed
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2018-06-19 19:30 EDT-------
And patch is applied to bionic master-next. see
http://kernel.ubuntu.com/git/ubuntu/ubuntu-bionic.git/commit/drivers/net/ethernet/intel/ixgbe?h=master-next&id=123dad8e7f35b815fdf6d0647b056c096f14d052

Thank you,

Mauro

Changed in linux (Ubuntu Bionic):
status: New → Fix Committed
bugproxy (bugproxy) wrote :
Download full text (3.4 KiB)

------- Comment From <email address hidden> 2018-06-21 03:33 EDT-------
Verified on 4.15.0-24-generic and adapter recovery happens neatly after error injection with no Oops messages.

[ 3473.707228] EEH: PHB#2 failure detected, location: N/A
[ 3473.707308] CPU: 96 PID: 20922 Comm: lspci Not tainted 4.15.0-24-generic #26-Ubuntu
[ 3473.707310] Call Trace:
[ 3473.707321] [c0002038006fbb00] [c000000000ce04bc] dump_stack+0xb0/0xf4 (unreliable)
[ 3473.707328] [c0002038006fbb40] [c00000000003ade4] eeh_dev_check_failure+0x234/0x5b0
[ 3473.707335] [c0002038006fbbe0] [c0000000000adc58] pnv_pci_read_config+0x128/0x160
[ 3473.707340] [c0002038006fbc20] [c00000000075d1ac] pci_user_read_config_dword+0x8c/0x180
[ 3473.707345] [c0002038006fbc70] [c0000000007722f4] pci_read_config+0x104/0x2d0
[ 3473.707350] [c0002038006fbcf0] [c0000000004a05f0] sysfs_kf_bin_read+0x70/0xd0
[ 3473.707354] [c0002038006fbd10] [c00000000049f540] kernfs_fop_read+0xe0/0x290
[ 3473.707358] [c0002038006fbd60] [c0000000003d517c] __vfs_read+0x3c/0x70
[ 3473.707361] [c0002038006fbd80] [c0000000003d526c] vfs_read+0xbc/0x1b0
[ 3473.707364] [c0002038006fbdd0] [c0000000003d5ae4] SyS_pread64+0xc4/0xf0
[ 3473.707369] [c0002038006fbe30] [c00000000000b284] system_call+0x58/0x6c
[ 3473.707381] EEH: Detected error on PHB#2
[ 3473.707384] EEH: This PCI device has failed 8 times in the last hour
[ 3473.707385] EEH: Notify device drivers to shutdown
[ 3473.707402] ixgbe 0002:01:00.0: Adapter removed
[ 3473.730202] ixgbe 0002:01:00.1: Adapter removed
[ 3473.752641] EEH: Collect temporary log
[ 3473.752644] PHB4 PHB#2 Diag-data (Version: 1)
[ 3473.752645] brdgCtl: 00000002
[ 3473.752649] RootSts: 00060040 00402000 c1010008 00100107 00004000
[ 3473.752651] RootErrSts: 00000024 00000020 00000000
[ 3473.752653] sourceId: 01000000
[ 3473.752655] nFir: 0000800000000000 0030001c00000000 0000800000000000
[ 3473.752657] PhbSts: 0000001c00000000 0000001c00000000
[ 3473.752659] Lem: 1001000104300100 0000000000000000 1000000000000000
[ 3473.752661] PhbErr: 00000da000000000 0000010000000000 2148000098000240 a008400000000000
[ 3473.752664] PhbTxeErr: 0000000600000000 0000000200000000 0000000000000000 0000000000000000
[ 3473.752666] RxeArbErr: 0000100030000020 0000000000000020 4000010000000000 0000000000000000
[ 3473.752668] RxeMrgErr: 0000000000000001 0000000000000001 0000000000000000 0000000000000000
[ 3473.752670] RegbErr: 00d0000000000000 0010000000000000 4800012c00000000 0000000007000000
[ 3473.752673] PE[000] A/B: a700000300000000 8101000001010000
[ 3473.752677] PE[100] A/B: 8000000000003bfe 80000000300c3de9
[ 3473.752680] EEH: Reset without hotplug activity
[ 3477.113186] EEH: Notify device drivers the completion of reset
[ 3477.113197] ixgbe 0002:01:00.0: enabling device (0140 -> 0142)
[ 3477.174161] ixgbe 0002:01:00.0: pci_cleanup_aer_uncorrect_error_status failed 0xffffffea
[ 3477.174239] ixgbe 0002:01:00.1: enabling device (0140 -> 0142)
[ 3477.238148] ixgbe 0002:01:00.1: pci_cleanup_aer_uncorrect_error_status failed 0xffffffea
[ 3477.238220] EEH: Notify device driver to resume
[ 3477.669705] ixgbe 0002:01:00.0 enP2p1s0f0: detected SFP+: 3
[ 3478.037802] ixgbe 0002:01:00.1 ...

Read more...

Changed in ubuntu-power-systems:
status: Triaged → Fix Committed
Manoj Iyer (manjo) on 2018-07-09
Changed in linux (Ubuntu Bionic):
assignee: nobody → Canonical Kernel Team (canonical-kernel-team)
importance: Undecided → High
Launchpad Janitor (janitor) wrote :
Download full text (14.9 KiB)

This bug was fixed in the package linux - 4.17.0-6.7

---------------
linux (4.17.0-6.7) cosmic; urgency=medium

  * linux: 4.17.0-6.7 -proposed tracker (LP: #1783396)

  * [Regression] EXT4-fs error (device sda2): ext4_validate_block_bitmap:383:
    comm stress-ng: bg 4705: bad block bitmap checksum (LP: #1781709)
    - SAUCE: Revert "UBUNTU: SAUCE: ext4: fix ext4_validate_inode_bitmap: comm
      stress-ng: Corrupt inode bitmap"
    - SAUCE: ext4: check for allocation block validity with block group locked

  * Cosmic update to 4.17.9 stable release (LP: #1783201)
    - userfaultfd: hugetlbfs: fix userfaultfd_huge_must_wait() pte access
    - mm: hugetlb: yield when prepping struct pages
    - mm: teach dump_page() to correctly output poisoned struct pages
    - PCI / ACPI / PM: Resume bridges w/o drivers on suspend-to-RAM
    - ACPICA: Drop leading newlines from error messages
    - ACPI / battery: Safe unregistering of hooks
    - drm/amdgpu: Make struct amdgpu_atif private to amdgpu_acpi.c
    - tracing: Avoid string overflow
    - tracing: Fix missing return symbol in function_graph output
    - scsi: sg: mitigate read/write abuse
    - scsi: aacraid: Fix PD performance regression over incorrect qd being set
    - scsi: target: Fix truncated PR-in ReadKeys response
    - s390: Correct register corruption in critical section cleanup
    - drbd: fix access after free
    - vfio: Use get_user_pages_longterm correctly
    - ARM: dts: imx51-zii-rdu1: fix touchscreen pinctrl
    - ARM: dts: omap3: Fix am3517 mdio and emac clock references
    - ARM: dts: dra7: Disable metastability workaround for USB2
    - cifs: Fix use after free of a mid_q_entry
    - cifs: Fix memory leak in smb2_set_ea()
    - cifs: Fix slab-out-of-bounds in send_set_info() on SMB2 ACE setting
    - cifs: Fix infinite loop when using hard mount option
    - drm: Use kvzalloc for allocating blob property memory
    - drm/udl: fix display corruption of the last line
    - drm/amdgpu: Add amdgpu_atpx_get_dhandle()
    - drm/amdgpu: Dynamically probe for ATIF handle (v2)
    - jbd2: don't mark block as modified if the handle is out of credits
    - ext4: add corruption check in ext4_xattr_set_entry()
    - ext4: always verify the magic number in xattr blocks
    - ext4: make sure bitmaps and the inode table don't overlap with bg
      descriptors
    - ext4: always check block group bounds in ext4_init_block_bitmap()
    - ext4: only look at the bg_flags field if it is valid
    - ext4: verify the depth of extent tree in ext4_find_extent()
    - ext4: include the illegal physical block in the bad map ext4_error msg
    - ext4: clear i_data in ext4_inode_info when removing inline data
    - ext4: never move the system.data xattr out of the inode body
    - ext4: avoid running out of journal credits when appending to an inline file
    - ext4: add more inode number paranoia checks
    - ext4: add more mount time checks of the superblock
    - ext4: check superblock mapped prior to committing
    - HID: i2c-hid: Fix "incomplete report" noise
    - HID: hiddev: fix potential Spectre v1
    - HID: debug: check length before copy_to_user()
    - HID: core: allow concurrent registr...

Changed in linux (Ubuntu):
status: Fix Committed → Fix Released
Brad Figg (brad-figg) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-bionic' to 'verification-done-bionic'. If the problem still exists, change the tag 'verification-needed-bionic' to 'verification-failed-bionic'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-bionic
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2018-08-21 18:58 EDT-------
(In reply to comment #20)
> This bug is awaiting verification that the kernel in -proposed solves the
> problem. Please test the kernel and update this bug with the results. If the
> problem is solved, change the tag 'verification-needed-bionic' to
> 'verification-done-bionic'. If the problem still exists, change the tag
> 'verification-needed-bionic' to 'verification-failed-bionic'.
>
> If verification is not done by 5 working days from today, this fix will be
> dropped from the source code, and this bug will be closed.
>
> See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to
> enable and use -proposed. Thank you!

Sorry for the delay here, unfortunately I was away and our testers couldn't verify this before.

I've verified the fix in Bionic's proposed 4.15.0-33.36, is this the correct kernel to verify? If so I'll mark it as verified accordingly.

Hi Mauro,

That is the correct kernel, thanks for verifying it!

I have marked the verification as done on our side.

Thank you.

tags: added: verification-done-bionic
removed: verification-needed-bionic
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2018-08-22 08:43 EDT-------
Great!
Thank you Kleber!

Launchpad Janitor (janitor) wrote :
Download full text (35.6 KiB)

This bug was fixed in the package linux - 4.15.0-33.36

---------------
linux (4.15.0-33.36) bionic; urgency=medium

  * linux: 4.15.0-33.36 -proposed tracker (LP: #1787149)

  * RTNL assertion failure on ipvlan (LP: #1776927)
    - ipvlan: drop ipv6 dependency
    - ipvlan: use per device spinlock to protect addrs list updates
    - SAUCE: fix warning from "ipvlan: drop ipv6 dependency"

  * ubuntu_bpf_jit test failed on Bionic s390x systems (LP: #1753941)
    - test_bpf: flag tests that cannot be jited on s390

  * HDMI/DP audio can't work on the laptop of Dell Latitude 5495 (LP: #1782689)
    - drm/nouveau: fix nouveau_dsm_get_client_id()'s return type
    - drm/radeon: fix radeon_atpx_get_client_id()'s return type
    - drm/amdgpu: fix amdgpu_atpx_get_client_id()'s return type
    - platform/x86: apple-gmux: fix gmux_get_client_id()'s return type
    - ALSA: hda: use PCI_BASE_CLASS_DISPLAY to replace PCI_CLASS_DISPLAY_VGA
    - vga_switcheroo: set audio client id according to bound GPU id

  * locking sockets broken due to missing AppArmor socket mediation patches
    (LP: #1780227)
    - UBUNTU SAUCE: apparmor: fix apparmor mediating locking non-fs, unix sockets

  * Update2 for ocxl driver (LP: #1781436)
    - ocxl: Fix page fault handler in case of fault on dying process

  * netns: unable to follow an interface that moves to another netns
    (LP: #1774225)
    - net: core: Expose number of link up/down transitions
    - dev: always advertise the new nsid when the netns iface changes
    - dev: advertise the new ifindex when the netns iface changes

  * [Bionic] Disk IO hangs when using BFQ as io scheduler (LP: #1780066)
    - block, bfq: fix occurrences of request finish method's old name
    - block, bfq: remove batches of confusing ifdefs
    - block, bfq: add requeue-request hook

  * HP ProBook 455 G5 needs mute-led-gpio fixup (LP: #1781763)
    - ALSA: hda: add mute led support for HP ProBook 455 G5

  * [Bionic] bug fixes to improve stability of the ThunderX2 i2c driver
    (LP: #1781476)
    - i2c: xlp9xx: Fix issue seen when updating receive length
    - i2c: xlp9xx: Make sure the transfer size is not more than
      I2C_SMBUS_BLOCK_SIZE

  * x86/kvm: fix LAPIC timer drift when guest uses periodic mode (LP: #1778486)
    - x86/kvm: fix LAPIC timer drift when guest uses periodic mode

  * Please include ax88179_178a and r8152 modules in d-i udeb (LP: #1771823)
    - [Config:] d-i: Add ax88179_178a and r8152 to nic-modules

  * Nvidia fails after switching its mode (LP: #1778658)
    - PCI: Restore config space on runtime resume despite being unbound

  * Kernel error "task zfs:pid blocked for more than 120 seconds" (LP: #1781364)
    - SAUCE: (noup) zfs to 0.7.5-1ubuntu16.3

  * CVE-2018-12232
    - PATCH 1/1] socket: close race condition between sock_close() and
      sockfs_setattr()

  * CVE-2018-10323
    - xfs: set format back to extents if xfs_bmap_extents_to_btree

  * change front mic location for more lenovo m7/8/9xx machines (LP: #1781316)
    - ALSA: hda/realtek - Fix the problem of two front mics on more machines
    - ALSA: hda/realtek - two more lenovo models need fixup of MIC_LOCATION

  * Cephfs + fscache: unab...

Changed in linux (Ubuntu Bionic):
status: Fix Committed → Fix Released
Changed in ubuntu-power-systems:
status: Fix Committed → Fix Released
Brad Figg (brad-figg) on 2019-07-24
tags: added: cscc
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Bug attachments