Some PCIe errors not surfaced through rasdaemon

Bug #1769730 reported by dann frazier
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Fix Released
Undecided
dann frazier
Bionic
Fix Released
Undecided
dann frazier

Bug Description

[Impact]
The APEI (ACPI Platform Error Interface) interface is supposed to report PCIe errors to the AER (Advanced Error Reporting) driver, which surfaces them to userspace. However, we're currently only reporting "recoverable" errors and not errors of other types (e.g. correctable), thus hiding signs of faulty hardware from the user.

[Test Case]
$ sudo apt install rasdaemon
# On a system that supports ACPI EINJ (dmesg | grep "ACPI: EINJ"), use the attached script to inject a correctable PCIe error.
$ sudo ras-mc-ctl --errors
# There should be an entry for the injected error, as shown below:
No Memory errors.

PCIe AER events:
1 2018-05-07 17:55:46 +0000 Fatal error: Receiver Error

No Extlog errors.

No MCE errors.

[Fix]
There is a 2-patch upstream fix that addresses this issue and cleanly cherry-picks into Ubuntu. The solution is to not artficially limit which PCIe errors are reported down to the AER driver to those that are recoverable.

[Regression Risk]
Above test was ran on x86 & ARM platforms to mitigate regression risk.

CVE References

dann frazier (dannf)
Changed in linux (Ubuntu):
status: New → In Progress
Changed in linux (Ubuntu Bionic):
status: New → In Progress
assignee: nobody → dann frazier (dannf)
Changed in linux (Ubuntu):
assignee: nobody → dann frazier (dannf)
Revision history for this message
dann frazier (dannf) wrote :
description: updated
dann frazier (dannf)
description: updated
Stefan Bader (smb)
Changed in linux (Ubuntu Bionic):
status: In Progress → Fix Committed
Revision history for this message
Brad Figg (brad-figg) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-bionic' to 'verification-done-bionic'. If the problem still exists, change the tag 'verification-needed-bionic' to 'verification-failed-bionic'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-bionic
Revision history for this message
dann frazier (dannf) wrote :

Verification:

ubuntu@awrep3:~$ cat /proc/version
Linux version 4.15.0-23-generic (buildd@bos02-arm64-002) (gcc version 7.3.0 (Ubuntu/Linaro 7.3.0-16ubuntu3)) #25-Ubuntu SMP Wed May 23 17:59:52 UTC 2018
ubuntu@awrep3:~$ sudo ras-mc-ctl --errors
No Memory errors.

PCIe AER events:
1 2018-05-07 17:55:46 +0000 Fatal error: Receiver Error

No Extlog errors.

No MCE errors.

ubuntu@awrep3:~$ sudo ./einj-aer.sh
Injecting PCI Express Correctable Error
[ 782.454317] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 5
[ 782.454321] {1}[Hardware Error]: It has been corrected by h/w and requires no further action
[ 782.454324] {1}[Hardware Error]: event severity: corrected
[ 782.454329] {1}[Hardware Error]: precise tstamp: 2018-05-25 15:02:13
[ 782.454332] {1}[Hardware Error]: Error 0, type: corrected
[ 782.454335] {1}[Hardware Error]: section_type: PCIe error
[ 782.454337] {1}[Hardware Error]: port_type: 4, root port
[ 782.454340] {1}[Hardware Error]: version: 3.0
[ 782.454342] {1}[Hardware Error]: command: 0x0407, status: 0x0010
[ 782.454345] {1}[Hardware Error]: device_id: 0000:00:00.0
[ 782.454347] {1}[Hardware Error]: slot: 0
[ 782.454349] {1}[Hardware Error]: secondary_bus: 0x01
[ 782.454351] {1}[Hardware Error]: vendor_id: 0x17cb, device_id: 0x0401
[ 782.454354] {1}[Hardware Error]: class_code: 000406
[ 782.454356] {1}[Hardware Error]: bridge: secondary_status: 0x0000, control: 0x0000
[ 782.454398] pcieport 0000:00:00.0: aer_status: 0x00000001, aer_mask: 0x0000e000
[ 782.460780] Receiver Error
[ 782.460784] pcieport 0000:00:00.0: aer_layer=Physical Layer, aer_agent=Receiver ID
ubuntu@awrep3:~$ sudo ras-mc-ctl --errors
No Memory errors.

PCIe AER events:
1 2018-05-07 17:55:46 +0000 Fatal error: Receiver Error
2 2018-05-25 15:02:37 +0000 Fatal error: Receiver Error

No Extlog errors.

No MCE errors.

tags: added: verification-done-bionic
removed: verification-needed-bionic
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (11.4 KiB)

This bug was fixed in the package linux - 4.15.0-23.25

---------------
linux (4.15.0-23.25) bionic; urgency=medium

  * linux: 4.15.0-23.25 -proposed tracker (LP: #1772927)

  * arm64 SDEI support needs trampoline code for KPTI (LP: #1768630)
    - arm64: mmu: add the entry trampolines start/end section markers into
      sections.h
    - arm64: sdei: Add trampoline code for remapping the kernel

  * Some PCIe errors not surfaced through rasdaemon (LP: #1769730)
    - ACPI: APEI: handle PCIe AER errors in separate function
    - ACPI: APEI: call into AER handling regardless of severity

  * qla2xxx: Fix page fault at kmem_cache_alloc_node() (LP: #1770003)
    - scsi: qla2xxx: Fix session cleanup for N2N
    - scsi: qla2xxx: Remove unused argument from qlt_schedule_sess_for_deletion()
    - scsi: qla2xxx: Serialize session deletion by using work_lock
    - scsi: qla2xxx: Serialize session free in qlt_free_session_done
    - scsi: qla2xxx: Don't call dma_free_coherent with IRQ disabled.
    - scsi: qla2xxx: Fix warning in qla2x00_async_iocb_timeout()
    - scsi: qla2xxx: Prevent relogin trigger from sending too many commands
    - scsi: qla2xxx: Fix double free bug after firmware timeout
    - scsi: qla2xxx: Fixup locking for session deletion

  * Several hisi_sas bug fixes (LP: #1768974)
    - scsi: hisi_sas: dt-bindings: add an property of signal attenuation
    - scsi: hisi_sas: support the property of signal attenuation for v2 hw
    - scsi: hisi_sas: fix the issue of link rate inconsistency
    - scsi: hisi_sas: fix the issue of setting linkrate register
    - scsi: hisi_sas: increase timer expire of internal abort task
    - scsi: hisi_sas: remove unused variable hisi_sas_devices.running_req
    - scsi: hisi_sas: fix return value of hisi_sas_task_prep()
    - scsi: hisi_sas: Code cleanup and minor bug fixes

  * [bionic] machine stuck and bonding not working well when nvmet_rdma module
    is loaded (LP: #1764982)
    - nvmet-rdma: Don't flush system_wq by default during remove_one
    - nvme-rdma: Don't flush delete_wq by default during remove_one

  * Warnings/hang during error handling of SATA disks on SAS controller
    (LP: #1768971)
    - scsi: libsas: defer ata device eh commands to libata

  * Hotplugging a SATA disk into a SAS controller may cause crash (LP: #1768948)
    - ata: do not schedule hot plug if it is a sas host

  * ISST-LTE:pKVM:Ubuntu1804: rcu_sched self-detected stall on CPU follow by CPU
    ATTEMPT TO RE-ENTER FIRMWARE! (LP: #1767927)
    - powerpc/powernv: Handle unknown OPAL errors in opal_nvram_write()
    - powerpc/64s: return more carefully from sreset NMI
    - powerpc/64s: sreset panic if there is no debugger or crash dump handlers

  * fsnotify: Fix fsnotify_mark_connector race (LP: #1765564)
    - fsnotify: Fix fsnotify_mark_connector race

  * Hang on network interface removal in Xen virtual machine (LP: #1771620)
    - xen-netfront: Fix hang on device removal

  * HiSilicon HNS NIC names are truncated in /proc/interrupts (LP: #1765977)
    - net: hns: Avoid action name truncation

  * Ubuntu 18.04 kernel crashed while in degraded mode (LP: #1770849)
    - SAUCE: powerpc/perf: Fix memory allocation for...

Changed in linux (Ubuntu Bionic):
status: Fix Committed → Fix Released
Changed in linux (Ubuntu):
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Bug attachments

Remote bug watches

Bug watches keep track of this bug in other bug trackers.