mce: ras: When inject 1bit ecc error, there is no mce log recorded in the dmesg

Bug #1857413 reported by fan jinke on 2019-12-24
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Undecided
Unassigned
Disco
Undecided
Po-Hsu Lin

Bug Description

== SRU Justification ==
With the 5.0 Disco kernel, the kernel cannot record the mce log while
injecting 1bit ecc error.

== Fix ==
  * 09cbd219 (RAS/CEC: Increment cec_entered under the mutex lock)
  * de0e0624 (RAS/CEC: Check count_threshold unconditionally)

Commit de0e0624 is the real fix for this issue, 09cbd219 is a fix to
avoid race condition, and it can make the latter become a clean
cherry-pick.

These have been landed on newer kernels.

== Test ==
Test kernel could be found here:
https://people.canonical.com/~phlin/kernel/lp-1857413-ras-err-msg/

Verified by the bug reporter, fan jinke, the patched kernel can log
the error correctly.

== Regression Potential ==
Low, changes are limited to the RAS Correctable Errors Collector. And
the fix has been verified as working as expected.

== Original Bug Report ==
Using Linux kernel, When inject 1bit ecc error, there are some mce log recorded in the dmesg.like:

[ 1561.511210] mce: [Hardware Error]: Machine check events logged
[ 1561.511221] [Hardware Error]: Corrected error, no action required.
[ 1561.511311] [Hardware Error]: CPU:0 (18:0:2) MC16_STATUS[Over|CE|MiscV|-|AddrV|-|-|SyndV|-|CECC]: 0xdc2040000000011b
[ 1561.511388] [Hardware Error]: Error Addr: 0x000000077cd66940
[ 1561.511439] [Hardware Error]: IPID: 0x0000009600150f00, Syndrome: 0x000010ce0a400d01
[ 1561.511499] [Hardware Error]: Unified Memory Controller Extended Error Code: 0
[ 1561.511556] [Hardware Error]: Unified Memory Controller Error: DRAM ECC error.
[ 1561.511646] EDAC MC0: 1 CE on mc#0csrow#1channel#1 (csrow:1 channel:1 page:0x7fcd66 offset:0x940 grain:0 syndrome:0x10ce)
[ 1561.511648] [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD

*But, there are no the log when Using "Ubuntu 18.04.3 LTS"*

The upstream related commit is de0e0624d86ff9fc512dedb297f8978698abf21a .

After merged this commit, Ubuntu kernel's dmesg can record the mce log as well.
---
ProblemType: Bug
AlsaDevices:
 total 0
 crw-rw----+ 1 root audio 116, 1 Dec 24 17:20 seq
 crw-rw----+ 1 root audio 116, 33 Dec 24 17:20 timer
AplayDevices: Error: [Errno 2] No such file or directory: 'aplay': 'aplay'
ApportVersion: 2.20.10-0ubuntu27
Architecture: amd64
ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord': 'arecord'
AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1:
DistroRelease: Ubuntu 19.04
InstallationDate: Installed on 2019-12-24 (0 days ago)
InstallationMedia: Ubuntu-Server 19.04 "Disco Dingo" - Release amd64 (20190416.1)
IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig': 'iwconfig'
MachineType: Sugon HygonH210
Package: linux (not installed)
PciMultimedia:

ProcEnviron:
 TERM=linux
 PATH=(custom, no user)
 LANG=en_US.UTF-8
 SHELL=/bin/bash
ProcFB: 0 astdrmfb
ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-5.0.0-13-generic root=UUID=43f8bc11-d850-4e79-9d14-1232ef50040f ro
ProcVersionSignature: Ubuntu 5.0.0-13.14-generic 5.0.6
RelatedPackageVersions:
 linux-restricted-modules-5.0.0-13-generic N/A
 linux-backports-modules-5.0.0-13-generic N/A
 linux-firmware 1.178
RfKill: Error: [Errno 2] No such file or directory: 'rfkill': 'rfkill'
Tags: disco
Uname: Linux 5.0.0-13-generic x86_64
UpgradeStatus: No upgrade log present (probably fresh install)
UserGroups:

_MarkForUpload: True
dmi.bios.date: 03/15/2019
dmi.bios.vendor: American Megatrends Inc.
dmi.bios.version: 210ER119
dmi.board.asset.tag: Default string
dmi.board.name: HygonH210
dmi.board.vendor: Sugon
dmi.board.version: Default string
dmi.chassis.asset.tag: Default string
dmi.chassis.type: 17
dmi.chassis.vendor: Sugon
dmi.chassis.version: Default string
dmi.modalias: dmi:bvnAmericanMegatrendsInc.:bvr210ER119:bd03/15/2019:svnSugon:pnHygonH210:pvrDefaultstring:rvnSugon:rnHygonH210:rvrDefaultstring:cvnSugon:ct17:cvrDefaultstring:
dmi.product.family: Rack
dmi.product.name: HygonH210
dmi.product.sku: Default string
dmi.product.version: Default string
dmi.sys.vendor: Sugon

fan jinke (fanjinke) on 2019-12-24
description: updated

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1857413

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
fan jinke (fanjinke) wrote :

ubuntu 19.04 server also have the same problem.
cat /etc/os-release
NAME="Ubuntu"
VERSION="19.04 (Disco Dingo)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 19.04"
VERSION_ID="19.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=disco
UBUNTU_CODENAME=disco

uname -a
Linux ubuntu 5.0.0-13-generic #14-Ubuntu SMP Mon Apr 15 14:59:14 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

fan jinke (fanjinke) wrote : CRDA.txt

apport information

tags: added: apport-collected disco
description: updated

apport information

apport information

apport information

apport information

apport information

apport information

apport information

apport information

apport information

Po-Hsu Lin (cypressyew) wrote :

Hello,
thanks for the bug report and the fix SHA1,
can you help us to see if this Disco kernel works: https://people.canonical.com/~phlin/kernel/lp-1857413-ras-err-msg/

It contains the following two commits:
  * 09cbd2197e9291d6a3d3f42873f06ca1f388c1a4
  * de0e0624d86ff9fc512dedb297f8978698abf21a

fan jinke (fanjinke) wrote :
Download full text (3.6 KiB)

So sorry for the late reply.

The debs which at https://people.canonical.com/~phlin/kernel/lp-1857413-ras-err-msg/ work well, and the same with Ubuntu 19.10 server.

dmesg log:
[ 316.984470] mce: [Hardware Error]: Machine check events logged
[ 316.984475] [Hardware Error]: Corrected error, no action required.
[ 316.984537] [Hardware Error]: CPU:0 (18:0:2) MC16_STATUS[Over|CE|MiscV|-|AddrV|-|-|SyndV|-|CECC]: 0xdc2040000000011b
[ 316.984610] [Hardware Error]: Error Addr: 0x00000007de33d040
[ 316.984654] [Hardware Error]: IPID: 0x0000009600150f00, Syndrome: 0x000040100a400f00
[ 316.984712] [Hardware Error]: Unified Memory Controller Extended Error Code: 0
[ 316.984765] [Hardware Error]: Unified Memory Controller Error: DRAM ECC error.
[ 316.984881] WARNING: CPU: 0 PID: 109 at drivers/edac/edac_mc.c:1243 edac_mc_handle_error+0x53f/0x590
[ 316.984883] Modules linked in: msr nls_iso8859_1 dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua amd64_edac_mod ipmi_ssif edac_mce_amd kvm_amd ccp kvm irqbypass ipmi_si input_leds ipmi_devintf ipmi_msghandler k10temp mac_hid sch_fq_codel ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables x_tables autofs4 btrfs zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear hid_generic usbhid hid ast ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ahci igb drm libahci dca i2c_algo_bit
[ 316.984938] CPU: 0 PID: 109 Comm: kworker/0:2 Not tainted 5.0.0-38-generic #41
[ 316.984939] Hardware name: Sugon HygonH210/HygonH210, BIOS 210ER119 03/15/2019
[ 316.984946] Workqueue: events mce_gen_pool_process
[ 316.984951] RIP: 0010:edac_mc_handle_error+0x53f/0x590
[ 316.984953] Code: 77 6e 20 41 b9 72 79 00 00 49 89 84 24 88 05 00 00 48 8b 45 b8 c7 40 08 6d 65 6d 6f 66 44 89 48 0c c6 40 0e 00 e9 6c fd ff ff <0f> 0b 49 c7 82 b0 06 00 00 01 00 00 00 31 c0 e9 48 fe ff ff 40 84
[ 316.984955] RSP: 0018:ffffb03743b33c68 EFLAGS: 00010246
[ 316.984958] RAX: 0000000000000000 RBX: ffffffff8a9b81f1 RCX: 0000000000000001
[ 316.984959] RDX: 0000000000000000 RSI: ffffffff8a9b81f7 RDI: ffff9e7219335c9a
[ 316.984960] RBP: ffffb03743b33ce8 R08: ffffffff8a973dc8 R09: 000000007568c237
[ 316.984961] R10: ffff9e7219335800 R11: ffff9e7219335c99 R12: 0000000000000002
[ 316.984962] R13: ffff9e7219335c9a R14: ffff9e7219335800 R15: 00000000ffffffff
[ 316.984964] FS: 0000000000000000(0000) GS:ffff9e721d000000(0000) knlGS:0000000000000000
[ 316.984965] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 316.984967] CR2: 000055d228ad3d90 CR3: 0000000852780000 CR4: 00000000003406f0
[ 316.984968] Call Trace:
[ 316.984991] __log_ecc_error+0x62/0x90 [amd64_edac_mod]
[ 316.984995] decode_umc_error+0xac/0x190 [amd64_edac_mod]
[ 316.985002] amd_decode_mce.cold.27+0xa7c/0xa81 [edac_mce_amd]
[ 316.985011] notifier_call_chain+0x4c/0x70
[ 316.985014] blocking_notifier_call_chain+0x43/0x60
[ 316.985016] mce_gen_pool_process+0x41/0x70
[ 316.985023] process_one_work+0x20f/0x410
[ 316.985025] worker_thread+0x34/0x400
[ 316.985028] kthread+0x120/0x140
[ 316.985031] ? process_one_work+...

Read more...

Po-Hsu Lin (cypressyew) on 2019-12-31
Changed in linux (Ubuntu Disco):
status: New → In Progress
assignee: nobody → Po-Hsu Lin (cypressyew)
Po-Hsu Lin (cypressyew) on 2019-12-31
Changed in linux (Ubuntu):
status: Incomplete → Fix Released
Po-Hsu Lin (cypressyew) wrote :

Thanks for testing!
I will SRU this to the Disco kernel.

https://lists.ubuntu.com/archives/kernel-team/2019-December/106569.html

description: updated
Changed in linux (Ubuntu Disco):
status: In Progress → Fix Committed

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-disco' to 'verification-done-disco'. If the problem still exists, change the tag 'verification-needed-disco' to 'verification-failed-disco'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-disco
Launchpad Janitor (janitor) wrote :
Download full text (22.6 KiB)

This bug was fixed in the package linux - 5.0.0-40.44

---------------
linux (5.0.0-40.44) disco; urgency=medium

  * disco/linux: 5.0.0-40.44 -proposed tracker (LP: #1859724)

  * use-after-free in i915_ppgtt_close (LP: #1859522) // CVE-2020-7053
    - SAUCE: drm/i915: Fix use-after-free when destroying GEM context

  * CVE-2019-14615
    - drm/i915/gen9: Clear residual context state on context switch

  * System hang with kernel traces while entering reboot process on a Disco
    ARM64 moonshot node (LP: #1859582)
    - Revert "RDMA/cm: Fix memory leak in cm_add/remove_one"

linux (5.0.0-39.43) disco; urgency=medium

  * disco/linux: 5.0.0-39.43 -proposed tracker (LP: #1858547)

  * [Regression] usb usb2-port2: Cannot enable. Maybe the USB cable is bad?
    (LP: #1856608)
    - SAUCE: Revert "usb: handle warm-reset port requests on hub resume"

  * PAN is broken for execute-only user mappings on ARMv8 (LP: #1858815)
    - arm64: Revert support for execute-only user mappings

  * Fix unusable USB hub on Dell TB16 after S3 (LP: #1855312)
    - SAUCE: USB: core: Make port power cycle a seperate helper function
    - SAUCE: USB: core: Attempt power cycle port when it's in eSS.Disabled state

  * [sas-1126]scsi: hisi_sas: Fix out of bound at debug_I_T_nexus_reset()
    (LP: #1853992)
    - scsi: hisi_sas: Fix out of bound at debug_I_T_nexus_reset()

  * [sas-1126]scsi: hisi_sas: Assign NCQ tag for all NCQ commands (LP: #1853995)
    - scsi: hisi_sas: Assign NCQ tag for all NCQ commands

  * [sas-1126]scsi: hisi_sas: Fix the conflict between device gone and host
    reset (LP: #1853997)
    - scsi: hisi_sas: Fix the conflict between device gone and host reset

  * scsi: hisi_sas: Check sas_port before using it (LP: #1855952)
    - scsi: hisi_sas: Check sas_port before using it

  * CVE-2019-18885
    - btrfs: refactor btrfs_find_device() take fs_devices as argument
    - btrfs: merge btrfs_find_device and find_device

  * Integrate Intel SGX driver into linux-azure (LP: #1844245)
    - [Packaging] Add systemd service to load intel_sgx

  * [SRU][B/OEM-B/OEM-OSP1/D/E/F] Add LG I2C touchscreen multitouch support
    (LP: #1857541)
    - SAUCE: HID: multitouch: Add LG MELF0410 I2C touchscreen support

  * cifs: DFS Caching feature causing problems traversing multi-tier DFS setups
    (LP: #1854887)
    - cifs: Fix retrieval of DFS referrals in cifs_mount()

  * qede driver causes 100% CPU load (LP: #1855409)
    - qede: Handle infinite driver spinning for Tx timestamp.

  * [roce-1126]RDMA/hns: bugfix for slab-out-of-bounds when loading hip08 driver
    (LP: #1853989)
    - RDMA/hns: Bugfix for slab-out-of-bounds when unloading hip08 driver
    - RDMA/hns: bugfix for slab-out-of-bounds when loading hip08 driver

  * [roce-1126]RDMA/hns: Fixs hw access invalid dma memory error (LP: #1853990)
    - RDMA/hns: Fixs hw access invalid dma memory error

  * [hns-1126]net: hns3: revert to old channel when setting new channel num fail
    (LP: #1853983)
    - net: hns3: revert to old channel when setting new channel num fail

  * [hns-1126]net: hns3: fix port setting handle for fibre port
    (LP: #1853984)
    - net: hns3: fix port setting handle for fibre...

Changed in linux (Ubuntu Disco):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers