new platform support: Intel SkyLake, AMD Scalable MCA

Bug #1871965 reported by dann frazier
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
rasdaemon (Ubuntu)
Fix Released
Undecided
Unassigned
Bionic
Fix Released
Undecided
dann frazier
Eoan
Fix Released
Undecided
dann frazier
Focal
Fix Released
Undecided
Unassigned

Bug Description

[Impact]
rasdaemon does not know how to decode MCE events from various new platforms, making it difficult to interpret errors reported up from the platform.

[Test Case]
On an AMD SMCA-capable system:
#!/bin/bash
modprobe mce-inject

EINJ=/sys/kernel/debug/mce-inject

# See /sys/kernel/debug/mce-inject/README

echo hw > $EINJ/flags
echo 0x9c2030000000011b > $EINJ/status
echo 0x040000035dd8bfc0 > $EINJ/addr
echo 0x0000c2030b404000 > $EINJ/synd
echo 0 > $EINJ/bank

# Wait for MCE to appear in dmesg
sudo ras-mc-ctl --errors
There should be a new MCE event in the output:
1 2020-04-13 19:19:55 +0000 error: Deferred error, no action required., CPU 2, bank Load Store Unit (bank=0), mcg mcgstatus=0, mci UECC, mcgcap=0x0000011c, status=0x9c2030000000011b, addr=0x35dd8bfc0, walltime=0x5e94bb5d, cpuid=0x00830f10

For Skylake, I regression tested by using mce-test w/ the "corrected" test, as I'm not sure how to inject a Skylake-specific event there.
git clone https://github.com/andikleen/mce-inject
cd mce-inject
make
sudo ./mce-inject < test/corrected
sudo ras-mc-ctl --errors
No Memory errors.

No PCIe AER errors.

No Extlog errors.

MCE events:
1 2020-04-14 00:13:07 +0000 error: No Error, mcg mcgstatus=0, mci Corrected_error Error_enabled, mcgcap=0x0f000814, status=0x9400000000000000, addr=0x0000abcd, walltime=0x5e950014, cpuid=0x00050654, bank=0x00000001
2 2020-04-14 00:13:07 +0000 error: No Error, mcg mcgstatus=0, mci Corrected_error Error_enabled, mcgcap=0x0f000814, status=0x9400000000000000, addr=0x00001234, walltime=0x5e950014, cpu=0x00000001, cpuid=0x00050654, apicid=0x00000002, bank=0x00000002

[Fix]
https://github.com/mchehab/rasdaemon/commit/b30a7fd4e5df8c4e61c7441f79e52d8f5f115974
https://github.com/mchehab/rasdaemon/commit/a16ca0711001957ee98f2c124abce0fa1f801529
https://github.com/mchehab/rasdaemon/commit/8704a85d8dc3483423ec2934fee8132f85f8fdb6
https://github.com/mchehab/rasdaemon/commit/22f2d8bb1d1065dede59b73b148ad4b4e2177c33

[Regression Risk]
The new code added should only run on the newly supported systems, so regressions should be restricted to those systems. On those systems, a bug in the decoding code could cause e.g. as a crash in rasdaemon. That is mitigated by testing on those newly supported platforms. Note that one code path I could not exercise is the Hygon Dhyana support as I don't have that hardware - that patch is a trivial "do the same thing as AMD Zen", as it is a derivative platform.

dann frazier (dannf)
Changed in rasdaemon (Ubuntu Focal):
status: New → Fix Released
dann frazier (dannf)
Changed in rasdaemon (Ubuntu Eoan):
assignee: nobody → dann frazier (dannf)
Changed in rasdaemon (Ubuntu Bionic):
assignee: nobody → dann frazier (dannf)
dann frazier (dannf)
description: updated
Changed in rasdaemon (Ubuntu Eoan):
status: New → In Progress
Changed in rasdaemon (Ubuntu Bionic):
status: New → In Progress
dann frazier (dannf)
description: updated
description: updated
description: updated
Revision history for this message
Brian Murray (brian-murray) wrote : Please test proposed package

Hello dann, or anyone else affected,

Accepted rasdaemon into eoan-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/rasdaemon/0.6.0-1.2ubuntu0.1 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, what testing has been performed on the package and change the tag from verification-needed-eoan to verification-done-eoan. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-eoan. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Changed in rasdaemon (Ubuntu Eoan):
status: In Progress → Fix Committed
tags: added: verification-needed verification-needed-eoan
Revision history for this message
Brian Murray (brian-murray) wrote :

Hello dann, or anyone else affected,

Accepted rasdaemon into bionic-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/rasdaemon/0.6.0-1ubuntu0.2 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, what testing has been performed on the package and change the tag from verification-needed-bionic to verification-done-bionic. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-bionic. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Changed in rasdaemon (Ubuntu Bionic):
status: In Progress → Fix Committed
tags: added: verification-needed-bionic
Revision history for this message
dann frazier (dannf) wrote :

= eoan verification =
== AMD EPYC ==
# dmesg
[ 2188.007718] mce: [Hardware Error]: Machine check events logged
[ 2188.007722] [Hardware Error]: Deferred error, no action required.
[ 2188.009446] [Hardware Error]: CPU:0 (17:31:0) MC0_STATUS[-|-|MiscV|AddrV|-|-|SyndV|UECC|Deferred|-|-]: 0x9c2030000000011b
[ 2188.011080] [Hardware Error]: Error Addr: 0x000000035dd8bfc0
[ 2188.012685] [Hardware Error]: IPID: 0x000000b000000000, Syndrome: 0x000000030b404000
[ 2188.014282] [Hardware Error]: Load Store Unit Ext. Error Code: 0, Load queue parity error.
[ 2188.015912] [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD
# ras-mc-ctl --errors
No Memory errors.

No PCIe AER errors.

No Extlog errors.

MCE events:
1 2020-04-15 17:40:23 +0000 error: Deferred error, no action required., CPU 2, bank Load Store Unit (bank=0), mcg mcgstatus=0, mci UECC, mcgcap=0x0000011c, status=0x9c2030000000011b, addr=0x35dd8bfc0, walltime=0x5e974708, cpuid=0x00830f10

== SkyLake ==
$ sudo ras-mc-ctl --errors
No Memory errors.

No PCIe AER errors.

No Extlog errors.

MCE events:
1 2020-04-15 18:20:30 +0000 error: No Error, mcg mcgstatus=0, mci Corrected_error Error_enabled, mcgcap=0x0f000814, status=0x9400000000000000, addr=0x0000abcd, walltime=0x5e97506f, cpuid=0x00050654, bank=0x00000001
2 2020-04-15 18:20:30 +0000 error: No Error, mcg mcgstatus=0, mci Corrected_error Error_enabled, mcgcap=0x0f000814, status=0x9400000000000000, addr=0x00001234, walltime=0x5e97506f, cpu=0x00000001, cpuid=0x00050654, apicid=0x00000002, bank=0x00000002

tags: added: verification-done-eoan
removed: verification-needed-eoan
Revision history for this message
dann frazier (dannf) wrote :

= bionic verification =

== AMD EPYC ==

# dmesg
[ 631.470101] mce: [Hardware Error]: Machine check events logged
[ 631.470104] [Hardware Error]: Deferred error, no action required.
[ 631.470153] [Hardware Error]: CPU:0 (17:31:0) MC0_STATUS[-|-|MiscV|AddrV|-|-|SyndV|UECC|Deferred|-|-]: 0x9c2030000000011b
[ 631.470213] [Hardware Error]: Error Addr: 0x000000035dd8bfc0
[ 631.470245] [Hardware Error]: IPID: 0x000000b000000000, Syndrome: 0x000000030b404000
[ 631.470287] [Hardware Error]: Load Store Unit Ext. Error Code: 0, Load queue parity error.
[ 631.470332] [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD
# ras-mc-ctl --errors
No Memory errors.

No PCIe AER errors.

No Extlog errors.

MCE events:
1 2020-04-15 18:06:00 +0000 error: Deferred error, no action required., CPU 2, bank Load Store Unit (bank=0), mcg mcgstatus=0, mci UECC, mcgcap=0x0000011c, status=0x9c2030000000011b, addr=0x35dd8bfc0, walltime=0x5e974d09, cpuid=0x00830f10

== Skylake ==
$ sudo ./mce-inject < test/corrected
$ dmesg | tail
[ 18.176600] EXT4-fs (sda2): resizing filesystem from 97545216 to 97546513 blocks
[ 18.176939] EXT4-fs (sda2): resized filesystem to 97546513
[ 19.097678] new mount options do not match the existing superblock, will be ignored
[ 3952.080562] mce: Machine check injector initialized
[ 3960.953025] mce: Starting machine check poll CPU 0
[ 3960.953063] mce: Machine check poll done on CPU 0
[ 3960.953174] mce: [Hardware Error]: Machine check events logged
[ 3960.953328] mce: Starting machine check poll CPU 1
[ 3960.953360] mce: Machine check poll done on CPU 1
[ 3960.953378] mce: [Hardware Error]: Machine check events logged
$ sudo ras-mc-ctl --errors
No Memory errors.

No PCIe AER errors.

No Extlog errors.

MCE events:
1 2020-04-15 19:41:29 +0000 error: No Error, mcg mcgstatus=0, mci Corrected_error Error_enabled, mcgcap=0x0f000814, status=0x9400000000000000, addr=0x0000abcd, walltime=0x5e976369, cpuid=0x00050654, bank=0x00000001
2 2020-04-15 19:41:29 +0000 error: No Error, mcg mcgstatus=0, mci Corrected_error Error_enabled, mcgcap=0x0f000814, status=0x9400000000000000, addr=0x00001234, walltime=0x5e976369, cpu=0x00000001, cpuid=0x00050654, apicid=0x00000002, bank=0x00000002

tags: added: verification-done verification-done-bionic
removed: verification-needed verification-needed-bionic
Revision history for this message
Chris Halse Rogers (raof) wrote : Update Released

The verification of the Stable Release Update for rasdaemon has completed successfully and the package is now being released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package rasdaemon - 0.6.0-1ubuntu0.2

---------------
rasdaemon (0.6.0-1ubuntu0.2) bionic; urgency=medium

  * Add support for Intel SkyLake and AMD Scalable MCA platforms.
    (LP: #1871965)

 -- dann frazier <email address hidden> Mon, 13 Apr 2020 18:37:29 -0600

Changed in rasdaemon (Ubuntu Bionic):
status: Fix Committed → Fix Released
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package rasdaemon - 0.6.0-1.2ubuntu0.1

---------------
rasdaemon (0.6.0-1.2ubuntu0.1) eoan; urgency=medium

  * Add support for Intel SkyLake and AMD Scalable MCA platforms.
    (LP: #1871965)

 -- dann frazier <email address hidden> Mon, 13 Apr 2020 18:40:10 -0600

Changed in rasdaemon (Ubuntu Eoan):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.