new platform support: Intel SkyLake, AMD Scalable MCA
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
rasdaemon (Ubuntu) |
Fix Released
|
Undecided
|
Unassigned | ||
Bionic |
Fix Released
|
Undecided
|
dann frazier | ||
Eoan |
Fix Released
|
Undecided
|
dann frazier | ||
Focal |
Fix Released
|
Undecided
|
Unassigned |
Bug Description
[Impact]
rasdaemon does not know how to decode MCE events from various new platforms, making it difficult to interpret errors reported up from the platform.
[Test Case]
On an AMD SMCA-capable system:
#!/bin/bash
modprobe mce-inject
EINJ=/sys/
# See /sys/kernel/
echo hw > $EINJ/flags
echo 0x9c2030000000011b > $EINJ/status
echo 0x040000035dd8bfc0 > $EINJ/addr
echo 0x0000c2030b404000 > $EINJ/synd
echo 0 > $EINJ/bank
# Wait for MCE to appear in dmesg
sudo ras-mc-ctl --errors
There should be a new MCE event in the output:
1 2020-04-13 19:19:55 +0000 error: Deferred error, no action required., CPU 2, bank Load Store Unit (bank=0), mcg mcgstatus=0, mci UECC, mcgcap=0x0000011c, status=
For Skylake, I regression tested by using mce-test w/ the "corrected" test, as I'm not sure how to inject a Skylake-specific event there.
git clone https:/
cd mce-inject
make
sudo ./mce-inject < test/corrected
sudo ras-mc-ctl --errors
No Memory errors.
No PCIe AER errors.
No Extlog errors.
MCE events:
1 2020-04-14 00:13:07 +0000 error: No Error, mcg mcgstatus=0, mci Corrected_error Error_enabled, mcgcap=0x0f000814, status=
2 2020-04-14 00:13:07 +0000 error: No Error, mcg mcgstatus=0, mci Corrected_error Error_enabled, mcgcap=0x0f000814, status=
[Fix]
https:/
https:/
https:/
https:/
[Regression Risk]
The new code added should only run on the newly supported systems, so regressions should be restricted to those systems. On those systems, a bug in the decoding code could cause e.g. as a crash in rasdaemon. That is mitigated by testing on those newly supported platforms. Note that one code path I could not exercise is the Hygon Dhyana support as I don't have that hardware - that patch is a trivial "do the same thing as AMD Zen", as it is a derivative platform.
Changed in rasdaemon (Ubuntu Focal): | |
status: | New → Fix Released |
Changed in rasdaemon (Ubuntu Eoan): | |
assignee: | nobody → dann frazier (dannf) |
Changed in rasdaemon (Ubuntu Bionic): | |
assignee: | nobody → dann frazier (dannf) |
description: | updated |
Changed in rasdaemon (Ubuntu Eoan): | |
status: | New → In Progress |
Changed in rasdaemon (Ubuntu Bionic): | |
status: | New → In Progress |
description: | updated |
description: | updated |
description: | updated |
Hello dann, or anyone else affected,
Accepted rasdaemon into eoan-proposed. The package will build now and be available at https:/ /launchpad. net/ubuntu/ +source/ rasdaemon/ 0.6.0-1. 2ubuntu0. 1 in a few hours, and then in the -proposed repository.
Please help us by testing this new package. See https:/ /wiki.ubuntu. com/Testing/ EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.
If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, what testing has been performed on the package and change the tag from verification- needed- eoan to verification- done-eoan. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification- failed- eoan. In either case, without details of your testing we will not be able to proceed.
Further information regarding the verification process can be found at https:/ /wiki.ubuntu. com/QATeam/ PerformingSRUVe rification . Thank you in advance for helping!
N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.