2020-04-09 23:36:47 |
dann frazier |
bug |
|
|
added bug |
2020-04-09 23:39:23 |
dann frazier |
nominated for series |
|
Ubuntu Eoan |
|
2020-04-09 23:39:23 |
dann frazier |
bug task added |
|
rasdaemon (Ubuntu Eoan) |
|
2020-04-09 23:39:23 |
dann frazier |
nominated for series |
|
Ubuntu Focal |
|
2020-04-09 23:39:23 |
dann frazier |
bug task added |
|
rasdaemon (Ubuntu Focal) |
|
2020-04-09 23:39:23 |
dann frazier |
nominated for series |
|
Ubuntu Bionic |
|
2020-04-09 23:39:23 |
dann frazier |
bug task added |
|
rasdaemon (Ubuntu Bionic) |
|
2020-04-09 23:39:31 |
dann frazier |
rasdaemon (Ubuntu Focal): status |
New |
Fix Released |
|
2020-04-10 00:35:54 |
dann frazier |
rasdaemon (Ubuntu Eoan): assignee |
|
dann frazier (dannf) |
|
2020-04-10 00:35:56 |
dann frazier |
rasdaemon (Ubuntu Bionic): assignee |
|
dann frazier (dannf) |
|
2020-04-14 00:23:49 |
dann frazier |
description |
[Impact]
rasdaemon does not know how to decode MCE events from various new platforms, making it difficult to interpret errors reported up from the platform.
[Test Case]
[Fix]
[Regression Risk] |
[Impact]
rasdaemon does not know how to decode MCE events from various new platforms, making it difficult to interpret errors reported up from the platform.
[Test Case]
On an AMD SMCA-capable system:
#!/bin/bash
modprobe mce-inject
EINJ=/sys/kernel/debug/mce-inject
# See /sys/kernel/debug/mce-inject/README
echo hw > $EINJ/flags
echo 0x9c2030000000011b > $EINJ/status
echo 0x040000035dd8bfc0 > $EINJ/addr
echo 0x0000c2030b404000 > $EINJ/synd
echo 0 > $EINJ/bank
# Wait for MCE to appear in dmesg
sudo ras-mc-ctl --errors
There should be a new MCE event in the output:
1 2020-04-13 19:19:55 +0000 error: Deferred error, no action required., CPU 2, bank Load Store Unit (bank=0), mcg mcgstatus=0, mci UECC, mcgcap=0x0000011c, status=0x9c2030000000011b, addr=0x35dd8bfc0, walltime=0x5e94bb5d, cpuid=0x00830f10
For Skylake, I regression tested by using mce-test w/ the "corrected" test, as I'm not sure how to inject a Skylake-specific event there.
git clone https://github.com/andikleen/mce-inject
cd mce-inject
make
sudo ./mce-inject < test/corrected
sudo ras-mc-ctl --errors
No Memory errors.
No PCIe AER errors.
No Extlog errors.
MCE events:
1 2020-04-14 00:13:07 +0000 error: No Error, mcg mcgstatus=0, mci Corrected_error Error_enabled, mcgcap=0x0f000814, status=0x9400000000000000, addr=0x0000abcd, walltime=0x5e950014, cpuid=0x00050654, bank=0x00000001
2 2020-04-14 00:13:07 +0000 error: No Error, mcg mcgstatus=0, mci Corrected_error Error_enabled, mcgcap=0x0f000814, status=0x9400000000000000, addr=0x00001234, walltime=0x5e950014, cpu=0x00000001, cpuid=0x00050654, apicid=0x00000002, bank=0x00000002
[Regression Risk]
The new code added should only run on the newly supported systems, so regressions should be restricted to those systems. On those systems, a bug in the decoding code could cause an issue on these systems such as a crash in rasdaemon, etc. That is mitigated by testing on those newly supported platforms. |
|
2020-04-14 00:23:59 |
dann frazier |
rasdaemon (Ubuntu Eoan): status |
New |
In Progress |
|
2020-04-14 00:24:03 |
dann frazier |
rasdaemon (Ubuntu Bionic): status |
New |
In Progress |
|
2020-04-14 00:34:41 |
dann frazier |
description |
[Impact]
rasdaemon does not know how to decode MCE events from various new platforms, making it difficult to interpret errors reported up from the platform.
[Test Case]
On an AMD SMCA-capable system:
#!/bin/bash
modprobe mce-inject
EINJ=/sys/kernel/debug/mce-inject
# See /sys/kernel/debug/mce-inject/README
echo hw > $EINJ/flags
echo 0x9c2030000000011b > $EINJ/status
echo 0x040000035dd8bfc0 > $EINJ/addr
echo 0x0000c2030b404000 > $EINJ/synd
echo 0 > $EINJ/bank
# Wait for MCE to appear in dmesg
sudo ras-mc-ctl --errors
There should be a new MCE event in the output:
1 2020-04-13 19:19:55 +0000 error: Deferred error, no action required., CPU 2, bank Load Store Unit (bank=0), mcg mcgstatus=0, mci UECC, mcgcap=0x0000011c, status=0x9c2030000000011b, addr=0x35dd8bfc0, walltime=0x5e94bb5d, cpuid=0x00830f10
For Skylake, I regression tested by using mce-test w/ the "corrected" test, as I'm not sure how to inject a Skylake-specific event there.
git clone https://github.com/andikleen/mce-inject
cd mce-inject
make
sudo ./mce-inject < test/corrected
sudo ras-mc-ctl --errors
No Memory errors.
No PCIe AER errors.
No Extlog errors.
MCE events:
1 2020-04-14 00:13:07 +0000 error: No Error, mcg mcgstatus=0, mci Corrected_error Error_enabled, mcgcap=0x0f000814, status=0x9400000000000000, addr=0x0000abcd, walltime=0x5e950014, cpuid=0x00050654, bank=0x00000001
2 2020-04-14 00:13:07 +0000 error: No Error, mcg mcgstatus=0, mci Corrected_error Error_enabled, mcgcap=0x0f000814, status=0x9400000000000000, addr=0x00001234, walltime=0x5e950014, cpu=0x00000001, cpuid=0x00050654, apicid=0x00000002, bank=0x00000002
[Regression Risk]
The new code added should only run on the newly supported systems, so regressions should be restricted to those systems. On those systems, a bug in the decoding code could cause an issue on these systems such as a crash in rasdaemon, etc. That is mitigated by testing on those newly supported platforms. |
[Impact]
rasdaemon does not know how to decode MCE events from various new platforms, making it difficult to interpret errors reported up from the platform.
[Test Case]
On an AMD SMCA-capable system:
#!/bin/bash
modprobe mce-inject
EINJ=/sys/kernel/debug/mce-inject
# See /sys/kernel/debug/mce-inject/README
echo hw > $EINJ/flags
echo 0x9c2030000000011b > $EINJ/status
echo 0x040000035dd8bfc0 > $EINJ/addr
echo 0x0000c2030b404000 > $EINJ/synd
echo 0 > $EINJ/bank
# Wait for MCE to appear in dmesg
sudo ras-mc-ctl --errors
There should be a new MCE event in the output:
1 2020-04-13 19:19:55 +0000 error: Deferred error, no action required., CPU 2, bank Load Store Unit (bank=0), mcg mcgstatus=0, mci UECC, mcgcap=0x0000011c, status=0x9c2030000000011b, addr=0x35dd8bfc0, walltime=0x5e94bb5d, cpuid=0x00830f10
For Skylake, I regression tested by using mce-test w/ the "corrected" test, as I'm not sure how to inject a Skylake-specific event there.
git clone https://github.com/andikleen/mce-inject
cd mce-inject
make
sudo ./mce-inject < test/corrected
sudo ras-mc-ctl --errors
No Memory errors.
No PCIe AER errors.
No Extlog errors.
MCE events:
1 2020-04-14 00:13:07 +0000 error: No Error, mcg mcgstatus=0, mci Corrected_error Error_enabled, mcgcap=0x0f000814, status=0x9400000000000000, addr=0x0000abcd, walltime=0x5e950014, cpuid=0x00050654, bank=0x00000001
2 2020-04-14 00:13:07 +0000 error: No Error, mcg mcgstatus=0, mci Corrected_error Error_enabled, mcgcap=0x0f000814, status=0x9400000000000000, addr=0x00001234, walltime=0x5e950014, cpu=0x00000001, cpuid=0x00050654, apicid=0x00000002, bank=0x00000002
[Regression Risk]
The new code added should only run on the newly supported systems, so regressions should be restricted to those systems. On those systems, a bug in the decoding code could cause an issue on these systems such as a crash in rasdaemon, etc. That is mitigated by testing on those newly supported platforms. Note that one code path I could not exercise is the Hygon Dhyana support as I don't have that hardware - that patch is a trivial "do the same thing as AMD Zen", as it is a derivative platform. |
|
2020-04-14 00:35:42 |
dann frazier |
description |
[Impact]
rasdaemon does not know how to decode MCE events from various new platforms, making it difficult to interpret errors reported up from the platform.
[Test Case]
On an AMD SMCA-capable system:
#!/bin/bash
modprobe mce-inject
EINJ=/sys/kernel/debug/mce-inject
# See /sys/kernel/debug/mce-inject/README
echo hw > $EINJ/flags
echo 0x9c2030000000011b > $EINJ/status
echo 0x040000035dd8bfc0 > $EINJ/addr
echo 0x0000c2030b404000 > $EINJ/synd
echo 0 > $EINJ/bank
# Wait for MCE to appear in dmesg
sudo ras-mc-ctl --errors
There should be a new MCE event in the output:
1 2020-04-13 19:19:55 +0000 error: Deferred error, no action required., CPU 2, bank Load Store Unit (bank=0), mcg mcgstatus=0, mci UECC, mcgcap=0x0000011c, status=0x9c2030000000011b, addr=0x35dd8bfc0, walltime=0x5e94bb5d, cpuid=0x00830f10
For Skylake, I regression tested by using mce-test w/ the "corrected" test, as I'm not sure how to inject a Skylake-specific event there.
git clone https://github.com/andikleen/mce-inject
cd mce-inject
make
sudo ./mce-inject < test/corrected
sudo ras-mc-ctl --errors
No Memory errors.
No PCIe AER errors.
No Extlog errors.
MCE events:
1 2020-04-14 00:13:07 +0000 error: No Error, mcg mcgstatus=0, mci Corrected_error Error_enabled, mcgcap=0x0f000814, status=0x9400000000000000, addr=0x0000abcd, walltime=0x5e950014, cpuid=0x00050654, bank=0x00000001
2 2020-04-14 00:13:07 +0000 error: No Error, mcg mcgstatus=0, mci Corrected_error Error_enabled, mcgcap=0x0f000814, status=0x9400000000000000, addr=0x00001234, walltime=0x5e950014, cpu=0x00000001, cpuid=0x00050654, apicid=0x00000002, bank=0x00000002
[Regression Risk]
The new code added should only run on the newly supported systems, so regressions should be restricted to those systems. On those systems, a bug in the decoding code could cause an issue on these systems such as a crash in rasdaemon, etc. That is mitigated by testing on those newly supported platforms. Note that one code path I could not exercise is the Hygon Dhyana support as I don't have that hardware - that patch is a trivial "do the same thing as AMD Zen", as it is a derivative platform. |
[Impact]
rasdaemon does not know how to decode MCE events from various new platforms, making it difficult to interpret errors reported up from the platform.
[Test Case]
On an AMD SMCA-capable system:
#!/bin/bash
modprobe mce-inject
EINJ=/sys/kernel/debug/mce-inject
# See /sys/kernel/debug/mce-inject/README
echo hw > $EINJ/flags
echo 0x9c2030000000011b > $EINJ/status
echo 0x040000035dd8bfc0 > $EINJ/addr
echo 0x0000c2030b404000 > $EINJ/synd
echo 0 > $EINJ/bank
# Wait for MCE to appear in dmesg
sudo ras-mc-ctl --errors
There should be a new MCE event in the output:
1 2020-04-13 19:19:55 +0000 error: Deferred error, no action required., CPU 2, bank Load Store Unit (bank=0), mcg mcgstatus=0, mci UECC, mcgcap=0x0000011c, status=0x9c2030000000011b, addr=0x35dd8bfc0, walltime=0x5e94bb5d, cpuid=0x00830f10
For Skylake, I regression tested by using mce-test w/ the "corrected" test, as I'm not sure how to inject a Skylake-specific event there.
git clone https://github.com/andikleen/mce-inject
cd mce-inject
make
sudo ./mce-inject < test/corrected
sudo ras-mc-ctl --errors
No Memory errors.
No PCIe AER errors.
No Extlog errors.
MCE events:
1 2020-04-14 00:13:07 +0000 error: No Error, mcg mcgstatus=0, mci Corrected_error Error_enabled, mcgcap=0x0f000814, status=0x9400000000000000, addr=0x0000abcd, walltime=0x5e950014, cpuid=0x00050654, bank=0x00000001
2 2020-04-14 00:13:07 +0000 error: No Error, mcg mcgstatus=0, mci Corrected_error Error_enabled, mcgcap=0x0f000814, status=0x9400000000000000, addr=0x00001234, walltime=0x5e950014, cpu=0x00000001, cpuid=0x00050654, apicid=0x00000002, bank=0x00000002
[Fix]
https://github.com/mchehab/rasdaemon/commit/b30a7fd4e5df8c4e61c7441f79e52d8f5f115974
https://github.com/mchehab/rasdaemon/commit/a16ca0711001957ee98f2c124abce0fa1f801529
https://github.com/mchehab/rasdaemon/commit/8704a85d8dc3483423ec2934fee8132f85f8fdb6
https://github.com/mchehab/rasdaemon/commit/22f2d8bb1d1065dede59b73b148ad4b4e2177c33
[Regression Risk]
The new code added should only run on the newly supported systems, so regressions should be restricted to those systems. On those systems, a bug in the decoding code could cause an issue on these systems such as a crash in rasdaemon, etc. That is mitigated by testing on those newly supported platforms. Note that one code path I could not exercise is the Hygon Dhyana support as I don't have that hardware - that patch is a trivial "do the same thing as AMD Zen", as it is a derivative platform. |
|
2020-04-14 00:36:57 |
dann frazier |
description |
[Impact]
rasdaemon does not know how to decode MCE events from various new platforms, making it difficult to interpret errors reported up from the platform.
[Test Case]
On an AMD SMCA-capable system:
#!/bin/bash
modprobe mce-inject
EINJ=/sys/kernel/debug/mce-inject
# See /sys/kernel/debug/mce-inject/README
echo hw > $EINJ/flags
echo 0x9c2030000000011b > $EINJ/status
echo 0x040000035dd8bfc0 > $EINJ/addr
echo 0x0000c2030b404000 > $EINJ/synd
echo 0 > $EINJ/bank
# Wait for MCE to appear in dmesg
sudo ras-mc-ctl --errors
There should be a new MCE event in the output:
1 2020-04-13 19:19:55 +0000 error: Deferred error, no action required., CPU 2, bank Load Store Unit (bank=0), mcg mcgstatus=0, mci UECC, mcgcap=0x0000011c, status=0x9c2030000000011b, addr=0x35dd8bfc0, walltime=0x5e94bb5d, cpuid=0x00830f10
For Skylake, I regression tested by using mce-test w/ the "corrected" test, as I'm not sure how to inject a Skylake-specific event there.
git clone https://github.com/andikleen/mce-inject
cd mce-inject
make
sudo ./mce-inject < test/corrected
sudo ras-mc-ctl --errors
No Memory errors.
No PCIe AER errors.
No Extlog errors.
MCE events:
1 2020-04-14 00:13:07 +0000 error: No Error, mcg mcgstatus=0, mci Corrected_error Error_enabled, mcgcap=0x0f000814, status=0x9400000000000000, addr=0x0000abcd, walltime=0x5e950014, cpuid=0x00050654, bank=0x00000001
2 2020-04-14 00:13:07 +0000 error: No Error, mcg mcgstatus=0, mci Corrected_error Error_enabled, mcgcap=0x0f000814, status=0x9400000000000000, addr=0x00001234, walltime=0x5e950014, cpu=0x00000001, cpuid=0x00050654, apicid=0x00000002, bank=0x00000002
[Fix]
https://github.com/mchehab/rasdaemon/commit/b30a7fd4e5df8c4e61c7441f79e52d8f5f115974
https://github.com/mchehab/rasdaemon/commit/a16ca0711001957ee98f2c124abce0fa1f801529
https://github.com/mchehab/rasdaemon/commit/8704a85d8dc3483423ec2934fee8132f85f8fdb6
https://github.com/mchehab/rasdaemon/commit/22f2d8bb1d1065dede59b73b148ad4b4e2177c33
[Regression Risk]
The new code added should only run on the newly supported systems, so regressions should be restricted to those systems. On those systems, a bug in the decoding code could cause an issue on these systems such as a crash in rasdaemon, etc. That is mitigated by testing on those newly supported platforms. Note that one code path I could not exercise is the Hygon Dhyana support as I don't have that hardware - that patch is a trivial "do the same thing as AMD Zen", as it is a derivative platform. |
[Impact]
rasdaemon does not know how to decode MCE events from various new platforms, making it difficult to interpret errors reported up from the platform.
[Test Case]
On an AMD SMCA-capable system:
#!/bin/bash
modprobe mce-inject
EINJ=/sys/kernel/debug/mce-inject
# See /sys/kernel/debug/mce-inject/README
echo hw > $EINJ/flags
echo 0x9c2030000000011b > $EINJ/status
echo 0x040000035dd8bfc0 > $EINJ/addr
echo 0x0000c2030b404000 > $EINJ/synd
echo 0 > $EINJ/bank
# Wait for MCE to appear in dmesg
sudo ras-mc-ctl --errors
There should be a new MCE event in the output:
1 2020-04-13 19:19:55 +0000 error: Deferred error, no action required., CPU 2, bank Load Store Unit (bank=0), mcg mcgstatus=0, mci UECC, mcgcap=0x0000011c, status=0x9c2030000000011b, addr=0x35dd8bfc0, walltime=0x5e94bb5d, cpuid=0x00830f10
For Skylake, I regression tested by using mce-test w/ the "corrected" test, as I'm not sure how to inject a Skylake-specific event there.
git clone https://github.com/andikleen/mce-inject
cd mce-inject
make
sudo ./mce-inject < test/corrected
sudo ras-mc-ctl --errors
No Memory errors.
No PCIe AER errors.
No Extlog errors.
MCE events:
1 2020-04-14 00:13:07 +0000 error: No Error, mcg mcgstatus=0, mci Corrected_error Error_enabled, mcgcap=0x0f000814, status=0x9400000000000000, addr=0x0000abcd, walltime=0x5e950014, cpuid=0x00050654, bank=0x00000001
2 2020-04-14 00:13:07 +0000 error: No Error, mcg mcgstatus=0, mci Corrected_error Error_enabled, mcgcap=0x0f000814, status=0x9400000000000000, addr=0x00001234, walltime=0x5e950014, cpu=0x00000001, cpuid=0x00050654, apicid=0x00000002, bank=0x00000002
[Fix]
https://github.com/mchehab/rasdaemon/commit/b30a7fd4e5df8c4e61c7441f79e52d8f5f115974
https://github.com/mchehab/rasdaemon/commit/a16ca0711001957ee98f2c124abce0fa1f801529
https://github.com/mchehab/rasdaemon/commit/8704a85d8dc3483423ec2934fee8132f85f8fdb6
https://github.com/mchehab/rasdaemon/commit/22f2d8bb1d1065dede59b73b148ad4b4e2177c33
[Regression Risk]
The new code added should only run on the newly supported systems, so regressions should be restricted to those systems. On those systems, a bug in the decoding code could cause e.g. as a crash in rasdaemon. That is mitigated by testing on those newly supported platforms. Note that one code path I could not exercise is the Hygon Dhyana support as I don't have that hardware - that patch is a trivial "do the same thing as AMD Zen", as it is a derivative platform. |
|
2020-04-14 21:56:10 |
Brian Murray |
rasdaemon (Ubuntu Eoan): status |
In Progress |
Fix Committed |
|
2020-04-14 21:56:14 |
Brian Murray |
bug |
|
|
added subscriber Ubuntu Stable Release Updates Team |
2020-04-14 21:56:15 |
Brian Murray |
bug |
|
|
added subscriber SRU Verification |
2020-04-14 21:56:19 |
Brian Murray |
tags |
|
verification-needed verification-needed-eoan |
|
2020-04-14 22:06:16 |
Brian Murray |
rasdaemon (Ubuntu Bionic): status |
In Progress |
Fix Committed |
|
2020-04-14 22:06:24 |
Brian Murray |
tags |
verification-needed verification-needed-eoan |
verification-needed verification-needed-bionic verification-needed-eoan |
|
2020-04-15 18:20:59 |
dann frazier |
tags |
verification-needed verification-needed-bionic verification-needed-eoan |
verification-done-eoan verification-needed verification-needed-bionic |
|
2020-04-15 19:42:43 |
dann frazier |
tags |
verification-done-eoan verification-needed verification-needed-bionic |
verification-done verification-done-bionic verification-done-eoan |
|
2020-04-22 01:34:18 |
Chris Halse Rogers |
removed subscriber Ubuntu Stable Release Updates Team |
|
|
|
2020-04-22 01:35:30 |
Launchpad Janitor |
rasdaemon (Ubuntu Bionic): status |
Fix Committed |
Fix Released |
|
2020-04-22 01:49:36 |
Launchpad Janitor |
rasdaemon (Ubuntu Eoan): status |
Fix Committed |
Fix Released |
|