Activity log for bug #1871965

Date Who What changed Old value New value Message
2020-04-09 23:36:47 dann frazier bug added bug
2020-04-09 23:39:23 dann frazier nominated for series Ubuntu Eoan
2020-04-09 23:39:23 dann frazier bug task added rasdaemon (Ubuntu Eoan)
2020-04-09 23:39:23 dann frazier nominated for series Ubuntu Focal
2020-04-09 23:39:23 dann frazier bug task added rasdaemon (Ubuntu Focal)
2020-04-09 23:39:23 dann frazier nominated for series Ubuntu Bionic
2020-04-09 23:39:23 dann frazier bug task added rasdaemon (Ubuntu Bionic)
2020-04-09 23:39:31 dann frazier rasdaemon (Ubuntu Focal): status New Fix Released
2020-04-10 00:35:54 dann frazier rasdaemon (Ubuntu Eoan): assignee dann frazier (dannf)
2020-04-10 00:35:56 dann frazier rasdaemon (Ubuntu Bionic): assignee dann frazier (dannf)
2020-04-14 00:23:49 dann frazier description [Impact] rasdaemon does not know how to decode MCE events from various new platforms, making it difficult to interpret errors reported up from the platform. [Test Case] [Fix] [Regression Risk] [Impact] rasdaemon does not know how to decode MCE events from various new platforms, making it difficult to interpret errors reported up from the platform. [Test Case] On an AMD SMCA-capable system: #!/bin/bash modprobe mce-inject EINJ=/sys/kernel/debug/mce-inject # See /sys/kernel/debug/mce-inject/README echo hw > $EINJ/flags echo 0x9c2030000000011b > $EINJ/status echo 0x040000035dd8bfc0 > $EINJ/addr echo 0x0000c2030b404000 > $EINJ/synd echo 0 > $EINJ/bank # Wait for MCE to appear in dmesg sudo ras-mc-ctl --errors There should be a new MCE event in the output: 1 2020-04-13 19:19:55 +0000 error: Deferred error, no action required., CPU 2, bank Load Store Unit (bank=0), mcg mcgstatus=0, mci UECC, mcgcap=0x0000011c, status=0x9c2030000000011b, addr=0x35dd8bfc0, walltime=0x5e94bb5d, cpuid=0x00830f10 For Skylake, I regression tested by using mce-test w/ the "corrected" test, as I'm not sure how to inject a Skylake-specific event there. git clone https://github.com/andikleen/mce-inject cd mce-inject make sudo ./mce-inject < test/corrected sudo ras-mc-ctl --errors No Memory errors. No PCIe AER errors. No Extlog errors. MCE events: 1 2020-04-14 00:13:07 +0000 error: No Error, mcg mcgstatus=0, mci Corrected_error Error_enabled, mcgcap=0x0f000814, status=0x9400000000000000, addr=0x0000abcd, walltime=0x5e950014, cpuid=0x00050654, bank=0x00000001 2 2020-04-14 00:13:07 +0000 error: No Error, mcg mcgstatus=0, mci Corrected_error Error_enabled, mcgcap=0x0f000814, status=0x9400000000000000, addr=0x00001234, walltime=0x5e950014, cpu=0x00000001, cpuid=0x00050654, apicid=0x00000002, bank=0x00000002 [Regression Risk] The new code added should only run on the newly supported systems, so regressions should be restricted to those systems. On those systems, a bug in the decoding code could cause an issue on these systems such as a crash in rasdaemon, etc. That is mitigated by testing on those newly supported platforms.
2020-04-14 00:23:59 dann frazier rasdaemon (Ubuntu Eoan): status New In Progress
2020-04-14 00:24:03 dann frazier rasdaemon (Ubuntu Bionic): status New In Progress
2020-04-14 00:34:41 dann frazier description [Impact] rasdaemon does not know how to decode MCE events from various new platforms, making it difficult to interpret errors reported up from the platform. [Test Case] On an AMD SMCA-capable system: #!/bin/bash modprobe mce-inject EINJ=/sys/kernel/debug/mce-inject # See /sys/kernel/debug/mce-inject/README echo hw > $EINJ/flags echo 0x9c2030000000011b > $EINJ/status echo 0x040000035dd8bfc0 > $EINJ/addr echo 0x0000c2030b404000 > $EINJ/synd echo 0 > $EINJ/bank # Wait for MCE to appear in dmesg sudo ras-mc-ctl --errors There should be a new MCE event in the output: 1 2020-04-13 19:19:55 +0000 error: Deferred error, no action required., CPU 2, bank Load Store Unit (bank=0), mcg mcgstatus=0, mci UECC, mcgcap=0x0000011c, status=0x9c2030000000011b, addr=0x35dd8bfc0, walltime=0x5e94bb5d, cpuid=0x00830f10 For Skylake, I regression tested by using mce-test w/ the "corrected" test, as I'm not sure how to inject a Skylake-specific event there. git clone https://github.com/andikleen/mce-inject cd mce-inject make sudo ./mce-inject < test/corrected sudo ras-mc-ctl --errors No Memory errors. No PCIe AER errors. No Extlog errors. MCE events: 1 2020-04-14 00:13:07 +0000 error: No Error, mcg mcgstatus=0, mci Corrected_error Error_enabled, mcgcap=0x0f000814, status=0x9400000000000000, addr=0x0000abcd, walltime=0x5e950014, cpuid=0x00050654, bank=0x00000001 2 2020-04-14 00:13:07 +0000 error: No Error, mcg mcgstatus=0, mci Corrected_error Error_enabled, mcgcap=0x0f000814, status=0x9400000000000000, addr=0x00001234, walltime=0x5e950014, cpu=0x00000001, cpuid=0x00050654, apicid=0x00000002, bank=0x00000002 [Regression Risk] The new code added should only run on the newly supported systems, so regressions should be restricted to those systems. On those systems, a bug in the decoding code could cause an issue on these systems such as a crash in rasdaemon, etc. That is mitigated by testing on those newly supported platforms. [Impact] rasdaemon does not know how to decode MCE events from various new platforms, making it difficult to interpret errors reported up from the platform. [Test Case] On an AMD SMCA-capable system: #!/bin/bash modprobe mce-inject EINJ=/sys/kernel/debug/mce-inject # See /sys/kernel/debug/mce-inject/README echo hw > $EINJ/flags echo 0x9c2030000000011b > $EINJ/status echo 0x040000035dd8bfc0 > $EINJ/addr echo 0x0000c2030b404000 > $EINJ/synd echo 0 > $EINJ/bank # Wait for MCE to appear in dmesg sudo ras-mc-ctl --errors There should be a new MCE event in the output: 1 2020-04-13 19:19:55 +0000 error: Deferred error, no action required., CPU 2, bank Load Store Unit (bank=0), mcg mcgstatus=0, mci UECC, mcgcap=0x0000011c, status=0x9c2030000000011b, addr=0x35dd8bfc0, walltime=0x5e94bb5d, cpuid=0x00830f10 For Skylake, I regression tested by using mce-test w/ the "corrected" test, as I'm not sure how to inject a Skylake-specific event there. git clone https://github.com/andikleen/mce-inject cd mce-inject make sudo ./mce-inject < test/corrected sudo ras-mc-ctl --errors No Memory errors. No PCIe AER errors. No Extlog errors. MCE events: 1 2020-04-14 00:13:07 +0000 error: No Error, mcg mcgstatus=0, mci Corrected_error Error_enabled, mcgcap=0x0f000814, status=0x9400000000000000, addr=0x0000abcd, walltime=0x5e950014, cpuid=0x00050654, bank=0x00000001 2 2020-04-14 00:13:07 +0000 error: No Error, mcg mcgstatus=0, mci Corrected_error Error_enabled, mcgcap=0x0f000814, status=0x9400000000000000, addr=0x00001234, walltime=0x5e950014, cpu=0x00000001, cpuid=0x00050654, apicid=0x00000002, bank=0x00000002 [Regression Risk] The new code added should only run on the newly supported systems, so regressions should be restricted to those systems. On those systems, a bug in the decoding code could cause an issue on these systems such as a crash in rasdaemon, etc. That is mitigated by testing on those newly supported platforms. Note that one code path I could not exercise is the Hygon Dhyana support as I don't have that hardware - that patch is a trivial "do the same thing as AMD Zen", as it is a derivative platform.
2020-04-14 00:35:42 dann frazier description [Impact] rasdaemon does not know how to decode MCE events from various new platforms, making it difficult to interpret errors reported up from the platform. [Test Case] On an AMD SMCA-capable system: #!/bin/bash modprobe mce-inject EINJ=/sys/kernel/debug/mce-inject # See /sys/kernel/debug/mce-inject/README echo hw > $EINJ/flags echo 0x9c2030000000011b > $EINJ/status echo 0x040000035dd8bfc0 > $EINJ/addr echo 0x0000c2030b404000 > $EINJ/synd echo 0 > $EINJ/bank # Wait for MCE to appear in dmesg sudo ras-mc-ctl --errors There should be a new MCE event in the output: 1 2020-04-13 19:19:55 +0000 error: Deferred error, no action required., CPU 2, bank Load Store Unit (bank=0), mcg mcgstatus=0, mci UECC, mcgcap=0x0000011c, status=0x9c2030000000011b, addr=0x35dd8bfc0, walltime=0x5e94bb5d, cpuid=0x00830f10 For Skylake, I regression tested by using mce-test w/ the "corrected" test, as I'm not sure how to inject a Skylake-specific event there. git clone https://github.com/andikleen/mce-inject cd mce-inject make sudo ./mce-inject < test/corrected sudo ras-mc-ctl --errors No Memory errors. No PCIe AER errors. No Extlog errors. MCE events: 1 2020-04-14 00:13:07 +0000 error: No Error, mcg mcgstatus=0, mci Corrected_error Error_enabled, mcgcap=0x0f000814, status=0x9400000000000000, addr=0x0000abcd, walltime=0x5e950014, cpuid=0x00050654, bank=0x00000001 2 2020-04-14 00:13:07 +0000 error: No Error, mcg mcgstatus=0, mci Corrected_error Error_enabled, mcgcap=0x0f000814, status=0x9400000000000000, addr=0x00001234, walltime=0x5e950014, cpu=0x00000001, cpuid=0x00050654, apicid=0x00000002, bank=0x00000002 [Regression Risk] The new code added should only run on the newly supported systems, so regressions should be restricted to those systems. On those systems, a bug in the decoding code could cause an issue on these systems such as a crash in rasdaemon, etc. That is mitigated by testing on those newly supported platforms. Note that one code path I could not exercise is the Hygon Dhyana support as I don't have that hardware - that patch is a trivial "do the same thing as AMD Zen", as it is a derivative platform. [Impact] rasdaemon does not know how to decode MCE events from various new platforms, making it difficult to interpret errors reported up from the platform. [Test Case] On an AMD SMCA-capable system: #!/bin/bash modprobe mce-inject EINJ=/sys/kernel/debug/mce-inject # See /sys/kernel/debug/mce-inject/README echo hw > $EINJ/flags echo 0x9c2030000000011b > $EINJ/status echo 0x040000035dd8bfc0 > $EINJ/addr echo 0x0000c2030b404000 > $EINJ/synd echo 0 > $EINJ/bank # Wait for MCE to appear in dmesg sudo ras-mc-ctl --errors There should be a new MCE event in the output: 1 2020-04-13 19:19:55 +0000 error: Deferred error, no action required., CPU 2, bank Load Store Unit (bank=0), mcg mcgstatus=0, mci UECC, mcgcap=0x0000011c, status=0x9c2030000000011b, addr=0x35dd8bfc0, walltime=0x5e94bb5d, cpuid=0x00830f10 For Skylake, I regression tested by using mce-test w/ the "corrected" test, as I'm not sure how to inject a Skylake-specific event there. git clone https://github.com/andikleen/mce-inject cd mce-inject make sudo ./mce-inject < test/corrected sudo ras-mc-ctl --errors No Memory errors. No PCIe AER errors. No Extlog errors. MCE events: 1 2020-04-14 00:13:07 +0000 error: No Error, mcg mcgstatus=0, mci Corrected_error Error_enabled, mcgcap=0x0f000814, status=0x9400000000000000, addr=0x0000abcd, walltime=0x5e950014, cpuid=0x00050654, bank=0x00000001 2 2020-04-14 00:13:07 +0000 error: No Error, mcg mcgstatus=0, mci Corrected_error Error_enabled, mcgcap=0x0f000814, status=0x9400000000000000, addr=0x00001234, walltime=0x5e950014, cpu=0x00000001, cpuid=0x00050654, apicid=0x00000002, bank=0x00000002 [Fix] https://github.com/mchehab/rasdaemon/commit/b30a7fd4e5df8c4e61c7441f79e52d8f5f115974 https://github.com/mchehab/rasdaemon/commit/a16ca0711001957ee98f2c124abce0fa1f801529 https://github.com/mchehab/rasdaemon/commit/8704a85d8dc3483423ec2934fee8132f85f8fdb6 https://github.com/mchehab/rasdaemon/commit/22f2d8bb1d1065dede59b73b148ad4b4e2177c33 [Regression Risk] The new code added should only run on the newly supported systems, so regressions should be restricted to those systems. On those systems, a bug in the decoding code could cause an issue on these systems such as a crash in rasdaemon, etc. That is mitigated by testing on those newly supported platforms. Note that one code path I could not exercise is the Hygon Dhyana support as I don't have that hardware - that patch is a trivial "do the same thing as AMD Zen", as it is a derivative platform.
2020-04-14 00:36:57 dann frazier description [Impact] rasdaemon does not know how to decode MCE events from various new platforms, making it difficult to interpret errors reported up from the platform. [Test Case] On an AMD SMCA-capable system: #!/bin/bash modprobe mce-inject EINJ=/sys/kernel/debug/mce-inject # See /sys/kernel/debug/mce-inject/README echo hw > $EINJ/flags echo 0x9c2030000000011b > $EINJ/status echo 0x040000035dd8bfc0 > $EINJ/addr echo 0x0000c2030b404000 > $EINJ/synd echo 0 > $EINJ/bank # Wait for MCE to appear in dmesg sudo ras-mc-ctl --errors There should be a new MCE event in the output: 1 2020-04-13 19:19:55 +0000 error: Deferred error, no action required., CPU 2, bank Load Store Unit (bank=0), mcg mcgstatus=0, mci UECC, mcgcap=0x0000011c, status=0x9c2030000000011b, addr=0x35dd8bfc0, walltime=0x5e94bb5d, cpuid=0x00830f10 For Skylake, I regression tested by using mce-test w/ the "corrected" test, as I'm not sure how to inject a Skylake-specific event there. git clone https://github.com/andikleen/mce-inject cd mce-inject make sudo ./mce-inject < test/corrected sudo ras-mc-ctl --errors No Memory errors. No PCIe AER errors. No Extlog errors. MCE events: 1 2020-04-14 00:13:07 +0000 error: No Error, mcg mcgstatus=0, mci Corrected_error Error_enabled, mcgcap=0x0f000814, status=0x9400000000000000, addr=0x0000abcd, walltime=0x5e950014, cpuid=0x00050654, bank=0x00000001 2 2020-04-14 00:13:07 +0000 error: No Error, mcg mcgstatus=0, mci Corrected_error Error_enabled, mcgcap=0x0f000814, status=0x9400000000000000, addr=0x00001234, walltime=0x5e950014, cpu=0x00000001, cpuid=0x00050654, apicid=0x00000002, bank=0x00000002 [Fix] https://github.com/mchehab/rasdaemon/commit/b30a7fd4e5df8c4e61c7441f79e52d8f5f115974 https://github.com/mchehab/rasdaemon/commit/a16ca0711001957ee98f2c124abce0fa1f801529 https://github.com/mchehab/rasdaemon/commit/8704a85d8dc3483423ec2934fee8132f85f8fdb6 https://github.com/mchehab/rasdaemon/commit/22f2d8bb1d1065dede59b73b148ad4b4e2177c33 [Regression Risk] The new code added should only run on the newly supported systems, so regressions should be restricted to those systems. On those systems, a bug in the decoding code could cause an issue on these systems such as a crash in rasdaemon, etc. That is mitigated by testing on those newly supported platforms. Note that one code path I could not exercise is the Hygon Dhyana support as I don't have that hardware - that patch is a trivial "do the same thing as AMD Zen", as it is a derivative platform. [Impact] rasdaemon does not know how to decode MCE events from various new platforms, making it difficult to interpret errors reported up from the platform. [Test Case] On an AMD SMCA-capable system: #!/bin/bash modprobe mce-inject EINJ=/sys/kernel/debug/mce-inject # See /sys/kernel/debug/mce-inject/README echo hw > $EINJ/flags echo 0x9c2030000000011b > $EINJ/status echo 0x040000035dd8bfc0 > $EINJ/addr echo 0x0000c2030b404000 > $EINJ/synd echo 0 > $EINJ/bank # Wait for MCE to appear in dmesg sudo ras-mc-ctl --errors There should be a new MCE event in the output: 1 2020-04-13 19:19:55 +0000 error: Deferred error, no action required., CPU 2, bank Load Store Unit (bank=0), mcg mcgstatus=0, mci UECC, mcgcap=0x0000011c, status=0x9c2030000000011b, addr=0x35dd8bfc0, walltime=0x5e94bb5d, cpuid=0x00830f10 For Skylake, I regression tested by using mce-test w/ the "corrected" test, as I'm not sure how to inject a Skylake-specific event there. git clone https://github.com/andikleen/mce-inject cd mce-inject make sudo ./mce-inject < test/corrected sudo ras-mc-ctl --errors No Memory errors. No PCIe AER errors. No Extlog errors. MCE events: 1 2020-04-14 00:13:07 +0000 error: No Error, mcg mcgstatus=0, mci Corrected_error Error_enabled, mcgcap=0x0f000814, status=0x9400000000000000, addr=0x0000abcd, walltime=0x5e950014, cpuid=0x00050654, bank=0x00000001 2 2020-04-14 00:13:07 +0000 error: No Error, mcg mcgstatus=0, mci Corrected_error Error_enabled, mcgcap=0x0f000814, status=0x9400000000000000, addr=0x00001234, walltime=0x5e950014, cpu=0x00000001, cpuid=0x00050654, apicid=0x00000002, bank=0x00000002 [Fix] https://github.com/mchehab/rasdaemon/commit/b30a7fd4e5df8c4e61c7441f79e52d8f5f115974 https://github.com/mchehab/rasdaemon/commit/a16ca0711001957ee98f2c124abce0fa1f801529 https://github.com/mchehab/rasdaemon/commit/8704a85d8dc3483423ec2934fee8132f85f8fdb6 https://github.com/mchehab/rasdaemon/commit/22f2d8bb1d1065dede59b73b148ad4b4e2177c33 [Regression Risk] The new code added should only run on the newly supported systems, so regressions should be restricted to those systems. On those systems, a bug in the decoding code could cause e.g. as a crash in rasdaemon. That is mitigated by testing on those newly supported platforms. Note that one code path I could not exercise is the Hygon Dhyana support as I don't have that hardware - that patch is a trivial "do the same thing as AMD Zen", as it is a derivative platform.
2020-04-14 21:56:10 Brian Murray rasdaemon (Ubuntu Eoan): status In Progress Fix Committed
2020-04-14 21:56:14 Brian Murray bug added subscriber Ubuntu Stable Release Updates Team
2020-04-14 21:56:15 Brian Murray bug added subscriber SRU Verification
2020-04-14 21:56:19 Brian Murray tags verification-needed verification-needed-eoan
2020-04-14 22:06:16 Brian Murray rasdaemon (Ubuntu Bionic): status In Progress Fix Committed
2020-04-14 22:06:24 Brian Murray tags verification-needed verification-needed-eoan verification-needed verification-needed-bionic verification-needed-eoan
2020-04-15 18:20:59 dann frazier tags verification-needed verification-needed-bionic verification-needed-eoan verification-done-eoan verification-needed verification-needed-bionic
2020-04-15 19:42:43 dann frazier tags verification-done-eoan verification-needed verification-needed-bionic verification-done verification-done-bionic verification-done-eoan
2020-04-22 01:34:18 Chris Halse Rogers removed subscriber Ubuntu Stable Release Updates Team
2020-04-22 01:35:30 Launchpad Janitor rasdaemon (Ubuntu Bionic): status Fix Committed Fix Released
2020-04-22 01:49:36 Launchpad Janitor rasdaemon (Ubuntu Eoan): status Fix Committed Fix Released