Activity log for bug #1917471

Date Who What changed Old value New value Message
2021-03-02 14:30:31 prabhakar pujeri bug added bug
2021-03-02 14:31:35 prabhakar pujeri bug added subscriber Jerry Clement
2021-03-16 19:30:48 Jeff Lane  affects kernel-sru-workflow linux
2021-03-16 19:31:42 Jeff Lane  summary Bus Fatal Error observed when reboot on BCM5720 [Regression] Bus Fatal Error observed when reboot on BCM5720
2021-03-17 07:51:19 Andrew Cloke bug added subscriber Andrew Cloke
2022-03-09 15:01:25 Sujith Pandel bug added subscriber Sujith Pandel
2022-03-14 06:22:10 Kai-Heng Feng bug added subscriber Kai-Heng Feng
2022-03-17 06:03:59 Sujith Pandel bug added subscriber Michael Reed
2022-03-17 06:04:10 Sujith Pandel bug added subscriber Narendra K
2022-03-17 06:04:19 Sujith Pandel bug added subscriber Vinay HM
2022-03-17 20:22:52 Jeff Lane  bug task added linux (Ubuntu)
2022-03-17 20:23:32 Jeff Lane  bug task deleted linux (Ubuntu)
2022-03-17 20:23:43 Launchpad Janitor linux (Ubuntu): status New Confirmed
2022-03-17 20:23:43 Jeff Lane  affects linux linux (Ubuntu)
2022-03-17 20:23:56 Jeff Lane  nominated for series Ubuntu Jammy
2022-03-17 20:23:56 Jeff Lane  bug task added linux (Ubuntu Jammy)
2022-03-17 20:23:56 Jeff Lane  nominated for series Ubuntu Focal
2022-03-17 20:23:56 Jeff Lane  bug task added linux (Ubuntu Focal)
2022-03-17 20:23:56 Jeff Lane  nominated for series Ubuntu Impish
2022-03-17 20:23:56 Jeff Lane  bug task added linux (Ubuntu Impish)
2022-03-17 20:24:05 Jeff Lane  linux (Ubuntu Focal): status New In Progress
2022-03-17 20:24:07 Jeff Lane  linux (Ubuntu Focal): importance Undecided Medium
2022-03-17 20:24:09 Jeff Lane  linux (Ubuntu Impish): importance Undecided Medium
2022-03-17 20:24:11 Jeff Lane  linux (Ubuntu Jammy): importance Undecided Medium
2022-03-17 20:24:13 Jeff Lane  linux (Ubuntu Focal): assignee Jeff Lane (bladernr)
2022-03-17 20:24:15 Jeff Lane  linux (Ubuntu Impish): assignee Jeff Lane (bladernr)
2022-03-17 20:24:17 Jeff Lane  linux (Ubuntu Jammy): assignee Jeff Lane (bladernr)
2022-03-17 20:24:19 Jeff Lane  linux (Ubuntu Impish): status New In Progress
2022-03-17 20:24:21 Jeff Lane  linux (Ubuntu Jammy): status New In Progress
2022-03-17 20:36:48 Jeff Lane  linux (Ubuntu Impish): status In Progress Fix Released
2022-03-17 20:36:49 Jeff Lane  linux (Ubuntu Jammy): status In Progress Fix Released
2022-03-18 10:56:11 Vinay HM attachment added Fatal_issue_rmmod_tg3.txt https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1917471/+attachment/5570368/+files/Fatal_issue_rmmod_tg3.txt
2022-03-22 07:16:17 Vinay HM attachment added fatal_blacklist https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1917471/+attachment/5571631/+files/fatal_blacklist
2022-03-23 13:47:29 Jeff Lane  description following error messages are observed [ 146.429212] shutdown[1]: Rebooting. [ 146.435151] kvm: exiting hardware virtualization [ 146.575319] megaraid_sas 0000:67:00.0: megasas_disable_intr_fusion is called outbound_intr_mask:0x40000009 [ 148.088133] [qede_unload:2236(eno12409)]Link is down [ 148.183618] qede 0000:31:00.1: Ending qede_remove successfully [ 148.518541] [qede_unload:2236(eno12399)]Link is down [ 148.625066] qede 0000:31:00.0: Ending qede_remove successfully [ 148.762067] ACPI: Preparing to enter system sleep state S5 [ 148.794638] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 5 [ 148.803731] {1}[Hardware Error]: event severity: recoverable [ 148.810191] {1}[Hardware Error]: Error 0, type: fatal [ 148.816088] {1}[Hardware Error]: section_type: PCIe error [ 148.822391] {1}[Hardware Error]: port_type: 0, PCIe end point [ 148.829026] {1}[Hardware Error]: version: 3.0 [ 148.834266] {1}[Hardware Error]: command: 0x0006, status: 0x0010 [ 148.841140] {1}[Hardware Error]: device_id: 0000:04:00.0 [ 148.847309] {1}[Hardware Error]: slot: 0 [ 148.852077] {1}[Hardware Error]: secondary_bus: 0x00 [ 148.857876] {1}[Hardware Error]: vendor_id: 0x14e4, device_id: 0x165f [ 148.865145] {1}[Hardware Error]: class_code: 020000 [ 148.870845] {1}[Hardware Error]: aer_uncor_status: 0x00100000, aer_uncor_mask: 0x00010000 [ 148.879842] {1}[Hardware Error]: aer_uncor_severity: 0x000ef030 [ 148.886575] {1}[Hardware Error]: TLP Header: 40000001 0000030f 90028090 00000000 [ 148.894823] tg3 0000:04:00.0: AER: aer_status: 0x00100000, aer_mask: 0x00010000 [ 148.902795] tg3 0000:04:00.0: AER: [20] UnsupReq (First) [ 148.910234] tg3 0000:04:00.0: AER: aer_layer=Transaction Layer, aer_agent=Requester ID [ 148.918806] tg3 0000:04:00.0: AER: aer_uncor_severity: 0x000ef030 [ 148.925558] tg3 0000:04:00.0: AER: TLP Header: 40000001 0000030f 90028090 00000000 [ 148.933984] reboot: Restarting system [ 148.938319] reboot: machine restart I have observed the following. when I test older kernel Kernel version Fatal Error 5.4.0-42.46 No 5.4.0-45.49 No 5.4.0-47.51 No 5.4.0-48.52 No 5.4.0-51.56 No 5.4.0-52.57 No 5.4.0-53.59 No 5.4.0-54.60 No 5.4.0-58.64 No 5.4.0-59.65 yes 5.4.0-60.67 yes later I have bisect kernel between 5.4.0-58.64 and 5.4.0-59.65. looks like due to the following patch we are observing this issue. The driver is not handling D3 state properly PCI/ACPI: Whitelist hotplug ports for D3 if power managed by ACPI https://kernel.ubuntu.com/git/ubuntu/ubuntu-focal.git/commit/?id=b9319dd02269593911403dd5d684368bcef3261d impact being noticed a lot, only affects 5.4, fix in subsequent failures The offending patch was removed in 20.10 and later kernels (it was reverted upstream not long after being merged into mainline but we never reverted it) following error messages are observed [ 146.429212] shutdown[1]: Rebooting. [ 146.435151] kvm: exiting hardware virtualization [ 146.575319] megaraid_sas 0000:67:00.0: megasas_disable_intr_fusion is called outbound_intr_mask:0x40000009 [ 148.088133] [qede_unload:2236(eno12409)]Link is down [ 148.183618] qede 0000:31:00.1: Ending qede_remove successfully [ 148.518541] [qede_unload:2236(eno12399)]Link is down [ 148.625066] qede 0000:31:00.0: Ending qede_remove successfully [ 148.762067] ACPI: Preparing to enter system sleep state S5 [ 148.794638] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 5 [ 148.803731] {1}[Hardware Error]: event severity: recoverable [ 148.810191] {1}[Hardware Error]: Error 0, type: fatal [ 148.816088] {1}[Hardware Error]: section_type: PCIe error [ 148.822391] {1}[Hardware Error]: port_type: 0, PCIe end point [ 148.829026] {1}[Hardware Error]: version: 3.0 [ 148.834266] {1}[Hardware Error]: command: 0x0006, status: 0x0010 [ 148.841140] {1}[Hardware Error]: device_id: 0000:04:00.0 [ 148.847309] {1}[Hardware Error]: slot: 0 [ 148.852077] {1}[Hardware Error]: secondary_bus: 0x00 [ 148.857876] {1}[Hardware Error]: vendor_id: 0x14e4, device_id: 0x165f [ 148.865145] {1}[Hardware Error]: class_code: 020000 [ 148.870845] {1}[Hardware Error]: aer_uncor_status: 0x00100000, aer_uncor_mask: 0x00010000 [ 148.879842] {1}[Hardware Error]: aer_uncor_severity: 0x000ef030 [ 148.886575] {1}[Hardware Error]: TLP Header: 40000001 0000030f 90028090 00000000 [ 148.894823] tg3 0000:04:00.0: AER: aer_status: 0x00100000, aer_mask: 0x00010000 [ 148.902795] tg3 0000:04:00.0: AER: [20] UnsupReq (First) [ 148.910234] tg3 0000:04:00.0: AER: aer_layer=Transaction Layer, aer_agent=Requester ID [ 148.918806] tg3 0000:04:00.0: AER: aer_uncor_severity: 0x000ef030 [ 148.925558] tg3 0000:04:00.0: AER: TLP Header: 40000001 0000030f 90028090 00000000 [ 148.933984] reboot: Restarting system [ 148.938319] reboot: machine restart I have observed the following. when I test older kernel Kernel version Fatal Error 5.4.0-42.46 No 5.4.0-45.49 No 5.4.0-47.51 No 5.4.0-48.52 No 5.4.0-51.56 No 5.4.0-52.57 No 5.4.0-53.59 No 5.4.0-54.60 No 5.4.0-58.64 No 5.4.0-59.65 yes 5.4.0-60.67 yes later I have bisect kernel between 5.4.0-58.64 and 5.4.0-59.65. looks like due to the following patch we are observing this issue. The driver is not handling D3 state properly PCI/ACPI: Whitelist hotplug ports for D3 if power managed by ACPI https://kernel.ubuntu.com/git/ubuntu/ubuntu-focal.git/commit/?id=b9319dd02269593911403dd5d684368bcef3261d
2022-03-24 13:08:36 Jeff Lane  description impact being noticed a lot, only affects 5.4, fix in subsequent failures The offending patch was removed in 20.10 and later kernels (it was reverted upstream not long after being merged into mainline but we never reverted it) following error messages are observed [ 146.429212] shutdown[1]: Rebooting. [ 146.435151] kvm: exiting hardware virtualization [ 146.575319] megaraid_sas 0000:67:00.0: megasas_disable_intr_fusion is called outbound_intr_mask:0x40000009 [ 148.088133] [qede_unload:2236(eno12409)]Link is down [ 148.183618] qede 0000:31:00.1: Ending qede_remove successfully [ 148.518541] [qede_unload:2236(eno12399)]Link is down [ 148.625066] qede 0000:31:00.0: Ending qede_remove successfully [ 148.762067] ACPI: Preparing to enter system sleep state S5 [ 148.794638] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 5 [ 148.803731] {1}[Hardware Error]: event severity: recoverable [ 148.810191] {1}[Hardware Error]: Error 0, type: fatal [ 148.816088] {1}[Hardware Error]: section_type: PCIe error [ 148.822391] {1}[Hardware Error]: port_type: 0, PCIe end point [ 148.829026] {1}[Hardware Error]: version: 3.0 [ 148.834266] {1}[Hardware Error]: command: 0x0006, status: 0x0010 [ 148.841140] {1}[Hardware Error]: device_id: 0000:04:00.0 [ 148.847309] {1}[Hardware Error]: slot: 0 [ 148.852077] {1}[Hardware Error]: secondary_bus: 0x00 [ 148.857876] {1}[Hardware Error]: vendor_id: 0x14e4, device_id: 0x165f [ 148.865145] {1}[Hardware Error]: class_code: 020000 [ 148.870845] {1}[Hardware Error]: aer_uncor_status: 0x00100000, aer_uncor_mask: 0x00010000 [ 148.879842] {1}[Hardware Error]: aer_uncor_severity: 0x000ef030 [ 148.886575] {1}[Hardware Error]: TLP Header: 40000001 0000030f 90028090 00000000 [ 148.894823] tg3 0000:04:00.0: AER: aer_status: 0x00100000, aer_mask: 0x00010000 [ 148.902795] tg3 0000:04:00.0: AER: [20] UnsupReq (First) [ 148.910234] tg3 0000:04:00.0: AER: aer_layer=Transaction Layer, aer_agent=Requester ID [ 148.918806] tg3 0000:04:00.0: AER: aer_uncor_severity: 0x000ef030 [ 148.925558] tg3 0000:04:00.0: AER: TLP Header: 40000001 0000030f 90028090 00000000 [ 148.933984] reboot: Restarting system [ 148.938319] reboot: machine restart I have observed the following. when I test older kernel Kernel version Fatal Error 5.4.0-42.46 No 5.4.0-45.49 No 5.4.0-47.51 No 5.4.0-48.52 No 5.4.0-51.56 No 5.4.0-52.57 No 5.4.0-53.59 No 5.4.0-54.60 No 5.4.0-58.64 No 5.4.0-59.65 yes 5.4.0-60.67 yes later I have bisect kernel between 5.4.0-58.64 and 5.4.0-59.65. looks like due to the following patch we are observing this issue. The driver is not handling D3 state properly PCI/ACPI: Whitelist hotplug ports for D3 if power managed by ACPI https://kernel.ubuntu.com/git/ubuntu/ubuntu-focal.git/commit/?id=b9319dd02269593911403dd5d684368bcef3261d SRU Justification: [IMPACT] This is being reported by a hardware partner as it is being noticed a lot both in their internal testing teams and also being reported with some frequency by customers who are seeing these messages in their logs and thus it is generating an unusualy high volume of support calls from the field. In 5.4, commit d60cd06331a3566d3305b3c7b566e79edf4e2095 was introduced upstream and pulled into Ubuntu between 5.4.0-58.64 and 5.4.0-59.65. Upstream, these errors were discovered and that patch was reverted (see Fix Below). We carry the revert commit in all subsequent Focal HWE kernels starting at 5.12, but the fix was never pulled back into Focal 5.4. according to the hardware partner: the following error messages are observed when rebooting a machine that uses the BCM5720 chipset, which is a widely used 1GbE controller found on LOMs and OCP NICs as well as many PCIe NIC models. [ 146.429212] shutdown[1]: Rebooting. [ 146.435151] kvm: exiting hardware virtualization [ 146.575319] megaraid_sas 0000:67:00.0: megasas_disable_intr_fusion is called outbound_intr_mask:0x40000009 [ 148.088133] [qede_unload:2236(eno12409)]Link is down [ 148.183618] qede 0000:31:00.1: Ending qede_remove successfully [ 148.518541] [qede_unload:2236(eno12399)]Link is down [ 148.625066] qede 0000:31:00.0: Ending qede_remove successfully [ 148.762067] ACPI: Preparing to enter system sleep state S5 [ 148.794638] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 5 [ 148.803731] {1}[Hardware Error]: event severity: recoverable [ 148.810191] {1}[Hardware Error]: Error 0, type: fatal [ 148.816088] {1}[Hardware Error]: section_type: PCIe error [ 148.822391] {1}[Hardware Error]: port_type: 0, PCIe end point [ 148.829026] {1}[Hardware Error]: version: 3.0 [ 148.834266] {1}[Hardware Error]: command: 0x0006, status: 0x0010 [ 148.841140] {1}[Hardware Error]: device_id: 0000:04:00.0 [ 148.847309] {1}[Hardware Error]: slot: 0 [ 148.852077] {1}[Hardware Error]: secondary_bus: 0x00 [ 148.857876] {1}[Hardware Error]: vendor_id: 0x14e4, device_id: 0x165f [ 148.865145] {1}[Hardware Error]: class_code: 020000 [ 148.870845] {1}[Hardware Error]: aer_uncor_status: 0x00100000, aer_uncor_mask: 0x00010000 [ 148.879842] {1}[Hardware Error]: aer_uncor_severity: 0x000ef030 [ 148.886575] {1}[Hardware Error]: TLP Header: 40000001 0000030f 90028090 00000000 [ 148.894823] tg3 0000:04:00.0: AER: aer_status: 0x00100000, aer_mask: 0x00010000 [ 148.902795] tg3 0000:04:00.0: AER: [20] UnsupReq (First) [ 148.910234] tg3 0000:04:00.0: AER: aer_layer=Transaction Layer, aer_agent=Requester ID [ 148.918806] tg3 0000:04:00.0: AER: aer_uncor_severity: 0x000ef030 [ 148.925558] tg3 0000:04:00.0: AER: TLP Header: 40000001 0000030f 90028090 00000000 [ 148.933984] reboot: Restarting system [ 148.938319] reboot: machine restart The hardware partner did some bisection and observed the following: Kernel version Fatal Error 5.4.0-42.46 No 5.4.0-45.49 No 5.4.0-47.51 No 5.4.0-48.52 No 5.4.0-51.56 No 5.4.0-52.57 No 5.4.0-53.59 No 5.4.0-54.60 No 5.4.0-58.64 No 5.4.0-59.65 yes 5.4.0-60.67 yes [FIX] The fix is to apply this patch from upstream: commit 9d3fcb28f9b9750b474811a2964ce022df56336e Author: Josef Bacik <josef@toxicpanda.com> Date: Tue Mar 16 22:17:48 2021 -0400 Revert "PM: ACPI: reboot: Use S5 for reboot" This reverts commit d60cd06331a3566d3305b3c7b566e79edf4e2095. This patch causes a panic when rebooting my Dell Poweredge r440. I do not have the full panic log as it's lost at that stage of the reboot and I do not have a serial console. Reverting this patch makes my system able to reboot again. Example: https://code.launchpad.net/~bladernr/ubuntu/+source/linux/+git/focal/+ref/1917471 [TEST CASE] Install the patched kernel on a machine that uses a BCM5720 LOM and reboot the machine and see that the errors no longer appear.
2022-03-24 13:08:49 Jeff Lane  summary [Regression] Bus Fatal Error observed when reboot on BCM5720 [SRU][Regression] Bus Fatal Error observed when reboot on BCM5720
2022-03-24 13:09:36 Jeff Lane  summary [SRU][Regression] Bus Fatal Error observed when reboot on BCM5720 [SRU][Regression] Revert "PM: ACPI: reboot: Use S5 for reboot" which causes Bus Fatal Error when rebooting system with BCM5720 NIC
2022-03-24 13:15:04 Jeff Lane  description SRU Justification: [IMPACT] This is being reported by a hardware partner as it is being noticed a lot both in their internal testing teams and also being reported with some frequency by customers who are seeing these messages in their logs and thus it is generating an unusualy high volume of support calls from the field. In 5.4, commit d60cd06331a3566d3305b3c7b566e79edf4e2095 was introduced upstream and pulled into Ubuntu between 5.4.0-58.64 and 5.4.0-59.65. Upstream, these errors were discovered and that patch was reverted (see Fix Below). We carry the revert commit in all subsequent Focal HWE kernels starting at 5.12, but the fix was never pulled back into Focal 5.4. according to the hardware partner: the following error messages are observed when rebooting a machine that uses the BCM5720 chipset, which is a widely used 1GbE controller found on LOMs and OCP NICs as well as many PCIe NIC models. [ 146.429212] shutdown[1]: Rebooting. [ 146.435151] kvm: exiting hardware virtualization [ 146.575319] megaraid_sas 0000:67:00.0: megasas_disable_intr_fusion is called outbound_intr_mask:0x40000009 [ 148.088133] [qede_unload:2236(eno12409)]Link is down [ 148.183618] qede 0000:31:00.1: Ending qede_remove successfully [ 148.518541] [qede_unload:2236(eno12399)]Link is down [ 148.625066] qede 0000:31:00.0: Ending qede_remove successfully [ 148.762067] ACPI: Preparing to enter system sleep state S5 [ 148.794638] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 5 [ 148.803731] {1}[Hardware Error]: event severity: recoverable [ 148.810191] {1}[Hardware Error]: Error 0, type: fatal [ 148.816088] {1}[Hardware Error]: section_type: PCIe error [ 148.822391] {1}[Hardware Error]: port_type: 0, PCIe end point [ 148.829026] {1}[Hardware Error]: version: 3.0 [ 148.834266] {1}[Hardware Error]: command: 0x0006, status: 0x0010 [ 148.841140] {1}[Hardware Error]: device_id: 0000:04:00.0 [ 148.847309] {1}[Hardware Error]: slot: 0 [ 148.852077] {1}[Hardware Error]: secondary_bus: 0x00 [ 148.857876] {1}[Hardware Error]: vendor_id: 0x14e4, device_id: 0x165f [ 148.865145] {1}[Hardware Error]: class_code: 020000 [ 148.870845] {1}[Hardware Error]: aer_uncor_status: 0x00100000, aer_uncor_mask: 0x00010000 [ 148.879842] {1}[Hardware Error]: aer_uncor_severity: 0x000ef030 [ 148.886575] {1}[Hardware Error]: TLP Header: 40000001 0000030f 90028090 00000000 [ 148.894823] tg3 0000:04:00.0: AER: aer_status: 0x00100000, aer_mask: 0x00010000 [ 148.902795] tg3 0000:04:00.0: AER: [20] UnsupReq (First) [ 148.910234] tg3 0000:04:00.0: AER: aer_layer=Transaction Layer, aer_agent=Requester ID [ 148.918806] tg3 0000:04:00.0: AER: aer_uncor_severity: 0x000ef030 [ 148.925558] tg3 0000:04:00.0: AER: TLP Header: 40000001 0000030f 90028090 00000000 [ 148.933984] reboot: Restarting system [ 148.938319] reboot: machine restart The hardware partner did some bisection and observed the following: Kernel version Fatal Error 5.4.0-42.46 No 5.4.0-45.49 No 5.4.0-47.51 No 5.4.0-48.52 No 5.4.0-51.56 No 5.4.0-52.57 No 5.4.0-53.59 No 5.4.0-54.60 No 5.4.0-58.64 No 5.4.0-59.65 yes 5.4.0-60.67 yes [FIX] The fix is to apply this patch from upstream: commit 9d3fcb28f9b9750b474811a2964ce022df56336e Author: Josef Bacik <josef@toxicpanda.com> Date: Tue Mar 16 22:17:48 2021 -0400 Revert "PM: ACPI: reboot: Use S5 for reboot" This reverts commit d60cd06331a3566d3305b3c7b566e79edf4e2095. This patch causes a panic when rebooting my Dell Poweredge r440. I do not have the full panic log as it's lost at that stage of the reboot and I do not have a serial console. Reverting this patch makes my system able to reboot again. Example: https://code.launchpad.net/~bladernr/ubuntu/+source/linux/+git/focal/+ref/1917471 [TEST CASE] Install the patched kernel on a machine that uses a BCM5720 LOM and reboot the machine and see that the errors no longer appear. SRU Justification: [IMPACT] This is being reported by a hardware partner as it is being noticed a lot both in their internal testing teams and also being reported with some frequency by customers who are seeing these messages in their logs and thus it is generating an unusualy high volume of support calls from the field. In 5.4, commit d60cd06331a3566d3305b3c7b566e79edf4e2095 was introduced upstream and pulled into Ubuntu between 5.4.0-58.64 and 5.4.0-59.65. Upstream, these errors were discovered and that patch was reverted (see Fix Below). We carry the revert commit in all subsequent Focal HWE kernels starting at 5.12, but the fix was never pulled back into Focal 5.4. according to the hardware partner: the following error messages are observed when rebooting a machine that uses the BCM5720 chipset, which is a widely used 1GbE controller found on LOMs and OCP NICs as well as many PCIe NIC models. [ 146.429212] shutdown[1]: Rebooting. [ 146.435151] kvm: exiting hardware virtualization [ 146.575319] megaraid_sas 0000:67:00.0: megasas_disable_intr_fusion is called outbound_intr_mask:0x40000009 [ 148.088133] [qede_unload:2236(eno12409)]Link is down [ 148.183618] qede 0000:31:00.1: Ending qede_remove successfully [ 148.518541] [qede_unload:2236(eno12399)]Link is down [ 148.625066] qede 0000:31:00.0: Ending qede_remove successfully [ 148.762067] ACPI: Preparing to enter system sleep state S5 [ 148.794638] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 5 [ 148.803731] {1}[Hardware Error]: event severity: recoverable [ 148.810191] {1}[Hardware Error]: Error 0, type: fatal [ 148.816088] {1}[Hardware Error]: section_type: PCIe error [ 148.822391] {1}[Hardware Error]: port_type: 0, PCIe end point [ 148.829026] {1}[Hardware Error]: version: 3.0 [ 148.834266] {1}[Hardware Error]: command: 0x0006, status: 0x0010 [ 148.841140] {1}[Hardware Error]: device_id: 0000:04:00.0 [ 148.847309] {1}[Hardware Error]: slot: 0 [ 148.852077] {1}[Hardware Error]: secondary_bus: 0x00 [ 148.857876] {1}[Hardware Error]: vendor_id: 0x14e4, device_id: 0x165f [ 148.865145] {1}[Hardware Error]: class_code: 020000 [ 148.870845] {1}[Hardware Error]: aer_uncor_status: 0x00100000, aer_uncor_mask: 0x00010000 [ 148.879842] {1}[Hardware Error]: aer_uncor_severity: 0x000ef030 [ 148.886575] {1}[Hardware Error]: TLP Header: 40000001 0000030f 90028090 00000000 [ 148.894823] tg3 0000:04:00.0: AER: aer_status: 0x00100000, aer_mask: 0x00010000 [ 148.902795] tg3 0000:04:00.0: AER: [20] UnsupReq (First) [ 148.910234] tg3 0000:04:00.0: AER: aer_layer=Transaction Layer, aer_agent=Requester ID [ 148.918806] tg3 0000:04:00.0: AER: aer_uncor_severity: 0x000ef030 [ 148.925558] tg3 0000:04:00.0: AER: TLP Header: 40000001 0000030f 90028090 00000000 [ 148.933984] reboot: Restarting system [ 148.938319] reboot: machine restart The hardware partner did some bisection and observed the following: Kernel version Fatal Error 5.4.0-42.46 No 5.4.0-45.49 No 5.4.0-47.51 No 5.4.0-48.52 No 5.4.0-51.56 No 5.4.0-52.57 No 5.4.0-53.59 No 5.4.0-54.60 No 5.4.0-58.64 No 5.4.0-59.65 yes 5.4.0-60.67 yes [FIX] The fix is to apply this patch from upstream: commit 9d3fcb28f9b9750b474811a2964ce022df56336e Author: Josef Bacik <josef@toxicpanda.com> Date: Tue Mar 16 22:17:48 2021 -0400     Revert "PM: ACPI: reboot: Use S5 for reboot"     This reverts commit d60cd06331a3566d3305b3c7b566e79edf4e2095.     This patch causes a panic when rebooting my Dell Poweredge r440. I do     not have the full panic log as it's lost at that stage of the reboot and     I do not have a serial console. Reverting this patch makes my system     able to reboot again. Example: https://code.launchpad.net/~bladernr/ubuntu/+source/linux/+git/focal/+ref/1917471 The hardware partner has preemptively pulled our 5.4 tree, applied the fix and tested it in their labs and determined that this does resolve the issue. [TEST CASE] Install the patched kernel on a machine that uses a BCM5720 LOM and reboot the machine and see that the errors no longer appear.
2022-03-24 20:04:09 Vinay HM attachment added fatal_issue_1.txt https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1917471/+attachment/5572643/+files/fatal_issue_1.txt
2022-04-12 18:31:23 Zachary Tahenakos linux (Ubuntu Focal): status In Progress Fix Committed
2022-04-19 15:16:28 Ubuntu Kernel Bot tags verification-needed-focal
2022-04-20 12:36:34 Vinay HM attachment added fatal_error_fix_proposed_kernel.log https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1917471/+attachment/5581813/+files/fatal_error_fix_proposed_kernel.log
2022-04-20 13:44:34 Michael Reed tags verification-needed-focal verification-done-focal
2022-05-10 09:30:14 Launchpad Janitor linux (Ubuntu Focal): status Fix Committed Fix Released
2022-05-10 09:30:14 Launchpad Janitor cve linked 2020-27820
2022-05-10 09:30:14 Launchpad Janitor cve linked 2021-26401
2022-05-10 09:30:14 Launchpad Janitor cve linked 2022-0001
2022-05-10 09:30:14 Launchpad Janitor cve linked 2022-1016
2022-05-10 09:30:14 Launchpad Janitor cve linked 2022-26490
2022-05-10 09:30:14 Launchpad Janitor cve linked 2022-27223
2022-05-24 08:14:01 Vinay HM attachment added fatalerror_testkernel https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1917471/+attachment/5592562/+files/fatalerror_testkernel