Activity log for bug #1772675

Date Who What changed Old value New value Message
2018-05-22 14:40:49 Dan Streetman bug added bug
2018-05-22 14:41:06 Dan Streetman nominated for series Ubuntu Cosmic
2018-05-22 14:41:06 Dan Streetman bug task added linux (Ubuntu Cosmic)
2018-05-22 14:41:06 Dan Streetman nominated for series Ubuntu Bionic
2018-05-22 14:41:06 Dan Streetman bug task added linux (Ubuntu Bionic)
2018-05-22 14:41:06 Dan Streetman nominated for series Ubuntu Xenial
2018-05-22 14:41:06 Dan Streetman bug task added linux (Ubuntu Xenial)
2018-05-22 14:41:14 Dan Streetman linux (Ubuntu Xenial): assignee Dan Streetman (ddstreet)
2018-05-22 14:41:15 Dan Streetman linux (Ubuntu Bionic): assignee Dan Streetman (ddstreet)
2018-05-22 14:41:17 Dan Streetman linux (Ubuntu Cosmic): assignee Dan Streetman (ddstreet)
2018-05-22 14:41:20 Dan Streetman linux (Ubuntu Cosmic): status New In Progress
2018-05-22 14:41:22 Dan Streetman linux (Ubuntu Bionic): status New In Progress
2018-05-22 14:41:24 Dan Streetman linux (Ubuntu Xenial): status New In Progress
2018-05-22 14:54:15 Nobuto Murata bug added subscriber Nobuto Murata
2018-05-22 15:01:12 Adam Thorn bug added subscriber Adam Thorn
2018-05-23 06:50:01 Stefan Kooman bug added subscriber Stefan Kooman
2018-05-25 14:51:54 Dan Streetman linux (Ubuntu Xenial): importance Undecided Low
2018-05-25 14:51:56 Dan Streetman linux (Ubuntu Bionic): importance Undecided Low
2018-05-25 14:51:57 Dan Streetman linux (Ubuntu Cosmic): importance Undecided Low
2018-08-06 04:43:13 Andrew Ruthven bug added subscriber Andrew Ruthven
2018-10-05 15:17:25 Dan Streetman linux (Ubuntu Xenial): assignee Dan Streetman (ddstreet)
2018-10-05 15:17:27 Dan Streetman linux (Ubuntu Bionic): assignee Dan Streetman (ddstreet)
2018-10-05 15:17:29 Dan Streetman linux (Ubuntu Cosmic): assignee Dan Streetman (ddstreet)
2018-10-05 15:17:42 Dan Streetman linux (Ubuntu Xenial): status In Progress Incomplete
2018-10-05 15:17:45 Dan Streetman linux (Ubuntu Bionic): status In Progress Incomplete
2018-10-05 15:17:47 Dan Streetman linux (Ubuntu Cosmic): status In Progress Incomplete
2019-01-08 04:17:27 Launchpad Janitor linux (Ubuntu Bionic): status Incomplete Expired
2019-01-08 04:17:29 Launchpad Janitor linux (Ubuntu Cosmic): status Incomplete Expired
2019-01-08 04:17:30 Launchpad Janitor linux (Ubuntu Xenial): status Incomplete Expired
2019-01-08 04:17:31 Launchpad Janitor linux (Ubuntu): status Incomplete Expired
2021-02-19 17:46:01 Heitor Alves de Siqueira linux (Ubuntu): assignee Heitor Alves de Siqueira (halves)
2021-02-19 17:46:03 Heitor Alves de Siqueira linux (Ubuntu Xenial): assignee Heitor Alves de Siqueira (halves)
2021-02-19 17:46:05 Heitor Alves de Siqueira linux (Ubuntu Cosmic): assignee Heitor Alves de Siqueira (halves)
2021-02-19 17:46:08 Heitor Alves de Siqueira linux (Ubuntu Bionic): assignee Heitor Alves de Siqueira (halves)
2021-02-19 17:46:15 Heitor Alves de Siqueira linux (Ubuntu): status Expired Fix Released
2021-02-19 17:46:18 Heitor Alves de Siqueira linux (Ubuntu Xenial): status Expired In Progress
2021-02-19 17:46:19 Heitor Alves de Siqueira linux (Ubuntu Bionic): status Expired In Progress
2021-02-19 17:46:21 Heitor Alves de Siqueira linux (Ubuntu Cosmic): status Expired In Progress
2021-02-19 17:46:24 Heitor Alves de Siqueira tags sts
2021-02-19 17:46:53 Heitor Alves de Siqueira linux (Ubuntu Cosmic): status In Progress Won't Fix
2021-02-23 10:07:16 Denis Pascheka bug added subscriber Denis Pascheka
2021-02-26 12:24:42 Heitor Alves de Siqueira linux (Ubuntu Xenial): importance Low High
2021-02-26 12:24:44 Heitor Alves de Siqueira linux (Ubuntu Bionic): importance Low High
2021-03-04 12:21:18 Heitor Alves de Siqueira description [impact] The i40e driver sometimes causes a "malicious device" event that the firmware detects, which causes the firmware to reset the nic, causing an interruption in the network connection - which can cause further problems, e.g. if the interface is in a bond; the reset will at least cause a temporary interruption in network traffic. [fix] The fix for this is currently unknown. As the "MDD event" is generated by the i40e firmware, and is completely undocumented, there is no way to tell what the i40e driver did to cause the MDD event. [test case] the bug is unfortunately very difficult to reproduce, but as shown in this (and previous) bug comments, some users of the i40e have traffic that can consistently reproduce the problem (although usually on the order of days, or longer, to reproduce). Reproducing is easily detected, as the nw traffic will be interrupted and the system logs will contain a message like: i40e 0000:02:00.1: TX driver issue detected, PF reset issued [regression potential] unknown since the specific fix is unknown. [original description] This is a continuation from bug 1713553 and then bug 1723127; a patch was added in the first bug and then the second bug, to attempt to fix this, and it may have helped reduce the issue but appears not to have fixed it, based on more reports. See bug 1713553 and bug 1723127 for more details. [Impact] The i40e driver sometimes causes a "malicious device" event that the firmware detects, which causes the firmware to reset the NIC, causing an interruption in the network connection - which can cause further problems, e.g. if the interface is in a bond; the reset will at least cause a temporary interruption in network traffic. [Fix] In the case of MDD events issued for the PF, they are usually the result of a misconfigured TX descriptor and not due to "bad" actions in the VFs. We don't need to issue a reset to the whole NIC, TX hang checks should handle those if necessary. [Test Case] The bug is unfortunately difficult to reproduce, as there's no detailed documentation on how the i40e firmware detects and raises MDDs. We have seen reports of this happening in Xenial and Bionic, for workloads stressing i40e bonds in LACP mode. Reproducing is easily detected, as the network traffic will be interrupted and the system logs will contain a message like: i40e 0000:02:00.1: TX driver issue detected, PF reset issued [Regression Potential] Since we're removing resets for the NIC, regressions could show up as issues in connectivity after the MDD events are raised. If the firmware expects the whole NIC to reset, we could see TX/RX hangs and general unresponsiveness in networking. == [original description] This is a continuation from bug 1713553 and then bug 1723127; a patch was added in the first bug and then the second bug, to attempt to fix this, and it may have helped reduce the issue but appears not to have fixed it, based on more reports. See bug 1713553 and bug 1723127 for more details.
2021-03-04 12:26:19 Heitor Alves de Siqueira description [Impact] The i40e driver sometimes causes a "malicious device" event that the firmware detects, which causes the firmware to reset the NIC, causing an interruption in the network connection - which can cause further problems, e.g. if the interface is in a bond; the reset will at least cause a temporary interruption in network traffic. [Fix] In the case of MDD events issued for the PF, they are usually the result of a misconfigured TX descriptor and not due to "bad" actions in the VFs. We don't need to issue a reset to the whole NIC, TX hang checks should handle those if necessary. [Test Case] The bug is unfortunately difficult to reproduce, as there's no detailed documentation on how the i40e firmware detects and raises MDDs. We have seen reports of this happening in Xenial and Bionic, for workloads stressing i40e bonds in LACP mode. Reproducing is easily detected, as the network traffic will be interrupted and the system logs will contain a message like: i40e 0000:02:00.1: TX driver issue detected, PF reset issued [Regression Potential] Since we're removing resets for the NIC, regressions could show up as issues in connectivity after the MDD events are raised. If the firmware expects the whole NIC to reset, we could see TX/RX hangs and general unresponsiveness in networking. == [original description] This is a continuation from bug 1713553 and then bug 1723127; a patch was added in the first bug and then the second bug, to attempt to fix this, and it may have helped reduce the issue but appears not to have fixed it, based on more reports. See bug 1713553 and bug 1723127 for more details. [Impact] The i40e driver sometimes causes a "malicious device" event that the firmware detects, which causes the firmware to reset the NIC, causing an interruption in the network connection - which can cause further problems, e.g. if the interface is in a bond; the reset will at least cause a temporary interruption in network traffic. [Fix] In the case of MDD events issued for the PF, they are usually the result of a misconfigured TX descriptor and not due to "bad" actions in the VFs. We don't need to issue a reset to the whole NIC, TX hang checks should handle those if necessary. [Test Case] The bug is unfortunately difficult to reproduce, as there's no detailed documentation on how the i40e firmware detects and raises MDDs. We have seen reports of this happening in Xenial and Bionic, for workloads stressing i40e bonds in LACP mode. Reproducing is easily detected, as the network traffic will be interrupted and the system logs will contain a message like: i40e 0000:02:00.1: TX driver issue detected, PF reset issued [Regression Potential] Since we're removing resets for the NIC, regressions could show up as issues in connectivity after the MDD events are raised. If the firmware expects the whole NIC to reset, we could see TX/RX hangs and general unresponsiveness in networking. The potential for this should however be fairly low, as this patch has been present since kernel 5.2 and hasn't seen any fixes or regressions upstream. Basic smoke tests also showed that the driver continues working as expected. == [original description] This is a continuation from bug 1713553 and then bug 1723127; a patch was added in the first bug and then the second bug, to attempt to fix this, and it may have helped reduce the issue but appears not to have fixed it, based on more reports. See bug 1713553 and bug 1723127 for more details.
2021-03-04 12:32:27 Heitor Alves de Siqueira summary Intel i40e PF reset due to incorrect MDD detection (continues...again...) Intel i40e PF reset due to incorrect MDD detection
2021-03-04 12:34:08 Heitor Alves de Siqueira summary Intel i40e PF reset due to incorrect MDD detection i40e PF reset due to incorrect MDD event
2021-03-09 20:41:09 Heitor Alves de Siqueira attachment added probe_tx_desc.c https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1772675/+attachment/5475181/+files/probe_tx_desc.c
2021-03-10 19:41:10 Heitor Alves de Siqueira description [Impact] The i40e driver sometimes causes a "malicious device" event that the firmware detects, which causes the firmware to reset the NIC, causing an interruption in the network connection - which can cause further problems, e.g. if the interface is in a bond; the reset will at least cause a temporary interruption in network traffic. [Fix] In the case of MDD events issued for the PF, they are usually the result of a misconfigured TX descriptor and not due to "bad" actions in the VFs. We don't need to issue a reset to the whole NIC, TX hang checks should handle those if necessary. [Test Case] The bug is unfortunately difficult to reproduce, as there's no detailed documentation on how the i40e firmware detects and raises MDDs. We have seen reports of this happening in Xenial and Bionic, for workloads stressing i40e bonds in LACP mode. Reproducing is easily detected, as the network traffic will be interrupted and the system logs will contain a message like: i40e 0000:02:00.1: TX driver issue detected, PF reset issued [Regression Potential] Since we're removing resets for the NIC, regressions could show up as issues in connectivity after the MDD events are raised. If the firmware expects the whole NIC to reset, we could see TX/RX hangs and general unresponsiveness in networking. The potential for this should however be fairly low, as this patch has been present since kernel 5.2 and hasn't seen any fixes or regressions upstream. Basic smoke tests also showed that the driver continues working as expected. == [original description] This is a continuation from bug 1713553 and then bug 1723127; a patch was added in the first bug and then the second bug, to attempt to fix this, and it may have helped reduce the issue but appears not to have fixed it, based on more reports. See bug 1713553 and bug 1723127 for more details. [Impact] The i40e driver sometimes causes a "malicious device" event that the firmware detects, which causes the firmware to reset the NIC, causing an interruption in the network connection - which can cause further problems, e.g. if the interface is in a bond; the reset will at least cause a temporary interruption in network traffic. [Fix] In the case of MDD events issued for the PF, they are usually the result of a misconfigured TX descriptor and not due to "bad" actions in the VFs. We don't need to issue a reset to the whole NIC, TX hang checks should handle those if necessary. [Test Procedure] The bug is unfortunately difficult to reproduce, as there's no detailed documentation on how the i40e firmware detects and raises MDDs. We have seen reports of this happening in Xenial and Bionic, for workloads stressing i40e bonds in LACP mode. Reproducing is easily detected, as the network traffic will be interrupted and the system logs will contain a message like: i40e 0000:02:00.1: TX driver issue detected, PF reset issued An alternative test procedure makes use of the kprobes attached to the LP bug. The test setup is as follows: - Create 2 VFs on primary NIC - Passthrough VF 1 to a Bionic VM - Start iperf3 client on VM, going through i40evf interface - Start another iperf3 client on host, going through i40e interface Both iperf3 clients should be using an external server located on a separate host. By loading the kprobe module while iperf3 is running, we should be able to raise MDDs more consistently. MDD behaviour can change according to firmware version, so we may need to try with different sets of probes. The one with the most consistent results seems to be 'corrupt_tx_desc_addr', which corrupts the cmd_type_offset_bsz field of the last TX descriptor before the NIC is notified of new data. [Regression Potential] Since we're removing resets for the NIC, regressions could show up as issues in connectivity after the MDD events are raised. If the firmware expects the whole NIC to reset, we could see TX/RX hangs and general unresponsiveness in networking. The potential for this should however be fairly low, as this patch has been present since kernel 5.2 and hasn't seen any fixes or regressions upstream. Basic smoke tests also showed that the driver continues working as expected. == [original description] This is a continuation from bug 1713553 and then bug 1723127; a patch was added in the first bug and then the second bug, to attempt to fix this, and it may have helped reduce the issue but appears not to have fixed it, based on more reports. See bug 1713553 and bug 1723127 for more details.
2021-03-10 19:43:36 Heitor Alves de Siqueira attachment added probe_tx_xenial.c https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1772675/+attachment/5475473/+files/probe_tx_xenial.c
2021-03-10 19:55:32 Heitor Alves de Siqueira description [Impact] The i40e driver sometimes causes a "malicious device" event that the firmware detects, which causes the firmware to reset the NIC, causing an interruption in the network connection - which can cause further problems, e.g. if the interface is in a bond; the reset will at least cause a temporary interruption in network traffic. [Fix] In the case of MDD events issued for the PF, they are usually the result of a misconfigured TX descriptor and not due to "bad" actions in the VFs. We don't need to issue a reset to the whole NIC, TX hang checks should handle those if necessary. [Test Procedure] The bug is unfortunately difficult to reproduce, as there's no detailed documentation on how the i40e firmware detects and raises MDDs. We have seen reports of this happening in Xenial and Bionic, for workloads stressing i40e bonds in LACP mode. Reproducing is easily detected, as the network traffic will be interrupted and the system logs will contain a message like: i40e 0000:02:00.1: TX driver issue detected, PF reset issued An alternative test procedure makes use of the kprobes attached to the LP bug. The test setup is as follows: - Create 2 VFs on primary NIC - Passthrough VF 1 to a Bionic VM - Start iperf3 client on VM, going through i40evf interface - Start another iperf3 client on host, going through i40e interface Both iperf3 clients should be using an external server located on a separate host. By loading the kprobe module while iperf3 is running, we should be able to raise MDDs more consistently. MDD behaviour can change according to firmware version, so we may need to try with different sets of probes. The one with the most consistent results seems to be 'corrupt_tx_desc_addr', which corrupts the cmd_type_offset_bsz field of the last TX descriptor before the NIC is notified of new data. [Regression Potential] Since we're removing resets for the NIC, regressions could show up as issues in connectivity after the MDD events are raised. If the firmware expects the whole NIC to reset, we could see TX/RX hangs and general unresponsiveness in networking. The potential for this should however be fairly low, as this patch has been present since kernel 5.2 and hasn't seen any fixes or regressions upstream. Basic smoke tests also showed that the driver continues working as expected. == [original description] This is a continuation from bug 1713553 and then bug 1723127; a patch was added in the first bug and then the second bug, to attempt to fix this, and it may have helped reduce the issue but appears not to have fixed it, based on more reports. See bug 1713553 and bug 1723127 for more details. [Impact] The i40e driver sometimes causes a "malicious device" event that the firmware detects, which causes the firmware to reset the NIC, causing an interruption in the network connection - which can cause further problems, e.g. if the interface is in a bond; the reset will at least cause a temporary interruption in network traffic. [Fix] In the case of MDD events issued for the PF, they are usually the result of a misconfigured TX descriptor and not due to "bad" actions in the VFs. We don't need to issue a reset to the whole NIC, TX hang checks should handle those if necessary. [Test Procedure] The bug is unfortunately difficult to reproduce, as there's no detailed documentation on how the i40e firmware detects and raises MDDs. We have seen reports of this happening in Xenial and Bionic, for workloads stressing i40e bonds in LACP mode. Reproducing is easily detected, as the network traffic will be interrupted and the system logs will contain a message like: i40e 0000:02:00.1: TX driver issue detected, PF reset issued An alternative test procedure makes use of the kprobes attached to the LP bug. The test setup is as follows: - Create 2 VFs on primary NIC - Passthrough VF 1 to a Bionic VM - Start iperf3 client on VM, going through i40evf interface - Start another iperf3 client on host, going through i40e interface Both iperf3 clients should be using an external server located on a separate host. By loading the kprobe module while iperf3 is running, we should be able to raise MDDs more consistently. MDD behaviour can change according to firmware version, so we may need to try with different sets of probes. The one with the most consistent results seems to be 'corrupt_tx_desc_addr', which corrupts the cmd_type_offset_bsz field of the last TX descriptor before the NIC is notified of new data. [Regression Potential] Since we're removing resets for the NIC, regressions could show up as issues in connectivity after the MDD events are raised. If the firmware expects the whole NIC to reset, we could see TX/RX hangs and general unresponsiveness in networking. The potential for this should however be fairly low, as this patch has been present since kernel 5.2 and hasn't seen any fixes or regressions upstream. Basic smoke tests also showed that the driver continues working as expected, and that necessary PF resets will be issued by the netdev watchdog in case of any hung queues. == [original description] This is a continuation from bug 1713553 and then bug 1723127; a patch was added in the first bug and then the second bug, to attempt to fix this, and it may have helped reduce the issue but appears not to have fixed it, based on more reports. See bug 1713553 and bug 1723127 for more details.
2021-03-10 23:48:33 Kelsey Steele linux (Ubuntu Xenial): status In Progress Fix Committed
2021-03-10 23:48:36 Kelsey Steele linux (Ubuntu Bionic): status In Progress Fix Committed
2021-03-25 15:23:34 Ubuntu Kernel Bot tags sts sts verification-needed-bionic
2021-03-25 15:25:40 Ubuntu Kernel Bot tags sts verification-needed-bionic sts verification-needed-bionic verification-needed-xenial
2021-03-30 19:08:58 Heitor Alves de Siqueira tags sts verification-needed-bionic verification-needed-xenial sts verification-done-xenial verification-needed-bionic
2021-03-30 20:45:40 Heitor Alves de Siqueira tags sts verification-done-xenial verification-needed-bionic sts verification-done-bionic verification-done-xenial
2021-04-12 15:17:02 Launchpad Janitor linux (Ubuntu Bionic): status Fix Committed Fix Released
2021-04-12 15:17:02 Launchpad Janitor cve linked 2018-13095
2021-04-12 15:17:02 Launchpad Janitor cve linked 2021-3348
2021-04-12 15:32:06 Launchpad Janitor linux (Ubuntu Xenial): status Fix Committed Fix Released
2021-04-12 15:32:06 Launchpad Janitor cve linked 2015-1350
2021-04-12 15:32:06 Launchpad Janitor cve linked 2017-5967
2021-04-12 15:32:06 Launchpad Janitor cve linked 2018-5953
2021-04-12 15:32:06 Launchpad Janitor cve linked 2018-5995
2021-04-12 15:32:06 Launchpad Janitor cve linked 2018-7754
2021-04-12 15:32:06 Launchpad Janitor cve linked 2019-16231
2021-04-12 15:32:06 Launchpad Janitor cve linked 2019-16232
2021-04-12 15:32:06 Launchpad Janitor cve linked 2019-19061