Intel i40e PF reset due to incorrect MDD detection (continues...again...)

Bug #1772675 reported by Dan Streetman on 2018-05-22
28
This bug affects 4 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Low
Unassigned
Xenial
Low
Unassigned
Bionic
Low
Unassigned
Cosmic
Low
Unassigned

Bug Description

[impact]

The i40e driver sometimes causes a "malicious device" event that the firmware detects, which causes the firmware to reset the nic, causing an interruption in the network connection - which can cause further problems, e.g. if the interface is in a bond; the reset will at least cause a temporary interruption in network traffic.

[fix]

The fix for this is currently unknown. As the "MDD event" is generated by the i40e firmware, and is completely undocumented, there is no way to tell what the i40e driver did to cause the MDD event.

[test case]

the bug is unfortunately very difficult to reproduce, but as shown in this (and previous) bug comments, some users of the i40e have traffic that can consistently reproduce the problem (although usually on the order of days, or longer, to reproduce). Reproducing is easily detected, as the nw traffic will be interrupted and the system logs will contain a message like:

i40e 0000:02:00.1: TX driver issue detected, PF reset issued

[regression potential]

unknown since the specific fix is unknown.

[original description]

This is a continuation from bug 1713553 and then bug 1723127; a patch was added in the first bug and then the second bug, to attempt to fix this, and it may have helped reduce the issue but appears not to have fixed it, based on more reports.

See bug 1713553 and bug 1723127 for more details.

Dan Streetman (ddstreet) wrote :

For details about i40e registers that may be able to help debug the cause of this, see bug 1723127 comment 10.

Also, a (possible) workaround to avoid this error is to disable TSO on the i40e nic.

Changed in linux (Ubuntu Xenial):
assignee: nobody → Dan Streetman (ddstreet)
Changed in linux (Ubuntu Bionic):
assignee: nobody → Dan Streetman (ddstreet)
Changed in linux (Ubuntu Cosmic):
assignee: nobody → Dan Streetman (ddstreet)
status: New → In Progress
Changed in linux (Ubuntu Bionic):
status: New → In Progress
Changed in linux (Ubuntu Xenial):
status: New → In Progress
Dan Streetman (ddstreet) on 2018-05-25
Changed in linux (Ubuntu Xenial):
importance: Undecided → Low
Changed in linux (Ubuntu Bionic):
importance: Undecided → Low
Changed in linux (Ubuntu Cosmic):
importance: Undecided → Low
Dan Streetman (ddstreet) wrote :

as I can't reproduce this, and I have heard no more reports of it, i'm marking this as incomplete. If anyone does actually still see this problem with the latest (x/b/c) kernel, please add a comment to this bug.

Changed in linux (Ubuntu Xenial):
assignee: Dan Streetman (ddstreet) → nobody
Changed in linux (Ubuntu Bionic):
assignee: Dan Streetman (ddstreet) → nobody
Changed in linux (Ubuntu Cosmic):
assignee: Dan Streetman (ddstreet) → nobody
Changed in linux (Ubuntu Xenial):
status: In Progress → Incomplete
Changed in linux (Ubuntu Bionic):
status: In Progress → Incomplete
Changed in linux (Ubuntu Cosmic):
status: In Progress → Incomplete
Terry Hardie (terryh-orcas) wrote :

We are getting this error on all of our new systems (Dell 14G C6420) running Xenial. I've tried 4.4.0-139-generic. I'm now trying 4.13.0-45-generic and see if it still shows up there.

Dan Streetman (ddstreet) wrote :

@terryh-orcas,

if you are able to reproduce the problem relatively quickly and easily, then I suggest testing different kernel versions, up to the latest upstream, to see if and where it may be fixed with a newer i40e kernel driver. You can get upstream kernel debs here:
http://kernel.ubuntu.com/~kernel-ppa/mainline/?C=N;O=D

If you can narrow down the kernel to a specific short range (i.e. kernel X definitely fails, kernel Y never fails), I can review the upstream i40e driver for specific changes to backport.

If you can't reproduce it easily/quickly, there is another method of debug involving undocumented i40e register modification. See bug 1723127 comment 10 for details. If you try that method, you should attempt it with the latest kernel you can reproduce the problem with. As I don't have the chipset specifications, if you do reproduce it this way and can isolate the problem to a specific register/bit, I'll have to take that info back to Intel to ask them for clarification. Also note that there are 2 registers that you have to test each bit individually for, so this method can take a very long time if it takes you a long time to reproduce the problem.

Unfortunately, as has been mentioned in this and past bugs, the MDD event is generated by the i40e firmware and there is no documented way to tell what the i40e kernel driver did that the firmware didn't like (assuming it was something the driver did, and not external or firmware issues). Intel does update their upstream i40e driver with fixes for MDD firmware/driver bugs regularly, so this will likely only be fixed by a patch coming from Intel upstream, that we need to backport to our older stable Ubuntu kernel(s).

Sorry I can't help more.

Oladimeji Fayomi (fayomidimeji) wrote :

Hi,
We have disabled TSO and GSO and we are still experiencing the interface resets. This usually happens under high load.

Kernel version: 4.4.0-135-generic

driver: i40e
version: 2.4.10
firmware-version: 6.01 0x80003493 0.0.0
bus-info: 0000:05:00.1
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: yes

Dan Streetman (ddstreet) wrote :

@fayomidimeji, my comment 4 applies to you as well.

Additionally, you both might want to verify you are actually seeing this problem, and not something else.

Launchpad Janitor (janitor) wrote :

[Expired for linux (Ubuntu Bionic) because there has been no activity for 60 days.]

Changed in linux (Ubuntu Bionic):
status: Incomplete → Expired
Launchpad Janitor (janitor) wrote :

[Expired for linux (Ubuntu Cosmic) because there has been no activity for 60 days.]

Changed in linux (Ubuntu Cosmic):
status: Incomplete → Expired
Launchpad Janitor (janitor) wrote :

[Expired for linux (Ubuntu Xenial) because there has been no activity for 60 days.]

Changed in linux (Ubuntu Xenial):
status: Incomplete → Expired
Launchpad Janitor (janitor) wrote :

[Expired for linux (Ubuntu) because there has been no activity for 60 days.]

Changed in linux (Ubuntu):
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers