Intel i40e PF reset due to incorrect MDD detection (continues...)

Bug #1723127 reported by Dan Streetman
54
This bug affects 9 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Fix Released
Medium
Dan Streetman
Trusty
Won't Fix
Undecided
Unassigned
Xenial
Fix Released
Medium
Dan Streetman
Artful
Fix Released
Medium
Dan Streetman
Bionic
Fix Released
Medium
Dan Streetman

Bug Description

[impact]

The i40e driver sometimes causes a "malicious device" event that the firmware detects, which causes the firmware to reset the nic, causing an interruption in the network connection - which can cause further problems, e.g. if the interface is in a bond; the reset will at least cause a temporary interruption in network traffic.

[fix]

The upstream patch to fix this adjusts how the driver fragments TX data; the "malicious driver" detected by the firmware is a result of incorrectly crafted TX fragment descriptors (the firmware has specific complicated restrictions on this). The patch is from Intel, and they suggested this specific patch to address the problem; additionally I have checked with someone who reported this to me and provided a test kernel with the patch to them, and they have been able to run ~6 weeks so far without reproducing the issue; previously they could reproduce it as quickly as a day, but usually within 2-3 weeks.

[test case]

the bug is unfortunately very difficult to reproduce, but as shown in this (and previous) bug comments, some users of the i40e have traffic that can consistently reproduce the problem (although usually on the order of days, or longer, to reproduce). Reproducing is easily detected, as the nw traffic will be interrupted and the system logs will contain a message like:

i40e 0000:02:00.1: TX driver issue detected, PF reset issued

[regression potential]

the patch for this alters how tx is fragmented by the driver, so a possible regression would likely cause problems in TX traffic and/or additional "malicious device detection" events.

[original description]

This is a continuation from bug 1713553; a patch was added in that bug to attempt to fix this, and it may have helped reduce the issue but appears not to have fixed it, based on more reports.

The issue is the i40e driver, when TSO is enabled, sometimes sees the NIC firmware issue a "MDD event" where MDD is "Malicious Driver Detection". This is vaguely defined in the i40e spec, but with no way to tell what the NIC actually saw that it didn't like. So, the driver can do nothing but print an error message and reset the PF (or VF). Unfortunately, this resets the interface, which causes an interruption in network traffic flow while the PF is resetting.

See bug 1713553 for more details.

CVE References

Dan Streetman (ddstreet)
Changed in linux (Ubuntu):
status: New → In Progress
importance: Undecided → Medium
assignee: nobody → Dan Streetman (ddstreet)
Revision history for this message
Dan Streetman (ddstreet) wrote :

continuing conversation from previous (fix released) bug.

@bjozet, it would help a lot of you could test with the hwe 4.10 kernel and let me know if that fails also, or if it seems to be fixed there. If it works, I can review the changes and possibly find something, and/or work with you on a bisect.

Revision history for this message
Björn Zettergren (bjozet) wrote :

We've been using hwe-edge 4.11 for almost 24 hours without problems. We'll test the regular hwe 4.10 also if you think that narrows the bisect.

Revision history for this message
Dan Streetman (ddstreet) wrote :

> We'll test the regular hwe 4.10 also if you think that narrows the bisect.

yes please it will help to look just between 4.4 and 4.10. thanks!

Changed in linux (Ubuntu Xenial):
importance: Undecided → Medium
status: New → In Progress
assignee: nobody → Dan Streetman (ddstreet)
Revision history for this message
Björn Zettergren (bjozet) wrote :

As of now, we've been running HWE 4.10 for little more than 16 hours and no problems so far. Previously we'd hit the problem within the hour.

There is however one new logmessage that we haven't seen before, neither with 1.4.x driver or 2.0.x. But it might be unrelated, we can't see any particular performance-issues in any of our monitoring/graphs. And the message is:

TCP: bond0.5: Driver has suspect GRO implementation, TCP performance may be compromised.

How do we proceed? :-)

Revision history for this message
Dan Streetman (ddstreet) wrote :

> How do we proceed? :-)

one bug at a time, please. As this NIC's "MDD" behavior doesn't indicate what happened that it disliked, I can't tell if that is related or not to the MDD events, but I suspect not, especially if you have not seen that happen for kernels when you did get MDD events.

since the Ubuntu 4.4.0 isn't an ancestor of the Ubuntu 4.10.0 kernel, to bisect we would need to start at the merge base anyway (mainline 4.4 kernel); and since there are no changes to the i40e driver between mainline 4.10 and Ubuntu 4.10.0, a bisect will be a lot easier if we shift over to the mainline kernel series.

Are you able to test various kernel versions during the bisect process? It may take a while, and it's important to make sure at each step to determine for certain if the kernel is 'good' or 'bad' - an incorrect evaluation at any step leads to an incorrect endpoint.

If you are able to help with a kernel bisect by testing, can you test each of these kernels:

http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.4-wily/

http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.10/

I expect the v4.4 to be 'bad' (encounter the MDD event) and 4.10 to be 'good' (no MDD event), based on your evaluation of the Ubuntu kernels based on those versions. If those are good/bad as expected, we can start the bisection between them.

Revision history for this message
Björn Zettergren (bjozet) wrote :

> one bug at a time, please.

Absolutely! I just mentioned the "GRO implementation" because I wondered if it might have been related. I should have googled up better on it beforehand, that would have enlightened me that it wasn't.

I've tested the v4.4-wily kernel in the first link (4.4.0-040400-generic), and it failed miserably directly after the machine came online. I'm attaching a redacted syslog with relevant messages in it. One thing you'll note is that the i40e driver (1.3.x) complains that the firmware is too new, this might be a problem(?), but there's also a message, just before the "TX driver issue detected":

i40e 0000:02:00.1: FD filter programming failed due to incorrect filter parameters

See the attached file for more details.

We're currently running the second kernel v4.10, (4.10.0-041000-generic), and it's running fine so far, but the machine has only been up for 30 minutes, i'll let it run 24 hours, and report back tomorrow, or as soon as status changes, if at all.

Revision history for this message
Björn Zettergren (bjozet) wrote :

kernel v4.10, (4.10.0-041000-generic) has been running fine, without any issues since 24 hours. I'd say it's OK, as you suspected.

Revision history for this message
Dan Streetman (ddstreet) wrote :

Sorry for the delay.

So we have 2 options on how to continue debugging here:

1. we can try a traditional git bisect. This would involve testing various kernel builds, to try to eventually narrow down the issue to being fixed by a specific commit. It's a long-ish process, depending on how long testing each build takes, and it's critical that verification of 'good' or 'bad' at each step is correct - otherwise the bisect ends at the wrong commit. Each step will involve me building a new kernel, you test with the kernel until it fails or you've tested long enough to be sure that kernel build is 'good'. With hard-to-reproduce problems like this, bisecting can be tough, because if a build doesn't fail for a long time, that doesn't necessarily mean it's "good", it may just not have failed yet, in which case the bisect will end at the wrong commit, which doesn't help with figuring out how to fix anything.

2. Intel has provided me some undocumented commands that will allow controlling what MDD events the nic triggers on. I can provide those instructions, and you can test with each MDD event bit set individually, until the problem reproduces - then we know exactly which MDD source triggered the event, which should help identify what the driver did to cause the MDD event. This way has a much better chance of finding the specific problem, but the downside is you'll need to run undocumented commands with your hardware. I believe there should not be any risk in doing that since the info came from Intel, but I can't personally verify it, as I don't currently have access to this specific NIC.

If you're willing to try #2, I'll add the specific commands/instructions and you can get started testing. Otherwise if you would prefer not to run the undocumented commands, I can start a kernel bisect.

Revision history for this message
Björn Zettergren (bjozet) wrote :

No worries, we're not in a hurry.

I'd say we go with option #2. Please provide information on how to proceed, and how to undo any changes we test :)

Revision history for this message
Dan Streetman (ddstreet) wrote :

> I'd say we go with option #2. Please provide information on how to proceed, and how to
> undo any changes we test :)

ok, so first, these instructions may cause the card to hang; the system may need to be rebooted or the driver reloaded. The changes here can be undone by resetting the card; rebooting or reloading the driver.

Also please note these instructions are ONLY FOR i40e NICs!

The process here is to clear all the nic's hardware asserts, and then enable each of them one-by-one and try to reproduce the MDD event. That way, when it reproduces, we know exactly which hw assert triggered it.

First, find your nic's pci address, e.g. ethtool -i NIC | grep bus-info

Then (as root) cd to "/sys/kernel/debug/i40e/BUSID" (replace BUSID with your nic's actual pci addr). You should see a "command" file there.

Now zero out the registers:

$ echo write 0xe648c 0 > command
$ echo write 0x442f4 0 > command

Then, set a single bit; starting with 0x1 on the first register:

$ echo write 0xe648c 0x1 > command

Do normal testing. There are 3 possibilities at this step:

a) you test long enough to be sure the problem was avoided
b) your system and/or nic hangs due to an "uncaught" MDD event
c) you reproduce the problem, and see the TX error and PF reset

For either (a) or (b), that means this bit isn't the one we're looking for, so move to the next bit:

$ echo write 0xe648c 0 > command
$ echo write 0x442f4 0 > command
$ echo write 0xe648c 0x2 > command

Then retest. Replace "0x2" with incrementing bits, as you test each bit. Note this is setting individual bits, so the sequence to test is (in hex) 1, 2, 4, 8, 10, 20, 40, 80, 100, etc. This is a 32 bit register so the highest bit to test is 0x80000000. If you test all bits in register 0xe648c without reproducing the problem, then move on to register 0x442f4 testing bit-by-bit again starting at 0x1 again. You should be able to reproduce the problem with one of the bits set in one of these two registers, according to what I've been told by Intel.

As you set each bit, you should get output in your dmesg and/or syslog or kern.log, indicating the current value of the registers, e.g.:

write: 0xe648c = 0x1

You can also manually read the registers at any time with:

$ echo read 0xe648c > command
$ echo read 0x442f4 > command

you should see the results in dmesg/logs, e.g.:

read: 0xe648c = 0x1

Once/if you do reproduce the problem, make note of the values for both registers (i.e. what bit was set), and report that back here. I'll check with Intel to find what the specific bit indicates the problem was.

Thanks!

Revision history for this message
Björn Zettergren (bjozet) wrote :

Sorry for the delay, I've not forgotten about this, just been swamped with other things. Will hopefully have time to do the tests next week.

Revision history for this message
Stefan Kooman (stefan-n1) wrote :

Hi there. I can confirm this problem still exists in newest kernels and with the latest intel drivers as of today:

Jan 19 16:05:19 osd9 kernel: [511271.581413] i40e 0000:02:00.1: TX driver issue detected, PF reset issued
Jan 19 16:09:08 osd9 kernel: [511500.919380] i40e 0000:02:00.0: TX driver issue detected, PF reset issued

driver: i40e-2.4.3 (and xenial / 4.13 shipped driver: 2.1.14-k)
kernel: 4.13.0-25-generic #29~16.04.2-Ubuntu SMP Tue Jan 9 12:16:39 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux. Kernel loaded with nopti noibrs noibpb (Meltdown / Spetre mitigation disabled).

We can trigger the issue with high load (benchmarking Ceph cluster with fio: 4 clients, 8 threads, iodepth 256, 100% random write, 64K block size).

Only when we use relatively large block size (64K) do we hit this problem. With 4K blocks we do not hit this issue. We haven't tested large random reads (that test is still to be done).

When using openvswitch port-channel (as we do) with jumbo frames ... this port-channel will not come back online after the reset. rmmod i40e / modprobe i40e does the trick though.

Revision history for this message
Dan Streetman (ddstreet) wrote :

Hello,

can anyone still experiencing this on the 4.4 kernel please test with the kernel from this PPA:
https://launchpad.net/~ddstreet/+archive/ubuntu/lp1723127
Test kernel version is 4.4.0-112.135+hf1723127v20180206b2

If anyone would like to test with the 4.13 kernel please let me know and I can build it with the recent upstream patch (248de22e638f10bd5bfc7624a357f940f66ba137) that may finally fix this.

Revision history for this message
Dan Streetman (ddstreet) wrote :

As mentioned, upstream commit 248de22e638f10bd5bfc7624a357f940f66ba137 ("i40e/i40evf: Account for frags split over multiple descriptors in check linearize") appears to finally fix this. This commit is already included in bionic, but is required in artful and earlier.

In xenial, the commit 5c4654daf2e2f25dfbd7fa572c59937ea6d4198b ("i40e/i40evf: Allow up to 12K bytes of data per Tx descriptor instead of 8K") is also required.

Changed in linux (Ubuntu Artful):
assignee: nobody → Dan Streetman (ddstreet)
importance: Undecided → Medium
status: New → Incomplete
status: Incomplete → In Progress
Changed in linux (Ubuntu Bionic):
status: In Progress → Fix Released
Changed in linux (Ubuntu Trusty):
status: New → Won't Fix
Dan Streetman (ddstreet)
description: updated
Changed in linux (Ubuntu Xenial):
status: In Progress → Fix Committed
Changed in linux (Ubuntu Artful):
status: In Progress → Fix Committed
Revision history for this message
Brad Figg (brad-figg) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-xenial' to 'verification-done-xenial'. If the problem still exists, change the tag 'verification-needed-xenial' to 'verification-failed-xenial'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-xenial
tags: added: verification-needed-artful
Revision history for this message
Brad Figg (brad-figg) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-artful' to 'verification-done-artful'. If the problem still exists, change the tag 'verification-needed-artful' to 'verification-failed-artful'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

Revision history for this message
Dan Streetman (ddstreet) wrote :

Due to the nature of this bug, being very difficult to reproduce, real verification could take weeks instead of only days. However, one reporter has been running with a test kernel I built here
https://launchpad.net/~ddstreet/+archive/ubuntu/lp1723127

which is the base 4.4.0-112 kernel plus the two patches from this bug. In their testing, running on 6 weeks now, the problem has not reproduced and they have seen no other issues. Of course, that test kernel doesn't have all the other patches that the -proposed kernel has, but that testing is likely the best verification we can get for this particular bug. I have also asked the same reporter to switch their testing from my test kernel over to the -proposed kernel, and to report any unexpected issues they see. If they do report any regression, I'll communicate that here.

Based on that justification, I'll mark this bug as verified.

tags: added: verification-done-artful verification-done-xenial
removed: verification-needed-artful verification-needed-xenial
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (17.7 KiB)

This bug was fixed in the package linux - 4.4.0-121.145

---------------
linux (4.4.0-121.145) xenial; urgency=medium

  * linux: 4.4.0-121.145 -proposed tracker (LP: #1763687)

  * Ubuntu-4.4.0-120.144 fails to boot on arm64* hardware (LP: #1763644)
    - [Config] arm64: disable BPF_JIT_ALWAYS_ON

linux (4.4.0-120.144) xenial; urgency=medium

  * linux: 4.4.0-120.144 -proposed tracker (LP: #1761438)

  * intel-microcode 3.20180312.0 causes lockup at login screen(w/ linux-
    image-4.13.0-37-generic) (LP: #1759920) // CVE-2017-5715 (Spectre v2 Intel)
    - Revert "x86/mm: Only set IBPB when the new thread cannot ptrace current
      thread"
    - x86/speculation: Use Indirect Branch Prediction Barrier in context switch

  * DKMS driver builds fail with: Cannot use CONFIG_STACK_VALIDATION=y, please
    install libelf-dev, libelf-devel or elfutils-libelf-devel (LP: #1760876)
    - [Packaging] include the retpoline extractor in the headers

  * retpoline hints: primary infrastructure and initial hints (LP: #1758856)
    - [Packaging] retpoline-extract: flag *0xNNN(%reg) branches
    - x86/speculation, objtool: Annotate indirect calls/jumps for objtool
    - x86/speculation, objtool: Annotate indirect calls/jumps for objtool on 32bit
    - x86/paravirt, objtool: Annotate indirect calls
    - x86/asm: Stop depending on ptrace.h in alternative.h
    - [Packaging] retpoline -- add safe usage hint support
    - [Packaging] retpoline-check -- only report additions
    - [Packaging] retpoline -- widen indirect call/jmp detection
    - [Packaging] retpoline -- elide %rip relative indirections
    - [Packaging] retpoline -- clear hint information from packages
    - SAUCE: modpost: add discard to non-allocatable whitelist
    - KVM: x86: Make indirect calls in emulator speculation safe
    - KVM: VMX: Make indirect call speculation safe
    - x86/boot, objtool: Annotate indirect jump in secondary_startup_64()
    - SAUCE: early/late -- annotate indirect calls in early/late initialisation
      code
    - SAUCE: vga_set_mode -- avoid jump tables
    - [Config] retpoline -- switch to new format
    - [Packaging] final-checks -- remove check for empty retpoline files

  * Xenial update to 4.4.117 stable release (LP: #1756860)
    - IB/mlx4: Fix incorrectly releasing steerable UD QPs when have only ETH ports
    - PM / devfreq: Propagate error from devfreq_add_device()
    - s390: fix handling of -1 in set{,fs}[gu]id16 syscalls
    - ARM: dts: STi: Add gpio polarity for "hdmi,hpd-gpio" property
    - arm: spear600: Add missing interrupt-parent of rtc
    - arm: spear13xx: Fix dmas cells
    - arm: spear13xx: Fix spics gpio controller's warning
    - ALSA: seq: Fix regression by incorrect ioctl_mutex usages
    - KVM/x86: Reduce retpoline performance impact in slot_handle_level_range(),
      by always inlining iterator helper methods
    - x86/cpu: Change type of x86_cache_size variable to unsigned int
    - drm/radeon: adjust tested variable
    - rtc-opal: Fix handling of firmware error codes, prevent busy loops
    - ext4: save error to disk in __ext4_grp_locked_error()
    - ext4: correct documentation for grpid mount option
    - mm: hide a #warning fo...

Changed in linux (Ubuntu Xenial):
status: Fix Committed → Fix Released
status: Fix Committed → Fix Released
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (5.6 KiB)

This bug was fixed in the package linux - 4.13.0-39.44

---------------
linux (4.13.0-39.44) artful; urgency=medium

  * linux: 4.13.0-39.44 -proposed tracker (LP: #1761456)

  * intel-microcode 3.20180312.0 causes lockup at login screen(w/ linux-
    image-4.13.0-37-generic) (LP: #1759920) // CVE-2017-5715 (Spectre v2
    Intel) // CVE-2017-5754
    - x86/mm: Reinitialize TLB state on hotplug and resume

  * intel-microcode 3.20180312.0 causes lockup at login screen(w/ linux-
    image-4.13.0-37-generic) (LP: #1759920) // CVE-2017-5715 (Spectre v2 Intel)
    - Revert "x86/mm: Only set IBPB when the new thread cannot ptrace current
      thread"
    - x86/speculation: Use Indirect Branch Prediction Barrier in context switch

  * DKMS driver builds fail with: Cannot use CONFIG_STACK_VALIDATION=y, please
    install libelf-dev, libelf-devel or elfutils-libelf-devel (LP: #1760876)
    - [Packaging] include the retpoline extractor in the headers

  * retpoline hints: primary infrastructure and initial hints (LP: #1758856)
    - [Packaging] retpoline-extract: flag *0xNNN(%reg) branches
    - x86/speculation, objtool: Annotate indirect calls/jumps for objtool
    - x86/speculation, objtool: Annotate indirect calls/jumps for objtool on 32bit
    - x86/paravirt, objtool: Annotate indirect calls
    - [Packaging] retpoline -- add safe usage hint support
    - [Packaging] retpoline-check -- only report additions
    - [Packaging] retpoline -- widen indirect call/jmp detection
    - [Packaging] retpoline -- elide %rip relative indirections
    - [Packaging] retpoline -- clear hint information from packages
    - KVM: x86: Make indirect calls in emulator speculation safe
    - KVM: VMX: Make indirect call speculation safe
    - x86/boot, objtool: Annotate indirect jump in secondary_startup_64()
    - SAUCE: early/late -- annotate indirect calls in early/late initialisation
      code
    - SAUCE: vga_set_mode -- avoid jump tables
    - [Config] retpoline -- switch to new format
    - [Packaging] retpoline hints -- handle missing files when RETPOLINE not
      enabled
    - [Packaging] final-checks -- remove check for empty retpoline files

  * retpoline: ignore %cs:0xNNN constant indirections (LP: #1752655)
    - [Packaging] retpoline -- elide %cs:0xNNNN constants on i386

  * zfs system process hung on container stop/delete (LP: #1754584)
    - SAUCE: Fix non-prefaulted page deadlock (LP: #1754584)

  * zfs-linux 0.6.5.11-1ubuntu5 ADT test failure with linux 4.15.0-1.2
    (LP: #1737761)
    - SAUCE: (noup) Update zfs to 0.6.5.11-1ubuntu3.2

  * AT_BASE_PLATFORM in AUXV is absent on kernels available on Ubuntu 17.10
    (LP: #1759312)
    - powerpc/64s: Fix NULL AT_BASE_PLATFORM when using DT CPU features

  * btrfs and tar sparse truncate archives (LP: #1757565)
    - Btrfs: move definition of the function btrfs_find_new_delalloc_bytes
    - Btrfs: fix reported number of inode blocks after buffered append writes

  * efifb broken on ThunderX-based Gigabyte nodes (LP: #1758375)
    - drivers/fbdev/efifb: Allow BAR to be moved instead of claiming it

  * Intel i40e PF reset due to incorrect MDD detection (continues...)
    (LP: #1723127)
    - i40e/i40ev...

Read more...

Changed in linux (Ubuntu Artful):
status: Fix Committed → Fix Released
status: Fix Committed → Fix Released
Revision history for this message
haosdent (haosdent) wrote :

Hi, we still encounter this error in the latest 4.4.0 kernel. Our kernel version is

```
$ uname -r
4.4.0-122

$ dpkg -l|grep linux-image-4.4.0-122-generic
ii linux-image-4.4.0-122-generic 4.4.0-122.146 amd64 Linux kernel image for version 4.4.0 on 64 bit x86 SMP
```

Revision history for this message
Dan Streetman (ddstreet) wrote :

> Hi, we still encounter this error in the latest 4.4.0 kernel.

yes, unfortunately, the last patch seems to have helped reduce the frequency, but i did recently get another report of it happening again. So it seems to not be completely fixed.

To clarify, as I said before, this is an event generated by the i40e nic firmware, and is entirely undocumented, and the firmware event provides no (useful/documented) information about what exactly happened that it didn't like. So, there is literally nothing that I, or any non-Intel person, can do to fix this. The only possible way this can be fixed is to let Intel know (which I have done) and hope they can either point me to another upstream patch that we have not yet backported, or in the case that it's still not fixed upstream (which is possible), provide a new upstream patch to fix it. Or, new firmware, of course.

At this point, please don't add any more comments to this bug, since an upstream commit was backported and released for this bug (Intel pointed me to the upstrema commit).

I have opened a new bug to continue this, bug 1772675. Please add new comments to that bug.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.