Bug #1713553 “Intel i40e PF reset due to incorrect MDD detection...” : Bugs : linux package : Ubuntu

Revision history for this message

Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote on 2017-08-28: Missing required logs.

#1

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1713553

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status:	New → Incomplete

Dan Streetman (ddstreet) on 2017-08-28

Changed in linux (Ubuntu):
status:	Incomplete → In Progress
importance:	Undecided → Medium
assignee:	nobody → Dan Streetman (ddstreet)

Revision history for this message

Dan Streetman (ddstreet) wrote on 2017-08-28:

#2

Note there is one additional upstream commit that improves performance by allowing up to 12k per tx descriptor, instead of 8k per descriptor (the current code in Xenial 4.4 kernel), and its changes are related to the fixes for this issue. However, from my reading of the code, I don't think that commit is actually required to fix this problem, so I am not including it in this bug (yet).

commit 5c4654daf2e2f25dfbd7fa572c59937ea6d4198b
Author: Alexander Duyck <email address hidden>
Date: Fri Feb 19 12:17:08 2016 -0800

i40e/i40evf: Allow up to 12K bytes of data per Tx descriptor instead of 8K

Revision history for this message

Dan Streetman (ddstreet) wrote on 2017-09-13:

#3

Re: my last comment, testing confirmed that commit 5c4654 is *not* needed to fix this bug, so I am not including it. Only commit 841493a3 as listed in the bug description is required to fix this.

Stefan Bader (smb) on 2017-09-15

Changed in linux (Ubuntu Xenial):
status:	New → Fix Committed

Revision history for this message

Kleber Sacilotto de Souza (kleber-souza) wrote on 2017-09-25:

#4

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-xenial' to 'verification-done-xenial'. If the problem still exists, change the tag 'verification-needed-xenial' to 'verification-failed-xenial'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags:

added: verification-needed-xenial

Revision history for this message

Dan Streetman (ddstreet) wrote on 2017-09-26:

#5

The original reporter to me verified that with the patch the problem does not reoccur for several days, when previously they could reproduce it within a day; unfortunately as this problem is hard to reproduce that is the best verification possible from me currently.

tags:

added: verification-done-xenial
removed: verification-needed-xenial

Revision history for this message

Launchpad Janitor (janitor) wrote on 2017-10-10:

#6

Download full text (7.8 KiB)

This bug was fixed in the package linux - 4.4.0-97.120

---------------
linux (4.4.0-97.120) xenial; urgency=low

* linux: 4.4.0-97.120 -proposed tracker (LP: #1718149)

* blk-mq: possible deadlock on CPU hot(un)plug (LP: #1670634)
- [Config] s390x -- disable CONFIG_{DM, SCSI}_MQ_DEFAULT

  * Xenial update to 4.4.87 stable release (LP: #1715678)
    - irqchip: mips-gic: SYNC after enabling GIC region
    - i2c: ismt: Don't duplicate the receive length for block reads
    - i2c: ismt: Return EMSGSIZE for block reads with bogus length
    - ceph: fix readpage from fscache
    - cpumask: fix spurious cpumask_of_node() on non-NUMA multi-node configs
    - cpuset: Fix incorrect memory_pressure control file mapping
    - alpha: uapi: Add support for __SANE_USERSPACE_TYPES__
    - CIFS: remove endian related sparse warning
    - wl1251: add a missing spin_lock_init()
    - xfrm: policy: check policy direction value
    - drm/ttm: Fix accounting error when fail to get pages for pool
    - kvm: arm/arm64: Fix race in resetting stage2 PGD
    - kvm: arm/arm64: Force reading uncached stage2 PGD
    - epoll: fix race between ep_poll_callback(POLLFREE) and ep_free()/ep_remove()
    - crypto: algif_skcipher - only call put_page on referenced and used pages
    - Linux 4.4.87

  * Xenial update to 4.4.86 stable release (LP: #1715430)
    - scsi: isci: avoid array subscript warning
    - ALSA: au88x0: Fix zero clear of stream->resources
    - btrfs: remove duplicate const specifier
    - i2c: jz4780: drop superfluous init
    - gcov: add support for gcc version >= 6
    - gcov: support GCC 7.1
    - lightnvm: initialize ppa_addr in dev_to_generic_addr()
    - p54: memset(0) whole array
    - lpfc: Fix Device discovery failures during switch reboot test.
    - arm64: mm: abort uaccess retries upon fatal signal
    - x86/io: Add "memory" clobber to insb/insw/insl/outsb/outsw/outsl
    - arm64: fpsimd: Prevent registers leaking across exec
    - scsi: sg: protect accesses to 'reserved' page array
    - scsi: sg: reset 'res_in_use' after unlinking reserved array
    - drm/i915: fix compiler warning in drivers/gpu/drm/i915/intel_uncore.c
    - Linux 4.4.86

  * Xenial update to 4.4.85 stable release (LP: #1714298)
    - af_key: do not use GFP_KERNEL in atomic contexts
    - dccp: purge write queue in dccp_destroy_sock()
    - dccp: defer ccid_hc_tx_delete() at dismantle time
    - ipv4: fix NULL dereference in free_fib_info_rcu()
    - net_sched/sfq: update hierarchical backlog when drop packet
    - ipv4: better IP_MAX_MTU enforcement
    - sctp: fully initialize the IPv6 address in sctp_v6_to_addr()
    - tipc: fix use-after-free
    - ipv6: reset fn->rr_ptr when replacing route
    - ipv6: repair fib6 tree in failure case
    - tcp: when rearming RTO, if RTO time is in past then fire RTO ASAP
    - irda: do not leak initialized list.dev to userspace
    - net: sched: fix NULL pointer dereference when action calls some targets
    - net_sched: fix order of queue length updates in qdisc_replace()
    - mei: me: add broxton pci device ids
    - mei: me: add lewisburg device ids
    - Input: trackpoint - add new trackpoint firmware ID
    - Input: elan_i2c...

This bug was fixed in the package linux - 4.4.0-97.120

---------------
linux (4.4.0-97.120) xenial; urgency=low

* linux: 4.4.0-97.120 -proposed tracker (LP: #1718149)

* blk-mq: possible deadlock on CPU hot(un)plug (LP: #1670634)
    - [Config] s390x -- disable CONFIG_{DM, SCSI}_MQ_DEFAULT

* Xenial update to 4.4.87 stable release (LP: #1715678)
    - irqchip: mips-gic: SYNC after enabling GIC region
    - i2c: ismt: Don't duplicate the receive length for block reads
    - i2c: ismt: Return EMSGSIZE for block reads with bogus length
    - ceph: fix readpage from fscache
    - cpumask: fix spurious cpumask_of_node() on non-NUMA multi-node configs
    - cpuset: Fix incorrect memory_pressure control file mapping
    - alpha: uapi: Add support for __SANE_USERSPACE_TYPES__
    - CIFS: remove endian related sparse warning
    - wl1251: add a missing spin_lock_init()
    - xfrm: policy: check policy direction value
    - drm/ttm: Fix accounting error when fail to get pages for pool
    - kvm: arm/arm64: Fix race in resetting stage2 PGD
    - kvm: arm/arm64: Force reading uncached stage2 PGD
    - epoll: fix race between ep_poll_callback(POLLFREE) and ep_free()/ep_remove()
    - crypto: algif_skcipher - only call put_page on referenced and used pages
    - Linux 4.4.87

* Xenial update to 4.4.86 stable release (LP: #1715430)
    - scsi: isci: avoid array subscript warning
    - ALSA: au88x0: Fix zero clear of stream->resources
    - btrfs: remove duplicate const specifier
    - i2c: jz4780: drop superfluous init
    - gcov: add support for gcc version >= 6
    - gcov: support GCC 7.1
    - lightnvm: initialize ppa_addr in dev_to_generic_addr()
    - p54: memset(0) whole array
    - lpfc: Fix Device discovery failures during switch reboot test.
    - arm64: mm: abort uaccess retries upon fatal signal
    - x86/io: Add "memory" clobber to insb/insw/insl/outsb/outsw/outsl
    - arm64: fpsimd: Prevent registers leaking across exec
    - scsi: sg: protect accesses to 'reserved' page array
    - scsi: sg: reset 'res_in_use' after unlinking reserved array
    - drm/i915: fix compiler warning in drivers/gpu/drm/i915/intel_uncore.c
    - Linux 4.4.86

* Xenial update to 4.4.85 stable release (LP: #1714298)
    - af_key: do not use GFP_KERNEL in atomic contexts
    - dccp: purge write queue in dccp_destroy_sock()
    - dccp: defer ccid_hc_tx_delete() at dismantle time
    - ipv4: fix NULL dereference in free_fib_info_rcu()
    - net_sched/sfq: update hierarchical backlog when drop packet
    - ipv4: better IP_MAX_MTU enforcement
    - sctp: fully initialize the IPv6 address in sctp_v6_to_addr()
    - tipc: fix use-after-free
    - ipv6: reset fn->rr_ptr when replacing route
    - ipv6: repair fib6 tree in failure case
    - tcp: when rearming RTO, if RTO time is in past then fire RTO ASAP
    - irda: do not leak initialized list.dev to userspace
    - net: sched: fix NULL pointer dereference when action calls some targets
    - net_sched: fix order of queue length updates in qdisc_replace()
    - mei: me: add broxton pci device ids
    - mei: me: add lewisburg device ids
    - Input: trackpoint - add new trackpoint firmware ID
    - Input: elan_i2c - add ELAN0602 ACPI ID to support Lenovo Yoga310
    - ALSA: core: Fix unexpected error at replacing user TLV
    - ALSA: hda - Add stereo mic quirk for Lenovo G50-70 (17aa:3978)
    - ARCv2: PAE40: Explicitly set MSB counterpart of SLC region ops addresses
    - i2c: designware: Fix system suspend
    - drm: Release driver tracking before making the object available again
    - drm/atomic: If the atomic check fails, return its value first
    - drm: rcar-du: lvds: Fix PLL frequency-related configuration
    - drm: rcar-du: lvds: Rename PLLEN bit to PLLON
    - drm: rcar-du: Fix crash in encoder failure error path
    - drm: rcar-du: Fix display timing controller parameter
    - drm: rcar-du: Fix H/V sync signal polarity configuration
    - tracing: Fix freeing of filter in create_filter() when set_str is false
    - cifs: Fix df output for users with quota limits
    - cifs: return ENAMETOOLONG for overlong names in cifs_open()/cifs_lookup()
    - nfsd: Limit end of page list when decoding NFSv4 WRITE
    - perf/core: Fix group {cpu,task} validation
    - Bluetooth: hidp: fix possible might sleep error in hidp_session_thread
    - Bluetooth: cmtp: fix possible might sleep error in cmtp_session
    - Bluetooth: bnep: fix possible might sleep error in bnep_session
    - binder: use group leader instead of open thread
    - binder: Use wake up hint for synchronous transactions.
    - ANDROID: binder: fix proc->tsk check.
    - iio: imu: adis16480: Fix acceleration scale factor for adis16480
    - iio: hid-sensor-trigger: Fix the race with user space powering up sensors
    - staging: rtl8188eu: add RNX-N150NUB support
    - ASoC: simple-card: don't fail if sysclk setting is not supported
    - ASoC: rsnd: disable SRC.out only when stop timing
    - ASoC: rsnd: avoid pointless loop in rsnd_mod_interrupt()
    - ASoC: rsnd: Add missing initialization of ADG req_rate
    - ASoC: rsnd: ssi: 24bit data needs right-aligned settings
    - ASoC: rsnd: don't call update callback if it was NULL
    - ntb_transport: fix qp count bug
    - ntb_transport: fix bug calculating num_qps_mw
    - ACPI: ioapic: Clear on-stack resource before using it
    - ACPI / APEI: Add missing synchronize_rcu() on NOTIFY_SCI removal
    - Linux 4.4.85

* Xenial update to 4.4.84 stable release (LP: #1713729)
    - audit: Fix use after free in audit_remove_watch_rule()
    - parisc: pci memory bar assignment fails with 64bit kernels on dino/cujo
    - crypto: x86/sha1 - Fix reads beyond the number of blocks passed
    - Input: elan_i2c - Add antoher Lenovo ACPI ID for upcoming Lenovo NB
    - ALSA: seq: 2nd attempt at fixing race creating a queue
    - Revert "UBUNTU: SAUCE: (no-up) ALSA: usb-audio: Add quirk for sennheiser
      officerunner"
    - ALSA: usb-audio: Apply sample rate quirk to Sennheiser headset
    - ALSA: usb-audio: Add mute TLV for playback volumes on C-Media devices
    - mm/mempolicy: fix use after free when calling get_mempolicy
    - xen: fix bio vec merging
    - x86/asm/64: Clear AC on NMI entries
    - irqchip/atmel-aic: Fix unbalanced of_node_put() in aic_common_irq_fixup()
    - irqchip/atmel-aic: Fix unbalanced refcount in aic_common_rtc_irq_fixup()
    - Sanitize 'move_pages()' permission checks
    - pids: make task_tgid_nr_ns() safe
    - perf/x86: Fix LBR related crashes on Intel Atom
    - usb: optimize acpi companion search for usb port devices
    - usb: qmi_wwan: add D-Link DWM-222 device ID
    - Linux 4.4.84

* Intel i40e PF reset due to incorrect MDD detection (LP: #1713553)
    - i40e: Limit TX descriptor count in cases where frag size is greater than 16K

* Neighbour confirmation broken, breaks ARP cache aging (LP: #1715812)
    - sock: add sk_dst_pending_confirm flag
    - net: add dst_pending_confirm flag to skbuff
    - sctp: add dst_pending_confirm flag
    - tcp: replace dst_confirm with sk_dst_confirm
    - net: add confirm_neigh method to dst_ops
    - net: use dst_confirm_neigh for UDP, RAW, ICMP, L2TP
    - net: pending_confirm is not used anymore

* CVE-2017-14106
    - tcp: initialize rcv_mss to TCP_MIN_MSS instead of 0

* [CIFS] Fix maximum SMB2 header size (LP: #1713884)
    - CIFS: Fix maximum SMB2 header size

* Middle button of trackpoint doesn't work (LP: #1715271)
    - Input: trackpoint - assume 3 buttons when buttons detection fails

* kernel BUG at /build/linux-lts-xenial-_hWfOZ/linux-lts-
    xenial-4.4.0/security/apparmor/include/context.h:69! (LP: #1626984)
    - SAUCE: fix oops when disabled and module parameters, are accessed

* Touchpad not detected (LP: #1708852)
    - Input: elan_i2c - add ELAN0608 to the ACPI table

-- Kleber Sacilotto de Souza <kleber.souza@canonical.com>  Tue, 19 Sep 2017 17:55:11 +0200

Changed in linux (Ubuntu Xenial):
status:	Fix Committed → Fix Released

Revision history for this message

Björn Zettergren (bjozet) wrote on 2017-10-11:

#7

Download full text (5.1 KiB)

Hi,

Thanks for your efforts with this issue, however we're still experiencing problems with the newest kernel. Sorry about missing the patch-testing-window, we should have been there for you :)

After only 20 minutes of runtime with the new kernel, we saw the following, and networking is basically useless:

[ 2.410644] i40e: Intel(R) Ethernet Connection XL710 Network Driver - version 1.4.25-k
[ 2.419791] i40e: Copyright (c) 2013 - 2014 Intel Corporation.
[ 2.483362] i40e 0000:02:00.0: fw 5.40.47690 api 1.5 nvm 5.40 0x80002d35 18.0.16
[ 2.896678] i40e 0000:02:00.0: MAC address: 3c:fd:fe:1a:b5:e0
[ 2.903768] i40e 0000:02:00.0: SAN MAC: 3c:fd:fe:1a:b5:e1
[ 3.189818] i40e 0000:02:00.0: PCI-Express: Speed 8.0GT/s Width x4
[ 3.193934] i40e 0000:02:00.0: PCI-Express bandwidth available for this device may be insufficient for optimal performance.
[ 3.202198] i40e 0000:02:00.0: Please move the device to a different PCI-e link with more lanes and/or higher transfer rate.
[ 3.241095] i40e 0000:02:00.0: Features: PF-id[0] VFs: 64 VSIs: 2 QP: 4 RX: 1BUF RSS FD_ATR FD_SB NTUPLE DCB VxLAN Geneve PTP VEPA
[ 3.279202] i40e 0000:02:00.1: fw 5.40.47690 api 1.5 nvm 5.40 0x80002d35 18.0.16
[ 3.531346] i40e 0000:02:00.1: MAC address: 3c:fd:fe:1a:b5:e2
[ 3.539557] i40e 0000:02:00.1: SAN MAC: 3c:fd:fe:1a:b5:e3
[ 3.761719] i40e 0000:02:00.1: PCI-Express: Speed 8.0GT/s Width x4
[ 3.765721] i40e 0000:02:00.1: PCI-Express bandwidth available for this device may be insufficient for optimal performance.
[ 3.773539] i40e 0000:02:00.1: Please move the device to a different PCI-e link with more lanes and/or higher transfer rate.
[ 3.812022] i40e 0000:02:00.1: Features: PF-id[1] VFs: 64 VSIs: 2 QP: 4 RX: 1BUF RSS FD_ATR FD_SB NTUPLE DCB VxLAN Geneve PTP VEPA
[ 3.855168] i40e 0000:02:00.0 p1p1: renamed from eth2
[ 3.895278] i40e 0000:02:00.1 p1p2: renamed from eth0
[ 7.205832] i40e 0000:02:00.1 p1p2: already using mac address 3c:fd:fe:1a:b5:e2
[ 7.208378] i40e 0000:02:00.1 p1p2: NIC Link is Up 10 Gbps Full Duplex, Flow Control: None
[ 7.208401] i40e 0000:02:00.1 p1p2: adding 3c:fd:fe:1a:b5:e2 vid=0
[ 7.208453] i40e 0000:02:00.0 p1p1: set new mac address 3c:fd:fe:1a:b5:e2
[ 7.217191] i40e 0000:02:00.0 p1p1: NIC Link is Up 10 Gbps Full Duplex, Flow Control: None
[ 7.217215] i40e 0000:02:00.0 p1p1: adding 3c:fd:fe:1a:b5:e2 vid=0
[ 7.240919] i40e 0000:02:00.1 p1p2: set new mac address 3c:fd:fe:1a:b5:e0
[ 7.252720] i40e 0000:02:00.0 p1p1: returning to hw mac address 3c:fd:fe:1a:b5:e0
[ 7.324791] i40e 0000:02:00.1 p1p2: adding 3c:fd:fe:1a:b5:e0 vid=5
[ 7.324798] i40e 0000:02:00.0 p1p1: adding 3c:fd:fe:1a:b5:e0 vid=5
[ 1109.574733] i40e 0000:02:00.1: TX driver issue detected, PF reset issued
[ 1110.011152] i40e 0000:02:00.1 p1p2: adding 3c:fd:fe:1a:b5:e0 vid=0
[ 1110.011155] i40e 0000:02:00.1 p1p2: adding 3c:fd:fe:1a:b5:e0 vid=5
[ 1110.013749] i40e 0000:02:00.1: TX driver issue detected, PF reset issued
[ 1110.013773] i40e 0000:02:00.1 p1p2: speed changed to 0 for port p1p2
[ 1110.013954] bond0: link status up again after 0 ms for interface p1p2
[ 1110.983823] i40e 0000:02:00.1 p1p2: adding 3c:fd:fe:...

Hi,

Thanks for your efforts with this issue, however we're still experiencing problems with the newest kernel. Sorry about missing the patch-testing-window, we should have been there for you :)

After only 20 minutes of runtime with the new kernel, we saw the following, and networking is basically useless:

[    2.410644] i40e: Intel(R) Ethernet Connection XL710 Network Driver - version 1.4.25-k
[    2.419791] i40e: Copyright (c) 2013 - 2014 Intel Corporation.
[    2.483362] i40e 0000:02:00.0: fw 5.40.47690 api 1.5 nvm 5.40 0x80002d35 18.0.16
[    2.896678] i40e 0000:02:00.0: MAC address: 3c:fd:fe:1a:b5:e0
[    2.903768] i40e 0000:02:00.0: SAN MAC: 3c:fd:fe:1a:b5:e1
[    3.189818] i40e 0000:02:00.0: PCI-Express: Speed 8.0GT/s Width x4
[    3.193934] i40e 0000:02:00.0: PCI-Express bandwidth available for this device may be insufficient for optimal performance.
[    3.202198] i40e 0000:02:00.0: Please move the device to a different PCI-e link with more lanes and/or higher transfer rate.
[    3.241095] i40e 0000:02:00.0: Features: PF-id[0] VFs: 64 VSIs: 2 QP: 4 RX: 1BUF RSS FD_ATR FD_SB NTUPLE DCB VxLAN Geneve PTP VEPA
[    3.279202] i40e 0000:02:00.1: fw 5.40.47690 api 1.5 nvm 5.40 0x80002d35 18.0.16
[    3.531346] i40e 0000:02:00.1: MAC address: 3c:fd:fe:1a:b5:e2
[    3.539557] i40e 0000:02:00.1: SAN MAC: 3c:fd:fe:1a:b5:e3
[    3.761719] i40e 0000:02:00.1: PCI-Express: Speed 8.0GT/s Width x4
[    3.765721] i40e 0000:02:00.1: PCI-Express bandwidth available for this device may be insufficient for optimal performance.
[    3.773539] i40e 0000:02:00.1: Please move the device to a different PCI-e link with more lanes and/or higher transfer rate.
[    3.812022] i40e 0000:02:00.1: Features: PF-id[1] VFs: 64 VSIs: 2 QP: 4 RX: 1BUF RSS FD_ATR FD_SB NTUPLE DCB VxLAN Geneve PTP VEPA
[    3.855168] i40e 0000:02:00.0 p1p1: renamed from eth2
[    3.895278] i40e 0000:02:00.1 p1p2: renamed from eth0
[    7.205832] i40e 0000:02:00.1 p1p2: already using mac address 3c:fd:fe:1a:b5:e2
[    7.208378] i40e 0000:02:00.1 p1p2: NIC Link is Up 10 Gbps Full Duplex, Flow Control: None
[    7.208401] i40e 0000:02:00.1 p1p2: adding 3c:fd:fe:1a:b5:e2 vid=0
[    7.208453] i40e 0000:02:00.0 p1p1: set new mac address 3c:fd:fe:1a:b5:e2
[    7.217191] i40e 0000:02:00.0 p1p1: NIC Link is Up 10 Gbps Full Duplex, Flow Control: None
[    7.217215] i40e 0000:02:00.0 p1p1: adding 3c:fd:fe:1a:b5:e2 vid=0
[    7.240919] i40e 0000:02:00.1 p1p2: set new mac address 3c:fd:fe:1a:b5:e0
[    7.252720] i40e 0000:02:00.0 p1p1: returning to hw mac address 3c:fd:fe:1a:b5:e0
[    7.324791] i40e 0000:02:00.1 p1p2: adding 3c:fd:fe:1a:b5:e0 vid=5
[    7.324798] i40e 0000:02:00.0 p1p1: adding 3c:fd:fe:1a:b5:e0 vid=5
[ 1109.574733] i40e 0000:02:00.1: TX driver issue detected, PF reset issued
[ 1110.011152] i40e 0000:02:00.1 p1p2: adding 3c:fd:fe:1a:b5:e0 vid=0
[ 1110.011155] i40e 0000:02:00.1 p1p2: adding 3c:fd:fe:1a:b5:e0 vid=5
[ 1110.013749] i40e 0000:02:00.1: TX driver issue detected, PF reset issued
[ 1110.013773] i40e 0000:02:00.1 p1p2: speed changed to 0 for port p1p2
[ 1110.013954] bond0: link status up again after 0 ms for interface p1p2
[ 1110.983823] i40e 0000:02:00.1 p1p2: adding 3c:fd:fe:1a:b5:e0 vid=0
[ 1110.983825] i40e 0000:02:00.1 p1p2: adding 3c:fd:fe:1a:b5:e0 vid=5
[ 1110.985836] bond0: link status up again after 0 ms for interface p1p2
[ 1111.432231] i40e 0000:02:00.0: TX driver issue detected, PF reset issued
[ 1111.981828] i40e 0000:02:00.0 p1p1: adding 3c:fd:fe:1a:b5:e0 vid=0
[ 1111.981835] i40e 0000:02:00.0 p1p1: adding 3c:fd:fe:1a:b5:e0 vid=5
[ 1111.984816] i40e 0000:02:00.0: TX driver issue detected, PF reset issued
[ 1111.987007] bond0: link status up again after 0 ms for interface p1p1
[ 1112.981796] i40e 0000:02:00.0 p1p1: adding 3c:fd:fe:1a:b5:e0 vid=0
[ 1112.981803] i40e 0000:02:00.0 p1p1: adding 3c:fd:fe:1a:b5:e0 vid=5
[ 1112.985812] bond0: link status up again after 0 ms for interface p1p1
[ 1114.204548] i40e 0000:02:00.1: TX driver issue detected, PF reset issued
[ 1114.983686] i40e 0000:02:00.1 p1p2: adding 3c:fd:fe:1a:b5:e0 vid=0
[ 1114.983688] i40e 0000:02:00.1 p1p2: adding 3c:fd:fe:1a:b5:e0 vid=5
[ 1114.985692] bond0: link status up again after 0 ms for interface p1p2
[ 1115.752686] i40e 0000:02:00.1: TX driver issue detected, PF reset issued
[ 1116.985619] i40e 0000:02:00.1 p1p2: adding 3c:fd:fe:1a:b5:e0 vid=0
[ 1116.985624] i40e 0000:02:00.1 p1p2: adding 3c:fd:fe:1a:b5:e0 vid=5
[ 1116.988361] i40e 0000:02:00.1 p1p2: speed changed to 0 for port p1p2
[ 1116.989607] bond0: link status up again after 0 ms for interface p1p2

# uname -a
Linux lb05 4.4.0-97-generic #120-Ubuntu SMP Tue Sep 19 17:28:18 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

# modinfo i40e
filename:       /lib/modules/4.4.0-97-generic/kernel/drivers/net/ethernet/intel/i40e/i40e.ko
version:        1.4.25-k

As a workaround we're using i40e driver v2.0.30 via dkms, which does works fine without any issues so far, but it would be nice to have this problem fixed properly :-)

If we're going about this in the wrong way, and our problem is not applicable to this fix, please let us know. We're happy to test new patches if there are any.

We're gonna test the HWE 4.10 kernel mentioned and see how that behaves.

Revision history for this message

Dan Streetman (ddstreet) wrote on 2017-10-12:

#8

> however we're still experiencing problems with the newest kernel

well, I was afraid of that. As this problem is the NIC firmware complaining but not actually telling us what it's unhappy with, there's a bit of trial-and-error here figuring out what exactly it's complaining about.

Since this bug is already 'fix released', I opened a new bug 1723127 to track continuing work on this, let's move the discussion over there.

Revision history for this message

Stefan Kooman (stefan-n1) wrote on 2018-01-19:

#9

H there. I can confirm this problem still exists in newest kernels and with the latest intel drivers as of today:

Jan 19 16:05:19 osd9 kernel: [511271.581413] i40e 0000:02:00.1: TX driver issue detected, PF reset issued
Jan 19 16:09:08 osd9 kernel: [511500.919380] i40e 0000:02:00.0: TX driver issue detected, PF reset issued

driver: i40e-2.4.3 (and xenial / 4.13 shipped driver: 2.1.14-k)
kernel: 4.13.0-25-generic #29~16.04.2-Ubuntu SMP Tue Jan 9 12:16:39 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux. Kernel loaded with nopti noibrs noibpb (Meltdown / Spetre mitigation disabled).

We can trigger the issue with high load (benchmarking Ceph cluster with fio: 4 clients, 8 threads, iodepth 256, 100% random write, 64K block size).

Only when we use relatively large block size (64K) do we hit this problem. With 4K blocks we do not hit this issue. We haven't tested large random reads (that test is still to be done).

When using openvswitch port-channel (as we do) with jumbo frames ... this port-channel will not come back online after the reset. rmmod i40e / modprobe i40e does the trick though.

Revision history for this message

Dan Streetman (ddstreet) wrote on 2018-01-19:

#10

@stefan-n1, please move discussion over to bug 1723127, no more comments should be added to this bug.

Dan Streetman (ddstreet) on 2018-05-25

Changed in linux (Ubuntu):
status:	In Progress → Fix Released

Ubuntu
linux package

Intel i40e PF reset due to incorrect MDD detection

Bug Description

CVE References

Other bug subscribers

Remote bug watches

Affects		Status	Importance	Assigned to	Milestone
	linux (Ubuntu)	Fix Released	Medium	Dan Streetman
	Xenial	Fix Released	Undecided	Unassigned

Ubuntulinux package

Intel i40e PF reset due to incorrect MDD detection

Bug Description

CVE References

Other bug subscribers

Remote bug watches

Ubuntu
linux package