thunderx nics fail to establish link

Bug #1597867 reported by dann frazier on 2016-06-30
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
High
dann frazier
Xenial
High
dann frazier

Bug Description

[Impact]
When connected to certain switches, Cavium ThunderX nodes will occasionally fail to establish a link. On one setup (using a Cisco 10G switch), we're seeing a failure rate of about 20%.

Manually reloading the driver modules - or rebooting - is required to recover.

A fix is now available in linux-next that greatly reduces (though it doesn't 100% eliminate) the frequency of occurrences (2% vs. 20%, in my case). Investigation continues to identify a resolution for the remaining cases.

[Test Case]
Connect Cavium ThunderX systems to a known bad switch, and put them in a reboot loop. In my test, I use the maas cli to release/acquire/deploy systems, and wait for a node to enter the deployment failure state.

[Regression Risk]
The fix is upstream, and internal to a specific driver only used on Cavium ThunderX systems.

dann frazier (dannf) on 2016-06-30
Changed in linux (Ubuntu Xenial):
status: New → In Progress
assignee: nobody → dann frazier (dannf)
importance: Undecided → High
Changed in linux (Ubuntu Xenial):
status: In Progress → Fix Committed
Launchpad Janitor (janitor) wrote :
Download full text (14.6 KiB)

This bug was fixed in the package linux - 4.4.0-33.52

---------------
linux (4.4.0-33.52) xenial; urgency=low

  [ Seth Forshee ]

  * Release Tracking Bug
    - LP: #1605709

  * [regression] NFS client: access problems after updating to kernel
    4.4.0-31-generic (LP: #1603719)
    - SAUCE: (namespace) Bypass sget() capability check for nfs

linux (4.4.0-32.51) xenial; urgency=low

  [ Seth Forshee ]

  * Release Tracking Bug
    - LP: #1604443

  * thinkpad yoga 260 wacom touchscreen not working (LP: #1603975)
    - HID: wacom: break out parsing of device and registering of input
    - HID: wacom: Initialize hid_data.inputmode to -1
    - HID: wacom: Support switching from vendor-defined device mode on G9 and G11

  * changelog: add CVEs as first class citizens (LP: #1604344)
    - use CVE numbers in changelog

  * [Xenial] Include Huawei PCIe SSD hio kernel driver (LP: #1603483)
    - SAUCE: import Huawei ES3000_V2 (2.1.0.23)
    - SAUCE: hio: bio_endio() no longer takes errors arg
    - SAUCE: hio: blk_queue make_request_fn now returns a blk_qc_t
    - SAUCE: hio: use alloc_cpumask_var to avoid -Wframe-larger-than
    - SAUCE: hio: fix mask maybe-uninitialized warning
    - [config] enable CONFIG_HIO (Huawei ES3000_V2 PCIe SSD driver)
    - SAUCE: hio: Makefile and Kconfig

  * CVE-2016-5243 (LP: #1589036)
    - tipc: fix an infoleak in tipc_nl_compat_link_dump
    - tipc: fix nl compat regression for link statistics

  * CVE-2016-4470
    - KEYS: potential uninitialized variable

  * integer overflow in xt_alloc_table_info (LP: #1555353)
    - netfilter: x_tables: check for size overflow

  * CVE-2016-3135:
    - Revert "UBUNTU: SAUCE: (noup) netfilter: x_tables: check for size overflow"

  * CVE-2016-4440 (LP: #1584192)
    - kvm:vmx: more complete state update on APICv on/off

  * the system hangs in the dma driver when reboot or shutdown on a baytrail-m
    laptop (LP: #1602579)
    - dmaengine: dw: platform: power on device on shutdown
    - ACPI / LPSS: override power state for LPSS DMA device

  * Add proper palm detection support for MS Precision Touchpad (LP: #1593124)
    - Revert "HID: multitouch: enable palm rejection if device implements
      confidence usage"
    - HID: multitouch: enable palm rejection for Windows Precision Touchpad

  * Add support for Intel 8265 Bluetooth ([8087:0A2B]) (LP: #1599068)
    - Bluetooth: Add support for Intel Bluetooth device 8265 [8087:0a2b]

  * CVE-2016-4794 (LP: #1581871)
    - percpu: fix synchronization between chunk->map_extend_work and chunk
      destruction
    - percpu: fix synchronization between synchronous map extension and chunk
      destruction

  * Xenial update to v4.4.15 stable release (LP: #1601952)
    - net_sched: fix pfifo_head_drop behavior vs backlog
    - net: Don't forget pr_fmt on net_dbg_ratelimited for CONFIG_DYNAMIC_DEBUG
    - sit: correct IP protocol used in ipip6_err
    - esp: Fix ESN generation under UDP encapsulation
    - netem: fix a use after free
    - ipmr/ip6mr: Initialize the last assert time of mfc entries.
    - Bridge: Fix ipv6 mc snooping if bridge has no ipv6 address
    - sock_diag: do not broadcast raw socket destruction
    - bpf, perf...

Changed in linux (Ubuntu):
status: In Progress → Fix Released
Seth Forshee (sforshee) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-xenial' to 'verification-done-xenial'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-xenial
Michael Reed (mreed8855) on 2016-08-01
tags: added: verification-xenial-completed
removed: verification-needed-xenial
tags: added: verified-xenial
removed: verification-xenial-completed
Michael Reed (mreed8855) on 2016-08-01
tags: added: verified-xenial-done
removed: verified-xenial
Michael Reed (mreed8855) on 2016-08-01
tags: added: verification-done-xenial
removed: verified-xenial-done
Launchpad Janitor (janitor) wrote :
Download full text (15.0 KiB)

This bug was fixed in the package linux - 4.4.0-34.53

---------------
linux (4.4.0-34.53) xenial; urgency=low

  [ Seth Forshee ]

  * Release Tracking Bug
    - LP: #1606960

  * [APL][SAUCE] Slow system response time due to a monitor bug (LP: #1606147)
    - x86/cpu/intel: Introduce macros for Intel family numbers
    - SAUCE: x86/cpu: Add workaround for MONITOR instruction erratum on Goldmont
      based CPUs

linux (4.4.0-33.52) xenial; urgency=low

  [ Seth Forshee ]

  * Release Tracking Bug
    - LP: #1605709

  * [regression] NFS client: access problems after updating to kernel
    4.4.0-31-generic (LP: #1603719)
    - SAUCE: (namespace) Bypass sget() capability check for nfs

linux (4.4.0-32.51) xenial; urgency=low

  [ Seth Forshee ]

  * Release Tracking Bug
    - LP: #1604443

  * thinkpad yoga 260 wacom touchscreen not working (LP: #1603975)
    - HID: wacom: break out parsing of device and registering of input
    - HID: wacom: Initialize hid_data.inputmode to -1
    - HID: wacom: Support switching from vendor-defined device mode on G9 and G11

  * changelog: add CVEs as first class citizens (LP: #1604344)
    - use CVE numbers in changelog

  * [Xenial] Include Huawei PCIe SSD hio kernel driver (LP: #1603483)
    - SAUCE: import Huawei ES3000_V2 (2.1.0.23)
    - SAUCE: hio: bio_endio() no longer takes errors arg
    - SAUCE: hio: blk_queue make_request_fn now returns a blk_qc_t
    - SAUCE: hio: use alloc_cpumask_var to avoid -Wframe-larger-than
    - SAUCE: hio: fix mask maybe-uninitialized warning
    - [config] enable CONFIG_HIO (Huawei ES3000_V2 PCIe SSD driver)
    - SAUCE: hio: Makefile and Kconfig

  * CVE-2016-5243 (LP: #1589036)
    - tipc: fix an infoleak in tipc_nl_compat_link_dump
    - tipc: fix nl compat regression for link statistics

  * CVE-2016-4470
    - KEYS: potential uninitialized variable

  * integer overflow in xt_alloc_table_info (LP: #1555353)
    - netfilter: x_tables: check for size overflow

  * CVE-2016-3135:
    - Revert "UBUNTU: SAUCE: (noup) netfilter: x_tables: check for size overflow"

  * CVE-2016-4440 (LP: #1584192)
    - kvm:vmx: more complete state update on APICv on/off

  * the system hangs in the dma driver when reboot or shutdown on a baytrail-m
    laptop (LP: #1602579)
    - dmaengine: dw: platform: power on device on shutdown
    - ACPI / LPSS: override power state for LPSS DMA device

  * Add proper palm detection support for MS Precision Touchpad (LP: #1593124)
    - Revert "HID: multitouch: enable palm rejection if device implements
      confidence usage"
    - HID: multitouch: enable palm rejection for Windows Precision Touchpad

  * Add support for Intel 8265 Bluetooth ([8087:0A2B]) (LP: #1599068)
    - Bluetooth: Add support for Intel Bluetooth device 8265 [8087:0a2b]

  * CVE-2016-4794 (LP: #1581871)
    - percpu: fix synchronization between chunk->map_extend_work and chunk
      destruction
    - percpu: fix synchronization between synchronous map extension and chunk
      destruction

  * Xenial update to v4.4.15 stable release (LP: #1601952)
    - net_sched: fix pfifo_head_drop behavior vs backlog
    - net: Don't forget pr_fmt on net_dbg_ratelimited for CONFIG_DYNAMIC...

Changed in linux (Ubuntu Xenial):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers