Occasional crash in APM xgene enet driver on kernels prior to v3.19

Bug #1425576 reported by Craig Magina
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Triaged
Medium
Unassigned
Trusty
Fix Released
Medium
Unassigned
Utopic
Fix Released
Medium
Unassigned

Bug Description

[Impact]
System panics every few hours under constant network load

[Test Case]
On the X-Gene system, start an iperf server:
  $ iperf -s
Connect to it from another system (1Gbps is sufficient)
  $ iperf -t 100000 -c <xgene host>

On failure, you'll get a panic within several hours.

[Regression Risk]
The proposed patch only touches the impacted driver. It is adding a memory barrier - albeit a lightweight one - so there is the risk of a performance degradation. Indeed, when measured using a 60s iperf run, I do see a performance drop, but by < 0.5%.

CVE References

Revision history for this message
Craig Magina (craig.magina) wrote :

Here is a direct link to the upstream submission of the crash:

http://www.spinics.net/lists/arm-kernel/msg399023.html

The upstream commit that fixes it is:

commit ecf6ba83d76e0c78e89401750dc527008e14faa2
Author: Iyappan Subramanian <isubramanian@xxxxxxx>
Date: Thu Jan 29 14:38:23 2015 -0800
drivers: net: xgene: fix: Out of order descriptor bytes read

Revision history for this message
Brad Figg (brad-figg) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 1425576

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Changed in linux (Ubuntu):
importance: Undecided → Medium
status: Incomplete → Triaged
Changed in linux (Ubuntu Trusty):
status: New → Triaged
Changed in linux (Ubuntu Utopic):
status: New → Triaged
Changed in linux (Ubuntu Trusty):
importance: Undecided → Medium
Changed in linux (Ubuntu Utopic):
importance: Undecided → Medium
tags: added: kernel-da-key trusty utopic
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Commit ecf6ba83d wasn't cc'd to stable. I built a Utopic test kernel with a cherry pick of the commit. It can be downloaded from:

http://kernel.ubuntu.com/~jsalisbury/lp1425576/

Can you test this kernel to see if it resolves the bug? If it does, we can send a request to the upstream stable maintainers to also pick this commit.

Note the test kernel request both the linux-image and linux-image-extra .deb packages.

Revision history for this message
Craig Magina (craig.magina) wrote :

I just looked at your packages and noticed they are for amd64, this issue is an xgene (arm64) platform bug. If you give me the a point to the git tree I can build the packages myself and test. Thanks.

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Sorry for building amd64. I no have arm64 packages available with a cherry pick of commit ecf6ba83d. They can be downloaded from the same location:

http://kernel.ubuntu.com/~jsalisbury/lp1425576/

If you also need my git tree, I'll move it to a public spot on zinc(kernel.ubuntu.com). But it's just the Utopic tree with a cherry-pick.

Revision history for this message
Craig Magina (craig.magina) wrote :

One final issue is the kernel version you used to test this fix contains a bug that prevents it from booting on arm64 systems. The fix is in the utopic-stable branch to be pulled into the next SRU cycle. The fix was pushed into the latest trusty kernel for a one-off release, 3.13.0-46.76. You can read more about the issue here: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1426043.

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

That test kernel should have upstream commit: 0b46b8a, which is the following commit in Utopic:

commit e16bd2b7e29c5434505929deda4c6bb4436104d3
Author: Sonny Rao <email address hidden>
Date: Sun Nov 23 23:02:44 2014 -0800

    clocksource: arch_timer: Fix code to use physical timers when requested

git describe --contains e16bd2b
Ubuntu-3.16.0-31.41~254

Commit 0b46b8a was in mainline as of 3.19-rc1. Vivid has since been rebased to 3.19 final, so it will have the fix for the other bug. I'll apply the fix for this bug to a Vivid kernel and post it for testing.

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Hmm, right commit ecf6ba83, the fix for this bug, is in mainline as of 3.19 final, so Vivid should already have the fix for both bugs.

Maybe I'll build a Utopic test kernel using the master-next branch.

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

I built a Utopic test kernel from the master-next branch, with a cherry-pick of commit ecf6ba83d.

It can be downloaded from:

http://kernel.ubuntu.com/~jsalisbury/lp1425576/

Can you test this kernel to see if it resolves the bug and does not exhibit bug 1426043 ?

Revision history for this message
Craig Magina (craig.magina) wrote :

I have not been able to find a reliable way to reproduce this issue, but the fix did not cause any issues in my testing.

Revision history for this message
Ming Lei (tom-leiming) wrote :

From the original report, the issue happened since 3.17, so close it on trysty because we can't reproduce it
on trusty too.

          I've seen this with mainline since somewhere in v3.17 and on several
         hardware boards stress testing KVM by running workloads in VMs.

Changed in linux (Ubuntu Trusty):
status: Triaged → Invalid
Revision history for this message
dann frazier (dannf) wrote :
Download full text (14.9 KiB)

I can't confirm that this is 100% the same issue since the backtrace isn't identical, but I was able to hit the following panic on trusty's 3.13 just by running iperf to another system for a few hours:

[ 2613.269450] BUG: Bad page state in process iperf pfn:41d5c57
[ 2613.275172] page:ffffffbce66c3308 count:-1 mapcount:0 mapping: (nul0
[ 2613.283219] page flags: 0x4000000000000000()
[ 2613.288129] skbuff: skb_over_panic: text:ffffffbffc00a3d0 len:2994 put:1514 >
[ 2613.301642] Kernel panic - not syncing: BUG!
[ 2613.305892] CPU: 0 PID: 1499 Comm: iperf Tainted: G B 3.13.0-55-gu
[ 2613.314370] Call trace:
[ 2613.316805] [<ffffffc00008848c>] dump_backtrace+0x0/0x16c
[ 2613.322175] [<ffffffc000088608>] show_stack+0x10/0x1c
[ 2613.327199] [<ffffffc00060c750>] dump_stack+0x74/0x94
[ 2613.332223] [<ffffffc000606984>] panic+0xe0/0x20c
[ 2613.336901] [<ffffffc00060dad4>] skb_panic+0x78/0x7c
[ 2613.341839] [<ffffffc000510b5c>] skb_put+0x88/0x8c
[ 2613.346610] [<ffffffbffc00a3cc>] xgene_enet_process_ring+0xdc/0x388 [xgene_e]
[ 2613.353967] [<ffffffbffc00a798>] xgene_enet_napi+0x1c/0x50 [xgene_enet]
[ 2613.360546] [<ffffffc00051f8c8>] net_rx_action+0x15c/0x240
[ 2613.366002] [<ffffffc0000aae24>] __do_softirq+0x124/0x29c
[ 2613.371372] [<ffffffc0000ab29c>] irq_exit+0x8c/0xc0
[ 2613.376223] [<ffffffc0000850ac>] handle_IRQ+0x5c/0xc8
[ 2613.381248] [<ffffffc000081290>] gic_handle_irq+0x38/0x7c
[ 2613.386618] Exception stack(0xffffffc1ee3db7d0 to 0xffffffc1ee3db8f0)
[ 2613.393025] b7c0: e8204200 ffffffc1 fff81
[ 2613.401159] b7e0: ee3db910 ffffffc1 0060ec5c ffffffc0 00007964 00000000 ee3d1
[ 2613.409293] b800: 00000000 00000000 00000002 00000000 00000000 00000000 00000
[ 2613.417429] b820: 00000001 00000000 fff833d8 ffffffc1 fff833b0 ffffffc1 f0ce1
[ 2613.425563] b840: 00000000 00000000 00000001 00000000 00000000 00000000 00000
[ 2613.433697] b860: 00000000 00000000 000005a8 00000000 00000014 00000000 a4250
[ 2613.441831] b880: 00000000 00000000 e8204200 ffffffc1 fff83340 ffffffc1 e8201
[ 2613.449965] b8a0: ee3d8000 ffffffc1 008ec340 ffffffc0 009d7000 ffffffc0 e8201
[ 2613.458099] b8c0: 008ec340 ffffffc0 ff697000 00000001 008ec340 ffffffc0 ee3d1
[ 2613.466232] b8e0: 0060e6e0 ffffffc0 ee3db910 ffffffc1
[ 2613.471257] [<ffffffc0000845a8>] el1_irq+0x68/0xc0
[ 2613.476023] [<ffffffc00060ed1c>] schedule+0x24/0x78
[ 2613.480875] [<ffffffc00060e040>] schedule_timeout+0x180/0x248
[ 2613.486590] [<ffffffc00050a78c>] sk_wait_data+0xcc/0xd8
[ 2613.491788] [<ffffffc0005645ac>] tcp_recvmsg+0x59c/0x9f4
[ 2613.497072] [<ffffffc00058a038>] inet_recvmsg+0x48/0x90
...

Changed in linux (Ubuntu Trusty):
status: Invalid → New
Revision history for this message
dann frazier (dannf) wrote :

After testing with the proposed backport[1], I was able to run the iperf test overnight on a trusty kernel w/o any crashes.

[1]http://kernel.ubuntu.com/git/ming/ubuntu-utopic.git/commit/?h=arm64-net-backport&id=d46456e3ed9153e743f2add6e01d06c56eb2ceb7

Changed in linux (Ubuntu Trusty):
status: New → In Progress
Changed in linux (Ubuntu Utopic):
status: Triaged → In Progress
dann frazier (dannf)
description: updated
Revision history for this message
Ming Lei (tom-leiming) wrote :

Finally I figured out one approach to reproduce it quickly, see the attachment log.

Revision history for this message
Ming Lei (tom-leiming) wrote :
Revision history for this message
Ming Lei (tom-leiming) wrote :
Brad Figg (brad-figg)
Changed in linux (Ubuntu Utopic):
status: In Progress → Fix Committed
Changed in linux (Ubuntu Trusty):
status: In Progress → Fix Committed
Revision history for this message
Brad Figg (brad-figg) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-trusty' to 'verification-done-trusty'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-trusty
tags: added: verification-needed-utopic
Revision history for this message
Brad Figg (brad-figg) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-utopic' to 'verification-done-utopic'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

Revision history for this message
Ming Lei (tom-leiming) wrote :

ubuntu@am2:~$ iperf -c 10.228.0.2 -P 8 -t 120
------------------------------------------------------------
Client connecting to 10.228.0.2, TCP port 5001
TCP window size: 85.0 KByte (default)
------------------------------------------------------------
[ 10] local 10.228.66.98 port 59722 connected with 10.228.0.2 port 5001
[ 4] local 10.228.66.98 port 59717 connected with 10.228.0.2 port 5001
[ 3] local 10.228.66.98 port 59715 connected with 10.228.0.2 port 5001
[ 5] local 10.228.66.98 port 59716 connected with 10.228.0.2 port 5001
[ 6] local 10.228.66.98 port 59718 connected with 10.228.0.2 port 5001
[ 7] local 10.228.66.98 port 59719 connected with 10.228.0.2 port 5001
[ 9] local 10.228.66.98 port 59721 connected with 10.228.0.2 port 5001
[ 8] local 10.228.66.98 port 59720 connected with 10.228.0.2 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0-120.0 sec 1.71 GBytes 122 Mbits/sec
[ 7] 0.0-120.0 sec 1.64 GBytes 117 Mbits/sec
[ 4] 0.0-120.0 sec 1.56 GBytes 112 Mbits/sec
[ 5] 0.0-120.0 sec 1.54 GBytes 110 Mbits/sec
[ 10] 0.0-120.0 sec 1.70 GBytes 121 Mbits/sec
[ 6] 0.0-120.0 sec 1.71 GBytes 123 Mbits/sec
[ 9] 0.0-120.0 sec 1.72 GBytes 123 Mbits/sec
[ 8] 0.0-120.0 sec 1.58 GBytes 113 Mbits/sec
[SUM] 0.0-120.0 sec 13.2 GBytes 942 Mbits/sec
ubuntu@am2:~$ uname -aa
Linux am2 3.13.0-58-generic #97-Ubuntu SMP Wed Jul 8 03:00:52 UTC 2015 aarch64 aarch64 aarch64 GNU/Linux

tags: added: verification-done-trusty
removed: verification-needed-trusty
Revision history for this message
Ming Lei (tom-leiming) wrote :

ubuntu@ms10-40-mcdivittA3:~$ uname -a
Linux ms10-40-mcdivittA3 3.16.0-44-generic #59-Ubuntu SMP Tue Jul 7 02:18:58 UTC 2015 aarch64 aarch64 aarch64 GNU/Linux
ubuntu@ms10-40-mcdivittA3:~$
ubuntu@ms10-40-mcdivittA3:~$
ubuntu@ms10-40-mcdivittA3:~$
ubuntu@ms10-40-mcdivittA3:~$ iperf -c 10.229.0.101 -P 8 -t 120
------------------------------------------------------------
Client connecting to 10.229.0.101, TCP port 5001
TCP window size: 85.0 KByte (default)
------------------------------------------------------------
[ 10] local 10.229.65.140 port 48807 connected with 10.229.0.101 port 5001
[ 5] local 10.229.65.140 port 48802 connected with 10.229.0.101 port 5001
[ 3] local 10.229.65.140 port 48801 connected with 10.229.0.101 port 5001
[ 4] local 10.229.65.140 port 48800 connected with 10.229.0.101 port 5001
[ 6] local 10.229.65.140 port 48803 connected with 10.229.0.101 port 5001
[ 7] local 10.229.65.140 port 48804 connected with 10.229.0.101 port 5001
[ 8] local 10.229.65.140 port 48806 connected with 10.229.0.101 port 5001
[ 9] local 10.229.65.140 port 48805 connected with 10.229.0.101 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0-120.0 sec 3.46 GBytes 247 Mbits/sec
[ 5] 0.0-120.0 sec 192 MBytes 13.4 Mbits/sec
[ 6] 0.0-120.0 sec 2.94 GBytes 210 Mbits/sec
[ 7] 0.0-120.0 sec 2.83 GBytes 203 Mbits/sec
[ 8] 0.0-120.0 sec 3.19 GBytes 228 Mbits/sec
[ 9] 0.0-120.0 sec 192 MBytes 13.4 Mbits/sec
[ 10] 0.0-120.1 sec 191 MBytes 13.3 Mbits/sec
[ 4] 0.0-120.1 sec 191 MBytes 13.3 Mbits/sec
[SUM] 0.0-120.1 sec 13.2 GBytes 942 Mbits/sec

tags: added: verification-done-utopic
removed: verification-needed-utopic
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (6.4 KiB)

This bug was fixed in the package linux - 3.16.0-44.59

---------------
linux (3.16.0-44.59) utopic; urgency=low

  [ Brad Figg ]

  * Release Tracking Bug
    - LP: #1472030

  [ Iyappan Subramanian ]

  * SAUCE: (no-up) drivers: net: xgene: fix: Out of order descriptor bytes
    read
    - LP: #1425576

  [ Upstream Kernel Changes ]

  * Revert "tools/vm: fix page-flags build"
    - LP: #1471170
  * NVMe: Add shutdown timeout as module parameter.
    - LP: #1465136
  * Drivers: hv: vmbus: Add support for VMBus panic notifier handler
    - LP: #1463584
  * Drivers: hv: vmbus: Correcting truncation error for constant
    HV_CRASH_CTL_CRASH_NOTIFY
    - LP: #1463584
  * KVM: nVMX: fix lifetime issues for vmcs02
    - LP: #1448269
  * KVM: nVMX: Fix nested vmexit ack intr before load vmcs01
    - LP: #1448269
  * mm/slab_common: support the slub_debug boot option on specific object
    size
    - LP: #1456952
  * kvm: x86: fix kvm_apic_has_events to check for NULL pointer
  * cpuidle: powernv: Populate cpuidle state details by querying the
    device-tree
    - LP: #1470404
  * cpuidle: powernv: Read target_residency value of idle states from DT if
    available
    - LP: #1470404
  * cpuidle: powernv: Avoid endianness conversions while parsing DT
    - LP: #1470404
  * cpuidle: powernv/pseries: Auto-promotion of snooze to deeper idle state
    - LP: #1470404
  * iio: adis16400: Report pressure channel scale
    - LP: #1471170
  * iio: adis16400: Use != channel indices for the two voltage channels
    - LP: #1471170
  * iio: adis16400: Compute the scan mask from channel indices
    - LP: #1471170
  * iio: adis16400: Remove unused variable
    - LP: #1471170
  * iio: adis16400: Fix burst mode
    - LP: #1471170
  * iio: adis16400: Fix burst transfer for adis16448
    - LP: #1471170
  * USB: serial: ftdi_sio: Add support for a Motion Tracker Development
    Board
    - LP: #1471170
  * iio: adc: twl6030-gpadc: Fix modalias
    - LP: #1471170
  * serial: imx: Fix DMA handling for IDLE condition aborts
    - LP: #1471170
  * usb: dwc3: gadget: Fix incorrect DEPCMD and DGCMD status macros
    - LP: #1471170
  * ALSA: usb-audio: Add mic volume fix quirk for Logitech Quickcam Fusion
    - LP: #1471170
  * n_tty: Fix auditing support for cannonical mode
    - LP: #1471170
  * drm/i915/hsw: Fix workaround for server AUX channel clock divisor
    - LP: #1471170
  * x86/asm/irq: Stop relying on magic JMP behavior for early_idt_handlers
    - LP: #1471170
  * lib: Fix strnlen_user() to not touch memory after specified maximum
    - LP: #1471170
  * Input: elantech - fix detection of touchpads where the revision matches
    a known rate
    - LP: #1471170
  * ALSA: hda/realtek - Add a fixup for another Acer Aspire 9420
    - LP: #1471170
  * ALSA: usb-audio: add MAYA44 USB+ mixer control names
    - LP: #1471170
  * ALSA: usb-audio: fix missing input volume controls in MAYA44 USB(+)
    - LP: #1471170
  * USB: cp210x: add ID for HubZ dual ZigBee and Z-Wave dongle
    - LP: #1471170
  * Input: elantech - add new icbody type
    - LP: #1471170
  * MIPS: Fix enabling of DEBUG_STACKOVERFLOW
    - LP: #1471170
  * xfrm: fix a race in xfrm_state_lookup_byspi
    ...

Read more...

Changed in linux (Ubuntu Utopic):
status: Fix Committed → Fix Released
status: Fix Committed → Fix Released
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (9.8 KiB)

This bug was fixed in the package linux - 3.13.0-58.97

---------------
linux (3.13.0-58.97) trusty; urgency=low

  [ Brad Figg ]

  * Release Tracking Bug
    - LP: #1472453

  [ Upstream Kernel Changes ]

  * vm: Fix incomplete backport of VM_FAULT_SIGSEGV handling support
    - LP: #1471892

linux (3.13.0-58.96) trusty; urgency=low

  [ Brad Figg ]

  * Release Tracking Bug
    - LP: #1471991

  [ Iyappan Subramanian ]

  * SAUCE: (no-up): drivers: net: xgene: fix: Out of order descriptor bytes
    read
    - LP: #1425576

  [ Upstream Kernel Changes ]

  * NVMe: Add shutdown timeout as module parameter.
    - LP: #1465136
  * Drivers: hv: vmbus: Add support for VMBus panic notifier handler
    - LP: #1463584
  * Drivers: hv: vmbus: Correcting truncation error for constant
    HV_CRASH_CTL_CRASH_NOTIFY
    - LP: #1463584
  * netfilter: nf_conntrack: don't release a conntrack with non-zero refcnt
    - LP: #1466135
  * lpfc: Add iotag memory barrier
    - LP: #1468416
  * mm/slab_common: support the slub_debug boot option on specific object
    size
    - LP: #1456952
  * pipe: iovec: Fix memory corruption when retrying atomic copy as
    non-atomic
    - CVE-2015-1805
  * kvm: x86: fix kvm_apic_has_events to check for NULL pointer
  * staging, rtl8192e, LLVMLinux: Change extern inline to static inline
    - LP: #1471233
  * kernel: use the gnu89 standard explicitly
    - LP: #1471233
  * staging, rtl8192e, LLVMLinux: Remove unused inline prototype
    - LP: #1471233
  * staging: rtl8712, rtl8712: avoid lots of build warnings
    - LP: #1471233
  * qla2xxx: remove redundant declaration in 'qla_gbl.h'
    - LP: #1471233
  * staging: wlags49_h2: fix extern inline functions
    - LP: #1471233
  * ARM: 8307/1: psci: move psci firmware calls out of line
    - LP: #1471233
  * kconfig: Fix warning "‘jump’ may be used uninitialized"
    - LP: #1471233
  * scripts/sortextable: suppress warning: `relocs_size' may be used
    uninitialized
    - LP: #1471233
  * ASoC: dapm: Enable autodisable on SOC_DAPM_SINGLE_TLV_AUTODISABLE
    - LP: #1471233
  * ALSA: hda - Fix mute-LED fixed mode
    - LP: #1471233
  * ALSA: emu10k1: Fix card shortname string buffer overflow
    - LP: #1471233
  * ALSA: emux: Fix mutex deadlock at unloading
    - LP: #1471233
  * drm/radeon: add SI DPM quirk for Sapphire R9 270 Dual-X 2G GDDR5
    - LP: #1471233
  * SCSI: add 1024 max sectors black list flag
    - LP: #1471233
  * 3w-sas: fix command completion race
    - LP: #1471233
  * 3w-xxxx: fix command completion race
    - LP: #1471233
  * 3w-9xxx: fix command completion race
    - LP: #1471233
  * serial: xilinx: Use platform_get_irq to get irq description structure
    - LP: #1471233
  * serial: of-serial: Remove device_type = "serial" registration
    - LP: #1471233
  * tty/serial: at91: maxburst was missing for dma transfers
    - LP: #1471233
  * ALSA: emux: Fix mutex deadlock in OSS emulation
    - LP: #1471233
  * ALSA: emu10k1: Emu10k2 32 bit DMA mode
    - LP: #1471233
  * rbd: end I/O the entire obj_request on error
    - LP: #1471233
  * powerpc/pseries: Correct cpu affinity for dlpar added cpus
    - LP: #1471233
  * bridge/mdb: remove wrong use of NLM_F_MULT...

Changed in linux (Ubuntu Trusty):
status: Fix Committed → Fix Released
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.