vmxnet3 driver could causes kernel panic with v4.4 if LRO enabled.

Bug #1650635 reported by Eric Desrochers
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Fix Released
Medium
Unassigned
Xenial
Fix Released
Medium
Eric Desrochers

Bug Description

[Impact]

It has been brought to my attention that a Trusty Vmware Virtual Machine running kernel v4.4.0-36 crashed with the following stacktrace :

PANIC: "kernel BUG at /build/linux-lts-xenial-QiVniY/linux-lts-xenial-4.4.0/drivers/net/vmxnet3/vmxnet3_drv.c:1353!"
...
#0 [ffff88042d683aa0] machine_kexec at ffffffff8105987c
#1 [ffff88042d683af8] crash_kexec at ffffffff81105d23
#2 [ffff88042d683bc0] oops_end at ffffffff81030a79
#3 [ffff88042d683be8] die at ffffffff81030f7b
#4 [ffff88042d683c18] do_trap at ffffffff8102e04d
#5 [ffff88042d683c68] do_error_trap at ffffffff8102e5a7
#6 [ffff88042d683d20] do_invalid_op at ffffffff8102e840
#7 [ffff88042d683d30] invalid_op at ffffffff817f900e
[exception RIP: vmxnet3_rq_rx_complete+3016]
RIP: ffffffffc004e448 RSP: ffff88042d683de8 RFLAGS: 00010246
RAX: 0000000000000001 RBX: ffff880424099668 RCX: 0000000000000000
RDX: 00000000000005f2 RSI: 00000000000005f2 RDI: ffff88042a61f400
RBP: ffff88042d683e50 R8: 0000000000000000 R9: 0000000000000000
R10: ffff88042902b470 R11: ffff8804293406a8 R12: ffff880424098840
R13: ffff880424099580 R14: ffff88042a61ec00 R15: ffff88042933ae00
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0000
#8 [ffff88042d683de0] vmxnet3_rq_rx_complete at ffffffffc004dcfa [vmxnet3]
#9 [ffff88042d683e58] vmxnet3_poll_rx_only at ffffffffc004e60a [vmxnet3]
#10 [ffff88042d683e90] net_rx_action at ffffffff816f3544
#11 [ffff88042d683f00] __do_softirq at ffffffff81081e7d
#12 [ffff88042d683f68] irq_exit at ffffffff81082255
#13 [ffff88042d683f78] do_IRQ at ffffffff817f9ee6
--- <IRQ stack> ---
#14 [ffff880426c73f30] ret_from_intr at ffffffff817f7fc2
[exception RIP: unknown or invalid address]
RIP: fffffffffffffffb RSP: 00007fe17e59bf48 RFLAGS: 00000001
RAX: 00007fe18564ed58 RBX: 00007fe2064ce848 RCX: 00007fe185612d60
RDX: 00007fe20b47eb30 RSI: 00007fe185640d38 RDI: 00007fe18564ed50
RBP: ffffffff817f7fe5 R8: 00007fe185100068 R9: 0000000000037ce0
R10: 0000000000134ad8 R11: 00007fe17e4b7028 R12: 00007fe185100068
R13: 00007fe185632380 R14: 0000000000000000 R15: ffffffff81003a64
ORIG_RAX: 0000000000000001 CS: 7fe185640d38 SS: ffffffffffffff91
bt: WARNING: possibly bogus exception frame
RIP: 00000000004e92bb RSP: 00007fe20b47ea40 RFLAGS: 00000283
RAX: 0000000000000001 RBX: 00007fe18564ed58 RCX: fffffffffffffffb
RDX: 00007fe185640d38 RSI: 0000000000000001 RDI: 00007fe17e59bf48
RBP: 00007fe185100068 R8: 00007fe18564ed50 R9: 00007fe185640d38
R10: 00007fe20b47eb30 R11: 00007fe185612d60 R12: 0000000000037ce0
R13: 0000000000134ad8 R14: 00007fe17e4b7028 R15: 00007fe2064ce848
ORIG_RAX: ffffffffffffff91 CS: 0033 SS: 002b

[Test Case]

 * There is no real reproducer, the problem occurred randomly if SegCnt == 1 on a Trusty VMware Virtual Machine using Xenial kernel with LRO enabled in the VMware environment.

[Regression Potential]

 * none expected
 * Commit can be found in upstream linux stable
 * Yakkety and Zesty kernel has the patch already

[Other Info]

 * Upstream commit :
   5021953 vmxnet3: segCnt can be 1 for LRO packets

[Original Description]

It has been brought to my attention that a Trusty Vmware Virtual Machine running kernel v4.4.0-36 crashed with the following stacktrace :

PANIC: "kernel BUG at /build/linux-lts-xenial-QiVniY/linux-lts-xenial-4.4.0/drivers/net/vmxnet3/vmxnet3_drv.c:1353!"
...
#0 [ffff88042d683aa0] machine_kexec at ffffffff8105987c
#1 [ffff88042d683af8] crash_kexec at ffffffff81105d23
#2 [ffff88042d683bc0] oops_end at ffffffff81030a79
#3 [ffff88042d683be8] die at ffffffff81030f7b
#4 [ffff88042d683c18] do_trap at ffffffff8102e04d
#5 [ffff88042d683c68] do_error_trap at ffffffff8102e5a7
#6 [ffff88042d683d20] do_invalid_op at ffffffff8102e840
#7 [ffff88042d683d30] invalid_op at ffffffff817f900e
[exception RIP: vmxnet3_rq_rx_complete+3016]
RIP: ffffffffc004e448 RSP: ffff88042d683de8 RFLAGS: 00010246
RAX: 0000000000000001 RBX: ffff880424099668 RCX: 0000000000000000
RDX: 00000000000005f2 RSI: 00000000000005f2 RDI: ffff88042a61f400
RBP: ffff88042d683e50 R8: 0000000000000000 R9: 0000000000000000
R10: ffff88042902b470 R11: ffff8804293406a8 R12: ffff880424098840
R13: ffff880424099580 R14: ffff88042a61ec00 R15: ffff88042933ae00
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0000
#8 [ffff88042d683de0] vmxnet3_rq_rx_complete at ffffffffc004dcfa [vmxnet3]
#9 [ffff88042d683e58] vmxnet3_poll_rx_only at ffffffffc004e60a [vmxnet3]
#10 [ffff88042d683e90] net_rx_action at ffffffff816f3544
#11 [ffff88042d683f00] __do_softirq at ffffffff81081e7d
#12 [ffff88042d683f68] irq_exit at ffffffff81082255
#13 [ffff88042d683f78] do_IRQ at ffffffff817f9ee6
--- <IRQ stack> ---
#14 [ffff880426c73f30] ret_from_intr at ffffffff817f7fc2
[exception RIP: unknown or invalid address]
RIP: fffffffffffffffb RSP: 00007fe17e59bf48 RFLAGS: 00000001
RAX: 00007fe18564ed58 RBX: 00007fe2064ce848 RCX: 00007fe185612d60
RDX: 00007fe20b47eb30 RSI: 00007fe185640d38 RDI: 00007fe18564ed50
RBP: ffffffff817f7fe5 R8: 00007fe185100068 R9: 0000000000037ce0
R10: 0000000000134ad8 R11: 00007fe17e4b7028 R12: 00007fe185100068
R13: 00007fe185632380 R14: 0000000000000000 R15: ffffffff81003a64
ORIG_RAX: 0000000000000001 CS: 7fe185640d38 SS: ffffffffffffff91
bt: WARNING: possibly bogus exception frame
RIP: 00000000004e92bb RSP: 00007fe20b47ea40 RFLAGS: 00000283
RAX: 0000000000000001 RBX: 00007fe18564ed58 RCX: fffffffffffffffb
RDX: 00007fe185640d38 RSI: 0000000000000001 RDI: 00007fe17e59bf48
RBP: 00007fe185100068 R8: 00007fe18564ed50 R9: 00007fe185640d38
R10: 00007fe20b47eb30 R11: 00007fe185612d60 R12: 0000000000037ce0
R13: 0000000000134ad8 R14: 00007fe17e4b7028 R15: 00007fe2064ce848
ORIG_RAX: ffffffffffffff91 CS: 0033 SS: 002b

CVE References

Revision history for this message
Eric Desrochers (slashd) wrote :

Note that the affected system has LRO turn on.

The system crashed on :

#7 [ffff88042d683d30] invalid_op at ffffffff817f900e
[exception RIP: vmxnet3_rq_rx_complete+3016]

which is referring to line 1353 in "drivers/net/vmxnet3/vmxnet3_drv.c" :

0xffffffffc004e448 is in vmxnet3_rq_rx_complete (drivers/net/vmxnet3/vmxnet3_drv.c:1353).
1348 rcd->type == VMXNET3_CDTYPE_RXCOMP_LRO) {
1349 struct Vmxnet3_RxCompDescExt *rcdlro;
1350 rcdlro = (struct Vmxnet3_RxCompDescExt *)rcd;
1351
1352 segCnt = rcdlro->segCnt;
==> 1353 BUG_ON(segCnt <= 1);
1354 mss = rcdlro->mss;
1355 if (unlikely(segCnt <= 1))
1356 segCnt = 0;
1357 } else {

BUG_ON(condition) are used as a debugging help when something in the kernel goes wrong.

The condition here execute BUG_ON if SegCnt is less or equal than (<=) 1.
SegCnt being the "Number of aggregated packets" :

# drivers/net/vmxnet3/vmxnet3_defs.h
u8 segCnt; /* Number of aggregated packets */

Looking at the crashdump I can confirm that at the moment of the crash SegCnt was set to 1 :

crash> * Vmxnet3_RxCompDescExt.segCnt ffff88042933ae00
segCnt = 1 '\001'

According to commit "50219538ffc0493a2b451a3aa0191138ef8bfe9d", segCnt can be 1 for LRO packets and introduce the following change :

- BUG_ON(segCnt <= 1);
+ WARN_ON_ONCE(segCnt == 0);

[2] - commit 50219538ffc0493a2b451a3aa0191138ef8bfe9d
--
Author: Shrikrishna Khare <email address hidden>
Date: Wed Jun 8 07:40:53 2016 -0700

vmxnet3: segCnt can be 1 for LRO packets

The device emulation may send segCnt of 1 for LRO packets.

Signed-off-by: Shrikrishna Khare <email address hidden>
---

description: updated
Changed in linux (Ubuntu):
importance: Undecided → Medium
status: New → Confirmed
assignee: nobody → Eric Desrochers (slashd)
Eric Desrochers (slashd)
Changed in linux (Ubuntu):
status: Confirmed → In Progress
Eric Desrochers (slashd)
summary: - vmxnet3 driver causes kernel panic w/ kernel v4.4
+ vmxnet3 driver could causes kernel panic with v4.4 if LRO enabled.
Changed in linux (Ubuntu Xenial):
status: New → In Progress
importance: Undecided → Medium
Eric Desrochers (slashd)
Changed in linux (Ubuntu Xenial):
assignee: nobody → Eric Desrochers (slashd)
Changed in linux (Ubuntu):
assignee: Eric Desrochers (slashd) → nobody
status: In Progress → Fix Released
Revision history for this message
Eric Desrochers (slashd) wrote :

Patch has been submitted to Ubuntu kernel team earlier today.

Eric Desrochers (slashd)
description: updated
description: updated
Luis Henriques (henrix)
Changed in linux (Ubuntu Xenial):
status: In Progress → Fix Committed
Revision history for this message
Luis Henriques (henrix) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-xenial' to 'verification-done-xenial'. If the problem still exists, change the tag 'verification-needed-xenial' to 'verification-failed-xenial'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-xenial
Eric Desrochers (slashd)
tags: added: verification-done-xenial
removed: verification-needed-xenial
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (5.9 KiB)

This bug was fixed in the package linux - 4.4.0-59.80

---------------
linux (4.4.0-59.80) xenial; urgency=low

  [ John Donnelly ]

  * Release Tracking Bug
    - LP: #1654282

  * [2.1.1] MAAS has nvme0n1 set as boot disk, curtin fails (LP: #1651602)
    - (fix) nvme: only require 1 interrupt vector, not 2+

linux (4.4.0-58.79) xenial; urgency=low

  [ Luis Henriques ]

  * Release Tracking Bug
    - LP: #1651402

  * Support ACPI probe for IIO sensor drivers from ST Micro (LP: #1650123)
    - SAUCE: iio: st_sensors: match sensors using ACPI handle
    - SAUCE: iio: st_accel: Support sensor i2c probe using acpi
    - SAUCE: iio: st_pressure: Support i2c probe using acpi
    - [Config] CONFIG_HTS221=m, CONFIG_HTS221_I2C=m, CONFIG_HTS221_SPI=m

  * Fix channel data parsing in ST Micro sensor IIO drivers (LP: #1650189)
    - SAUCE: iio: common: st_sensors: fix channel data parsing

  * ST Micro lng2dm 3-axis "femto" accelerometer support (LP: #1650112)
    - SAUCE: iio: st-accel: add support for lis2dh12
    - SAUCE: iio: st_sensors: support active-low interrupts
    - SAUCE: iio: accel: Add support for the h3lis331dl accelerometer
    - SAUCE: iio: st_sensors: verify interrupt event to status
    - SAUCE: iio: st_sensors: support open drain mode
    - SAUCE: iio:st_sensors: fix power regulator usage
    - SAUCE: iio: st_sensors: switch to a threaded interrupt
    - SAUCE: iio: accel: st_accel: Add lis3l02dq support
    - SAUCE: iio: st_sensors: fix scale configuration for h3lis331dl
    - SAUCE: iio: accel: st_accel: add support to lng2dm
    - SAUCE: iio: accel: st_accel: inline per-sensor data
    - SAUCE: Documentation: dt: iio: accel: add lng2dm sensor device binding

  * ST Micro hts221 relative humidity sensor support (LP: #1650116)
    - SAUCE: iio: humidity: add support to hts221 rh/temp combo device
    - SAUCE: Documentation: dt: iio: humidity: add hts221 sensor device binding
    - SAUCE: iio: humidity: remove
    - SAUCE: iio: humidity: Support acpi probe for hts211

  * crypto : tolerate new crypto hardware for z Systems (LP: #1644557)
    - s390/zcrypt: Introduce CEX6 toleration

  * Acer, Inc ID 5986:055a is useless after 14.04.2 installed. (LP: #1433906)
    - uvcvideo: uvc_scan_fallback() for webcams with broken chain

  * vmxnet3 driver could causes kernel panic with v4.4 if LRO enabled.
    (LP: #1650635)
    - vmxnet3: segCnt can be 1 for LRO packets

  * system freeze when swapping to encrypted swap partition (LP: #1647400)
    - mm, oom: rework oom detection
    - mm: throttle on IO only when there are too many dirty and writeback pages

  * Kernel Fixes to get TCMU File Backed Optical to work (LP: #1646204)
    - target/user: Use sense_reason_t in tcmu_queue_cmd_ring
    - target/user: Return an error if cmd data size is too large
    - target/user: Fix comments to not refer to data ring
    - SAUCE: (no-up) target/user: Fix use-after-free of tcmu_cmds if they are
      expired

  * CVE-2016-9756
    - KVM: x86: drop error recovery in em_jmp_far and em_ret_far

  * Dell Precision 5520 & 3520 freezes at login screent (LP: #1650054)
    - ACPI / blacklist: add _REV quirks for Dell Precision 5520 and 3520

  * CVE-2016-979...

Read more...

Changed in linux (Ubuntu Xenial):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.