vmxnet3 driver could causes kernel panic with v4.4 if LRO enabled.

Bug #1650635 reported by Eric Desrochers on 2016-12-16
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Medium
Unassigned
Xenial
Medium
Eric Desrochers

Bug Description

[Impact]

It has been brought to my attention that a Trusty Vmware Virtual Machine running kernel v4.4.0-36 crashed with the following stacktrace :

PANIC: "kernel BUG at /build/linux-lts-xenial-QiVniY/linux-lts-xenial-4.4.0/drivers/net/vmxnet3/vmxnet3_drv.c:1353!"
...
#0 [ffff88042d683aa0] machine_kexec at ffffffff8105987c
#1 [ffff88042d683af8] crash_kexec at ffffffff81105d23
#2 [ffff88042d683bc0] oops_end at ffffffff81030a79
#3 [ffff88042d683be8] die at ffffffff81030f7b
#4 [ffff88042d683c18] do_trap at ffffffff8102e04d
#5 [ffff88042d683c68] do_error_trap at ffffffff8102e5a7
#6 [ffff88042d683d20] do_invalid_op at ffffffff8102e840
#7 [ffff88042d683d30] invalid_op at ffffffff817f900e
[exception RIP: vmxnet3_rq_rx_complete+3016]
RIP: ffffffffc004e448 RSP: ffff88042d683de8 RFLAGS: 00010246
RAX: 0000000000000001 RBX: ffff880424099668 RCX: 0000000000000000
RDX: 00000000000005f2 RSI: 00000000000005f2 RDI: ffff88042a61f400
RBP: ffff88042d683e50 R8: 0000000000000000 R9: 0000000000000000
R10: ffff88042902b470 R11: ffff8804293406a8 R12: ffff880424098840
R13: ffff880424099580 R14: ffff88042a61ec00 R15: ffff88042933ae00
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0000
#8 [ffff88042d683de0] vmxnet3_rq_rx_complete at ffffffffc004dcfa [vmxnet3]
#9 [ffff88042d683e58] vmxnet3_poll_rx_only at ffffffffc004e60a [vmxnet3]
#10 [ffff88042d683e90] net_rx_action at ffffffff816f3544
#11 [ffff88042d683f00] __do_softirq at ffffffff81081e7d
#12 [ffff88042d683f68] irq_exit at ffffffff81082255
#13 [ffff88042d683f78] do_IRQ at ffffffff817f9ee6
--- <IRQ stack> ---
#14 [ffff880426c73f30] ret_from_intr at ffffffff817f7fc2
[exception RIP: unknown or invalid address]
RIP: fffffffffffffffb RSP: 00007fe17e59bf48 RFLAGS: 00000001
RAX: 00007fe18564ed58 RBX: 00007fe2064ce848 RCX: 00007fe185612d60
RDX: 00007fe20b47eb30 RSI: 00007fe185640d38 RDI: 00007fe18564ed50
RBP: ffffffff817f7fe5 R8: 00007fe185100068 R9: 0000000000037ce0
R10: 0000000000134ad8 R11: 00007fe17e4b7028 R12: 00007fe185100068
R13: 00007fe185632380 R14: 0000000000000000 R15: ffffffff81003a64
ORIG_RAX: 0000000000000001 CS: 7fe185640d38 SS: ffffffffffffff91
bt: WARNING: possibly bogus exception frame
RIP: 00000000004e92bb RSP: 00007fe20b47ea40 RFLAGS: 00000283
RAX: 0000000000000001 RBX: 00007fe18564ed58 RCX: fffffffffffffffb
RDX: 00007fe185640d38 RSI: 0000000000000001 RDI: 00007fe17e59bf48
RBP: 00007fe185100068 R8: 00007fe18564ed50 R9: 00007fe185640d38
R10: 00007fe20b47eb30 R11: 00007fe185612d60 R12: 0000000000037ce0
R13: 0000000000134ad8 R14: 00007fe17e4b7028 R15: 00007fe2064ce848
ORIG_RAX: ffffffffffffff91 CS: 0033 SS: 002b

[Test Case]

 * There is no real reproducer, the problem occurred randomly if SegCnt == 1 on a Trusty VMware Virtual Machine using Xenial kernel with LRO enabled in the VMware environment.

[Regression Potential]

 * none expected
 * Commit can be found in upstream linux stable
 * Yakkety and Zesty kernel has the patch already

[Other Info]

 * Upstream commit :
   5021953 vmxnet3: segCnt can be 1 for LRO packets

[Original Description]

It has been brought to my attention that a Trusty Vmware Virtual Machine running kernel v4.4.0-36 crashed with the following stacktrace :

PANIC: "kernel BUG at /build/linux-lts-xenial-QiVniY/linux-lts-xenial-4.4.0/drivers/net/vmxnet3/vmxnet3_drv.c:1353!"
...
#0 [ffff88042d683aa0] machine_kexec at ffffffff8105987c
#1 [ffff88042d683af8] crash_kexec at ffffffff81105d23
#2 [ffff88042d683bc0] oops_end at ffffffff81030a79
#3 [ffff88042d683be8] die at ffffffff81030f7b
#4 [ffff88042d683c18] do_trap at ffffffff8102e04d
#5 [ffff88042d683c68] do_error_trap at ffffffff8102e5a7
#6 [ffff88042d683d20] do_invalid_op at ffffffff8102e840
#7 [ffff88042d683d30] invalid_op at ffffffff817f900e
[exception RIP: vmxnet3_rq_rx_complete+3016]
RIP: ffffffffc004e448 RSP: ffff88042d683de8 RFLAGS: 00010246
RAX: 0000000000000001 RBX: ffff880424099668 RCX: 0000000000000000
RDX: 00000000000005f2 RSI: 00000000000005f2 RDI: ffff88042a61f400
RBP: ffff88042d683e50 R8: 0000000000000000 R9: 0000000000000000
R10: ffff88042902b470 R11: ffff8804293406a8 R12: ffff880424098840
R13: ffff880424099580 R14: ffff88042a61ec00 R15: ffff88042933ae00
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0000
#8 [ffff88042d683de0] vmxnet3_rq_rx_complete at ffffffffc004dcfa [vmxnet3]
#9 [ffff88042d683e58] vmxnet3_poll_rx_only at ffffffffc004e60a [vmxnet3]
#10 [ffff88042d683e90] net_rx_action at ffffffff816f3544
#11 [ffff88042d683f00] __do_softirq at ffffffff81081e7d
#12 [ffff88042d683f68] irq_exit at ffffffff81082255
#13 [ffff88042d683f78] do_IRQ at ffffffff817f9ee6
--- <IRQ stack> ---
#14 [ffff880426c73f30] ret_from_intr at ffffffff817f7fc2
[exception RIP: unknown or invalid address]
RIP: fffffffffffffffb RSP: 00007fe17e59bf48 RFLAGS: 00000001
RAX: 00007fe18564ed58 RBX: 00007fe2064ce848 RCX: 00007fe185612d60
RDX: 00007fe20b47eb30 RSI: 00007fe185640d38 RDI: 00007fe18564ed50
RBP: ffffffff817f7fe5 R8: 00007fe185100068 R9: 0000000000037ce0
R10: 0000000000134ad8 R11: 00007fe17e4b7028 R12: 00007fe185100068
R13: 00007fe185632380 R14: 0000000000000000 R15: ffffffff81003a64
ORIG_RAX: 0000000000000001 CS: 7fe185640d38 SS: ffffffffffffff91
bt: WARNING: possibly bogus exception frame
RIP: 00000000004e92bb RSP: 00007fe20b47ea40 RFLAGS: 00000283
RAX: 0000000000000001 RBX: 00007fe18564ed58 RCX: fffffffffffffffb
RDX: 00007fe185640d38 RSI: 0000000000000001 RDI: 00007fe17e59bf48
RBP: 00007fe185100068 R8: 00007fe18564ed50 R9: 00007fe185640d38
R10: 00007fe20b47eb30 R11: 00007fe185612d60 R12: 0000000000037ce0
R13: 0000000000134ad8 R14: 00007fe17e4b7028 R15: 00007fe2064ce848
ORIG_RAX: ffffffffffffff91 CS: 0033 SS: 002b

CVE References

Eric Desrochers (slashd) wrote :

Note that the affected system has LRO turn on.

The system crashed on :

#7 [ffff88042d683d30] invalid_op at ffffffff817f900e
[exception RIP: vmxnet3_rq_rx_complete+3016]

which is referring to line 1353 in "drivers/net/vmxnet3/vmxnet3_drv.c" :

0xffffffffc004e448 is in vmxnet3_rq_rx_complete (drivers/net/vmxnet3/vmxnet3_drv.c:1353).
1348 rcd->type == VMXNET3_CDTYPE_RXCOMP_LRO) {
1349 struct Vmxnet3_RxCompDescExt *rcdlro;
1350 rcdlro = (struct Vmxnet3_RxCompDescExt *)rcd;
1351
1352 segCnt = rcdlro->segCnt;
==> 1353 BUG_ON(segCnt <= 1);
1354 mss = rcdlro->mss;
1355 if (unlikely(segCnt <= 1))
1356 segCnt = 0;
1357 } else {

BUG_ON(condition) are used as a debugging help when something in the kernel goes wrong.

The condition here execute BUG_ON if SegCnt is less or equal than (<=) 1.
SegCnt being the "Number of aggregated packets" :

# drivers/net/vmxnet3/vmxnet3_defs.h
u8 segCnt; /* Number of aggregated packets */

Looking at the crashdump I can confirm that at the moment of the crash SegCnt was set to 1 :

crash> * Vmxnet3_RxCompDescExt.segCnt ffff88042933ae00
segCnt = 1 '\001'

According to commit "50219538ffc0493a2b451a3aa0191138ef8bfe9d", segCnt can be 1 for LRO packets and introduce the following change :

- BUG_ON(segCnt <= 1);
+ WARN_ON_ONCE(segCnt == 0);

[2] - commit 50219538ffc0493a2b451a3aa0191138ef8bfe9d
--
Author: Shrikrishna Khare <email address hidden>
Date: Wed Jun 8 07:40:53 2016 -0700

vmxnet3: segCnt can be 1 for LRO packets

The device emulation may send segCnt of 1 for LRO packets.

Signed-off-by: Shrikrishna Khare <email address hidden>
---

description: updated
Changed in linux (Ubuntu):
importance: Undecided → Medium
status: New → Confirmed
assignee: nobody → Eric Desrochers (slashd)
Eric Desrochers (slashd) on 2016-12-16
Changed in linux (Ubuntu):
status: Confirmed → In Progress
Eric Desrochers (slashd) on 2016-12-16
summary: - vmxnet3 driver causes kernel panic w/ kernel v4.4
+ vmxnet3 driver could causes kernel panic with v4.4 if LRO enabled.
Changed in linux (Ubuntu Xenial):
status: New → In Progress
importance: Undecided → Medium
Eric Desrochers (slashd) on 2016-12-16
Changed in linux (Ubuntu Xenial):
assignee: nobody → Eric Desrochers (slashd)
Changed in linux (Ubuntu):
assignee: Eric Desrochers (slashd) → nobody
status: In Progress → Fix Released
Eric Desrochers (slashd) wrote :

Patch has been submitted to Ubuntu kernel team earlier today.

Eric Desrochers (slashd) on 2016-12-17
description: updated
description: updated
Luis Henriques (henrix) on 2016-12-19
Changed in linux (Ubuntu Xenial):
status: In Progress → Fix Committed
Luis Henriques (henrix) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-xenial' to 'verification-done-xenial'. If the problem still exists, change the tag 'verification-needed-xenial' to 'verification-failed-xenial'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-xenial
Eric Desrochers (slashd) on 2016-12-27
tags: added: verification-done-xenial
removed: verification-needed-xenial
Launchpad Janitor (janitor) wrote :
Download full text (5.9 KiB)

This bug was fixed in the package linux - 4.4.0-59.80

---------------
linux (4.4.0-59.80) xenial; urgency=low

  [ John Donnelly ]

  * Release Tracking Bug
    - LP: #1654282

  * [2.1.1] MAAS has nvme0n1 set as boot disk, curtin fails (LP: #1651602)
    - (fix) nvme: only require 1 interrupt vector, not 2+

linux (4.4.0-58.79) xenial; urgency=low

  [ Luis Henriques ]

  * Release Tracking Bug
    - LP: #1651402

  * Support ACPI probe for IIO sensor drivers from ST Micro (LP: #1650123)
    - SAUCE: iio: st_sensors: match sensors using ACPI handle
    - SAUCE: iio: st_accel: Support sensor i2c probe using acpi
    - SAUCE: iio: st_pressure: Support i2c probe using acpi
    - [Config] CONFIG_HTS221=m, CONFIG_HTS221_I2C=m, CONFIG_HTS221_SPI=m

  * Fix channel data parsing in ST Micro sensor IIO drivers (LP: #1650189)
    - SAUCE: iio: common: st_sensors: fix channel data parsing

  * ST Micro lng2dm 3-axis "femto" accelerometer support (LP: #1650112)
    - SAUCE: iio: st-accel: add support for lis2dh12
    - SAUCE: iio: st_sensors: support active-low interrupts
    - SAUCE: iio: accel: Add support for the h3lis331dl accelerometer
    - SAUCE: iio: st_sensors: verify interrupt event to status
    - SAUCE: iio: st_sensors: support open drain mode
    - SAUCE: iio:st_sensors: fix power regulator usage
    - SAUCE: iio: st_sensors: switch to a threaded interrupt
    - SAUCE: iio: accel: st_accel: Add lis3l02dq support
    - SAUCE: iio: st_sensors: fix scale configuration for h3lis331dl
    - SAUCE: iio: accel: st_accel: add support to lng2dm
    - SAUCE: iio: accel: st_accel: inline per-sensor data
    - SAUCE: Documentation: dt: iio: accel: add lng2dm sensor device binding

  * ST Micro hts221 relative humidity sensor support (LP: #1650116)
    - SAUCE: iio: humidity: add support to hts221 rh/temp combo device
    - SAUCE: Documentation: dt: iio: humidity: add hts221 sensor device binding
    - SAUCE: iio: humidity: remove
    - SAUCE: iio: humidity: Support acpi probe for hts211

  * crypto : tolerate new crypto hardware for z Systems (LP: #1644557)
    - s390/zcrypt: Introduce CEX6 toleration

  * Acer, Inc ID 5986:055a is useless after 14.04.2 installed. (LP: #1433906)
    - uvcvideo: uvc_scan_fallback() for webcams with broken chain

  * vmxnet3 driver could causes kernel panic with v4.4 if LRO enabled.
    (LP: #1650635)
    - vmxnet3: segCnt can be 1 for LRO packets

  * system freeze when swapping to encrypted swap partition (LP: #1647400)
    - mm, oom: rework oom detection
    - mm: throttle on IO only when there are too many dirty and writeback pages

  * Kernel Fixes to get TCMU File Backed Optical to work (LP: #1646204)
    - target/user: Use sense_reason_t in tcmu_queue_cmd_ring
    - target/user: Return an error if cmd data size is too large
    - target/user: Fix comments to not refer to data ring
    - SAUCE: (no-up) target/user: Fix use-after-free of tcmu_cmds if they are
      expired

  * CVE-2016-9756
    - KVM: x86: drop error recovery in em_jmp_far and em_ret_far

  * Dell Precision 5520 & 3520 freezes at login screent (LP: #1650054)
    - ACPI / blacklist: add _REV quirks for Dell Precision 5520 and 3520

  * CVE-2016-979...

Read more...

Changed in linux (Ubuntu Xenial):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers