kernel BUG at /build/linux-_qw1uB/linux-4.8.0/drivers/net/vmxnet3/vmxnet3_drv.c:1413

Bug #1654319 reported by Mikołaj Kowalski on 2017-01-05
16
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Linux
Unknown
Unknown
linux (Ubuntu)
Medium
Unassigned

Bug Description

Hello everyone,

I have jenkins installed with an apache frontend. Sometimes, after making few GET requests whole system crashes and restarts.

ESXi host version 6.5, virtual hardware version 13.

I've found similar bug report here https://bugzilla.kernel.org/show_bug.cgi?id=191201

[ 6544.559660] ------------[ cut here ]------------
[ 6544.559694] kernel BUG at /build/linux-_qw1uB/linux-4.8.0/drivers/net/vmxnet3/vmxnet3_drv.c:1413!
[ 6544.559728] invalid opcode: 0000 [#1] SMP
[ 6544.559745] Modules linked in: xt_multiport vmw_vsock_vmci_transport vsock ppdev vmw_balloon joydev coretemp input_leds serio_raw nfit shpchp i2c_piix4 vmw_vmci parport_pc parport mac_hid ip6t_REJECT nf_reject_ipv6 nf_log_ipv6 xt_hl ip6t_rt nf_conntrack_ipv6 nf_defrag_ipv6 ipt_REJECT nf_reject_ipv4 nf_log_ipv4 nf_log_common xt_LOG xt_limit xt_tcpudp xt_addrtype nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack ip6table_filter ip6_tables nf_conntrack_netbios_ns ib_iser rdma_cm nf_conntrack_broadcast iw_cm nf_nat_ftp ib_cm nf_nat ib_core nf_conntrack_ftp nf_conntrack iptable_filter configfs iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables x_tables autofs4 btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear hid_generic
[ 6544.560172] usbhid hid crct10dif_pclmul crc32_pclmul ghash_clmulni_intel vmwgfx ttm aesni_intel aes_x86_64 lrw glue_helper ablk_helper cryptd drm_kms_helper syscopyarea sysfillrect psmouse sysimgblt fb_sys_fops vmxnet3 ahci libahci mptspi drm mptscsih mptbase scsi_transport_spi pata_acpi fjes
[ 6544.560332] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.8.0-32-generic #34-Ubuntu
[ 6544.560360] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/05/2016
[ 6544.560399] task: ffffa172f62a0d00 task.stack: ffffa172f62ac000
[ 6544.560422] RIP: 0010:[<ffffffffc02e6701>] [<ffffffffc02e6701>] vmxnet3_rq_rx_complete+0x8d1/0xeb0 [vmxnet3]
[ 6544.560465] RSP: 0018:ffffa172ffc83dc8 EFLAGS: 00010297
[ 6544.560487] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffa172efb8a000
[ 6544.560513] RDX: 0000000000000040 RSI: 0000000000000001 RDI: 0000000000000040
[ 6544.560540] RBP: ffffa172ffc83e40 R08: 0000000000000002 R09: 0000000000000030
[ 6544.560566] R10: 0000000000000000 R11: ffffa172ec950880 R12: ffffa172f35cc280
[ 6544.560593] R13: ffffa172ec951400 R14: ffffa172f1c83000 R15: ffffa172f2b1c000
[ 6544.560621] FS: 0000000000000000(0000) GS:ffffa172ffc80000(0000) knlGS:0000000000000000
[ 6544.560650] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 6544.560671] CR2: 00007f742a78da30 CR3: 000000022f962000 CR4: 00000000000006e0
[ 6544.560752] Stack:
[ 6544.560766] ffffa172ec950880 ffffa172ec950880 0000000000000000 ffffa172ec950880
[ 6544.560803] 000000000000002d ffffa172ec951420 0000000000000002 0000000100000040
[ 6544.560837] ffffa172ec9514e8 0000000000000000 ffffa172ec950880 ffffa172ec951420
[ 6544.560872] Call Trace:
[ 6544.561677] <IRQ>
[ 6544.561689] [<ffffffffc02e6e3a>] vmxnet3_poll_rx_only+0x3a/0xb0 [vmxnet3]
[ 6544.563233] [<ffffffff9917fa28>] net_rx_action+0x238/0x380
[ 6544.564004] [<ffffffff9929dddd>] __do_softirq+0x10d/0x298
[ 6544.564757] [<ffffffff98a88d93>] irq_exit+0xa3/0xb0
[ 6544.565499] [<ffffffff9929db24>] do_IRQ+0x54/0xd0
[ 6544.566224] [<ffffffff9929bc02>] common_interrupt+0x82/0x82
[ 6544.566928] <EOI>
[ 6544.566940] [<ffffffff98a64236>] ? native_safe_halt+0x6/0x10
[ 6544.568313] [<ffffffff98a37e60>] default_idle+0x20/0xd0
[ 6544.568978] [<ffffffff98a385cf>] arch_cpu_idle+0xf/0x20
[ 6544.569625] [<ffffffff98ac77fa>] default_idle_call+0x2a/0x40
[ 6544.570255] [<ffffffff98ac7afc>] cpu_startup_entry+0x2ec/0x350
[ 6544.570871] [<ffffffff98a518a1>] start_secondary+0x151/0x190
[ 6544.571475] Code: 90 88 45 98 4c 89 55 a0 e8 4d 70 e9 d8 0f b6 45 98 4c 8b 5d 90 4c 8b 55 a0 49 c7 85 60 01 00 00 00 00 00 00 89 c6 e9 91 f8 ff ff <0f> 0b 0f 0b 49 83 85 b8 01 00 00 01 49 c7 85 60 01 00 00 00 00
[ 6544.573327] RIP [<ffffffffc02e6701>] vmxnet3_rq_rx_complete+0x8d1/0xeb0 [vmxnet3]
[ 6544.573926] RSP <ffffa172ffc83dc8>

BR,
Mikolaj Kowalski

ProblemType: Bug
DistroRelease: Ubuntu 16.10
Package: linux-image-4.8.0-32-generic 4.8.0-32.34
ProcVersionSignature: Ubuntu 4.8.0-32.34-generic 4.8.11
Uname: Linux 4.8.0-32-generic x86_64
AlsaDevices:
 total 0
 crw-rw---- 1 root audio 116, 1 sty 3 17:47 seq
 crw-rw---- 1 root audio 116, 33 sty 3 17:47 timer
AplayDevices: Error: [Errno 2] No such file or directory: 'aplay'
ApportVersion: 2.20.3-0ubuntu8.2
Architecture: amd64
ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord'
AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1:
Date: Wed Jan 4 15:24:16 2017
HibernationDevice: RESUME=/dev/mapper/jenkins--vg-swap_1
InstallationDate: Installed on 2016-11-24 (40 days ago)
InstallationMedia: Ubuntu-Server 16.10 "Yakkety Yak" - Release amd64 (20161012.1)
IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig'
Lsusb:
 Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
 Bus 002 Device 003: ID 0e0f:0002 VMware, Inc. Virtual USB Hub
 Bus 002 Device 002: ID 0e0f:0003 VMware, Inc. Virtual Mouse
 Bus 002 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub
MachineType: VMware, Inc. VMware Virtual Platform
PciMultimedia:

ProcEnviron:
 TERM=xterm
 PATH=(custom, no user)
 LANG=pl_PL.UTF-8
 SHELL=/bin/bash
ProcFB: 0 svgadrmfb
ProcKernelCmdLine: BOOT_IMAGE=/vmlinuz-4.8.0-32-generic root=/dev/mapper/hostname--vg-root ro crashkernel=384M-:128M
RelatedPackageVersions:
 linux-restricted-modules-4.8.0-32-generic N/A
 linux-backports-modules-4.8.0-32-generic N/A
 linux-firmware 1.161.1
RfKill: Error: [Errno 2] No such file or directory: 'rfkill'
SourcePackage: linux
UpgradeStatus: No upgrade log present (probably fresh install)
dmi.bios.date: 04/05/2016
dmi.bios.vendor: Phoenix Technologies LTD
dmi.bios.version: 6.00
dmi.board.name: 440BX Desktop Reference Platform
dmi.board.vendor: Intel Corporation
dmi.board.version: None
dmi.chassis.asset.tag: No Asset Tag
dmi.chassis.type: 1
dmi.chassis.vendor: No Enclosure
dmi.chassis.version: N/A
dmi.modalias: dmi:bvnPhoenixTechnologiesLTD:bvr6.00:bd04/05/2016:svnVMware,Inc.:pnVMwareVirtualPlatform:pvrNone:rvnIntelCorporation:rn440BXDesktopReferencePlatform:rvrNone:cvnNoEnclosure:ct1:cvrN/A:
dmi.product.name: VMware Virtual Platform
dmi.product.version: None
dmi.sys.vendor: VMware, Inc.

Mikołaj Kowalski (cmosek) wrote :

This change was made by a bot.

Changed in linux (Ubuntu):
status: New → Confirmed
Joseph Salisbury (jsalisbury) wrote :

Did this issue start happening after an update/upgrade? Was there a prior kernel version where you were not having this particular problem?

Would it be possible for you to test the latest upstream kernel? Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Please test the latest v4.10 kernel[0].

If this bug is fixed in the mainline kernel, please add the following tag 'kernel-fixed-upstream'.

If the mainline kernel does not fix this bug, please add the tag: 'kernel-bug-exists-upstream'.

Once testing of the upstream kernel is complete, please mark this bug as "Confirmed".

Thanks in advance.

[0] http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.10-rc2

Changed in linux (Ubuntu):
importance: Undecided → Medium
status: Confirmed → Incomplete
Mikołaj Kowalski (cmosek) wrote :

This is a fresh installation.

I've tested
Linux version 4.10.0-041000rc4-generic (kernel@gloin) (gcc version 6.2.0 20161005 (Ubuntu 6.2.0-5ubuntu12) ) #201701152031 SMP Mon Jan 16 01:33:39 UTC 2017

First of all, I had problem booting this kernel on VMWare. The screen goes black after few seconds. I've attached dmesg. Appending nomodeset to kernel commandline helps though. You can file another bug if You want to.

Second, I this I was able to reproduce the bug on mentioned kernel. I run test using JMeter and the system hangs and restarts. I am not sure, because I do not have kernel dump. Where can I find debug symbols for the kernel you wanted me to test?

sty 19 17:54:25 jenkins kdump-tools[1259]: Starting kdump-tools: * running makedumpfile -c -d 31 /proc/vmcore /var/crash/201701191754/dump-incomplete
sty 19 17:54:25 jenkins kdump-tools[1259]: check_release: Can't get the kernel version.
sty 19 17:54:25 jenkins kdump-tools[1259]: The kernel version is not supported.
sty 19 17:54:25 jenkins kdump-tools[1259]: The makedumpfile operation may be incomplete.
sty 19 17:54:25 jenkins kdump-tools[1259]: makedumpfile Failed.
sty 19 17:54:25 jenkins kdump-tools[1259]: * kdump-tools: makedumpfile failed, falling back to 'cp'
sty 19 17:54:25 jenkins kdump-tools[1348]: makedumpfile failed, falling back to 'cp'
sty 19 17:54:52 jenkins kernel: hpet_rtc_timer_reinit: 37 callbacks suppressed
sty 19 17:54:53 jenkins kernel: hpet1: lost 1725 rtc interrupts
sty 19 17:55:04 jenkins kdump-tools[1259]: * kdump-tools: saved vmcore in /var/crash/201701191754
sty 19 17:55:04 jenkins kernel: hpet1: lost 759 rtc interrupts
sty 19 17:55:04 jenkins kdump-tools[1362]: saved vmcore in /var/crash/201701191754
sty 19 17:55:04 jenkins kdump-tools[1259]: * running makedumpfile --dump-dmesg /proc/vmcore /var/crash/201701191754/dmesg.201701191754
sty 19 17:55:04 jenkins kernel: hpet1: lost 14 rtc interrupts
sty 19 17:55:04 jenkins kdump-tools[1259]: check_release: Can't get the kernel version.
sty 19 17:55:04 jenkins kdump-tools[1259]: The kernel version is not supported.
sty 19 17:55:04 jenkins kdump-tools[1259]: The makedumpfile operation may be incomplete.
sty 19 17:55:04 jenkins kdump-tools[1259]: makedumpfile Failed.
sty 19 17:55:04 jenkins kdump-tools[1365]: makedumpfile --dump-dmesg failed. dmesg content will be unavailable
sty 19 17:55:04 jenkins kdump-tools[1259]: * kdump-tools: makedumpfile --dump-dmesg failed. dmesg content will be unavailable
sty 19 17:55:04 jenkins kdump-tools[1259]: * kdump-tools: failed to save dmesg content in /var/crash/201701191754
sty 19 17:55:04 jenkins kdump-tools[1366]: failed to save dmesg content in /var/crash/201701191754
sty 19 17:55:04 jenkins kdump-tools[1259]: Thu, 19 Jan 2017 17:55:04 +0100

I can provide vmcore if you want me to.

tags: added: kernel-bug-exists-upstream
Mikołaj Kowalski (cmosek) wrote :

Is this still incomplete? I also think that importance should be higher

cryptz (cryptz) wrote :

Im seeing this as well, from other posts ive seen the following is true:

1. Doesnt exist in kernel 4.4
2. Only affects esxi 6.5 with HW version 13. HW version 11 is fine.

Shrikrishna Khare (skhare-k) wrote :

I could not repro this issue locally despite running traffic for several hours.

If you hit the issue again, could you please generate vmss file when the VM is in the panic state? It can be obtained by suspending the VM in the panic state and then locating the vmss file in the director of the virtual machine on the ESX host. More details here:

https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2005831

cryptz (cryptz) wrote :

Just to confirm your test, are you running a hw version 13 vm with kernel ~4.8? I am also using the driver provided in vmware tools 10.1.5

it is my understanding that hw version 11 is fine.

of note the system can run for a day but generally speaking it only runs for an hour or 2 before the lockup.

i will get you the requested info when it happens again tomorrow.

my vms are light use/haproxy so while i do have vrrp in the mix i see too many other people with the issue and no mention of keepalived. I just wanted to point out that you do not need to run larges amount of traffic to the system. Its actually possible that idle may be better for reproducing the issue.

cryptz (cryptz) wrote :

also i saw your other thread where there is mention of the # of cpus being a factor. In my case i have 1 socket with 2 cpus assigned to each vm. I have 2 identical vms running haproxy. for whatever reason the primary typically locks up 2-3 times a day, the secondary is only once or twice a week.

cryptz (cryptz) wrote :

vmss file is posted here: http://98.115.67.171/ex.zip

had to zip, webserver wouldnt serve the file otherwise.

Shrikrishna Khare (skhare-k) wrote :

My VMs are running hwversion 13 on 6.5 (for several days now), but haven't hit this issue yet.

Thanks for the vmss, cryptz.

To analyze this, also need access to ex.vmem, could you please provide that? It should have been created in the same directory as vmss when you suspended the VM.

Do you how I can get vmlinux (debug symbols) corresponding to which distro version (16.10?) and kernel you were running?

Shrikrishna Khare (skhare-k) wrote :

This is an issue in the vmxnet3 device emulation (ESX 6.5) and will be fixed in the next update.

In the meantime, suggested workaround:
 - disable rx data ring: ethtool -G eth? rx-mini 0

The issue should not be hit if using HW version 12 or older (with any kernel) or with kernel older than 4.8 (any HW version).

Thanks,
Shri

siside (siside) wrote :

Please, can you confirm which is the ESX release or build with the fix implemented?

Thanks,

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.