linux 4.4.0-93 triggers call traces

Bug #1713787 reported by Simon Déziel
14
This bug affects 3 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Incomplete
High
Unassigned

Bug Description

== Regression details ==

Since the rolled out 4.4.0-93, we are getting many ~identical call traces that look like this:

------------[ cut here ]------------
WARNING: CPU: 0 PID: XXXX at /build/linux-YyUNAI/linux-4.4.0/net/core/dev.c:2445 skb_warn_bad_offload+0xd1/0x120()
bond0: caps=(0x000000000fd97ba9, 0x0000000000000000) len=1487 data_len=0 gso_size=1440 gso_type=6 ip_summed=0
Modules linked in: binfmt_misc nf_log_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables nf_log_ipv4 nf_log_common xt_LOG xt_tcpudp nf_conntrack_ipv4 nf_defrag_ipv4 xt_owner xt_conntrack nf_conntrack iptable_filter ip_tables x_tables 8021q garp mrp bonding ipmi_devintf dcdbas intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp crct10dif_pclmul ipmi_ssif crc32_pclmul bridge ghash_clmulni_intel stp llc aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd vhost_net vhost macvtap macvlan sb_edac edac_core kvm_intel kvm ipmi_si 8250_fintek ipmi_msghandler lpc_ich irqbypass shpchp mei_me mei acpi_power_meter mac_hid drbd lru_cache autofs4 btrfs xor raid6_pq bnx2x mxm_wmi vxlan ip6_udp_tunnel tg3 udp_tunnel ahci mdio ptp libcrc32c libahci megaraid_sas pps_core fjes wmi
CPU: 0 PID: XXXX Comm: qemu-system-x86 Not tainted 4.4.0-93-generic #116-Ubuntu
Hardware name: Dell Inc. PowerEdge R830/0VVT0H, BIOS 1.3.4 11/09/2016
 0000000000000286 f5a0655ded96de87 ffff88bd7f2038c0 ffffffff813f9f83
 ffff88bd7f203908 ffffffff81d6f780 ffff88bd7f2038f8 ffffffff810812f2
 ffff88b6e118e000 ffff8a3d17f52000 0000000000000006 0000000000000081
Call Trace:
 <IRQ> [<ffffffff813f9f83>] dump_stack+0x63/0x90
 [<ffffffff810812f2>] warn_slowpath_common+0x82/0xc0
 [<ffffffff8108138c>] warn_slowpath_fmt+0x5c/0x80
 [<ffffffff814000a2>] ? ___ratelimit+0xa2/0xe0
 [<ffffffff81735cd1>] skb_warn_bad_offload+0xd1/0x120
 [<ffffffff817393dd>] __skb_gso_segment+0xfd/0x110
 [<ffffffff8173971d>] validate_xmit_skb.isra.97.part.98+0x10d/0x2b0
 [<ffffffff8173a352>] __dev_queue_xmit+0x582/0x590
 [<ffffffff8173a370>] dev_queue_xmit+0x10/0x20
 [<ffffffffc03fd848>] vlan_dev_hard_start_xmit+0x98/0x120 [8021q]
 [<ffffffff81739b69>] dev_hard_start_xmit+0x249/0x3d0
 [<ffffffff8173a2f6>] __dev_queue_xmit+0x526/0x590
 [<ffffffff8173a370>] dev_queue_xmit+0x10/0x20
 [<ffffffffc05b0f2a>] br_dev_queue_push_xmit+0x7a/0x140 [bridge]
 [<ffffffffc05b107d>] br_forward_finish+0x3d/0xb0 [bridge]
 [<ffffffffc05b2860>] ? br_handle_frame_finish+0x3a0/0x620 [bridge]
 [<ffffffffc05b1196>] __br_forward+0xa6/0x130 [bridge]
 [<ffffffffc05b1727>] br_forward+0x87/0x90 [bridge]
 [<ffffffffc05b2860>] br_handle_frame_finish+0x3a0/0x620 [bridge]
 [<ffffffff817374a4>] ? __netif_receive_skb_core+0x364/0xa60
 [<ffffffff81738400>] ? dev_gro_receive+0x1c0/0x3c0
 [<ffffffffc05b2c54>] br_handle_frame+0x174/0x2b0 [bridge]
 [<ffffffffc00d73db>] ? tg3_poll_work+0x4cb/0xf10 [tg3]
 [<ffffffff81737bb8>] ? __netif_receive_skb+0x18/0x60
 [<ffffffff817374a4>] __netif_receive_skb_core+0x364/0xa60
 [<ffffffff81737e0a>] ? napi_gro_flush+0x5a/0x80
 [<ffffffff81737bb8>] __netif_receive_skb+0x18/0x60
 [<ffffffff817389a8>] process_backlog+0xa8/0x150
 [<ffffffff817380fe>] net_rx_action+0x21e/0x360
 [<ffffffff81085dd1>] __do_softirq+0x101/0x290
 [<ffffffff81844f0c>] do_softirq_own_stack+0x1c/0x30
 <EOI> [<ffffffff81085818>] do_softirq.part.19+0x38/0x40
 [<ffffffff81085fcd>] do_softirq+0x1d/0x20
 [<ffffffff817367b3>] netif_rx_ni+0x33/0x80
 [<ffffffff816063e6>] tun_get_user+0x506/0x880
 [<ffffffff81606827>] tun_chr_write_iter+0x57/0x70
 [<ffffffff8120ecfc>] do_iter_readv_writev+0x6c/0xa0
 [<ffffffff8120f87f>] do_readv_writev+0x18f/0x230
 [<ffffffff8120f9a9>] vfs_writev+0x39/0x50
 [<ffffffff812106d9>] SyS_writev+0x59/0xf0
 [<ffffffff818431f2>] entry_SYSCALL_64_fastpath+0x16/0x71
---[ end trace 9fe5c984ff319e6a ]---

For now this doesn't seem to have any user visible impact but we are not paying close attention to latencies/packet losses.

Discovered in version: linux-image-4.4.0-93-generic 4.4.0-93.116
Last known good version: linux-image-4.4.0-92-generic 4.4.0-92.115

ProblemType: Bug
DistroRelease: Ubuntu 16.04
Package: linux-image-4.4.0-93-generic 4.4.0-93.116
ProcVersionSignature: Ubuntu 4.4.0-93.116-generic 4.4.79
Uname: Linux 4.4.0-93-generic x86_64
AlsaDevices:
 total 0
 crw-rw---- 1 root audio 116, 1 Aug 28 12:27 seq
 crw-rw---- 1 root audio 116, 33 Aug 28 12:27 timer
AplayDevices: Error: [Errno 2] No such file or directory: 'aplay'
ApportVersion: 2.20.1-0ubuntu2.10
Architecture: amd64
ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord'
AudioDevicesInUse: Error: [Errno 2] No such file or directory: 'fuser'
Date: Tue Aug 29 13:46:12 2017
HibernationDevice: RESUME=/dev/mapper/vghost-swap
InstallationDate: Installed on 2017-04-01 (149 days ago)
InstallationMedia: Ubuntu-Server 16.04.2 LTS "Xenial Xerus" - Release amd64 (20170215.8)
IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig'
Lsusb:
 Bus 002 Device 002: ID 8087:8002 Intel Corp.
 Bus 002 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
 Bus 001 Device 003: ID 413c:a001 Dell Computer Corp. Hub
 Bus 001 Device 002: ID 8087:800a Intel Corp.
 Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
MachineType: Dell Inc. PowerEdge R830
PciMultimedia:

ProcFB: 0 EFI VGA
ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-4.4.0-93-generic.efi.signed root=/dev/mapper/vghost-root ro panic=300
RelatedPackageVersions:
 linux-restricted-modules-4.4.0-93-generic N/A
 linux-backports-modules-4.4.0-93-generic N/A
 linux-firmware 1.157.11
RfKill: Error: [Errno 2] No such file or directory: 'rfkill'
SourcePackage: linux
UpgradeStatus: No upgrade log present (probably fresh install)
dmi.bios.date: 11/09/2016
dmi.bios.vendor: Dell Inc.
dmi.bios.version: 1.3.4
dmi.board.name: 0VVT0H
dmi.board.vendor: Dell Inc.
dmi.board.version: A01
dmi.chassis.type: 23
dmi.chassis.vendor: Dell Inc.
dmi.modalias: dmi:bvnDellInc.:bvr1.3.4:bd11/09/2016:svnDellInc.:pnPowerEdgeR830:pvr:rvnDellInc.:rn0VVT0H:rvrA01:cvnDellInc.:ct23:cvr:
dmi.product.name: PowerEdge R830
dmi.sys.vendor: Dell Inc.

Revision history for this message
Simon Déziel (sdeziel) wrote :
Revision history for this message
Simon Déziel (sdeziel) wrote :

So far those messages were only noticed on 2 of our hypervisor nodes. I'm attaching the relevant extracts of dmesg on those 2 machines.

Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote : Status changed to Confirmed

This change was made by a bot.

Changed in linux (Ubuntu):
status: New → Confirmed
Revision history for this message
Simon Déziel (sdeziel) wrote :

I'm now seeing similar traces on a wildly different machine that's not using the bonding module. All machines are KVM hypervisors with VHOST_NET_ENABLED=1. I'm attaching multiple dmesg extract from 3 machines, 2 PowerEdge R830 and 1 old Core 2 Duo/Xeon.

summary: - linux 4.4.0-93 triggers call traces related to the bonding module
+ linux 4.4.0-93 triggers call traces
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Would it be possible for you to test the latest upstream kernel? Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Please test the latest v4.13 kernel[0].

If this bug is fixed in the mainline kernel, please add the following tag 'kernel-fixed-upstream'.

If the mainline kernel does not fix this bug, please add the tag: 'kernel-bug-exists-upstream'.

Once testing of the upstream kernel is complete, please mark this bug as "Confirmed".

Thanks in advance.

[0] http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.13-rc7

Changed in linux (Ubuntu):
importance: Undecided → High
tags: added: needs-bisect
Changed in linux (Ubuntu):
status: Confirmed → Incomplete
Revision history for this message
Simon Déziel (sdeziel) wrote :

This really looks like https://patchwork.ozlabs.org/patch/799015/ which fortunately made it into 4.4.0-94.117. I'll give that -proposed kernel a test run.

Revision history for this message
Simon Déziel (sdeziel) wrote :

4.4.0-94.117 has solved the problem so I'll mark this as a duplicate of LP: #1711535 (update to stable kernel 4.4.82).

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.