Kernel panic skb_segment+0x5d7/0x980

Bug #1377851 reported by Frederik Kriewitz on 2014-10-06
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Undecided
Unassigned
Trusty
High
Unassigned
Utopic
Undecided
Unassigned

Bug Description

On two Ubuntu 14.04 amd64 servers with tg3 NICs acting as a openvpn gateway we recently had a lot of trouble with kernel panics (linux-image-3.13.0-36-generic 3.13.0-36.63)
The panics were kind of random happening sometimes already during the boot process and sometimes a couple of hours later.
The boxes were running perfectly find for a couple of months before. We believe some kind of "special" packet triggered the bug.

Potential related bugs:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1331219
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1313591

Upgrading to linux-image-3.16.3-031603-generic (3.16.3-031603.201409171435) solved the problem for us.

[ 6076.726520] BUG: unable to handle kernel NULL pointer dereference at 000000000000006c
[ 6076.737716] IP: [<ffffffff81616787>] skb_segment+0x5d7/0x980
[ 6076.745780] PGD 0
[ 6076.748641] Oops: 0000 [#1] SMP
[ 6076.753268] Modules linked in: btrfs ufs qnx4 hfsplus hfs minix ntfs msdos jfs xfs libcrc32c cdc_ether usbnet mii mpt3sas mpt2sas raid_class scsi_transport_sas mptctl mptbase ipmi_si ipmi_devintf dell_rbu gpio_ich intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp dcdbas kvm_intel kvm crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd 8021q garp stp mrp llc sb_edac edac_core shpchp joydev pl2303 usbserial lpc_ich wmi mei_me mei mac_hid acpi_power_meter ioatdma nf_conntrack dca lp parport raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq raid1 tg3 ahci hid_generic raid0 ptp usbhid multipath hid libahci pps_core linear [last unloaded: ipmi_si]
[ 6076.850743] CPU: 3 PID: 0 Comm: swapper/3 Not tainted 3.13.0-36-generic #63-Ubuntu
[ 6076.861485] Hardware name: Dell Inc. PowerEdge R620/0PXXHP, BIOS 1.6.0 03/07/2013
[ 6076.872104] task: ffff880223841800 ti: ffff880223848000 task.ti: ffff880223848000
[ 6076.882722] RIP: 0010:[<ffffffff81616787>] [<ffffffff81616787>] skb_segment+0x5d7/0x980
[ 6076.894243] RSP: 0018:ffff880227263790 EFLAGS: 00010246
[ 6076.901769] RAX: 0000000000000646 RBX: ffff88021f03f000 RCX: ffff8801ed4fff00
[ 6076.911893] RDX: 0000000000000646 RSI: 00000000000000c2 RDI: ffffea0007f6de00
[ 6076.971665] RBP: ffff880227263858 R08: 000000000000fff6 R09: 0000000000000001
[ 6077.031463] R10: ffff88021f03e800 R11: 0000000000010552 R12: ffff8801fdb1fc80
[ 6077.091313] R13: 0000000000000001 R14: 0000000000000000 R15: 0000000000000646
[ 6077.151619] FS: 0000000000000000(0000) GS:ffff880227260000(0000) knlGS:0000000000000000
[ 6077.262424] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 6077.320083] CR2: 000000000000006c CR3: 0000000001c0e000 CR4: 00000000000407e0
[ 6077.379854] Stack:
[ 6077.433555] ffffffff811a48f9 ffff8802272772c0 000000000000fff6 ffffffffffff000a
[ 6077.546280] ffffffff00010552 000000000000006a ffff88021f03e800 0000000100000020
[ 6077.658915] ffffffffffffffe4 0000000000010012 0000001c0000055c ffff88021f03f000
[ 6077.771400] Call Trace:
[ 6077.824494] <IRQ>
[ 6077.827251] [<ffffffff811a48f9>] ? __kmalloc_node_track_caller+0xb9/0x290
[ 6077.933211] [<ffffffff8168149d>] tcp_gso_segment+0x10d/0x3f0
[ 6077.988641] [<ffffffff81691822>] inet_gso_segment+0x132/0x360
[ 6078.043154] [<ffffffff810a5db2>] ? enqueue_task_fair+0x422/0x6c0
[ 6078.097358] [<ffffffff81623ffc>] skb_mac_gso_segment+0x9c/0x180
[ 6078.150464] [<ffffffff816a0fb4>] gre_gso_segment+0x134/0x370
[ 6078.202321] [<ffffffff8109828d>] ? ttwu_do_activate.constprop.74+0x5d/0x70
[ 6078.255348] [<ffffffff81691822>] inet_gso_segment+0x132/0x360
[ 6078.306129] [<ffffffff8109a800>] ? try_to_wake_up+0x240/0x2c0
[ 6078.355712] [<ffffffff81623ffc>] skb_mac_gso_segment+0x9c/0x180
[ 6078.404660] [<ffffffff8162413d>] __skb_gso_segment+0x5d/0xb0
[ 6078.452918] [<ffffffff8162444a>] dev_hard_start_xmit+0x18a/0x560
[ 6078.501057] [<ffffffff8164360e>] sch_direct_xmit+0xee/0x1c0
[ 6078.548821] [<ffffffff81624a50>] __dev_queue_xmit+0x230/0x500
[ 6078.596793] [<ffffffff81624d30>] dev_queue_xmit+0x10/0x20
[ 6078.644041] [<ffffffff8162be31>] neigh_direct_output+0x11/0x20
[ 6078.691822] [<ffffffff8165d370>] ip_finish_output+0x1b0/0x3b0
[ 6078.739211] [<ffffffff8165e8d8>] ip_output+0x58/0x90
[ 6078.784448] [<ffffffff8165a84b>] ip_forward_finish+0x8b/0x170
[ 6078.830211] [<ffffffff8165ac85>] ip_forward+0x355/0x410
[ 6078.874484] [<ffffffff8165899d>] ip_rcv_finish+0x7d/0x350
[ 6078.918046] [<ffffffff816592e8>] ip_rcv+0x298/0x3d0
[ 6078.959829] [<ffffffff81622bb6>] __netif_receive_skb_core+0x666/0x840
[ 6079.003064] [<ffffffff8101b200>] ? flush_ptrace_hw_breakpoint+0x30/0x60
[ 6079.045844] [<ffffffff81622da8>] __netif_receive_skb+0x18/0x60
[ 6079.086823] [<ffffffff81622e13>] netif_receive_skb+0x23/0x90
[ 6079.126412] [<ffffffff81622f24>] napi_gro_complete+0xa4/0xe0
[ 6079.164669] [<ffffffff816234a0>] dev_gro_receive+0x210/0x2d0
[ 6079.203193] [<ffffffff816237e5>] napi_gro_receive+0x25/0xb0
[ 6079.242062] [<ffffffffa00d8c2b>] tg3_poll_work+0xc2b/0xf30 [tg3]
[ 6079.281003] [<ffffffffa00d8f6b>] tg3_poll_msix+0x3b/0x140 [tg3]
[ 6079.319178] [<ffffffff81623192>] net_rx_action+0x152/0x250
[ 6079.356843] [<ffffffff8106cbac>] __do_softirq+0xec/0x2c0
[ 6079.394099] [<ffffffff8106d0f5>] irq_exit+0x105/0x110
[ 6079.430844] [<ffffffff817312d6>] do_IRQ+0x56/0xc0
[ 6079.466888] [<ffffffff81726a6d>] common_interrupt+0x6d/0x6d
[ 6079.503660] <EOI>
[ 6079.506415] [<ffffffff815d11bf>] ? cpuidle_enter_state+0x4f/0xc0
[ 6079.573717] [<ffffffff815d12e9>] cpuidle_idle_call+0xb9/0x1f0
[ 6079.611376] [<ffffffff8101cede>] arch_cpu_idle+0xe/0x30
[ 6079.648243] [<ffffffff810bed95>] cpu_startup_entry+0xc5/0x290
[ 6079.685771] [<ffffffff81041018>] start_secondary+0x218/0x2c0
[ 6079.723121] Code: 4c 24 60 eb 21 0f 1f 80 00 00 00 00 41 83 c5 01 49 83 c4 10 48 83 c1 10 41 39 c3 0f 86 83 01 00 00 41 89 c7 89 c2 45 39 e9 7f 37 <41> 8b 46 6c 41 39 46 68 0f 85 75 03 00 00 45 8b a6 cc 00 00 00
[ 6079.844842] RIP [<ffffffff81616787>] skb_segment+0x5d7/0x980
[ 6079.885283] RSP <ffff880227263790>
[ 6079.922830] CR2: 000000000000006c
[ 6080.024035] ---[ end trace 6e658236aae2d239 ]---
[ 6080.065884] Kernel panic - not syncing: Fatal exception in interrupt

ethtool -k port1
Features for port1:
rx-checksumming: on
tx-checksumming: on
        tx-checksum-ipv4: on
        tx-checksum-ip-generic: off [fixed]
        tx-checksum-ipv6: on
        tx-checksum-fcoe-crc: off [fixed]
        tx-checksum-sctp: off [fixed]
scatter-gather: on
        tx-scatter-gather: on
        tx-scatter-gather-fraglist: off [fixed]
tcp-segmentation-offload: on
        tx-tcp-segmentation: on
        tx-tcp-ecn-segmentation: on
        tx-tcp6-segmentation: on
udp-fragmentation-offload: off [fixed]
generic-segmentation-offload: on
generic-receive-offload: on
large-receive-offload: off [fixed]
rx-vlan-offload: on [fixed]
tx-vlan-offload: on [fixed]
ntuple-filters: off [fixed]
receive-hashing: off [fixed]
highdma: on
rx-vlan-filter: off [fixed]
vlan-challenged: off [fixed]
tx-lockless: off [fixed]
netns-local: off [fixed]
tx-gso-robust: off [fixed]
tx-fcoe-segmentation: off [fixed]
tx-gre-segmentation: off [fixed]
tx-ipip-segmentation: off [fixed]
tx-sit-segmentation: off [fixed]
tx-udp_tnl-segmentation: off [fixed]
tx-mpls-segmentation: off [fixed]
fcoe-mtu: off [fixed]
tx-nocache-copy: on
loopback: off [fixed]
rx-fcs: off [fixed]
rx-all: off [fixed]
tx-vlan-stag-hw-insert: off [fixed]
rx-vlan-stag-hw-parse: off [fixed]
rx-vlan-stag-filter: off [fixed]
l2-fwd-offload: off [fixed]

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 1377851

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
tags: added: trusty
Frederik Kriewitz (freddy436) wrote :

backtrace already posted

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Joseph Salisbury (jsalisbury) wrote :

Would it be possible for you to test the latest upstream stable kernel? Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Please test the latest v3.13 stable kernel[0].

If this bug is fixed in the mainline kernel, please add the following tag 'kernel-fixed-upstream'.

If the mainline kernel does not fix this bug, please add the tag: 'kernel-bug-exists-upstream'.

If you are unable to test the mainline kernel, for example it will not boot, please add the tag: 'kernel-unable-to-test-upstream'.
Once testing of the upstream kernel is complete, please mark this bug as "Confirmed".

Thanks in advance.

[0] kernel.ubuntu.com/~kernel-ppa/mainline/v3.13.11.8-trusty/

Changed in linux (Ubuntu):
importance: Undecided → High
status: Confirmed → Incomplete
Frederik Kriewitz (freddy436) wrote :

Can't test with the 3.13 mainline as these are productive servers and we can't risk another outage at this time.
As mentioned in the first post it in appears to be fixed in the 3.16 mainline kernel.

tags: added: kernel-unable-to-test-upstream
Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Chris J Arges (arges) on 2014-10-07
Changed in linux (Ubuntu Trusty):
status: New → Confirmed
importance: Undecided → High
Changed in linux (Ubuntu Utopic):
status: Confirmed → Fix Released
importance: High → Undecided
Chris J Arges (arges) wrote :

I strongly suspect this patch:
http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=c3caf1192f904de2f1381211f564537235d50de3

And regardless this patch should be applied to 3.13.y, so I'll send an email to include it.

Luis Henriques (henrix) on 2014-10-09
Changed in linux (Ubuntu Trusty):
status: Confirmed → Fix Committed
Brad Figg (brad-figg) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-trusty' to 'verification-done-trusty'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-trusty
Chris J Arges (arges) wrote :

This patch fixes an issue with a currently applied patch to the ubuntu-trusty kernel. If the patch was applied to 3.13.11.y, I would suggest it go through the stable process and be applied there (and in fact it is applied to other stables). Overall it would be best if the original reporter could verify if this patch does fix the issue; but regardless I think this patch should be applied.

Frederik Kriewitz (freddy436) wrote :

We installed 3.13.0-38-generic from proposed during a maintenance today on one of the initially affected server.
We'll provide an update soon.

Frederik Kriewitz (freddy436) wrote :

No crashes so far with 3.13.0-38-generic, looks like it's fixed.

tags: added: verification-done-trusty
removed: verification-needed-trusty
Launchpad Janitor (janitor) wrote :
Download full text (10.4 KiB)

This bug was fixed in the package linux - 3.13.0-39.66

---------------
linux (3.13.0-39.66) trusty; urgency=low

  [ Luis Henriques ]

  * Release Tracking Bug
    - LP: #1386629

  [ Upstream Kernel Changes ]

  * KVM: x86: Check non-canonical addresses upon WRMSR
    - LP: #1384539
    - CVE-2014-3610
  * KVM: x86: Prevent host from panicking on shared MSR writes.
    - LP: #1384539
    - CVE-2014-3610
  * KVM: x86: Improve thread safety in pit
    - LP: #1384540
    - CVE-2014-3611
  * KVM: x86: Fix wrong masking on relative jump/call
    - LP: #1384545
    - CVE-2014-3647
  * KVM: x86: Warn if guest virtual address space is not 48-bits
    - LP: #1384545
    - CVE-2014-3647
  * KVM: x86: Emulator fixes for eip canonical checks on near branches
    - LP: #1384545
    - CVE-2014-3647
  * KVM: x86: emulating descriptor load misses long-mode case
    - LP: #1384545
    - CVE-2014-3647
  * KVM: x86: Handle errors when RIP is set during far jumps
    - LP: #1384545
    - CVE-2014-3647
  * kvm: vmx: handle invvpid vm exit gracefully
    - LP: #1384544
    - CVE-2014-3646
  * Input: synaptics - gate forcepad support by DMI check
    - LP: #1381815

linux (3.13.0-38.65) trusty; urgency=low

  [ Luis Henriques ]

  * Release Tracking Bug
    - LP: #1379244

  [ Andy Whitcroft ]

  * Revert "SAUCE: scsi: hyper-v storsvc switch up to SPC-3"
    - LP: #1354397
  * [Config] linux-image-extra is additive to linux-image
    - LP: #1375310
  * [Config] linux-image-extra postrm is not needed on purge
    - LP: #1375310

  [ Upstream Kernel Changes ]

  * Revert "KVM: x86: Increase the number of fixed MTRR regs to 10"
    - LP: #1377564
  * Revert "USB: option,zte_ev: move most ZTE CDMA devices to zte_ev"
    - LP: #1377564
  * aufs: bugfix, stop calling security_mmap_file() again
    - LP: #1371316
  * ipvs: fix ipv6 hook registration for local replies
    - LP: #1349768
  * Drivers: add blist flags
    - LP: #1354397
  * sd: fix a bug in deriving the FLUSH_TIMEOUT from the basic I/O timeout
    - LP: #1354397
  * drm/i915/bdw: Add 42ms delay for IPS disable
    - LP: #1374389
  * drm/i915: add null render states for gen6, gen7 and gen8
    - LP: #1374389
  * drm/i915/bdw: 3D_CHICKEN3 has write mask bits
    - LP: #1374389
  * drm/i915/bdw: Disable idle DOP clock gating
    - LP: #1374389
  * drm/i915: call lpt_init_clock_gating on BDW too
    - LP: #1374389
  * drm/i915: shuffle panel code
    - LP: #1374389
  * drm/i915: extract backlight minimum brightness from VBT
    - LP: #1374389
  * drm/i915: respect the VBT minimum backlight brightness
    - LP: #1374389
  * drm/i915/bdw: Apply workarounds in render ring init function
    - LP: #1374389
  * drm/i915/bdw: Cleanup pre prod workarounds
    - LP: #1374389
  * drm/i915: Replace hardcoded cacheline size with macro
    - LP: #1374389
  * drm/i915: Refactor Broadwell PIPE_CONTROL emission into a helper.
    - LP: #1374389
  * drm/i915: Add the WaCsStallBeforeStateCacheInvalidate:bdw workaround.
    - LP: #1374389
  * drm/i915/bdw: Remove BDW preproduction W/As until C stepping.
    - LP: #1374389
  * mptfusion: enable no_write_same for vmware scsi disks
    - LP: #1371591
  * iommu/amd: Fix cleanup_domai...

Changed in linux (Ubuntu Trusty):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers