IPv6 related kernel panic following upgrade to 3.13.0-43

Bug #1404558 reported by Stéphane Graber on 2014-12-20
46
This bug affects 7 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Critical
Andy Whitcroft
Trusty
Critical
Andy Whitcroft

Bug Description

After updating a dozen machines from 3.13.0-40 to 3.13.0-43, they all kernel panic within the next 24 hours.

I managed to pull the console from one over an IP KVM and it shows a panic related to IPv6 networking:
https://dl.stgraber.org/panic-3.13-43.png

All affected machines had native IPv6 connectivity to the Internet.

Downgrading to 3.13.0-40 resolved the issue (so it's clearly a regression) and upgrading to lts-utopic 3.16.0-28-generic also appeared to do the trick.

A friend also just reported seeing the exact same problem on his server which also has native IPv6 connectivity so the issue appears pretty widespread.

Stéphane Graber (stgraber) wrote :

Might be worth mentioning that all affected hosts are x86 64bit Intel.

For those I've got access to, the issue happened on:
 - 2x Xeon E3-1245v2
 - 1x Xeon E5-2620v2
 - 1x Atom C2750
 - 1x Atom D2500
 - 1x Core i5 750

All running on pretty standard Intel boards, so the usual set of Intel chipsets for their generation.

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 1404558

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
tags: added: utopic
Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Changed in linux (Ubuntu):
importance: Undecided → Critical
Joseph Salisbury (jsalisbury) wrote :

Hi Stephane,

Can you see if this issue was already fixed in the latest 3.13 upstream kernel:
http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.13.11-ckt13-trusty/

If it was not, we can bisect the issue.

It might also be worth testing the latest mainline kernel, to see if this issue came down from upstream:
http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.19-rc2-vivid/

tags: added: regression-update
tags: added: kernel-key trusty
Andy Whitcroft (apw) wrote :

Could you also give us a flavour for how reproducible this is?

If this is seemingly fixed in the lts-utopic, the -ckt13 test is likely the most informative. There is also a kernel sitting in -proposed which might be worth a shot too.

Stéphane Graber (stgraber) wrote :

I don't have a reproducer other than install the kernel and wait 24h as that's how long it took for some systems to panic...

From a very quick look, those kernels are mainline kernels, unfortunately all my hosts are LXC hosts using unprivileged containers with overlayfs, so I need kernels with the Ubuntu patchset applied for them to be usable.

For now, I'll go test the current -proposed kernel on my least critical system, see if I can get that one to panic.

Stéphane Graber (stgraber) wrote :

Rebooted the Xeon E5-2620v2 system on linux-image-3.13.0-44-generic now.

Stéphane Graber (stgraber) wrote :

Reproduced the panic with -44, same stack trace, screenshot attached. Booting the machine back on -40 now.

Andy Whitcroft (apw) wrote :

This might be related to a backport issue in the upstream stable patch below:

  commit 4fab9071950c2021d846e18351e0f46a1cffd67b
  Author: Neal Cardwell <email address hidden>
  Date: Thu Aug 14 12:40:05 2014 -0400

    tcp: fix tcp_release_cb() to dispatch via address family for mtu_reduced()

I have produced some test kernels with the backport corrected, could you try the kernels below to confirm if this is indeed the underlying issue. Kernels are at the URL below:

    http://people.canonical.com/~apw/lp1404558-trusty/

Please report any testing back here.

Andy Whitcroft (apw) on 2015-01-06
Changed in linux (Ubuntu):
assignee: nobody → Andy Whitcroft (apw)
milestone: none → ubuntu-15.01
Andy Whitcroft (apw) on 2015-01-06
Changed in linux (Ubuntu Trusty):
status: New → In Progress
importance: Undecided → Critical
assignee: nobody → Andy Whitcroft (apw)
tags: added: kernel-da-key
removed: kernel-key
Stéphane Graber (stgraber) wrote :

Almost 24 hours and no kernel panic so far!

Stéphane Graber (stgraber) wrote :

Still no panic after 48h, let's call it good.

Andy Whitcroft (apw) wrote :

Patched pushed up to kernel-team@ for SRU.

Brad Figg (brad-figg) on 2015-01-09
Changed in linux (Ubuntu Trusty):
status: In Progress → Fix Committed
Andy Whitcroft (apw) wrote :

This was a backport specific issue, trusty alone is affected.

Changed in linux (Ubuntu):
status: Confirmed → Invalid
Brad Figg (brad-figg) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-trusty' to 'verification-done-trusty'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-trusty
Stéphane Graber (stgraber) wrote :

Booted the new kernel and so far so good. Since there is no real reproducer for this bug, I'll mark this as verification-done and will come flip it back to verification-failed if the box panics by the time you push this kernel to updates.

tags: added: verification-done-trusty
removed: verification-needed-trusty
Stephen Frost (sfrost) wrote :

I've been running the new kenrel across all of my systems and have not had any problems for the past 2 days, so I'd call this good. Previously failures were happening pretty quickly and always within a day. Thanks!

Launchpad Janitor (janitor) wrote :
Download full text (9.0 KiB)

This bug was fixed in the package linux - 3.13.0-45.74

---------------
linux (3.13.0-45.74) trusty; urgency=low

  [ Seth Forshee ]

  * Release Tracking Bug
    - LP: #1410384

  [ Jesse Barnes ]

  * SAUCE: drm/i915/vlv: assert and de-assert sideband reset at boot and
    resume v3
    - LP: #1401963

  [ K. Y. Srinivasan ]

  * SAUCE: storvsc: force SPC-3 compliance on win8 and win8 r2 hosts
    - LP: #1406867

  [ Timo Aaltonen ]

  * SAUCE: Switch VLV/BYT to use i915_bdw.
    - LP: #1401963

  [ Upstream Kernel Changes ]

  * Revert "xhci: clear root port wake on bits if controller isn't wake-up
    capable"
    - LP: #1408779
  * KVM: PPC: BOOK3S: HV: CMA: Reserve cma region only in hypervisor mode
    - LP: #1400209
  * e1000e: Fix no connectivity when driver loaded with cable out
    - LP: #1400365
  * net/mlx4_core: Enable CQE/EQE stride support
    - LP: #1400127
  * net/mlx4_core: Cache line EQE size support
    - LP: #1400127
  * net/mlx4_en: Add mlx4_en_get_cqe helper
    - LP: #1400127
  * net/mlx4_core: Introduce mlx4_get_module_info for cable module info
    reading
    - LP: #1400127
  * ethtool, net/mlx4_en: Cable info, get_module_info/eeprom ethtool
    support
    - LP: #1400127
  * net/mlx4_core: Introduce ACCESS_REG CMD and eth_prot_ctrl dev cap
    - LP: #1400127
  * net/mlx4_core: Add ethernet backplane autoneg device capability
    - LP: #1400127
  * ethtool, net/mlx4_en: Add 100M, 20G, 56G speeds ethtool reporting
    support
    - LP: #1400127
  * net/mlx4_en: Use PTYS register to query ethtool settings
    - LP: #1400127
  * net/mlx4_en: Use PTYS register to set ethtool settings (Speed)
    - LP: #1400127
  * net/mlx4_en: Add support for setting rxvlan offload OFF/ON
    - LP: #1400127
  * net/mlx4_en: Add ethtool support for [rx|tx]vlan offload set to OFF/ON
    - LP: #1400127
  * net/mlx4_core: Prevent VF from changing port configuration
    - LP: #1400127
  * net/mlx4_en: mlx4_en_set_settings() always fails when autoneg is set
    - LP: #1400127
  * ipv4: fix nexthop attlen check in fib_nh_match
    - LP: #1408779
  * vxlan: fix a use after free in vxlan_encap_bypass
    - LP: #1408779
  * vxlan: using pskb_may_pull as early as possible
    - LP: #1408779
  * vxlan: fix a free after use
    - LP: #1408779
  * ipv4: fix a potential use after free in ip_tunnel_core.c
    - LP: #1408779
  * ax88179_178a: fix bonding failure
    - LP: #1408779
  * tcp: md5: do not use alloc_percpu()
    - LP: #1408779
  * ipv4: dst_entry leak in ip_send_unicast_reply()
    - LP: #1408779
  * drivers/net, ipv6: Select IPv6 fragment idents for virtio UFO packets
    - LP: #1408779
  * drivers/net: macvtap and tun depend on INET
    - LP: #1408779
  * ip6_tunnel: Use ip6_tnl_dev_init as the ndo_init function.
    - LP: #1408779
  * vti6: Use vti6_dev_init as the ndo_init function.
    - LP: #1408779
  * sit: Use ipip6_tunnel_init as the ndo_init function.
    - LP: #1408779
  * gre6: Move the setting of dev->iflink into the ndo_init functions.
    - LP: #1408779
  * vxlan: Do not reuse sockets for a different address family
    - LP: #1408779
  * net: sctp: fix memory leak in auth key management
    - LP: #1408779
  * smsc911x: power-...

Read more...

Changed in linux (Ubuntu Trusty):
status: Fix Committed → Fix Released
Richard van der Hoff (richvdh) wrote :

I was previously seeing this problem; I've now upgraded to 3.1.0-46 and am seeing a similar, but slightly different, panic (see attached)

Stéphane Graber (stgraber) wrote :

I can confirm that something's broken with the recent kernel update...

Stéphane Graber (stgraber) wrote :

Let's file a new bug for that one.

Adam Conrad (adconrad) on 2015-03-02
Changed in linux (Ubuntu Trusty):
status: Fix Released → Fix Committed
tags: removed: verification-done-trusty
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package linux - 3.13.0-46.77

---------------
linux (3.13.0-46.77) trusty; urgency=low

  [ Seth Forshee ]

  * Revert "ipv6: fix swapped ipv4/ipv6 mtu_reduced callbacks"
    - LP: #1404558
  * Release Tracking Bug
    - LP: #1427292
 -- Seth Forshee <email address hidden> Mon, 02 Mar 2015 11:33:20 -0600

Changed in linux (Ubuntu Trusty):
status: Fix Committed → Fix Released
status: Fix Committed → Fix Released
Oliver Weis (oliver-c) wrote :

Short feedback on the "fix". I have 2 KVM based 14.04 64bit systems which were not affected by this bug UNTIL the 3.13.0-46.77 fix was release. On 1 of the 2 systems everything seems fine on the other IPv6 is not available directly after booting thus e.g. nginx can not bind to the configured IPv6 address.

on BOTH system I see this in kern.log:

Mar 3 07:14:56 sigma kernel: [ 8.449276] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready

but checking old kern.log file it has been this way for a long time so this is not the issue

on the system which works fine anyway I see this:

Mar 3 07:11:43 omega kernel: [ 21.556845] NFSD: starting 90-second grace period (net ffffffff81cdaa00)
Mar 3 07:11:44 omega kernel: [ 22.909888] random: nonblocking pool is initialized
Mar 3 07:11:45 omega kernel: [ 23.566021] ip_tables: (C) 2000-2006 Netfilter Core Team
Mar 3 07:11:45 omega kernel: [ 23.573073] nf_conntrack version 0.5.0 (16384 buckets, 65536 max)
Mar 3 07:11:45 omega kernel: [ 23.637709] ip6_tables: (C) 2000-2006 Netfilter Core Team

so after 23 seconds ip_tables and ip6_tables is running and IPv6 is working.

on the system where IPv6 remains unavailable for 2 minutes and thus nginx refuses to start I see this:

Mar 3 07:14:57 sigma kernel: [ 9.748740] NFSD: starting 90-second grace period (net ffffffff81cdaa00)
Mar 3 07:16:26 sigma kernel: [ 102.269932] ip_tables: (C) 2000-2006 Netfilter Core Team
Mar 3 07:16:26 sigma kernel: [ 102.279642] nf_conntrack version 0.5.0 (16384 buckets, 65536 max)
Mar 3 07:16:26 sigma kernel: [ 102.304003] ip6_tables: (C) 2000-2006 Netfilter Core Team
Mar 3 07:16:50 sigma kernel: [ 126.135981] perf samples too long (2621 > 2500), lowering kernel.perf_event_max_sample_rate to 50000
Mar 3 07:23:11 sigma kernel: [ 507.343615] perf samples too long (5025 > 5000), lowering kernel.perf_event_max_sample_rate to 25000
Mar 3 08:07:24 sigma kernel: [ 3159.511565] perf samples too long (10036 > 10000), lowering kernel.perf_event_max_sample_rate to 12500

This second system is an Intel(R) Xeon(R) CPU E5-2660 v3 @ 2.60GHz KVM based system with 12GB and 4 Cores, the first system which works fine is an Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz based system with 8GB and 2 Cores.

As you can see it takes about 2 minutes until ip_tables as well as ip6_tables is starting on the second system. Which then causes nginx to not start up at all and having to restart it manually after IPv6 is available.

The perf samples too long error is also a new error I have not seen in the logs before. Unsure if it might be related to this problem or is related to changes made from 3.13.0-46.75 -> 3.13.0-46.76 a few days ago.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers