IPv6 related kernel panic following upgrade to 3.13.0-43
| Affects | Status | Importance | Assigned to | Milestone | |
|---|---|---|---|---|---|
| | linux (Ubuntu) |
Critical
|
Andy Whitcroft | ||
| | Trusty |
Critical
|
Andy Whitcroft | ||
Bug Description
After updating a dozen machines from 3.13.0-40 to 3.13.0-43, they all kernel panic within the next 24 hours.
I managed to pull the console from one over an IP KVM and it shows a panic related to IPv6 networking:
https:/
All affected machines had native IPv6 connectivity to the Internet.
Downgrading to 3.13.0-40 resolved the issue (so it's clearly a regression) and upgrading to lts-utopic 3.16.0-28-generic also appeared to do the trick.
A friend also just reported seeing the exact same problem on his server which also has native IPv6 connectivity so the issue appears pretty widespread.
| Stéphane Graber (stgraber) wrote : | #1 |
This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:
apport-collect 1404558
and then change the status of the bug to 'Confirmed'.
If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.
This change has been made by an automated script, maintained by the Ubuntu Kernel Team.
| Changed in linux (Ubuntu): | |
| status: | New → Incomplete |
| tags: | added: utopic |
| Changed in linux (Ubuntu): | |
| status: | Incomplete → Confirmed |
| Changed in linux (Ubuntu): | |
| importance: | Undecided → Critical |
| Joseph Salisbury (jsalisbury) wrote : | #3 |
Hi Stephane,
Can you see if this issue was already fixed in the latest 3.13 upstream kernel:
http://
If it was not, we can bisect the issue.
It might also be worth testing the latest mainline kernel, to see if this issue came down from upstream:
http://
| tags: | added: regression-update |
| tags: | added: kernel-key trusty |
| Andy Whitcroft (apw) wrote : | #4 |
Could you also give us a flavour for how reproducible this is?
If this is seemingly fixed in the lts-utopic, the -ckt13 test is likely the most informative. There is also a kernel sitting in -proposed which might be worth a shot too.
| Stéphane Graber (stgraber) wrote : | #5 |
I don't have a reproducer other than install the kernel and wait 24h as that's how long it took for some systems to panic...
From a very quick look, those kernels are mainline kernels, unfortunately all my hosts are LXC hosts using unprivileged containers with overlayfs, so I need kernels with the Ubuntu patchset applied for them to be usable.
For now, I'll go test the current -proposed kernel on my least critical system, see if I can get that one to panic.
| Stéphane Graber (stgraber) wrote : | #6 |
Rebooted the Xeon E5-2620v2 system on linux-image-
| Stéphane Graber (stgraber) wrote : | #7 |
Reproduced the panic with -44, same stack trace, screenshot attached. Booting the machine back on -40 now.
| Andy Whitcroft (apw) wrote : | #8 |
This might be related to a backport issue in the upstream stable patch below:
commit 4fab9071950c202
Author: Neal Cardwell <email address hidden>
Date: Thu Aug 14 12:40:05 2014 -0400
tcp: fix tcp_release_cb() to dispatch via address family for mtu_reduced()
I have produced some test kernels with the backport corrected, could you try the kernels below to confirm if this is indeed the underlying issue. Kernels are at the URL below:
http://
Please report any testing back here.
| Changed in linux (Ubuntu): | |
| assignee: | nobody → Andy Whitcroft (apw) |
| milestone: | none → ubuntu-15.01 |
| Changed in linux (Ubuntu Trusty): | |
| status: | New → In Progress |
| importance: | Undecided → Critical |
| assignee: | nobody → Andy Whitcroft (apw) |
| tags: |
added: kernel-da-key removed: kernel-key |
| Stéphane Graber (stgraber) wrote : | #9 |
Almost 24 hours and no kernel panic so far!
| Stéphane Graber (stgraber) wrote : | #10 |
Still no panic after 48h, let's call it good.
| Andy Whitcroft (apw) wrote : | #11 |
Patched pushed up to kernel-team@ for SRU.
| Changed in linux (Ubuntu Trusty): | |
| status: | In Progress → Fix Committed |
| Andy Whitcroft (apw) wrote : | #12 |
This was a backport specific issue, trusty alone is affected.
| Changed in linux (Ubuntu): | |
| status: | Confirmed → Invalid |
| Brad Figg (brad-figg) wrote : | #13 |
This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-
If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.
See https:/
| tags: | added: verification-needed-trusty |
| Stéphane Graber (stgraber) wrote : | #14 |
Booted the new kernel and so far so good. Since there is no real reproducer for this bug, I'll mark this as verification-done and will come flip it back to verification-failed if the box panics by the time you push this kernel to updates.
| tags: |
added: verification-done-trusty removed: verification-needed-trusty |
| Stephen Frost (sfrost) wrote : | #15 |
I've been running the new kenrel across all of my systems and have not had any problems for the past 2 days, so I'd call this good. Previously failures were happening pretty quickly and always within a day. Thanks!
| Launchpad Janitor (janitor) wrote : | #16 |
This bug was fixed in the package linux - 3.13.0-45.74
---------------
linux (3.13.0-45.74) trusty; urgency=low
[ Seth Forshee ]
* Release Tracking Bug
- LP: #1410384
[ Jesse Barnes ]
* SAUCE: drm/i915/vlv: assert and de-assert sideband reset at boot and
resume v3
- LP: #1401963
[ K. Y. Srinivasan ]
* SAUCE: storvsc: force SPC-3 compliance on win8 and win8 r2 hosts
- LP: #1406867
[ Timo Aaltonen ]
* SAUCE: Switch VLV/BYT to use i915_bdw.
- LP: #1401963
[ Upstream Kernel Changes ]
* Revert "xhci: clear root port wake on bits if controller isn't wake-up
capable"
- LP: #1408779
* KVM: PPC: BOOK3S: HV: CMA: Reserve cma region only in hypervisor mode
- LP: #1400209
* e1000e: Fix no connectivity when driver loaded with cable out
- LP: #1400365
* net/mlx4_core: Enable CQE/EQE stride support
- LP: #1400127
* net/mlx4_core: Cache line EQE size support
- LP: #1400127
* net/mlx4_en: Add mlx4_en_get_cqe helper
- LP: #1400127
* net/mlx4_core: Introduce mlx4_get_
reading
- LP: #1400127
* ethtool, net/mlx4_en: Cable info, get_module_
support
- LP: #1400127
* net/mlx4_core: Introduce ACCESS_REG CMD and eth_prot_ctrl dev cap
- LP: #1400127
* net/mlx4_core: Add ethernet backplane autoneg device capability
- LP: #1400127
* ethtool, net/mlx4_en: Add 100M, 20G, 56G speeds ethtool reporting
support
- LP: #1400127
* net/mlx4_en: Use PTYS register to query ethtool settings
- LP: #1400127
* net/mlx4_en: Use PTYS register to set ethtool settings (Speed)
- LP: #1400127
* net/mlx4_en: Add support for setting rxvlan offload OFF/ON
- LP: #1400127
* net/mlx4_en: Add ethtool support for [rx|tx]vlan offload set to OFF/ON
- LP: #1400127
* net/mlx4_core: Prevent VF from changing port configuration
- LP: #1400127
* net/mlx4_en: mlx4_en_
- LP: #1400127
* ipv4: fix nexthop attlen check in fib_nh_match
- LP: #1408779
* vxlan: fix a use after free in vxlan_encap_bypass
- LP: #1408779
* vxlan: using pskb_may_pull as early as possible
- LP: #1408779
* vxlan: fix a free after use
- LP: #1408779
* ipv4: fix a potential use after free in ip_tunnel_core.c
- LP: #1408779
* ax88179_178a: fix bonding failure
- LP: #1408779
* tcp: md5: do not use alloc_percpu()
- LP: #1408779
* ipv4: dst_entry leak in ip_send_
- LP: #1408779
* drivers/net, ipv6: Select IPv6 fragment idents for virtio UFO packets
- LP: #1408779
* drivers/net: macvtap and tun depend on INET
- LP: #1408779
* ip6_tunnel: Use ip6_tnl_dev_init as the ndo_init function.
- LP: #1408779
* vti6: Use vti6_dev_init as the ndo_init function.
- LP: #1408779
* sit: Use ipip6_tunnel_init as the ndo_init function.
- LP: #1408779
* gre6: Move the setting of dev->iflink into the ndo_init functions.
- LP: #1408779
* vxlan: Do not reuse sockets for a different address family
- LP: #1408779
* net: sctp: fix memory leak in auth key management
- LP: #1408779
* smsc911x: power-...
| Changed in linux (Ubuntu Trusty): | |
| status: | Fix Committed → Fix Released |
| Richard van der Hoff (richvdh) wrote : | #17 |
I was previously seeing this problem; I've now upgraded to 3.1.0-46 and am seeing a similar, but slightly different, panic (see attached)
| Stéphane Graber (stgraber) wrote : | #18 |
I can confirm that something's broken with the recent kernel update...
| Stéphane Graber (stgraber) wrote : | #19 |
Let's file a new bug for that one.
| Changed in linux (Ubuntu Trusty): | |
| status: | Fix Released → Fix Committed |
| tags: | removed: verification-done-trusty |
| Launchpad Janitor (janitor) wrote : | #21 |
This bug was fixed in the package linux - 3.13.0-46.77
---------------
linux (3.13.0-46.77) trusty; urgency=low
[ Seth Forshee ]
* Revert "ipv6: fix swapped ipv4/ipv6 mtu_reduced callbacks"
- LP: #1404558
* Release Tracking Bug
- LP: #1427292
-- Seth Forshee <email address hidden> Mon, 02 Mar 2015 11:33:20 -0600
| Changed in linux (Ubuntu Trusty): | |
| status: | Fix Committed → Fix Released |
| status: | Fix Committed → Fix Released |
| Oliver Weis (oliver-c) wrote : | #23 |
Short feedback on the "fix". I have 2 KVM based 14.04 64bit systems which were not affected by this bug UNTIL the 3.13.0-46.77 fix was release. On 1 of the 2 systems everything seems fine on the other IPv6 is not available directly after booting thus e.g. nginx can not bind to the configured IPv6 address.
on BOTH system I see this in kern.log:
Mar 3 07:14:56 sigma kernel: [ 8.449276] IPv6: ADDRCONF(
but checking old kern.log file it has been this way for a long time so this is not the issue
on the system which works fine anyway I see this:
Mar 3 07:11:43 omega kernel: [ 21.556845] NFSD: starting 90-second grace period (net ffffffff81cdaa00)
Mar 3 07:11:44 omega kernel: [ 22.909888] random: nonblocking pool is initialized
Mar 3 07:11:45 omega kernel: [ 23.566021] ip_tables: (C) 2000-2006 Netfilter Core Team
Mar 3 07:11:45 omega kernel: [ 23.573073] nf_conntrack version 0.5.0 (16384 buckets, 65536 max)
Mar 3 07:11:45 omega kernel: [ 23.637709] ip6_tables: (C) 2000-2006 Netfilter Core Team
so after 23 seconds ip_tables and ip6_tables is running and IPv6 is working.
on the system where IPv6 remains unavailable for 2 minutes and thus nginx refuses to start I see this:
Mar 3 07:14:57 sigma kernel: [ 9.748740] NFSD: starting 90-second grace period (net ffffffff81cdaa00)
Mar 3 07:16:26 sigma kernel: [ 102.269932] ip_tables: (C) 2000-2006 Netfilter Core Team
Mar 3 07:16:26 sigma kernel: [ 102.279642] nf_conntrack version 0.5.0 (16384 buckets, 65536 max)
Mar 3 07:16:26 sigma kernel: [ 102.304003] ip6_tables: (C) 2000-2006 Netfilter Core Team
Mar 3 07:16:50 sigma kernel: [ 126.135981] perf samples too long (2621 > 2500), lowering kernel.
Mar 3 07:23:11 sigma kernel: [ 507.343615] perf samples too long (5025 > 5000), lowering kernel.
Mar 3 08:07:24 sigma kernel: [ 3159.511565] perf samples too long (10036 > 10000), lowering kernel.
This second system is an Intel(R) Xeon(R) CPU E5-2660 v3 @ 2.60GHz KVM based system with 12GB and 4 Cores, the first system which works fine is an Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz based system with 8GB and 2 Cores.
As you can see it takes about 2 minutes until ip_tables as well as ip6_tables is starting on the second system. Which then causes nginx to not start up at all and having to restart it manually after IPv6 is available.
The perf samples too long error is also a new error I have not seen in the logs before. Unsure if it might be related to this problem or is related to changes made from 3.13.0-46.75 -> 3.13.0-46.76 a few days ago.


Might be worth mentioning that all affected hosts are x86 64bit Intel.
For those I've got access to, the issue happened on:
- 2x Xeon E3-1245v2
- 1x Xeon E5-2620v2
- 1x Atom C2750
- 1x Atom D2500
- 1x Core i5 750
All running on pretty standard Intel boards, so the usual set of Intel chipsets for their generation.