4.4.0-145-generic Kernel Panic ip6_expire_frag_queue

Bug #1824687 reported by Dirk
20
This bug affects 2 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Invalid
High
Unassigned
Xenial
Fix Released
High
Stefan Bader
Cosmic
Invalid
High
Unassigned
Disco
Won't Fix
High
Unassigned

Bug Description

[SRU Justification]

== Impact ==

Since 05c0b86b96 "ipv6: frags: rewrite ip6_expire_frag_queue()" the 16.04/4.4 kernel crashes whenever that functions gets called (on busy systems this can be every 3-4 hours). While this potentially affects Cosmic and later, too, the fix differs on later kernels (Bionic is not yet affected as it does not yet carry updates to the frags handling).

== Fix ==

For Xenial and Cosmic, the proposed fix would be additional changes to ip6_expipre_frag_queue(), taken from follow-up changes to ip_expire().
For Disco, I would hold back because we have a backlog of stable patches there and depending on what got backported to 5.0.y there would be a simpler fix.
For current development kernels, one just needs to ensure that the following upstream change is included: 47d3d7fdb10a "ip6: fix skb leak in ip6frag_expire_frag_queue()".

== Testcase ==

Unfortunately this could not be re-created locally. But a test kernel which had the proposed fix applied was showing good testing (see comment #37 and #38).

== Risk of Regression ==

The modified function is only called in rare cases and the positive testing in production would cover this. So I would consider it low.

---

Description: Ubuntu 16.04.6 LTS
Release: 16.04

After upgrading our server to this Kernel we experience frequent Kernel panics (Attachment).
Every 3 hours.
Our machine has a throuput of about 600 Mbits/s
The Panics are around the area of ip6_expire_frag_queue.

  __pskb_pull_tail
  ip6_dst_lookup_tail
  _decode_session6
  __xfrm_decode_session
  icmpv6_route_lookup
  icmp6_send

It seems similar to Bug Report in Debian.
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=922488

According to the bug finder of above bug it also occurred after using a Kernel with the change of
rewrite ip6_expire_frag_queue()

Intermediate solution. We disabled IPv6 on this machine to avoid further Panics.
Please let me know what information is missing. The ubuntu-bug linux was send. And I hope it is attached to this report.

ProblemType: Bug
DistroRelease: Ubuntu 16.04
Package: linux-image-4.4.0-145-generic 4.4.0-145.171
ProcVersionSignature: Ubuntu 4.4.0-145.171-generic 4.4.176
Uname: Linux 4.4.0-145-generic x86_64
ApportVersion: 2.20.1-0ubuntu2.18
Architecture: amd64
Date: Sun Apr 14 11:40:11 2019
InstallationDate: Installed on 2018-03-18 (391 days ago)
InstallationMedia: Ubuntu-Server 16.04.4 LTS "Xenial Xerus" - Release amd64 (20180228)
ProcEnviron:
 LANGUAGE=en_GB:en
 TERM=xterm-256color
 PATH=(custom, no user)
 LANG=en_GB.UTF-8
 SHELL=/bin/bash
SourcePackage: linux-signed
UpgradeStatus: Upgraded to xenial on 2018-10-21 (174 days ago)
---
AlsaDevices:
 total 0
 crw-rw---- 1 root audio 116, 1 Apr 12 21:04 seq
 crw-rw---- 1 root audio 116, 33 Apr 12 21:04 timer
AplayDevices: Error: [Errno 2] No such file or directory
ApportVersion: 2.20.1-0ubuntu2.18
Architecture: amd64
ArecordDevices: Error: [Errno 2] No such file or directory
AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1:
DistroRelease: Ubuntu 16.04
HibernationDevice: RESUME=/dev/mapper/tor3--vg-swap_1
InstallationDate: Installed on 2018-03-18 (393 days ago)
InstallationMedia: Ubuntu-Server 16.04.4 LTS "Xenial Xerus" - Release amd64 (20180228)
IwConfig: Error: [Errno 2] No such file or directory
Lsusb:
 Bus 002 Device 002: ID 8087:0024 Intel Corp. Integrated Rate Matching Hub
 Bus 002 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
 Bus 001 Device 003: ID 0557:2221 ATEN International Co., Ltd Winbond Hermon
 Bus 001 Device 002: ID 8087:0024 Intel Corp. Integrated Rate Matching Hub
 Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
MachineType: Supermicro X9SRE/X9SRE-3F/X9SRi/X9SRi-3F
Package: linux (not installed)
PciMultimedia:

ProcEnviron:
 LANGUAGE=en_GB:en
 TERM=xterm-256color
 PATH=(custom, no user)
 LANG=en_GB.UTF-8
 SHELL=/bin/bash
ProcFB: 0 VESA VGA
ProcKernelCmdLine: BOOT_IMAGE=/vmlinuz-4.4.0-145-generic root=/dev/mapper/hostname--vg-root ro
ProcVersionSignature: Ubuntu 4.4.0-145.171-generic 4.4.176
RelatedPackageVersions:
 linux-restricted-modules-4.4.0-145-generic N/A
 linux-backports-modules-4.4.0-145-generic N/A
 linux-firmware 1.157.21
RfKill: Error: [Errno 2] No such file or directory
Tags: xenial xenial
Uname: Linux 4.4.0-145-generic x86_64
UpgradeStatus: Upgraded to xenial on 2018-10-21 (176 days ago)
UserGroups:

_MarkForUpload: True
dmi.bios.date: 10/08/2012
dmi.bios.vendor: American Megatrends Inc.
dmi.bios.version: 1.0c
dmi.board.asset.tag: To be filled by O.E.M.
dmi.board.name: X9SRE/X9SRE-3F/X9SRi/X9SRi-3F
dmi.board.vendor: Supermicro
dmi.board.version: 1.2
dmi.chassis.asset.tag: To Be Filled By O.E.M.
dmi.chassis.type: 3
dmi.chassis.vendor: Supermicro
dmi.chassis.version: 0123456789
dmi.modalias: dmi:bvnAmericanMegatrendsInc.:bvr1.0c:bd10/08/2012:svnSupermicro:pnX9SRE/X9SRE-3F/X9SRi/X9SRi-3F:pvr0123456789:rvnSupermicro:rnX9SRE/X9SRE-3F/X9SRi/X9SRi-3F:rvr1.2:cvnSupermicro:ct3:cvr0123456789:
dmi.product.name: X9SRE/X9SRE-3F/X9SRi/X9SRi-3F
dmi.product.version: 0123456789
dmi.sys.vendor: Supermicro

Revision history for this message
Dirk (iggs) wrote :
Stefan Bader (smb)
affects: linux-signed (Ubuntu) → linux (Ubuntu)
Revision history for this message
Stefan Bader (smb) wrote :

Which kernel version was used before (and did not show this crash)? Can you reproduce the issue on a non-production server (which would allow to experiment with the HWE (4.15) kernel)?

Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1824687

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Changed in linux (Ubuntu Xenial):
status: New → Incomplete
Revision history for this message
Dirk (iggs) wrote : CRDA.txt

apport information

tags: added: apport-collected
description: updated
Revision history for this message
Dirk (iggs) wrote : CurrentDmesg.txt

apport information

Revision history for this message
Dirk (iggs) wrote : Lspci.txt

apport information

Revision history for this message
Dirk (iggs) wrote : ProcCpuinfo.txt

apport information

Revision history for this message
Dirk (iggs) wrote : ProcCpuinfoMinimal.txt

apport information

Revision history for this message
Dirk (iggs) wrote : ProcInterrupts.txt

apport information

Revision history for this message
Dirk (iggs) wrote : ProcModules.txt

apport information

Revision history for this message
Dirk (iggs) wrote : UdevDb.txt

apport information

Revision history for this message
Dirk (iggs) wrote : WifiSyslog.txt

apport information

Revision history for this message
Dirk (iggs) wrote :

added logs of apport-collect 1824687

and then change the status of the bug to 'Confirmed'.

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Revision history for this message
Dirk (iggs) wrote :

regarding #2 I do not know which kernel ran before.

I assume linux-image-4.4.0-143 due to following apt history logs.
I however do not know if we rebooted.

I have a different type of server here with ubuntu. I can try to stress it and to see.
But I doubt I get the same quality of traffic since the other machine is an tor exit.

Start-Date: 2019-04-03 06:09:30
Commandline: /usr/bin/unattended-upgrade
Install: linux-modules-extra-4.4.0-145-generic:amd64 (4.4.0-145.171, automatic), linux-modules-4.4.0-145-generic:amd64 (4.4.0-145.171, automatic), linux-headers-4.4.0-145:amd64 (4.4.0-145.171, automatic), linux-image-4.4.0-145-generic:amd64 (4.4.0-145.171, automatic), linux-headers-4.4.0-145-generic:amd64 (4.4.0-145.171, automatic)
Upgrade: linux-headers-generic:amd64 (4.4.0.143.151, 4.4.0.145.153), linux-image-generic:amd64 (4.4.0.143.151, 4.4.0.145.153), linux-generic:amd64 (4.4.0.143.151, 4.4.0.145.153)
End-Date: 2019-04-03 06:10:20

Start-Date: 2019-03-17 06:21:38
Commandline: /usr/bin/unattended-upgrade
Install: linux-modules-4.4.0-143-generic:amd64 (4.4.0-143.169, automatic), linux-headers-4.4.0-143:amd64 (4.4.0-143.169, automatic), linux-image-4.4.0-143-generic:amd64 (4.4.0-143.169, automatic), linux-headers-4.4.0-143-generic:amd64 (4.4.0-143.169, automatic), linux-modules-extra-4.4.0-143-generic:amd64 (4.4.0-143.169, automatic)
Upgrade: linux-headers-generic:amd64 (4.4.0.142.148, 4.4.0.143.151), linux-image-generic:amd64 (4.4.0.142.148, 4.4.0.143.151), linux-generic:amd64 (4.4.0.142.148, 4.4.0.143.151)
End-Date: 2019-03-17 06:22:23

Revision history for this message
Stefan Bader (smb) wrote :

Knowing which was the last good kernel would be good to minimize the delta of changes. Note that if you are able to interact with the grub loader at boot, you can go back to at least the previous kernel before the reboot.
For the trace it would be good to capture the full message. If the server has IPMI capabilities you could add a console= kernel command-line to have messages observable through SOL.

Revision history for this message
Dirk (iggs) wrote :

I spend the better part of 2h to install java on an old windows to satisfy the IPMI needs.
I was able to start SOL. BUT it is just a black windows displaying nothing.
I give up. on this - sorry I do not know how to produce an better screenshot.
(Yes I googled). IPMI Viewer produced the same bad results.

Regards grub. Does this help:
root@XXXX:~# update-grub2
Generating grub configuration file ...
Found linux image: /boot/vmlinuz-4.4.0-145-generic
Found initrd image: /boot/initrd.img-4.4.0-145-generic
Found linux image: /boot/vmlinuz-4.4.0-143-generic
Found initrd image: /boot/initrd.img-4.4.0-143-generic
done

Revision history for this message
Stefan Bader (smb) wrote :

The latter means that you still have 4.4.0-143 around and could select that if you had any way of interfacing with the booting server. So you could go back and confirm the regression happened between 143 and 145.

About IPMI, I don't know how one would do that with Windows, but using a Linux box, there is a package (name in Ubuntu, might vary on other distros) called ipmitool which can be used to do the SOL session without any java and from a terminal window. Of course in any case to see anything you have to figure out which ttyS# on the server is mapped to the SOL session (ttyS0 or ttyS1 usually). And then something like "console=ttyS#,115200n8" has to be added to the default arguments in /etc/default/grub to tell the kernel to re-direct the console to that serial port.

just for completeness the command to initial a SOL session would be:
ipmitool -Ilanplus -H<ip/name of ipmi interface> -U<ipmi user> -P<ipmi password> sol activate

Revision history for this message
Heikki Hannikainen (hessu) wrote :

I have had this crash, with the ip6_expire_frag_queue stack trace, more than 18 times since 2019-04-16 on more than 10 different servers in 8 different countries. There have been some more crashes, but from these ones the panic dump managed to go out to a remote syslog server where it's easy to grep. Crash count by kernel version; these are on both trusty and xenial:

2 crashes: 4.4.0-144-generic #170~14.04.1-Ubuntu
8 crashes: 4.4.0-145-generic #171-Ubuntu
8 crashes: 4.4.0-146-generic #172-Ubuntu

Downgrading to 4.4.0-143 now, as that build does not seem to have the "ipv6: frags: rewrite ip6_expire_frag_queue()" change; it first appears in 4.4.0-144-generic image. I think by tomorrow it's clear whether that kernel is stable as we're now having multiple crashes per day (last crash 50 minutes ago).

These are routers running NAT & firewall & some applications, with substantial IPv6 traffic.

Interestingly the crashes only happen on bare hardware. We have a much
larger number of VMs doing the same thing, most of them now running
4.4.0-146, and none of them have crashed like this. The hardware instances
do have a larger number of CPU cores, the VMs only have 2 or 4.

I am also seeing crashes on 4.15.0-48-generic hwe kernel running on xenial,
but no stack trace to show yet.

Attaching kernel stack trace file containing several crashes on various servers (hessu-ipv6_expire_frag_queue-crashes.txt).

Revision history for this message
Heikki Hannikainen (hessu) wrote :
Revision history for this message
Heikki Hannikainen (hessu) wrote :

kernel.org bug ticket, showing similar crashes on 4.9 and 4.19 kernels: https://bugzilla.kernel.org/show_bug.cgi?id=202669

Revision history for this message
Stefan Bader (smb) wrote :

Thanks for the stack traces. Those help a lot to pinpoint the problem. Will be taking a look.

Changed in linux (Ubuntu Xenial):
assignee: nobody → Stefan Bader (smb)
importance: Undecided → High
status: Incomplete → Triaged
Revision history for this message
Stefan Bader (smb) wrote :

The issue is a check which is causing a oops/crash when a send buffer is referenced more than once when calling pskb_expand_head(). As mentioned in comment #18, this seems to be introduced by a series of patches modifying the way fragments are handled.

The networking code is quite complex, so I am not sure whether some detail I found actually is causing this issues (one backport claims to drop some extraneous initialization in ipv6 which was not done in the ipv4 counterpart), but I created a test kernel to see what happens. If someone could give http://people.canonical.com/~smb/lp1824687/ a try and let me know I would highly appreciate.

Revision history for this message
Dirk (iggs) wrote :

Thanks for the updated Kernel.
Sorry for the late reply. Changes of ipmi where without success.

However I could install and boot your kernel
Linux version 4.4.0-144-generic (smb@kathleen) (gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.10) ) #170+lp1824687v1 SMP Tue Apr 30 11:18:53 UTC 2019 (Ubuntu 4.4.0-144.170+lp1824687v1-generic 4.4.176)

I activated IPv6 Traffic. We should see if the machine will panic.

Revision history for this message
Dirk (iggs) wrote :

O.k. tested with the Kernel Provided. It does not improve the situtation.
Machine crashed first time after about 2hours - same error as always.
I rebootet it - took 2-3 hours until next crash.

Revision history for this message
Stefan Bader (smb) wrote :

As a status update: thanks for testing. I pity it did not help. So far I was looking through all related changes in that set but could not find anything that immediately stuck out. Thinking more over the crash stacktrace it is a netfilter contrack timer expiring which causes a call into ip6_expire_frag_queue() and that got rewritten in "ipv6: frags: rewrite ip6_expire_frag_queue()" to use the first entry in the frag list for sending an ICMP message. And before doing that, it calls skb_get() which does increment the user refcount. That might actually be the issue but it is still done that way in any kernel since v4.18 upstream. Could be that nobody is using those under heavy ipv6 traffic, yet. Since I am not that familiar with the network stack, I would like to reach out to upstream with that question.

Revision history for this message
Heikki Hannikainen (hessu) wrote :

There is the kernel.org bug ticket which describes similar oopsing through ip6_expire_frag_queue() in 4.9 and 4.19 kernels: https://bugzilla.kernel.org/show_bug.cgi?id=202669

I also saw crashes on 4.15.0-48-generic on a server running the same task; I don't have stack traces to show yet since they didn't get out to the remote syslog server.

Revision history for this message
Stefan Bader (smb) wrote :

From the upstream discussion thread it looks like I was on the right track (https://marc.info/?l=linux-netdev&m=155688404826002&w=2). For confirmation I am building another set of test kernel packages and once this can be confirmed will proceed to SRU this into the other series. This looks to have remained unnoticed so far, so anything after 4.18 and all the older kernels which have backported those changes would be affected.

Changed in linux (Ubuntu Cosmic):
importance: Undecided → High
status: New → Triaged
Changed in linux (Ubuntu Disco):
importance: Undecided → High
status: New → Triaged
Changed in linux (Ubuntu):
importance: Undecided → High
status: Confirmed → Triaged
Revision history for this message
Stefan Bader (smb) wrote :

Ok, http://people.canonical.com/~smb/lp1824687/ has been updated with a v2 set which has the upstream patch backported.

Revision history for this message
Heikki Hannikainen (hessu) wrote :

Thank you! I can test this on Monday, weekend is starting here in 2 minutes and this is not the greatest moment to start testing. :)

Revision history for this message
Stefan Bader (smb) wrote :

Just a reminder for the test kernel. If this can be tested soon, it could make it into the next update cycle which starts next week. But for that it has to be submitted before end of Wednesday.

Revision history for this message
Heikki Hannikainen (hessu) wrote :

Sorry for the delay. I'm back in the office now and deploying the test kernel today to a few servers, and to additional ones tomorrow if it's OK on the first ones.

Revision history for this message
Heikki Hannikainen (hessu) wrote :

Unfortunately 4.4.0-144-generic #170+lp1824687v2 testing kernel still crashes. I have 4 hardware instances running it now, there were 2 panics (Australia, Sweden) within 24 hours. I installed linux-crashdump on them after the first crash to get the panic logs reliably. Attached a log from the second panic.

Revision history for this message
Stefan Bader (smb) wrote :

Thanks, quickly glancing at this it looks to be different as in crashing now at a different occasion (when releasing a buffer). I will have to take a closer look but probably not today.

Revision history for this message
Stefan Bader (smb) wrote :

Spend a little more time on this yesterday. While it is somewhat clear that this results from fixing the original issue (now it crashes when releasing memory a little later), My past experience of looking at network issues like that is that memory dumps are of rather limited use as the reasons lie in the past and by the time crashes happen all the interesting state is already lost.
On the other hand I also would rather avoid making experiments in production environments (if that can be avoided). But I am not sure how much chance there is for that.

Revision history for this message
Heikki Hannikainen (hessu) wrote :

I've now got 6 crashes within past 24 hours on the #170+lp1824687v2 testing kernel on a *single* server. It's a production environment, so I'll roll back for now. Two latest backtraces:

[ 6251.834160] Call Trace:
[ 6251.834166] <IRQ>
[ 6251.834174] [<ffffffff8173d130>] skb_release_head_state+0x90/0xb0
[ 6251.834189] [<ffffffff8173dd62>] skb_release_all+0x12/0x30
[ 6251.834203] [<ffffffff8173ddd2>] kfree_skb+0x32/0xa0
[ 6251.834798] [<ffffffff817dca2e>] inet_frag_destroy+0x7e/0x100
[ 6251.835388] [<ffffffffc04b7260>] ? nf_ct_net_exit+0x50/0x50 [nf_defrag_ipv6]
[ 6251.835979] [<ffffffff8182b522>] ip6_expire_frag_queue+0x102/0x110
[ 6251.836562] [<ffffffffc04b727f>] nf_ct_frag6_expire+0x1f/0x30 [nf_defrag_ipv6]
[ 6251.837154] [<ffffffff810f3b57>] call_timer_fn+0x37/0x140
[ 6251.837746] [<ffffffffc04b7260>] ? nf_ct_net_exit+0x50/0x50 [nf_defrag_ipv6]
[ 6251.838350] [<ffffffff810f5464>] run_timer_softirq+0x234/0x330
[ 6251.838961] [<ffffffff8108a339>] __do_softirq+0x109/0x2b0
[ 6251.839574] [<ffffffff8108a655>] irq_exit+0xa5/0xb0
[ 6251.840192] [<ffffffff818660c0>] smp_apic_timer_interrupt+0x50/0x70
[ 6251.843313] [<ffffffff8186383c>] apic_timer_interrupt+0xcc/0xe0
[ 6251.843944] <EOI>
[ 6251.843952] [<ffffffff8173cf29>] ? kfree_skbmem+0x59/0x60
[ 6251.845153] [<ffffffff8126047d>] ? __fsnotify_parent+0x5d/0x130
[ 6251.845744] [<ffffffff8121c0ab>] vfs_read+0xfb/0x130
[ 6251.846316] [<ffffffff8121cd85>] SyS_read+0x55/0xc0
[ 6251.846868] [<ffffffff8186281b>] entry_SYSCALL_64_fastpath+0x22/0xcb

[ 1037.665436] Call Trace:
[ 1037.665442] <IRQ>
[ 1037.665452] [<ffffffffc05131d7>] nf_skb_free+0x17/0x20 [nf_defrag_ipv6]
[ 1037.665469] [<ffffffff817dca23>] inet_frag_destroy+0x73/0x100
[ 1037.665484] [<ffffffffc0513260>] ? nf_ct_net_exit+0x50/0x50 [nf_defrag_ipv6]
[ 1037.665501] [<ffffffff8182b522>] ip6_expire_frag_queue+0x102/0x110
[ 1037.665516] [<ffffffffc051327f>] nf_ct_frag6_expire+0x1f/0x30 [nf_defrag_ipv6]
[ 1037.665534] [<ffffffff810f3b57>] call_timer_fn+0x37/0x140
[ 1037.665548] [<ffffffffc0513260>] ? nf_ct_net_exit+0x50/0x50 [nf_defrag_ipv6]
[ 1037.665569] [<ffffffff810f5464>] run_timer_softirq+0x234/0x330
[ 1037.665585] [<ffffffff8108a339>] __do_softirq+0x109/0x2b0
[ 1037.665598] [<ffffffff8108a655>] irq_exit+0xa5/0xb0
[ 1037.666290] [<ffffffff818660c0>] smp_apic_timer_interrupt+0x50/0x70
[ 1037.666929] [<ffffffff8186383c>] apic_timer_interrupt+0xcc/0xe0
[ 1037.667566] <EOI>
[ 1037.667578] [<ffffffff813af1e0>] ? audit_unix_sk_addr+0x40/0x40
[ 1037.669394] [<ffffffff817cfc20>] ? inet_recvmsg+0xb0/0xb0
[ 1037.670423] [<ffffffff817cfc42>] ? inet_sendmsg+0x22/0xa0
[ 1037.671441] [<ffffffff81735b7e>] sock_sendmsg+0x3e/0x50
[ 1037.672440] [<ffffffff81735c15>] sock_write_iter+0x85/0xf0
[ 1037.673409] [<ffffffff8121b6bf>] do_iter_readv_writev+0x6f/0xa0
[ 1037.674353] [<ffffffff8121c40f>] do_readv_writev+0x18f/0x230
[ 1037.675273] [<ffffffff8121b8c9>] ? __vfs_read+0x29/0x40
[ 1037.676167] [<ffffffff8121c539>] vfs_writev+0x39/0x50
[ 1037.677035] [<ffffffff8121d269>] SyS_writev+0x59/0xf0
[ 1037.677873] [<ffffffff8186281b>] entry_SYSCALL_64_fastpath+0x22/0xcb

Revision history for this message
Stefan Bader (smb) wrote :

So far I have not been successful to trigger the code path which leads to the crashes on my test system. I have, however been able to extend the patch I had in v2 in a way that makes me a bit more hopeful that it might get us somewhere. Potentially not the most optimized handling but that could wait. The problem is a bit that all the changes come from a set of changes where I am not sure upstream really tested the intermediate steps too well. Anyhow, you would find the new debs again at http://people.canonical.com/~smb/lp1824687/
I know it sucks, but I would appreciate if we could put that again into production stress.

Revision history for this message
Heikki Hannikainen (hessu) wrote :

Thanks, I deployed the v4 debs on one server which was particularly unstable, and it's still up after 1 day and 8 hours now. I'll deploy more widely on Monday and Tuesday.

Revision history for this message
Heikki Hannikainen (hessu) wrote :

I've now got the v4 debs on 5 servers, and not a single crash since they were installed on each. Looks good to me. Thank you!

Stefan Bader (smb)
description: updated
Stefan Bader (smb)
Changed in linux (Ubuntu Xenial):
status: Triaged → Fix Committed
Changed in linux (Ubuntu Cosmic):
status: Triaged → Fix Committed
Stefan Bader (smb)
Changed in linux (Ubuntu Cosmic):
status: Fix Committed → Incomplete
Revision history for this message
Stefan Bader (smb) wrote :

I reverted the changes to Cosmic because that needs at least a different approach. In that version the rbtree usage is not yet present and the IPv4 expire function does the exactly same thing (increment the refcount of the skb) and we have no hard evidence this actually causes crashes in the 4.18 kernel. So for now only keep the xenial change.

Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-xenial' to 'verification-done-xenial'. If the problem still exists, change the tag 'verification-needed-xenial' to 'verification-failed-xenial'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-xenial
Revision history for this message
Dirk (iggs) wrote :

I tested with Kernel:
Linux tor3 4.4.0-149-generic #175+lp1824687v4 SMP Mon May 27 17:21:02 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

IPv6 is enabled and the system is under usual load.
No crashes in 24h.
For me this is a clear indication that the problem is fixed. Before there where crashes all 2-4 hours.

Therefore Verified for Xenial.

How can I add tags ?

Revision history for this message
Dirk (iggs) wrote :

Verfied fix by reporter

tags: added: verification-done-xenial
removed: amd64 apport-bug apport-collected verification-needed-xenial xenial
Revision history for this message
Heikki Hannikainen (hessu) wrote :

I deployed the actual -proposed kernel 4.4.0-152.179 on 4 servers, and it is stable for us. Previously there were multiple crashes per day. Confirming, verification done. Thank you!

tags: added: amd64 apport-bug apport-collected xenial
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (30.5 KiB)

This bug was fixed in the package linux - 4.4.0-157.185

---------------
linux (4.4.0-157.185) xenial; urgency=medium

  * linux: 4.4.0-157.185 -proposed tracker (LP: #1837476)

  * systemd 229-4ubuntu21.22 ADT test failure with linux 4.4.0-156.183 (storage)
    (LP: #1837235)
    - Revert "block/bio: Do not zero user pages"
    - Revert "block: Clear kernel memory before copying to user"
    - Revert "bio_copy_from_iter(): get rid of copying iov_iter"

linux (4.4.0-156.183) xenial; urgency=medium

  * linux: 4.4.0-156.183 -proposed tracker (LP: #1836880)

  * BCM43602 802.11ac Wireless regression - PCI ID 14e4:43ba (LP: #1836801)
    - brcmfmac: add eth_type_trans back for PCIe full dongle

linux (4.4.0-155.182) xenial; urgency=medium

  * linux: 4.4.0-155.182 -proposed tracker (LP: #1834918)

  * Geneve tunnels don't work when ipv6 is disabled (LP: #1794232)
    - geneve: correctly handle ipv6.disable module parameter

  * Kernel modules generated incorrectly when system is localized to a non-
    English language (LP: #1828084)
    - scripts: override locale from environment when running recordmcount.pl

  * Handle overflow in proc_get_long of sysctl (LP: #1833935)
    - sysctl: handle overflow in proc_get_long

  * Xenial update: 4.4.181 upstream stable release (LP: #1832661)
    - x86/speculation/mds: Revert CPU buffer clear on double fault exit
    - x86/speculation/mds: Improve CPU buffer clear documentation
    - ARM: exynos: Fix a leaked reference by adding missing of_node_put
    - crypto: vmx - fix copy-paste error in CTR mode
    - crypto: crct10dif-generic - fix use via crypto_shash_digest()
    - crypto: x86/crct10dif-pcl - fix use via crypto_shash_digest()
    - ALSA: usb-audio: Fix a memory leak bug
    - ALSA: hda/hdmi - Consider eld_valid when reporting jack event
    - ALSA: hda/realtek - EAPD turn on later
    - ASoC: max98090: Fix restore of DAPM Muxes
    - ASoC: RT5677-SPI: Disable 16Bit SPI Transfers
    - mm/mincore.c: make mincore() more conservative
    - ocfs2: fix ocfs2 read inode data panic in ocfs2_iget
    - mfd: da9063: Fix OTP control register names to match datasheets for
      DA9063/63L
    - tty/vt: fix write/write race in ioctl(KDSKBSENT) handler
    - ext4: actually request zeroing of inode table after grow
    - ext4: fix ext4_show_options for file systems w/o journal
    - Btrfs: do not start a transaction at iterate_extent_inodes()
    - bcache: fix a race between cache register and cacheset unregister
    - bcache: never set KEY_PTRS of journal key to 0 in journal_reclaim()
    - ipmi:ssif: compare block number correctly for multi-part return messages
    - crypto: gcm - Fix error return code in crypto_gcm_create_common()
    - crypto: gcm - fix incompatibility between "gcm" and "gcm_base"
    - crypto: chacha20poly1305 - set cra_name correctly
    - crypto: salsa20 - don't access already-freed walk.iv
    - crypto: arm/aes-neonbs - don't access already-freed walk.iv
    - writeback: synchronize sync(2) against cgroup writeback membership switches
    - fs/writeback.c: use rcu_barrier() to wait for inflight wb switches going
      into workqueue when umount
    - ALSA: hda/realtek - Fix for Lenovo B...

Changed in linux (Ubuntu Xenial):
status: Fix Committed → Fix Released
Brad Figg (brad-figg)
tags: added: cscc
Terry Rudd (terrykrudd)
Changed in linux (Ubuntu Cosmic):
status: Incomplete → Invalid
Steve Langasek (vorlon)
Changed in linux (Ubuntu Disco):
status: Triaged → Won't Fix
Po-Hsu Lin (cypressyew)
Changed in linux (Ubuntu):
status: Triaged → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.