Kernel Panic with linux-image-4.15.0-60-generic when specifying nameserver in docker-compose

Bug #1842447 reported by Patrik Kernstock
246
This bug affects 47 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Confirmed
Undecided
Unassigned
Bionic
Fix Released
Critical
Thadeu Lima de Souza Cascardo

Bug Description

[Impact]
Some fragmentation+NAT workloads will cause kernel BUG/Ooops.

[Test case]
sudo iptables -t nat -I POSTROUTING -j MASQUERADE
sudo hping3 192.168.122.1 -s 1000 -p 2000 -d 60000

[Regression potential]
This could make fragmented packets stop flowing. So, make sure fragmented pings still work.
ping 192.168.122.1 -s 60000 still works, even with the above nat rule.

--------------------------------------------

Hello,

there are multiple inquries in the mailcow GitHub issues over at https://github.com/mailcow/mailcow-dockerized/issues/2904 that the latest kernel linux-image-4.15.0-60-generic causes kernel panics when "- dns" setting is used within the docker-compose.yml file, for yet some unclear reasons.

Multiple users on different systems (e.g. virtualized ones on VMware ESXi and KVM) were able to reproduce the same issue. I was also able to reproduce this constantly on a completely new deployed Ubuntu 18.04 VM (KVM) with a fresh mailcow installation.

Steps to reproduce:
1. Install a clean Ubuntu 18.04(.03) machine
2. Upgrade the installation to linux-image-4.15.0-60-generic
3. Setup mailcow as instructed at https://mailcow.github.io/mailcow-dockerized-docs/i_u_m_install/ (just takes less than a minute, easy to reproduce)
4. Start mailcow with "dns"-settings specified in docker-compose file (Make sure using the older docker-compose version with dns settings: https://raw.githubusercontent.com/mailcow/mailcow-dockerized/a1403b7a5969637df23001d05c59c2a20774fbb5/docker-compose.yml)
5. Wait a few minutes, then kernel crash appears

Using this workaround it appears to be stable again: https://github.com/mailcow/mailcow-dockerized/commit/dc6eea5142c063e26408a685b66fbb7754408ec2

I've attached the apport file to this bug. Please let me know if you need any kind of further information. (As this is my first bug report here, I hope I have included all required information helping you finding the cause.)

Kind regards,
Patrik

Revision history for this message
Patrik Kernstock (pkernstock) wrote :
description: updated
Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in linux (Ubuntu):
status: New → Confirmed
summary: - Kernel Panic with linux-image-4.15.0-60-generic when using docker-
- compose with dns setting
+ Kernel Panic with linux-image-4.15.0-60-generic when specifying
+ nameserver in docker-compose
jkrakenda (krakenda)
no longer affects: linux
Revision history for this message
Fingerless Gloves (fingerlessgloves) wrote :

Tested using the docker-compose.yml with original docker-compose.yml with every container but unbound using -dns.

Test1: Crashed (took less than 1 minutes to crash)
Fully updated ubuntu, using Scaleway's ubuntu image. (4.15.0-60-generic)
Crash: https://gist.github.com/FingerlessGlov3s/f829642cfc6390c975ca89be3b0d685a

Test2: Crashed (took less than 5 seconds to crash)
Fully updated ubuntu, using Scaleway's ubuntu image. (4.15.0-60-generic)
Also turned off appamour, SELinux not installed (systemctl disable apparmor.service and rebooted)
Crash: https://gist.github.com/FingerlessGlov3s/fe3b5e6b89986b397e20bc5474049584

Test3: Working
Fully updated ubuntu, using Scaleway's ubuntu image. using old kernel (4.15.0-58-generic)

Just ask if you want any more information or testing.

Revision history for this message
Wouter Horré (wouterh) wrote :

We were just affected by this bug on one of our server. I was able to get some information from the kernel log (see attachment).

It appears to crash at net/ipv4/ip_output.c:636:

Sep 4 07:54:04 gonzo kernel: [ 145.912955] kernel BUG at /build/linux-5mCauq/linux-4.15.0/net/ipv4/ip_output.c:636!
Sep 4 07:54:04 gonzo kernel: [ 146.005576] invalid opcode: 0000 [#1] SMP PTI

Changed in linux (Ubuntu Bionic):
status: New → Confirmed
Changed in linux (Ubuntu Bionic):
status: Confirmed → In Progress
assignee: nobody → Thadeu Lima de Souza Cascardo (cascardo)
importance: Undecided → Critical
description: updated
description: updated
Changed in linux (Ubuntu Bionic):
status: In Progress → Fix Committed
Revision history for this message
Andreas Perhab (ap-wtioit) wrote :

I was affected by this (or a very similar bug), but it only happened on the first HTTP request to a docker-compose managed docker container.

Revision history for this message
André Peters (andryyy) wrote :

Thanks!!!

Revision history for this message
Fingerless Gloves (fingerlessgloves) wrote :

Thank you for fixing this, most appreciated.

When will the fix be available on the repos? On the next kernel release?

Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-bionic' to 'verification-done-bionic'. If the problem still exists, change the tag 'verification-needed-bionic' to 'verification-failed-bionic'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-bionic
Revision history for this message
André Peters (andryyy) wrote :

Fixed it for me (via 4.15.0-62-generic).

Revision history for this message
Patrik Kernstock (pkernstock) wrote :

I can also confirm that it's now working. Thanks a lot for the fast fix, @cascardo!

tags: added: verification-done-bionic
removed: verification-needed-bionic
Changed in linux (Ubuntu Bionic):
status: Fix Committed → In Progress
status: In Progress → Fix Committed
Revision history for this message
André Peters (andryyy) wrote :

Oh. I was able to set the status to released? :-D Sorry. Not released yet.

Changed in linux (Ubuntu Bionic):
status: Fix Committed → Fix Released
Changed in linux (Ubuntu Bionic):
status: Fix Released → Fix Committed
Revision history for this message
Em House (ehouse) wrote :

We updated last night (4.15.0-60) and started having issues this morning... issue has been fixed with update to 5.0.0.27 HWE. We’re up and running again, but trying to figure out where the issue stems from. Is the issue caused from a bug or an attack?

Revision history for this message
Jason A. Donenfeld (zx2c4) wrote :

It's possible this same issue is responsible for this crash in WireGuard: https://lists.zx2c4.com/pipermail/wireguard/2019-September/004495.html

Changed in linux (Ubuntu Bionic):
status: Fix Committed → Fix Released
Revision history for this message
Oliver Mueller (oliver-vpr) wrote :

The last status change happened on accident. The fix has not been released, yet. I cannot change the status back to "Fix Committed" though. Can anyone help with this?

Changed in linux (Ubuntu Bionic):
status: Fix Released → Fix Committed
Revision history for this message
Julian Alexander (julalx) wrote :

This bug has to be dealt with a higher degree of urgency. It impacts any machine that performs any type of NAT. Is not application specific though.

Revision history for this message
Ewen McNeill (ewen) wrote :
Download full text (5.7 KiB)

I agree with Taher (in https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1842447/comments/15), this bug seems to impact a lot of systems (my colo host was kernel panic restarting about every 75-90 minutes, all weekend). It has a NAT firewall on it (for the hosted VMs), but no Docker/Wireguard, etc. My guess for the 75-90 minutes is that is how long it took to dirty enough memory that the relevant value just happened not to already be 0 (NULL). (Curiously I installed -60 on Thursday last week, and the first issue didn't happen until Friday, so it's been worse over the weekend than the first 24 hours. I enabled kernel.panic=15 after the first issue, to automate recovery, and was *extremely* glad I did so.)

Honestly I'd suggest withdrawing -60 as it's very unstable in a lot of common configurations. And also suggest expediting the release of -62, which AFAICT just contains the one line fix for the bug in -60.

Now it's Monday morning (and thus I can get into the colo if needed), I've upgraded the colo system to the proposed -62 version, and crossing my fingers the system is more stable as a result.

In case it helps others, I found I needed to:

(a) https://wiki.ubuntu.com/Testing/EnableProposed (changing "xenial" to "bionic", for 18.04 LTS, including enabling the low priority pin of bionic-proposed); and

(b) sudo apt-get install linux-generic/bionic-proposed linux-signed-generic/bionic-proposed linux-headers-generic/bionic-proposed

(without at least two of those three, the proposed update metapackages wouldn't install due to conflicts; I'm not sure if linux-signed-generic is needed, but it's still installed, so I chose to keep it in sync.)

That list of packages found by looking for 4.15.0-60 versioned packages that didn't have that version in their package name (ie, to find the generic metapackages).

Ewen

PS: Reboots (due to kernel panic, and kernel.panic=15 sysctl) over the weekend:

-=- cut here -=-
ewen@naosr620:~$ last | grep reboot
reboot system boot 4.15.0-62-generi Mon Sep 9 10:43 still running
reboot system boot 4.15.0-60-generi Mon Sep 9 10:14 - 10:39 (00:25)
reboot system boot 4.15.0-60-generi Mon Sep 9 08:48 - 10:09 (01:21)
reboot system boot 4.15.0-60-generi Mon Sep 9 07:33 - 10:09 (02:36)
reboot system boot 4.15.0-60-generi Mon Sep 9 06:18 - 10:09 (03:51)
reboot system boot 4.15.0-60-generi Mon Sep 9 05:03 - 10:09 (05:06)
reboot system boot 4.15.0-60-generi Mon Sep 9 03:48 - 10:09 (06:21)
reboot system boot 4.15.0-60-generi Mon Sep 9 02:33 - 10:09 (07:36)
reboot system boot 4.15.0-60-generi Mon Sep 9 01:13 - 10:09 (08:56)
reboot system boot 4.15.0-60-generi Sun Sep 8 23:58 - 10:09 (10:11)
reboot system boot 4.15.0-60-generi Sun Sep 8 22:43 - 10:09 (11:26)
reboot system boot 4.15.0-60-generi Sun Sep 8 21:28 - 10:09 (12:41)
reboot system boot 4.15.0-60-generi Sun Sep 8 20:08 - 10:09 (14:01)
reboot system boot 4.15.0-60-generi Sun Sep 8 18:53 - 10:09 (15:16)
reboot system boot 4.15.0-60-generi Sun Sep 8 17:38 - 10:09 (16:31)
reboot system boot 4.15.0-60-generi Sun Sep 8 16:23 - 10:09 (17:46)
reboot system boot 4.15.0-60-generi Sun Sep 8...

Read more...

Revision history for this message
Ewen McNeill (ewen) wrote :

FTR, I think this is the fix in -62:

https://kernel.ubuntu.com/git/ubuntu/ubuntu-bionic.git/commit/?h=master-next&id=b502cfeffec81be8564189e5498fd3f252b27900

and it appears to be the only change from -60 to -62:

-=- cut here -=-
ewen@naosr620:~$ zcat /usr/share/doc/linux-headers-4.15.0-62-generic/changelog.Debian.gz | head -9
linux (4.15.0-62.69) bionic; urgency=medium

  * bionic/linux: 4.15.0-62.69 -proposed tracker (LP: #1842746)

  * Kernel Panic with linux-image-4.15.0-60-generic when specifying nameserver
    in docker-compose (LP: #1842447)
    - ip: frags: fix crash in ip_do_fragment()

 -- Khalid Elmously <email address hidden> Wed, 04 Sep 2019 16:11:43 -0400
ewen@naosr620:~$
-=- cut here -=-

It's a one line fix.

Ewen

Revision history for this message
Ewen McNeill (ewen) wrote :

FTR, 4.15.0-62 seems *much* better than 4.15.0-60. With 4.15.0-60 this system was kernel panic restarting every 75-90 minutes; now it's been up since I installed 4.15.0-62, over 5 hours ago:

-=- cut here -=-
ewen@naosr620:~$ uname -r
4.15.0-62-generic
ewen@naosr620:~$ uptime
 16:09:54 up 5:26, 1 user, load average: 0.24, 0.25, 0.24
ewen@naosr620:~$
-=- cut here -=-

That one line fix seems important :-)

Ewen

Revision history for this message
Julian Alexander (julalx) wrote :

Could you please release 4.15.0-62 into the main repo? It's still in the proposed. Thank you.

Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package linux - 4.15.0-62.69

---------------
linux (4.15.0-62.69) bionic; urgency=medium

  * bionic/linux: 4.15.0-62.69 -proposed tracker (LP: #1842746)

  * Kernel Panic with linux-image-4.15.0-60-generic when specifying nameserver
    in docker-compose (LP: #1842447)
    - ip: frags: fix crash in ip_do_fragment()

 -- Khalid Elmously <email address hidden> Wed, 04 Sep 2019 16:11:43 -0400

Changed in linux (Ubuntu Bionic):
status: Fix Committed → Fix Released
Revision history for this message
Brian Detweiler (bldetwei) wrote :

Will this be available for xenial soon?

Revision history for this message
Gerard Krupa (uxian) wrote :

I'm seeing the same kernel panic on linux-aws 4.15.0.1048.47 as well

Revision history for this message
Thadeu Lima de Souza Cascardo (cascardo) wrote :

Hi, @uxian.

linux-aws 4.15.0-1047 has the bug, and I can reproduce it. linux-aws 4.15.0-1048, however, has the fix and should not present the bug. I can confirm I wasn't able to reproduce the issue. Can you send any logs?

Thanks.
Cascardo.

Revision history for this message
Gerard Krupa (uxian) wrote :

Disregard. Looks like 4.15.0.1048 is installed but running uname -a shows 4.15.0.1047 so the box just hasn't been rebooted since the update and since it's an AWS ASG it's just reverting back to the 1047 version from the AMI every time it panics and AWS re-creates the instance. Time to rebuild the AMI with the new kernel.

Revision history for this message
Olivier Febwin (febcrash) wrote :

4.15.0-62.69~16.04.1 is available on -updates https://kernel.ubuntu.com/sru/sru-report.html

Revision history for this message
Steve Beattie (sbeattie) wrote :

For the xenial/linux-hwe kernel, 4.15.0-62.69~16.04.1 with this fix has been published. Are you seeing this issue in xenial's 4.4 kernel.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.