Samba mount/umount in docker container triggers kernel Oops

Bug #1729637 reported by Fabian Holler on 2017-11-02
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Medium
Joseph Salisbury
Xenial
Medium
Joseph Salisbury
Zesty
Medium
Joseph Salisbury

Bug Description

== SRU Justification ==
This bug causes Samba mount and umounts in docker container to trigger
a kernel Oops. When running 2 docker containers, one as samba server and
another one as samba client that mounts and umounts a smb share a kernel
OOps can be triggered.

This bug happens in Xenial and Zesty, and is fixed by the following two commits:
76da0704507b ("ipv6: only call ip6_route_dev_notify() once for NETDEV_UNREGISTER")
12d94a804946 ("ipv6: fix NULL dereference in ip6_route_dev_notify()")

Both commits are clean cherry picks. 76da0704507b is in mainline as of v4.12.
Commit 12d94a804946 is in mainline as of v4.13-rc6.

== Fixes ==
commit 76da0704507bbc51875013f6557877ab308cfd0a
Author: WANG Cong <xiyou.wangcong at gmail.com>
Date: Tue Jun 20 11:42:27 2017 -0700

    ipv6: only call ip6_route_dev_notify() once for NETDEV_UNREGISTER

commit 12d94a804946af291e24b80fc53ec86264765781
Author: Eric Dumazet <edumazet at google.com>
Date: Tue Aug 15 04:09:51 2017 -0700

    ipv6: fix NULL dereference in ip6_route_dev_notify()

== Regression Potential ==
Both commits are specific to ipv6 and fix a regression introduced into Xenial and Zesty.

== Test Case ==
A test kernel was built with these patches and tested by the original bug reporter.
The bug reporter states the test kernel resolved the bug.

The kernel message:
  unregister_netdevice: waiting for lo to become free. Usage count = 1
shows up, some minutes later the oops and/or warnings happens.

The scripts to trigger the kernel Oops can be found at: https://github.com/fho/docker-samba-loop
I was able to reproduce kernel Oopses on a clean Ubuntu 16.0.4 installation with:

- linux-image-4.4.0-93-generic=4.4.0-93.116~14.04.1
- linux-image-4.10.0-32-generic=4.10.0-32.36~16.04.1
- linux-image-4.11.0-14-generic=4.11.0-14.20~16.04.1

In a different scenario were Ubuntu 16.04 servers were running multiple docker containers with Nginx or small network applications in parallel, I was also able to reproduce the kernel Oopses also on:

- linux-image-4.10.0-1004-gcp
- linux-image-4.12.10-041210-generic=4.12.10-041210.20170830

I haven't tried again to reproduce it with those kernels on a clean Ubuntu
installation and unfortunately didn't kept the kernel logs.

The "unregister_netdevice: waiting for lo to become free. Usage count = 1" messages are related to https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1711407 which is handled as separate issue.

According to https://github.com/moby/moby/issues/35068 the crash is fixed by:
https://patchwork.ozlabs.org/patch/801533/
https://patchwork.ozlabs.org/patch/778449/
---
AlsaDevices:
 total 0
 crw-rw---- 1 root audio 116, 1 Nov 3 09:51 seq
 crw-rw---- 1 root audio 116, 33 Nov 3 09:51 timer
AplayDevices: Error: [Errno 2] No such file or directory
ApportVersion: 2.20.1-0ubuntu2.10
Architecture: amd64
ArecordDevices: Error: [Errno 2] No such file or directory
AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1:
CRDA: N/A
DistroRelease: Ubuntu 16.04
IwConfig: Error: [Errno 2] No such file or directory
Lsusb: Error: command ['lsusb'] failed with exit code 1:
MachineType: Google Google Compute Engine
Package: linux (not installed)
PciMultimedia:

ProcEnviron:
 TERM=rxvt-unicode-256color
 PATH=(custom, no user)
 LANG=en_US.UTF-8
 SHELL=/bin/bash
ProcFB:

ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-4.10.0-37-generic root=UUID=bf9a017a-931d-4191-84bc-b8434dbba527 ro scsi_mod.use_blk_mq=Y console=ttyS0
ProcVersionSignature: Ubuntu 4.10.0-37.41~16.04.1-generic 4.10.17
RelatedPackageVersions:
 linux-restricted-modules-4.10.0-37-generic N/A
 linux-backports-modules-4.10.0-37-generic N/A
 linux-firmware N/A
RfKill: Error: [Errno 2] No such file or directory
Tags: xenial uec-images xenial uec-images
Uname: Linux 4.10.0-37-generic x86_64
UnreportableReason: The report belongs to a package that is not installed.
UpgradeStatus: No upgrade log present (probably fresh install)
UserGroups:

_MarkForUpload: False
dmi.bios.date: 01/01/2011
dmi.bios.vendor: Google
dmi.bios.version: Google
dmi.board.asset.tag: A3DDBB61-646B-C60C-3999-1F1D7B7A334A
dmi.board.name: Google Compute Engine
dmi.board.vendor: Google
dmi.chassis.type: 1
dmi.chassis.vendor: Google
dmi.modalias: dmi:bvnGoogle:bvrGoogle:bd01/01/2011:svnGoogle:pnGoogleComputeEngine:pvr:rvnGoogle:rnGoogleComputeEngine:rvr:cvnGoogle:ct1:cvr:
dmi.product.name: Google Compute Engine
dmi.sys.vendor: Google

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1729637

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete

apport information

tags: added: apport-collected uec-images xenial
description: updated

apport information

Fabian Holler (fh0) wrote : Lspci.txt

apport information

apport information

apport information

apport information

apport information

apport information

Fabian Holler (fh0) wrote :

The collected apport informations are from a fresh start of the machine because the bugs causes the machine to crash.

Attached are the logs of a kernel crash that happened by the described method.

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Changed in linux (Ubuntu):
importance: Undecided → Medium
Changed in linux (Ubuntu Xenial):
status: New → Triaged
Changed in linux (Ubuntu Zesty):
status: New → Triaged
Changed in linux (Ubuntu):
status: Confirmed → Triaged
Changed in linux (Ubuntu Xenial):
importance: Undecided → Medium
Changed in linux (Ubuntu Zesty):
importance: Undecided → Medium
Joseph Salisbury (jsalisbury) wrote :

I built Xenial and Zesty test kernels with the following two commits:

12d94a804946 ("ipv6: fix NULL dereference in ip6_route_dev_notify()")
76da0704507b ("ipv6: only call ip6_route_dev_notify() once for NETDEV_UNREGISTER")

These commits are already in the Artful kernel.

The test kernels can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1729637/

Can you test these kernels and see if they resolve this bug?

Changed in linux (Ubuntu):
status: Triaged → In Progress
Changed in linux (Ubuntu Xenial):
status: Triaged → In Progress
Changed in linux (Ubuntu Zesty):
status: Triaged → In Progress
Changed in linux (Ubuntu):
assignee: nobody → Joseph Salisbury (jsalisbury)
Changed in linux (Ubuntu Xenial):
assignee: nobody → Joseph Salisbury (jsalisbury)
Changed in linux (Ubuntu Zesty):
assignee: nobody → Joseph Salisbury (jsalisbury)
Fabian Holler (fh0) wrote :

Thanks for the fast reply.

I tried both kernels and was not able to trigger an Oops.

On the 4.4.0-98.121~lp1729637-generic kernel a hung task warning happened:

[ +0.750497] unregister_netdevice: waiting for lo to become free. Usage count = 1
[ +0.992665] aufs au_opts_verify:1597:dockerd[1620]: dirperm1 breaks the protection by the permission bits on the lower branch
[ +0.016910] aufs au_opts_verify:1597:dockerd[1620]: dirperm1 breaks the protection by the permission bits on the lower branch
[ +0.015247] aufs au_opts_verify:1597:dockerd[1592]: dirperm1 breaks the protection by the permission bits on the lower branch
[ +0.006387] device veth7d3bee3 entered promiscuous mode
[ +0.000923] IPv6: ADDRCONF(NETDEV_UP): veth7d3bee3: link is not ready
[ +9.051406] unregister_netdevice: waiting for lo to become free. Usage count = 1
[ +10.067531] unregister_netdevice: waiting for lo to become free. Usage count = 1
[ +10.083551] unregister_netdevice: waiting for lo to become free. Usage count = 1
[Nov 3 18:53] unregister_netdevice: waiting for lo to become free. Usage count = 1
[ +8.055656] INFO: task exe:2868 blocked for more than 120 seconds.
[ +0.006363] Not tainted 4.4.0-98-generic #121~lp1729637
[ +0.005970] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ +0.008347] exe D ffff88012a0e3cb8 0 2868 1 0x00000000
[ +0.000006] ffff88012a0e3cb8 ffffffff821db9a0 ffffffff81e11500 ffff8800ba25f000
[ +0.000002] ffff88012a0e4000 ffffffff81ef7a64 ffff8800ba25f000 00000000ffffffff
[ +0.000002] ffffffff81ef7a68 ffff88012a0e3cd0 ffffffff818405d5 ffffffff81ef7a60
[ +0.000002] Call Trace:
[ +0.000011] [<ffffffff818405d5>] schedule+0x35/0x80
[ +0.000004] [<ffffffff8184087e>] schedule_preempt_disabled+0xe/0x10
[ +0.000003] [<ffffffff818424b9>] __mutex_lock_slowpath+0xb9/0x130
[ +0.000002] [<ffffffff8184254f>] mutex_lock+0x1f/0x30
[ +0.000007] [<ffffffff8172ea8e>] copy_net_ns+0x6e/0x120
[ +0.000010] [<ffffffff810a172b>] create_new_namespaces+0x11b/0x1d0
[ +0.000001] [<ffffffff810a184d>] copy_namespaces+0x6d/0xa0
[ +0.000005] [<ffffffff8107f1d2>] copy_process+0x8e2/0x1b30
[ +0.000003] [<ffffffff810805b0>] _do_fork+0x80/0x360
[ +0.000002] [<ffffffff81080939>] SyS_clone+0x19/0x20
[ +0.000004] [<ffffffff818446f2>] entry_SYSCALL_64_fastpath+0x16/0x71
[ +1.999139] unregister_netdevice: waiting for lo to become free. Usage count = 1

Joseph Salisbury (jsalisbury) wrote :

The test kernel fixes the original bug, but introduces this new call trace?

Do you happen to know if this new call trace also happens with artful?

Fabian Holler (fh0) wrote :

No, the hung task warning is bug: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1711407

Without the patches the kernel probably crashed during my tests before the hung task could happen.

On artful the hung task warning does not happen.

Joseph Salisbury (jsalisbury) wrote :

Thanks for the update, Fabian. I can SRU those two commits to Zesty and Xenial.

Stefan Bader (smb) on 2017-11-20
Changed in linux (Ubuntu Xenial):
status: In Progress → Fix Committed
Changed in linux (Ubuntu Zesty):
status: In Progress → Fix Committed
Khaled El Mously (kmously) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-xenial' to 'verification-done-xenial'. If the problem still exists, change the tag 'verification-needed-xenial' to 'verification-failed-xenial'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-xenial
tags: added: verification-needed-zesty
Khaled El Mously (kmously) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-zesty' to 'verification-done-zesty'. If the problem still exists, change the tag 'verification-needed-zesty' to 'verification-failed-zesty'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

Fabian Holler (fh0) wrote :

I could not reproduce the bug with 4.4.0-102-generic or 4.10.0-41-generic

tags: added: verification-done-xenial verification-done-zesty
removed: verification-needed-xenial verification-needed-zesty
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package linux - 4.10.0-42.46

---------------
linux (4.10.0-42.46) zesty; urgency=low

  * linux: 4.10.0-42.46 -proposed tracker (LP: #1736152)

  * CVE-2017-1000405
    - mm, thp: Do not make page table dirty unconditionally in touch_p[mu]d()

  * CVE-2017-16939
    - ipsec: Fix aborted xfrm policy dump crash

linux (4.10.0-41.45) zesty; urgency=low

  * linux: 4.10.0-41.45 -proposed tracker (LP: #1733524)

  * tar -x sometimes fails on overlayfs (LP: #1728489)
    - ovl: check if all layers are on the same fs
    - ovl: persistent inode number for directories

  * CVE-2017-12146
    - driver core: platform: fix race condition with driver_override

  * NVMe timeout is too short (LP: #1729119)
    - nvme: update timeout module parameter type

  * Set PANIC_TIMEOUT=10 on Power Systems (LP: #1730660)
    - [Config]: Set PANIC_TIMEOUT=10 on ppc64el

  * Cannot pair BLE remote devices when using combo BT SoC (LP: #1731467)
    - Bluetooth: increase timeout for le auto connections

  * Plantronics P610 does not support sample rate reading (LP: #1719853)
    - ALSA: usb-audio: Add sample rate quirk for Plantronics P610

  * Invalid btree pointer causes the kernel NULL pointer dereference
    (LP: #1729256)
    - xfs: reinit btree pointer on attr tree inactivation walk

  * Samba mount/umount in docker container triggers kernel Oops (LP: #1729637)
    - ipv6: only call ip6_route_dev_notify() once for NETDEV_UNREGISTER
    - ipv6: fix NULL dereference in ip6_route_dev_notify()

  * Device hotplugging with MPT SAS cannot work for VMWare ESXi (LP: #1730852)
    - scsi: mptsas: Fixup device hotplug for VMWare ESXi

  * Boot/Installation crash of Ubuntu-16.04.3 HWE kernel on R940 (LP: #1719697)
    - Revert "x86/acpi: Set persistent cpuid <-> nodeid mapping when booting"

 -- Stefan Bader <email address hidden> Mon, 04 Dec 2017 15:04:01 +0100

Changed in linux (Ubuntu Zesty):
status: Fix Committed → Fix Released
Launchpad Janitor (janitor) wrote :
Download full text (9.5 KiB)

This bug was fixed in the package linux - 4.4.0-103.126

---------------
linux (4.4.0-103.126) xenial; urgency=low

  * linux: 4.4.0-103.126 -proposed tracker (LP: #1736181)

  * CVE-2017-1000405
    - mm, thp: Do not make page table dirty unconditionally in touch_p[mu]d()

  * CVE-2017-16939
    - netlink: add a start callback for starting a netlink dump
    - ipsec: Fix aborted xfrm policy dump crash

linux (4.4.0-102.125) xenial; urgency=low

  * linux: 4.4.0-102.125 -proposed tracker (LP: #1733541)

  * tar -x sometimes fails on overlayfs (LP: #1728489)
    - ovl: check if all layers are on the same fs
    - ovl: persistent inode number for directories

  * NVMe timeout is too short (LP: #1729119)
    - nvme: update timeout module parameter type

  * Set PANIC_TIMEOUT=10 on Power Systems (LP: #1730660)
    - [Config]: Set PANIC_TIMEOUT=10 on ppc64el

  * Cannot pair BLE remote devices when using combo BT SoC (LP: #1731467)
    - Bluetooth: increase timeout for le auto connections

  * CIFS errors on 4.4.0-98, but not on 4.4.0-97 with same config (LP: #1729337)
    - SMB3: Validate negotiate request must always be signed

  * Plantronics P610 does not support sample rate reading (LP: #1719853)
    - ALSA: usb-audio: Add sample rate quirk for Plantronics P610

  * Invalid btree pointer causes the kernel NULL pointer dereference
    (LP: #1729256)
    - xfs: reinit btree pointer on attr tree inactivation walk

  * Samba mount/umount in docker container triggers kernel Oops (LP: #1729637)
    - ipv6: only call ip6_route_dev_notify() once for NETDEV_UNREGISTER
    - ipv6: fix NULL dereference in ip6_route_dev_notify()

  * [kernel] tty/hvc: Use opal irqchip interface if available (LP: #1728098)
    - tty/hvc: Use opal irqchip interface if available

  * Device hotplugging with MPT SAS cannot work for VMWare ESXi (LP: #1730852)
    - scsi: mptsas: Fixup device hotplug for VMWare ESXi

  * NMI watchdog: BUG: soft lockup on Guest upon boot (KVM) (LP: #1727331)
    - KVM: PPC: Book3S: Treat VTB as a per-subcore register, not per-thread

  * Attempt to map rbd image from ceph jewel/luminous hangs (LP: #1728739)
    - crush: ensure bucket id is valid before indexing buckets array
    - crush: ensure take bucket value is valid
    - crush: add chooseleaf_stable tunable
    - crush: decode and initialize chooseleaf_stable
    - libceph: advertise support for TUNABLES5
    - libceph: MOSDOpReply v7 encoding

  * Xenial update to 4.4.98 stable release (LP: #1732698)
    - adv7604: Initialize drive strength to default when using DT
    - video: fbdev: pmag-ba-fb: Remove bad `__init' annotation
    - PCI: mvebu: Handle changes to the bridge windows while enabled
    - xen/netback: set default upper limit of tx/rx queues to 8
    - drm: drm_minor_register(): Clean up debugfs on failure
    - KVM: PPC: Book 3S: XICS: correct the real mode ICP rejecting counter
    - iommu/arm-smmu-v3: Clear prior settings when updating STEs
    - powerpc/corenet: explicitly disable the SDHC controller on kmcoge4
    - ARM: omap2plus_defconfig: Fix probe errors on UARTs 5 and 6
    - crypto: vmx - disable preemption to enable vsx in aes_ctr.c
    - iio: trigger: free trigger...

Read more...

Changed in linux (Ubuntu Xenial):
status: Fix Committed → Fix Released
Changed in linux (Ubuntu):
status: In Progress → Fix Released
To post a comment you must log in.