Bug #1711407 “unregister_netdevice: waiting for lo to become fre...” : Bugs : linux package : Ubuntu

Revision history for this message

Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote on 2017-08-17: Missing required logs.

#1

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1711407

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status:	New → Incomplete

Revision history for this message

Dan Streetman (ddstreet) wrote on 2017-08-17:

#2

Can anyone experiencing this issue please provide details such as:

-what release are you using (trusty/xenial/zesty)?

-what kernel version are you using?

-do you have specific steps to reproduce the problem?

description:	updated
Changed in linux (Ubuntu):
assignee:	nobody → Dan Streetman (ddstreet)
importance:	Undecided → Medium
status:	Incomplete → In Progress

Revision history for this message

Andrew Reis (areis422) wrote on 2017-08-17:

#3

Download full text (5.6 KiB)

Distributor ID: Ubuntu
Description: Ubuntu 16.04.3 LTS
Release: 16.04
Codename: xenial

cat /etc/apt/sources.list.d/docker.list
deb [arch=amd64] https://apt.dockerproject.org/repo ubuntu-xenial main
# deb-src [arch=amd64] https://apt.dockerproject.org/repo ubuntu-xenial main

cat /etc/apt/sources.list.d/runner_gitlab-ci-multi-runner.list
# this file was generated by packages.gitlab.com for
# the repository at https://packages.gitlab.com/runner/gitlab-ci-multi-runner

deb https://packages.gitlab.com/runner/gitlab-ci-multi-runner/ubuntu/ xenial main
deb-src https://packages.gitlab.com/runner/gitlab-ci-multi-runner/ubuntu/ xenial main

dpkg -l | grep gitlab
ii gitlab-ci-multi-runner 9.4.2 amd64 GitLab Runner

dpkg -l | grep docker
ii docker-engine 17.05.0~ce-0~ubuntu-xenial amd64 Docker: the open-source application container engine

Seems to only occur when running docker with privileged inside of gitlab-runner. On "close" of the docker container, the process fails with the following:

Aug 17 17:53:04 WIS-DOCKERPHY kernel: [ 1579.245473] unregister_netdevice: waiting for lo to become free. Usage count = 1
Aug 17 17:53:15 WIS-DOCKERPHY kernel: [ 1589.356182] unregister_netdevice: waiting for lo to become free. Usage count = 1
Aug 17 17:53:25 WIS-DOCKERPHY kernel: [ 1599.466898] unregister_netdevice: waiting for lo to become free. Usage count = 1
Aug 17 17:53:35 WIS-DOCKERPHY kernel: [ 1609.577597] unregister_netdevice: waiting for lo to become free. Usage count = 1
Aug 17 17:53:45 WIS-DOCKERPHY kernel: [ 1619.692335] unregister_netdevice: waiting for lo to become free. Usage count = 1
Aug 17 17:53:54 WIS-DOCKERPHY dockerd[18386]: time="2017-08-17T17:53:54.994924776-05:00" level=error msg="containerd: start container" error="containerd: container did not start before the specified timeout" id=2ef6677cc840f3597e5b38024cfb337856adfaccdcc1a2739f78413979c0fa2b
Aug 17 17:53:54 WIS-DOCKERPHY dockerd[18386]: time="2017-08-17T17:53:54.995769254-05:00" level=error msg="Create container failed with error: containerd: container did not start before the specified timeout"
Aug 17 17:53:55 WIS-DOCKERPHY kernel: [ 1629.366310] docker0: port 1(veth9f03d7e) entered disabled state
Aug 17 17:53:55 WIS-DOCKERPHY kernel: [ 1629.367889] device veth9f03d7e left promiscuous mode
Aug 17 17:53:55 WIS-DOCKERPHY kernel: [ 1629.367891] docker0: port 1(veth9f03d7e) entered disabled state
Aug 17 17:53:55 WIS-DOCKERPHY kernel: [ 1629.449744] auplink[25254]: segfault at 7ffdd8ddb0d8 ip 00007facdfff1579 sp 00007ffdd8ddb0e0 error 6 in libc-2.23.so[7facdfef8000+1c0000]
Aug 17 17:53:55 WIS-DOCKERPHY dockerd[18386]: time="2017-08-17T17:53:55.215290996-05:00" level=warning msg="Couldn't run auplink before unmount /var/lib/docker/aufs/mnt/90bdc718c5e518ef7b03df0e1769ab4638642ad0e847a38dbf16c6e600783f00: signal: segmentation fault (core dumped)"
Aug 17 17:53:55 WIS-DOCKERPHY dockerd[18386]: time="2017-08-17T17:53:55.215372764-05:00" level=error msg="Error unmounting container 2ef6677cc840f3597e5b38024cfb337856adfaccdcc1a2739f78413979c0fa2b: device or resource...

Distributor ID: Ubuntu
Description:    Ubuntu 16.04.3 LTS
Release:        16.04
Codename:       xenial

cat /etc/apt/sources.list.d/docker.list
deb [arch=amd64] https://apt.dockerproject.org/repo ubuntu-xenial main
# deb-src [arch=amd64] https://apt.dockerproject.org/repo ubuntu-xenial main

cat /etc/apt/sources.list.d/runner_gitlab-ci-multi-runner.list
# this file was generated by packages.gitlab.com for
# the repository at https://packages.gitlab.com/runner/gitlab-ci-multi-runner

deb https://packages.gitlab.com/runner/gitlab-ci-multi-runner/ubuntu/ xenial main
deb-src https://packages.gitlab.com/runner/gitlab-ci-multi-runner/ubuntu/ xenial main

dpkg -l | grep gitlab
ii  gitlab-ci-multi-runner                   9.4.2                                      amd64        GitLab Runner

dpkg -l | grep docker
ii  docker-engine                            17.05.0~ce-0~ubuntu-xenial                 amd64        Docker: the open-source application container engine

Seems to only occur when running docker with privileged inside of gitlab-runner. On "close" of the docker container, the process fails with the following:

Aug 17 17:53:04 WIS-DOCKERPHY kernel: [ 1579.245473] unregister_netdevice: waiting for lo to become free. Usage count = 1
Aug 17 17:53:15 WIS-DOCKERPHY kernel: [ 1589.356182] unregister_netdevice: waiting for lo to become free. Usage count = 1
Aug 17 17:53:25 WIS-DOCKERPHY kernel: [ 1599.466898] unregister_netdevice: waiting for lo to become free. Usage count = 1
Aug 17 17:53:35 WIS-DOCKERPHY kernel: [ 1609.577597] unregister_netdevice: waiting for lo to become free. Usage count = 1
Aug 17 17:53:45 WIS-DOCKERPHY kernel: [ 1619.692335] unregister_netdevice: waiting for lo to become free. Usage count = 1
Aug 17 17:53:54 WIS-DOCKERPHY dockerd[18386]: time="2017-08-17T17:53:54.994924776-05:00" level=error msg="containerd: start container" error="containerd: container did not start before the specified timeout" id=2ef6677cc840f3597e5b38024cfb337856adfaccdcc1a2739f78413979c0fa2b
Aug 17 17:53:54 WIS-DOCKERPHY dockerd[18386]: time="2017-08-17T17:53:54.995769254-05:00" level=error msg="Create container failed with error: containerd: container did not start before the specified timeout"
Aug 17 17:53:55 WIS-DOCKERPHY kernel: [ 1629.366310] docker0: port 1(veth9f03d7e) entered disabled state
Aug 17 17:53:55 WIS-DOCKERPHY kernel: [ 1629.367889] device veth9f03d7e left promiscuous mode
Aug 17 17:53:55 WIS-DOCKERPHY kernel: [ 1629.367891] docker0: port 1(veth9f03d7e) entered disabled state
Aug 17 17:53:55 WIS-DOCKERPHY kernel: [ 1629.449744] auplink[25254]: segfault at 7ffdd8ddb0d8 ip 00007facdfff1579 sp 00007ffdd8ddb0e0 error 6 in libc-2.23.so[7facdfef8000+1c0000]
Aug 17 17:53:55 WIS-DOCKERPHY dockerd[18386]: time="2017-08-17T17:53:55.215290996-05:00" level=warning msg="Couldn't run auplink before unmount /var/lib/docker/aufs/mnt/90bdc718c5e518ef7b03df0e1769ab4638642ad0e847a38dbf16c6e600783f00: signal: segmentation fault (core dumped)"
Aug 17 17:53:55 WIS-DOCKERPHY dockerd[18386]: time="2017-08-17T17:53:55.215372764-05:00" level=error msg="Error unmounting container 2ef6677cc840f3597e5b38024cfb337856adfaccdcc1a2739f78413979c0fa2b: device or resource busy"
Aug 17 17:53:55 WIS-DOCKERPHY dockerd[18386]: time="2017-08-17T17:53:55.243281045-05:00" level=error msg="Handler for POST /v1.18/containers/2ef6677cc840f3597e5b38024cfb337856adfaccdcc1a2739f78413979c0fa2b/start returned error: containerd: container did not start before the specified timeout"
Aug 17 17:53:55 WIS-DOCKERPHY kernel: [ 1629.862984] unregister_netdevice: waiting for lo to become free. Usage count = 1
Aug 17 17:54:05 WIS-DOCKERPHY kernel: [ 1640.005751] unregister_netdevice: waiting for lo to become free. Usage count = 1
Aug 17 17:54:15 WIS-DOCKERPHY kernel: [ 1650.084470] unregister_netdevice: waiting for lo to become free. Usage count = 1
Aug 17 17:54:25 WIS-DOCKERPHY dockerd[18386]: time="2017-08-17T17:54:25.123304254-05:00" level=error msg="containerd: start container" error="containerd: container did not start before the specified timeout" id=891adb44118b9114d73a600748e52160a250ec0dd7f94a7ad1fc2352d492252d
Aug 17 17:54:25 WIS-DOCKERPHY dockerd[18386]: time="2017-08-17T17:54:25.124077481-05:00" level=error msg="Create container failed with error: containerd: container did not start before the specified timeout"
Aug 17 17:54:25 WIS-DOCKERPHY kernel: [ 1659.490959] docker0: port 2(vetha61fb77) entered disabled state
Aug 17 17:54:25 WIS-DOCKERPHY kernel: [ 1659.492532] device vetha61fb77 left promiscuous mode
Aug 17 17:54:25 WIS-DOCKERPHY kernel: [ 1659.492534] docker0: port 2(vetha61fb77) entered disabled state
Aug 17 17:54:25 WIS-DOCKERPHY dockerd[18386]: time="2017-08-17T17:54:25.249728323-05:00" level=error msg="Error unmounting container 891adb44118b9114d73a600748e52160a250ec0dd7f94a7ad1fc2352d492252d: device or resource busy"
Aug 17 17:54:25 WIS-DOCKERPHY dockerd[18386]: time="2017-08-17T17:54:25.271272174-05:00" level=error msg="Handler for POST /v1.18/containers/891adb44118b9114d73a600748e52160a250ec0dd7f94a7ad1fc2352d492252d/start returned error: containerd: container did not start before the specified timeout"
Aug 17 17:54:25 WIS-DOCKERPHY kernel: [ 1660.163205] unregister_netdevice: waiting for lo to become free. Usage count = 1
Aug 17 17:54:35 WIS-DOCKERPHY kernel: [ 1670.241938] unregister_netdevice: waiting for lo to become free. Usage count = 1
Aug 17 17:54:45 WIS-DOCKERPHY kernel: [ 1680.320664] unregister_netdevice: waiting for lo to become free. Usage count = 1
Aug 17 17:54:56 WIS-DOCKERPHY kernel: [ 1690.399418] unregister_netdevice: waiting for lo to become free. Usage count = 1
Aug 17 17:55:06 WIS-DOCKERPHY kernel: [ 1700.478157] unregister_netdevice: waiting for lo to become free. Usage count = 1

Changed in linux (Ubuntu):
status:	In Progress → Confirmed
tags:	added: apport-collected xenial

Revision history for this message

Andrew Reis (areis422) wrote on 2017-08-17: apport information

#4

AlsaDevices:
total 0
crw-rw---- 1 root audio 116, 1 Aug 17 17:27 seq
crw-rw---- 1 root audio 116, 33 Aug 17 17:27 timer
AplayDevices: Error: [Errno 2] No such file or directory
ApportVersion: 2.20.1-0ubuntu2.10
Architecture: amd64
ArecordDevices: Error: [Errno 2] No such file or directory
AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1:
DistroRelease: Ubuntu 16.04
HibernationDevice: RESUME=UUID=7356cc51-3c3e-4d0d-bb9c-9d8daf436e5c
InstallationDate: Installed on 2016-09-20 (331 days ago)
InstallationMedia: Ubuntu-Server 16.04.1 LTS "Xenial Xerus" - Release amd64 (20160719)
IwConfig: Error: [Errno 2] No such file or directory
MachineType: Dell Inc. PowerEdge R710
Package: linux (not installed)
PciMultimedia:

ProcEnviron:
TERM=xterm
PATH=(custom, no user)
LANG=en_US.UTF-8
SHELL=/bin/bash
ProcFB: 0 mgadrmfb
ProcKernelCmdLine: BOOT_IMAGE=/vmlinuz-4.11.0-14-generic root=UUID=35a35d02-8a6c-402f-86c5-c39085c81bd3 ro
ProcVersionSignature: Ubuntu 4.11.0-14.20~16.04.1-generic 4.11.12
RelatedPackageVersions:
linux-restricted-modules-4.11.0-14-generic N/A
linux-backports-modules-4.11.0-14-generic N/A
linux-firmware 1.157.11
RfKill: Error: [Errno 2] No such file or directory
Tags: xenial
Uname: Linux 4.11.0-14-generic x86_64
UpgradeStatus: No upgrade log present (probably fresh install)
UserGroups:

_MarkForUpload: True
dmi.bios.date: 07/23/2013
dmi.bios.vendor: Dell Inc.
dmi.bios.version: 6.4.0
dmi.board.name: 00NH4P
dmi.board.vendor: Dell Inc.
dmi.board.version: A00
dmi.chassis.type: 23
dmi.chassis.vendor: Dell Inc.
dmi.modalias: dmi:bvnDellInc.:bvr6.4.0:bd07/23/2013:svnDellInc.:pnPowerEdgeR710:pvr:rvnDellInc.:rn00NH4P:rvrA00:cvnDellInc.:ct23:cvr:
dmi.product.name: PowerEdge R710
dmi.sys.vendor: Dell Inc.

Revision history for this message

Andrew Reis (areis422) wrote on 2017-08-17: CRDA.txt

#5

CRDA.txt Edit (420 bytes, text/plain)

apport information

Revision history for this message

Andrew Reis (areis422) wrote on 2017-08-17: CurrentDmesg.txt

#6

CurrentDmesg.txt Edit (110.0 KiB, text/plain)

apport information

Revision history for this message

Andrew Reis (areis422) wrote on 2017-08-17: JournalErrors.txt

#7

JournalErrors.txt Edit (20.9 KiB, text/plain)

apport information

Revision history for this message

Andrew Reis (areis422) wrote on 2017-08-17: Lspci.txt

#8

Lspci.txt Edit (87.6 KiB, text/plain)

apport information

Revision history for this message

Andrew Reis (areis422) wrote on 2017-08-17: Lsusb.txt

#9

Lsusb.txt Edit (579 bytes, text/plain)

apport information

Revision history for this message

Andrew Reis (areis422) wrote on 2017-08-17: ProcCpuinfo.txt

#10

ProcCpuinfo.txt Edit (21.0 KiB, text/plain)

apport information

Revision history for this message

Andrew Reis (areis422) wrote on 2017-08-17: ProcCpuinfoMinimal.txt

#11

ProcCpuinfoMinimal.txt Edit (897 bytes, text/plain)

apport information

Revision history for this message

Andrew Reis (areis422) wrote on 2017-08-17: ProcInterrupts.txt

#12

ProcInterrupts.txt Edit (17.8 KiB, text/plain)

apport information

Revision history for this message

Andrew Reis (areis422) wrote on 2017-08-17: ProcModules.txt

#13

ProcModules.txt Edit (5.8 KiB, text/plain)

apport information

Revision history for this message

Andrew Reis (areis422) wrote on 2017-08-17: UdevDb.txt

#14

UdevDb.txt Edit (169.9 KiB, text/plain)

apport information

Revision history for this message

Andrew Reis (areis422) wrote on 2017-08-17: WifiSyslog.txt

#15

WifiSyslog.txt Edit (149.4 KiB, text/plain)

apport information

Revision history for this message

Fabian Holler (fh0) wrote on 2017-08-31:

#16

kernel-logs.txt Edit (10.4 KiB, text/plain)

With https://github.com/fho/docker-samba-loop I was able to reproduce kernel Oopses on a clean Ubuntu 16.0.4 installation with:

- linux-image-4.10.0-32-generic=4.10.0-32.36~16.04.1
- linux-image-4.11.0-14-generic=4.11.0-14.20~16.04.1

On 4.11.0-14 it was much harder to reproduce. Sometimes only a warning
happened, 1x time I was able to produce an Oops with
https://github.com/fho/docker-samba-loop/commits/more_stress
But I don't know if the changes in the branch made it more likely to happen or
if it was only a coincidence.

The kernel message:
unregister_netdevice: waiting for lo to become free. Usage count = 1
shows up, some minutes later the oops and/or warning happens.

In a different scenario were the Ubuntu 16.04 servers were running multiple docker containers with Nginx or small network applications in parallel, I was also able to reproduce the kernel Oopses also on:

- linux-image-4.10.0-1004-gcp
- linux-image-4.4.0-93-generic
- linux-image-4.11.0-14-generic
- linux-image-4.12.10-041210-generic=4.12.10-041210.20170830

I haven't tried again to reproduce it with those kernels on a clean Ubuntu
installation and unfortunately didn't kept the kernel logs.

Kernel logs can be found in the attached file.

Revision history for this message

Fabian Holler (fh0) wrote on 2017-08-31:

#17

kernoops-4.4.0-93.txt Edit (19.0 KiB, text/plain)

Attached are the logs for an Oops on Ubuntu 14.04 on kernel linux-image-4.4.0-93-generic=4.4.0-93.116~14.04.1

Revision history for this message

Thomas Morin (tmmorin-orange) wrote on 2017-09-07:

#18

bug 1715660 seems related

( Lots of "waiting for lo to become free" message, then a "task ip:1358 blocked for more than 120 seconds." with a backtrace )

Revision history for this message

Fabian Holler (fh0) wrote on 2017-09-07:

#19

I could not reproduce the bug with the described method with kernel 4.4.0-81-generic and neither with 4.13.0-041300rc7-generic. 4.4.0-81 logged a hung tasks but does not Oops.

So the bug might have been reintroduced between 4.4.0-82 and 4.4.0-93 and 4.13 seems to contain a fix.

Revision history for this message

Dan Streetman (ddstreet) wrote on 2017-09-12:

#20

Thank you for the reproducer, I have been able to reproduce this using docker-samba-loop scripts. This should help me to dig into the kernel to see where the dst leak is coming from.

description:

updated

Revision history for this message

Charles (bityard) wrote on 2017-09-29:

#21

We're seeing this on our build servers. These are VMWare virtual machines which run docker containers in privileged mode. Inside the containers are Jenkins agents and a build environment. Part of the build process uses unshare to create these namespaces: ipc uts mount pid net. If you have any questions about this arrangement, I can try to provide more detail and maybe example code.

The kernel is 4.4.0-79-generic x86_64. We are using docker 17.05.0-ce.

In one case, the whole VM hung when this was encountered.

Are there any workarounds for 16.04 users beyond rebooting the host?

Revision history for this message

Jens Elkner (jelmd) wrote on 2017-10-02:

#22

Since xenial we have this very annoying behavior (unregister_netdevice: waiting for lo to become free. Usage count = 1) as well, which in turn makes a container reboot taking several minutes instead of seconds e.g. on vivid (lxc-ls -f also hangs for that time). E.g.:

[ +10.244888] unregister_netdevice: waiting for lo to become free. Usage count = 1
[Oct 2 17:16] INFO: task lxc-autostart:5321 blocked for more than 120 seconds.
[ +0.007147] Tainted: P OE 4.4.0-96-generic #119-Ubuntu
[ +0.006796] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ +0.007913] lxc-autostart D ffff883fb99fbcb8 0 5321 1 0x00000000
[ +0.000007] ffff883fb99fbcb8 ffffffffffffffff ffff883ff2a3aa00 ffff881fed2fe200
[ +0.000004] ffff883fb99fc000 ffffffff81ef79a4 ffff881fed2fe200 00000000ffffffff
[ +0.000004] ffffffff81ef79a8 ffff883fb99fbcd0 ffffffff8183f165 ffffffff81ef79a0
[ +0.000004] Call Trace:
[ +0.000014] [<ffffffff8183f165>] schedule+0x35/0x80
[ +0.000005] [<ffffffff8183f40e>] schedule_preempt_disabled+0xe/0x10
[ +0.000016] [<ffffffff81841049>] __mutex_lock_slowpath+0xb9/0x130
[ +0.000014] [<ffffffff818410df>] mutex_lock+0x1f/0x30
[ +0.000009] [<ffffffff8172da4e>] copy_net_ns+0x6e/0x120
[ +0.000009] [<ffffffff810a179b>] create_new_namespaces+0x11b/0x1d0
[ +0.000003] [<ffffffff810a18bd>] copy_namespaces+0x6d/0xa0
[ +0.000007] [<ffffffff8107f1d2>] copy_process+0x8e2/0x1b30
[ +0.000011] [<ffffffff810805b0>] _do_fork+0x80/0x360
[ +0.000015] [<ffffffff81003226>] ? exit_to_usermode_loop+0xa6/0xd0
[ +0.000004] [<ffffffff81080939>] SyS_clone+0x19/0x20
[ +0.000009] [<ffffffff81843272>] entry_SYSCALL_64_fastpath+0x16/0x71
[ +4.882232] unregister_netdevice: waiting for lo to become free. Usage count = 1
...

Revision history for this message

Dan Streetman (ddstreet) wrote on 2017-10-11:

#23

Download full text (4.9 KiB)

Hi everyone,

so this is an interesting problem. There are 2 parts to this:

1. in the container being destroyed, some socket(s) remain open for a period of time, which prevents the container from fully exits until all its sockets have exited. While this happens you will see the 'waiting for lo to become free' message repeated.

2. while the previous container is waiting for its sockets to exit, so it can exit, any new containers are blocked from starting.

For issue #2, I'm not clear just yet on what is blocking the new containers from starting. I thought it could be the rtnl_lock since the exiting container does loop in the rtnl_unlock path, however while it's looping it also calls __rtnl_unlock/rtnl_lock during its periods of waiting, so the mutex should not be held the entire time. I'm still investigating that part of this issue to find what exactly is blocking the new containers.

For issue #1, the sockets are lingering because - at least for the reproducer case from comment 16 - there is a TCP connection that never is closed, so the kernel continues to probe the other end until it times out, which takes around 2 minutes. For this specific reproducer, there are 2 easy ways to work around this. First for background, the reproducer uses docker to create a CIFS mount, which uses a TCP connection. This is the script run in the 'client' side, where the hang happens:

    date ; \
    mount.cifs //$SAMBA_PORT_445_TCP_ADDR/public /mnt/ -o vers=3.0,user=nobody,password= ; \
    date ; \
    ls -la /mnt ; \
    umount /mnt ; \
    echo "umount ok"

this is repeated 50 times in the script, but the 2nd (and later) calls hang (for 2 minutes each, if you wait that long).

A. The reason the TCP connection to the CIFS server lingers is beacuse it is never closed; the reason it isn't closed is because the script exits the container immediately after unmounting. The kernel cifs driver, during fs unmount, closes the TCP socket to the CIFS server, which queues a TCP FIN to the server. Normally, this FIN will reach the server who will respond and the TCP connection will close. However, since the container starts exiting immediately, that FIN never reaches the CIFS server - the interface is taken down too quickly. So, the TCP connection remains. To avoid this, simply add a short sleep between unmounting and exiting the container, e.g. (new line prefixed with +):

    date ; \
    mount.cifs //$SAMBA_PORT_445_TCP_ADDR/public /mnt/ -o vers=3.0,user=nobody,password= ; \
    date ; \
    ls -la /mnt ; \
    umount /mnt ; \
+ sleep 1 ; \
    echo "umount ok"

that avoids the problem, and allows the container to exit immediately and the next container to start immediately.

B. Instead of delaying enough to allow the TCP connection to exit, the kernel's TCP configuration can be changed to more quickly timeout. The specific setting to change is tcp_orphan_retries, which controls how many times the kernel keeps trying to talk to a remote host over a closed socket. Normally, this defaults to 0 - this will cause the kernel to actually use a value of 8 (retries). So, change it from 0 to 1, to actually reduce it from 8 to 1 (confusing, I know). Like (n...

Hi everyone,

so this is an interesting problem.  There are 2 parts to this:

1. in the container being destroyed, some socket(s) remain open for a period of time, which prevents the container from fully exits until all its sockets have exited.  While this happens you will see the 'waiting for lo to become free' message repeated.

2. while the previous container is waiting for its sockets to exit, so it can exit, any new containers are blocked from starting.

For issue #2, I'm not clear just yet on what is blocking the new containers from starting.  I thought it could be the rtnl_lock since the exiting container does loop in the rtnl_unlock path, however while it's looping it also calls __rtnl_unlock/rtnl_lock during its periods of waiting, so the mutex should not be held the entire time.  I'm still investigating that part of this issue to find what exactly is blocking the new containers.

For issue #1, the sockets are lingering because - at least for the reproducer case from comment 16 - there is a TCP connection that never is closed, so the kernel continues to probe the other end until it times out, which takes around 2 minutes.  For this specific reproducer, there are 2 easy ways to work around this.  First for background, the reproducer uses docker to create a CIFS mount, which uses a TCP connection.  This is the script run in the 'client' side, where the hang happens:

date ; \
    mount.cifs //$SAMBA_PORT_445_TCP_ADDR/public /mnt/ -o vers=3.0,user=nobody,password= ; \
    date ; \
    ls -la /mnt ; \
    umount /mnt ; \
    echo "umount ok"

this is repeated 50 times in the script, but the 2nd (and later) calls hang (for 2 minutes each, if you wait that long).

A. The reason the TCP connection to the CIFS server lingers is beacuse it is never closed; the reason it isn't closed is because the script exits the container immediately after unmounting.  The kernel cifs driver, during fs unmount, closes the TCP socket to the CIFS server, which queues a TCP FIN to the server.  Normally, this FIN will reach the server who will respond and the TCP connection will close.  However, since the container starts exiting immediately, that FIN never reaches the CIFS server - the interface is taken down too quickly.  So, the TCP connection remains.  To avoid this, simply add a short sleep between unmounting and exiting the container, e.g. (new line prefixed with +):

date ; \
    mount.cifs //$SAMBA_PORT_445_TCP_ADDR/public /mnt/ -o vers=3.0,user=nobody,password= ; \
    date ; \
    ls -la /mnt ; \
    umount /mnt ; \
+    sleep 1 ; \
    echo "umount ok"

that avoids the problem, and allows the container to exit immediately and the next container to start immediately.

B. Instead of delaying enough to allow the TCP connection to exit, the kernel's TCP configuration can be changed to more quickly timeout.  The specific setting to change is tcp_orphan_retries, which controls how many times the kernel keeps trying to talk to a remote host over a closed socket.  Normally, this defaults to 0 - this will cause the kernel to actually use a value of 8 (retries).  So, change it from 0 to 1, to actually reduce it from 8 to 1 (confusing, I know).  Like (new lines prefixed with +):

date ; \
+    echo 1 > /proc/sys/net/ipv4/tcp_orphan_retries ; \
+    grep -H . /proc/sys/net/ipv4/tcp_orphan_retries ; \
    mount.cifs //$SAMBA_PORT_445_TCP_ADDR/public /mnt/ -o vers=3.0,user=nobody,password= ; \
    date ; \
    ls -la /mnt ; \
    umount /mnt ; \
    echo "umount ok"

the 'grep' added there is just so you can verify the value was correctly changed.  Note this method will take slightly longer than method A, since this does perform 1 retry, while method A closes the TCP connection correctly and does no retries.

To extrapolate this generally to situations beside this specific reproducer, here are some suggestions:

1. close all your connections before closing the container.  Especially kernel connections (like the CIFS TCP connection).

2. if you can't close the connections, add a delay as the last action the container takes, after bringing down all its interfaces (except lo) but before exiting - like sleep for up to 2 minutes.  The container will take up a bit more resources during that time but it should allow enough time for all TCP connections to close.

3. if you can't do either of those workarounds, reduce the number of TCP retries that will be done after the container starts exiting by changing the tcp_orphan_retries param (from inside the container - don't change this value in your host system).  Change the parameter right before you close the container, if possible.

Those are simply workarounds - I'm still investigating what is blocking new containers from starting, as well as looking at ways to forcibly destroy a container's sockets when the container is exiting (e.g. maybe sock_diag_destroy)

Please let me know if any of those, or similar, workarounds helps anyone who is seeing this problem.

Revision history for this message

Charles (bityard) wrote on 2017-10-24:

#24

Is there any way for a non-kernel-dev type to see exactly which resource unregister_netdevice is waiting for? (Or does it only keep track of usage count?)

Revision history for this message

Fabian Holler (fh0) wrote on 2017-10-25:

#25

Hello Dan,

thanks for the analysis!

That the startup of containers is delayed is annoying.
The much bigger issue is that it can reproducible cause a kernel Oops and crash a whole machine.

Revision history for this message

Fabian Holler (fh0) wrote on 2017-11-02:

#26

According to https://github.com/moby/moby/issues/35068 the crash is fixed by:

https://patchwork.ozlabs.org/patch/801533/
https://patchwork.ozlabs.org/patch/778449/

Revision history for this message

Dan Streetman (ddstreet) wrote on 2017-11-02:

#27

Download full text (3.6 KiB)

My analysis so far of the problem:

1. container A has an outstanding TCP connection (thus, a socket and dst which hold reference on the "lo" interface from the container). When the container is stopped, the TCP connection takes ~2 minutes to timeout (with default settings).

2. when container A is being removed, its net namespace is removed via net/core/net_namespace.c function cleanup_net(). This takes two locks - first, the net_mutex, and then the rtnl mutex (i.e. rtnl_lock). It then cleans up the net ns and calls rtnl_unlock(). However, rtnl_unlock() internally waits for all the namespace's interfaces to be freed, which requires all their references to reach 0. This must wait until the above TCP connection times out, before it releases its reference. So, at this point the container is hung inside rtnl_unlock() waiting, and it still holds the net_mutex. It doesn't release the net_mutex until its lo interface finally is destroyed after the TCP connection times out.

3. When a new container is started, part of its startup is to call net/core/net_namespace.c function copy_net_ns() to copy the caller's net namespace to the new container net namespace. However, this function locks the net_mutex. Since the previous container still is holding the net_mutex as explaned in #2 above, this new container creation blocks until #2 releases the mutex.

There are a few ways to possibly address this:

a) when a container is being removed, all its TCP connections should abort themselves. Currently, TCP connections don't directly register for interface unregister events - they explictly want to stick around, so if an interface is taken down and then brought back up, the TCP connection remains, and the communication riding on top of the connection isn't interrupted. However, the way TCP does this is to move all its dst references off the interface that is unregistering, to the loopback interface. This works for the initial network namespace, where the loopback interface is always present and never removed. However, this doesn't work for non-default network namespaces - like containers - where the loopback interface is unregistered when the container is being removed. So this aspect of TCP may need to change, to correctly handle containers. This also may not cover all causes of this hang, since sockets handle more than just tcp connections.

b) when a container is being removed, instead of holding the net_mutex while the cleaning up (and calling rtnl_unlock), it could release the net_mutex first (after removing all netns marked for cleanup from the pernet list), then call rtnl_unlock. This needs examination to make sure it would not introduce any races or other problems.

c) rtnl_unlock could be simplified - currently, it includes significant side-effects, which include the long delay while waiting to actually remove all references to the namespace's interfaces (including loopback). Instead of blocking rtnl_unlock() to do all this cleanup, the cleanup could be deferred. This also would need investigation to make sure no caller is expecting to be able to free resources that may be accessed later from the cleanup action (which I believe is the case).

As...

My analysis so far of the problem:

1. container A has an outstanding TCP connection (thus, a socket and dst which hold reference on the "lo" interface from the container). When the container is stopped, the TCP connection takes ~2 minutes to timeout (with default settings).

2. when container A is being removed, its net namespace is removed via net/core/net_namespace.c function cleanup_net(). This takes two locks - first, the net_mutex, and then the rtnl mutex (i.e. rtnl_lock). It then cleans up the net ns and calls rtnl_unlock(). However, rtnl_unlock() internally waits for all the namespace's interfaces to be freed, which requires all their references to reach 0. This must wait until the above TCP connection times out, before it releases its reference. So, at this point the container is hung inside rtnl_unlock() waiting, and it still holds the net_mutex. It doesn't release the net_mutex until its lo interface finally is destroyed after the TCP connection times out.

3. When a new container is started, part of its startup is to call net/core/net_namespace.c function copy_net_ns() to copy the caller's net namespace to the new container net namespace. However, this function locks the net_mutex. Since the previous container still is holding the net_mutex as explaned in #2 above, this new container creation blocks until #2 releases the mutex.

There are a few ways to possibly address this:

a) when a container is being removed, all its TCP connections should abort themselves. Currently, TCP connections don't directly register for interface unregister events - they explictly want to stick around, so if an interface is taken down and then brought back up, the TCP connection remains, and the communication riding on top of the connection isn't interrupted. However, the way TCP does this is to move all its dst references off the interface that is unregistering, to the loopback interface. This works for the initial network namespace, where the loopback interface is always present and never removed. However, this doesn't work for non-default network namespaces - like containers - where the loopback interface is unregistered when the container is being removed. So this aspect of TCP may need to change, to correctly handle containers.  This also may not cover all causes of this hang, since sockets handle more than just tcp connections.

b) when a container is being removed, instead of holding the net_mutex while the cleaning up (and calling rtnl_unlock), it could release the net_mutex first (after removing all netns marked for cleanup from the pernet list), then call rtnl_unlock. This needs examination to make sure it would not introduce any races or other problems.

c) rtnl_unlock could be simplified - currently, it includes significant side-effects, which include the long delay while waiting to actually remove all references to the namespace's interfaces (including loopback). Instead of blocking rtnl_unlock() to do all this cleanup, the cleanup could be deferred. This also would need investigation to make sure no caller is expecting to be able to free resources that may be accessed later from the cleanup action (which I believe is the case).

As this is a complex problem there are likely other options to fix it as well.  This issue also has been around ever since namespaces were introduced to the kernel, as far as I can tell, but it's not a commonly seen issue because in most cases socket connections are shut down before stopping the container.  The socket connection causing the problem here is a kernel socket, which is different than normal userspace-created sockets; that may make a difference though I haven't investigated that angle yet.

Revision history for this message

Dan Streetman (ddstreet) wrote on 2017-11-02:

#28

> Is there any way for a non-kernel-dev type to see exactly which resource
> unregister_netdevice is waiting for? (Or does it only keep track of usage count?)

not really, no. It's simply waiting for its reference count to drop to 0, and just broadcasts unregister events periodically hoping that everyone with a reference is listening and does whatever they need to do to drop their references. In this case the socket's dst actually is listening, but the way the dst handles it assumes that the loopback interface will never, ever unregister - the dst shifts its reference count from the interface going down, to its netns's loopback interface. If the interface going down is the netns's loopback interface, then that reference shift doesn't help, of course.

So no, there is no simple way for you to tell what it's waiting for.

Revision history for this message

Dan Streetman (ddstreet) wrote on 2017-11-02:

#29

> That the startup of containers is delayed is annoying.
> The much bigger issue is that it can reproducible cause a kernel Oops and
> crash a whole machine.

Yes as you found this uncovers other bugs in the kernel, but those are separate and looks like they are getting fixed.

Revision history for this message

Fabian Holler (fh0) wrote on 2017-11-02:

#30

Ok, seems that I'm than in the wrong ticket with my issues. :-)
Should separate Ubuntu bug reports be created (if they don't exist already) regarding the kernel
crashes?

Revision history for this message

Dan Streetman (ddstreet) wrote on 2017-11-02:

#31

> Should separate Ubuntu bug reports be created (if they don't exist already) regarding
> the kernel crashes

This bug is only to fix the hang/delay in new container creation. Please open a new bug for kernel crashes.

Dan Streetman (ddstreet) on 2017-11-09

Changed in linux (Ubuntu Artful):
status:	Confirmed → In Progress
Changed in linux (Ubuntu Bionic):
status:	New → In Progress
Changed in linux (Ubuntu Zesty):
status:	New → In Progress
Changed in linux (Ubuntu Xenial):
status:	New → In Progress
Changed in linux (Ubuntu Bionic):
importance:	Undecided → Medium
assignee:	nobody → Dan Streetman (ddstreet)
Changed in linux (Ubuntu Zesty):
importance:	Undecided → Medium
Changed in linux (Ubuntu Trusty):
assignee:	nobody → Dan Streetman (ddstreet)
Changed in linux (Ubuntu Xenial):
assignee:	nobody → Dan Streetman (ddstreet)
Changed in linux (Ubuntu Trusty):
status:	New → In Progress
Changed in linux (Ubuntu Zesty):
assignee:	nobody → Dan Streetman (ddstreet)
Changed in linux (Ubuntu Xenial):
importance:	Undecided → Medium
Changed in linux (Ubuntu Trusty):
importance:	Undecided → Medium

Revision history for this message

Jiri Horky (jiri-horky) wrote on 2017-11-27:

#32

his information might be relevant.

We are able to reproduce the problem with unregister_netdevice: waiting for lo to become free. Usage count = 1 with 4.14.0-rc3 kernel with CONFIG_PREEMPT_NONE=y and running only on one CPU with following boot kernel options:

BOOT_IMAGE=/boot/vmlinuz-4.14.0-rc3 root=/dev/mapper/vg0-root ro quiet vsyscall=emulate nosmp
Once we hit this state, it stays in this state and reboot is needed. No more containers can be spawned. We reproduce it by running images doing ipsec/openvpn connections + downloading a small file inside the tunnels. Then the instances exist (usually they run < 10s). We run 10s of such containers a minute on one machine. With the abovementioned settings (only 1cpu), the machine hits it in ~2 hours.

Another reproducer with the same kernel, but without limiting number of CPUs, is to jus run iperf in UDP mode for 3 seconds inside the container (so there is no TCP communication at all). If we run 10 of such containers in parallel, wait for all of them to finish and do it again, we hit the trouble in less than 10 minutes (on 40 cores machine).

In both of our reproducers, we added "ip route flush table all; ifconfig down; sleep 10" before existing from containers. It does not seem to have any effect.

Revision history for this message

Dan Streetman (ddstreet) wrote on 2017-12-07:

#33

> In both of our reproducers, we added "ip route flush table all;

that won't help

> ifconfig down;

that's not a valid cmdline

> sleep 10" before existing from containers.

sleeping without closing existing open sockets doesn't help

what are you using for container mgmt? docker? lxc? lxd?

Revision history for this message

Dan Streetman (ddstreet) wrote on 2017-12-07:

#34

simpler reproducer for this using LXD instead of Docker:

1. install/setup LXD
$ sudo apt install lxd
$ sudo lxd init

2. create two containers
$ lxc launch ubuntu:xenial server
$ lxc launch ubuntu:xenial client

3. make the client privileged (so it can mount smb/cifs)
$ lxc config set client security.privileged true
$ lxc stop client
$ lxc start client

4. set up samba server in container
$ lxc exec server apt update
$ lxc exec server apt install cifs-utils samba
$ lxc exec server mkdir /dummy
$ lxc exec server touch /dummy/emptyfile
$ lxc exec server vim /etc/samba/smb.conf

add this section at the end of the container's smb.conf:

[dummy]
  path = /dummy
  browseable = yes
  read only = yes
  guest ok = yes

$ lxc stop server
$ lxc start server

5. note server container's ip
$ lxc list server

e.g.
| server | RUNNING | 10.109.201.68 (eth0)

6. setup client container
$ lxc exec client mkdir /dummy
$ lxc exec apt update
$ lxc exec apt install cifs-utils

7. stop client container
$ lxc stop client

8. full cycle of client container start through stop
$ lxc start client ; sleep 2 ; lxc exec client -- mount -o guest //10.109.201.68/dummy /dummy ; lxc exec client -- ls -l /dummy ; lxc exec client umount /dummy ; lxc stop -f client

9. repeat client cycle; it should hang due to this bug
$ lxc start client ; sleep 2 ; lxc exec client -- mount -o guest //10.109.201.68/dummy /dummy ; lxc exec client -- ls -l /dummy ; lxc exec client umount /dummy ; lxc stop -f client

Note this force-stops the container, instead of normally stopping it; if the container is allowed to exit normally, the problem isn't reproduced (with these steps, at least). Also note the 'sleep 2' is there to give enough time for the client container network to come up, it may need to be increased if your client container takes longer to assign its ip address (you can edit the client container nw config to use a static address if you want, which should be faster).

Revision history for this message

Stefan Rinkes (darinkes) wrote on 2017-12-13:

#35

Hi,

In a test-setup I can easily reproduce this issue very quickly.
At my scenario it looks like the issue is just triggered if I change the IP inside the docker container
(Needed to test StrongSwan's MOBIKE feature), send some data (ICMP) and then kill the docker container.

So if there is something to test or debug, I would be happy to help fixing this issue.

Also already tried the latest kernel from http://kernel.ubuntu.com/~kernel-ppa/mainline/:
Linux docker 4.15.0-041500rc3-generic #201712110230 SMP Mon Dec 11 02:31:13 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

Revision history for this message

Dan Streetman (ddstreet) wrote on 2017-12-13:

#36

> So if there is something to test or debug, I would be happy to help fixing this issue.

yes, I'm hoping to have a kernel build with some added debug to track/list exactly what socket(s) are still holding lo device references when the container is killed/shutdown, so testing with that would help clarify the specifics of this problem, once I have the debug kernel ready and built. I will update this bug once I have something ready to test.

> Also already tried the latest kernel from http://kernel.ubuntu.com/~kernel-ppa/mainline/:

and I assume that did not help, correct? This problem still exists upstream (assuming we're all encountering the same problem that I'm debugging) and will require an upstream patch.

Revision history for this message

Daniel Borkmann (dansel) wrote on 2017-12-20:

#37

Hi Dan,

the test kernel you were using had this one [1] included? It's upstream commit f186ce61bb82 ("Fix an intermittent pr_emerg warning about lo becoming free."). Meaning, it's a rather a dst leak somewhere in latest upstream as well?

Thanks,
Daniel

[1] https://patchwork.ozlabs.org/patch/773480/

Revision history for this message

Dan Streetman (ddstreet) wrote on 2017-12-20:

#38

> the test kernel you were using had this one [1] included? It's upstream commit f186ce61bb82

no that commit doesn't help, because the issue isn't one of dst refcount leaks or delayed dst gc, the problem is kernel sockets that haven't closed yet holding a reference to the netns lo interface device via their dsts.

I do think I have a possible fix for this though, and I have some test kernels building in this ppa (please note they are currently building and will take a few more hours to finish):
https://launchpad.net/~ddstreet/+archive/ubuntu/lp1711407

can anyone please test with any of those kernels (they all contain the patch to maybe fix this)? In my testing it eliminated the problem completely, but I'd like to get some more diverse real-world testing before starting an upstream discussion on the patch.

Revision history for this message

Stefan Rinkes (darinkes) wrote on 2017-12-20:

#39

Ok, waiting for the build to finish and then will test&report ASAP.

Revision history for this message

Stefan Rinkes (darinkes) wrote on 2017-12-20:

#40

$ sudo add-apt-repository ppa:ddstreet/lp1711407
Cannot add PPA: 'ppa:~ddstreet/ubuntu/lp1711407'.
ERROR: '~ddstreet' user or team does not exist.

DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=16.04
DISTRIB_CODENAME=xenial
DISTRIB_DESCRIPTION="Ubuntu 16.04.3 LTS"

Dan Streetman (ddstreet) on 2018-04-12

Changed in linux (Ubuntu Zesty):
status:	In Progress → Won't Fix
assignee:	Dan Streetman (ddstreet) → nobody

Revision history for this message

Jiri Horky (jiri-horky) wrote on 2018-04-12:

#70

Hi Dan,

how should I read that it got to "Won't fix" state? No more time to debug it?

Thanks
Jirka H.

Revision history for this message

Dan Streetman (ddstreet) wrote on 2018-04-12:

#71

Hi Jiri,

no the "Won't Fix" is only for the Zesty release, as that's EOL now.

I actually have a test kernel building in the ppa:
https://launchpad.net/~ddstreet/+archive/ubuntu/lp1711407/+packages

it's version 4.13.0-38.43+hf1711407v20180412b2; it hasn't finished building yet, probably will take another ~4 hours or so (building the kernel in ppas is sloooooow)

this patch isn't quite finished, I still have to redo the netns cleanup workqueue because the cleanup work can now hang (indefinitely) so it can't be a single-threaded workqueue with the same work queued. But, it should at least verify if it works around the container create/destroy hangs.

Revision history for this message

Dan Streetman (ddstreet) wrote on 2018-04-13:

#72

Download full text (4.5 KiB)

Just to re-summarize this:

The problem here is when a net namespace is being cleaned up, the thread cleaning it up gets hung inside net/core/dev.c netdev_wait_allrefs() function. This is called every time anything calls rtnl_unlock(), which happens a lot, including during net namespace cleanup. This function waits for all netdev device references to be released.

The thing(s) holding dev references here are dst objects from sockets (at least, that's what is failing to release the dev refs, in my debugging). When a net namespace is cleaned up, all its sockets' call dst_dev_put() for their dsts, which moves the netdev refs from the actual net namespace interfaces to the netns loopback device (this is why the message is typically 'waiting for lo to become free', but sometimes it waits for non-lo to become free if the dst_dev_put hasn't been called for a dst).

The dsts are then put, which should reduce their ref count to 0 and free them, which then also reduces the lo device's ref count. In correct operation, this reduces the lo device's ref count to 0, which allows netdev_wait_allrefs() to complete.

This is where the problem arrives:

1. the first possibility a kernel socket that hasn't closed, and still has dst(s). This is possible because while normal (user) sockets take a reference to their net namespace, and the net namespace does not begin cleanup until all its references are released (meaning, no user sockets are still open), kernel sockets do not take a reference to their net namespace. So, net namespace cleanup begins while kernel sockets are still open. While the net namespace is cleaning up, it ifdown's all its interfaces, which cause the kernel sockets to close, and free their dsts which releases all netdev interface refs, which allows the net namespace to close. If, however, one of the kernel sockets doesn't pay attention to the net namespace cleaning up, it may hang around, failing to free its dsts, and failing to release the netdev interface reference(s). That causes this bug. In some cases (maybe most or all, I'm not sure) the socket eventually times out, and does release its dsts, allowing the net namespace to finish cleaning up and exiting.

My recent patch 4ee806d51176ba7b8ff1efd81f271d7252e03a1d fixes *one* case of this happening - a TCP kernel socket remaining open, trying in vain to complete its TCP FIN sequence (which will never complete since all net namespace interfaces are down and will never come back up, since the net namespace is cleaning up). The TCP socket eventually would time out the FIN sequence failure, and close, allowing the net namespace to finish cleaning up, but it takes ~2 minutes or so. There very well could be more cases of this happening with other kernel sockets.

2. alternately, even if all kernel sockets behave correctly and cleanup immediately during net namespace cleanup, it's possible for a dst object to leak a reference. This has happened before, there are likely existing dst leaks, and there will likely be dst leaks accidentally introduced in the future. When a dst leaks, its refcount never reaches 0, and so it never releases its reference on its interface (which remember, is ...

Just to re-summarize this:

The problem here is when a net namespace is being cleaned up, the thread cleaning it up gets hung inside net/core/dev.c netdev_wait_allrefs() function.  This is called every time anything calls rtnl_unlock(), which happens a lot, including during net namespace cleanup.  This function waits for all netdev device references to be released.

The thing(s) holding dev references here are dst objects from sockets (at least, that's what is failing to release the dev refs, in my debugging).  When a net namespace is cleaned up, all its sockets' call dst_dev_put() for their dsts, which moves the netdev refs from the actual net namespace interfaces to the netns loopback device (this is why the message is typically 'waiting for lo to become free', but sometimes it waits for non-lo to become free if the dst_dev_put hasn't been called for a dst).

The dsts are then put, which should reduce their ref count to 0 and free them, which then also reduces the lo device's ref count.  In correct operation, this reduces the lo device's ref count to 0, which allows netdev_wait_allrefs() to complete.

This is where the problem arrives:

1. the first possibility a kernel socket that hasn't closed, and still has dst(s).  This is possible because while normal (user) sockets take a reference to their net namespace, and the net namespace does not begin cleanup until all its references are released (meaning, no user sockets are still open), kernel sockets do not take a reference to their net namespace.  So, net namespace cleanup begins while kernel sockets are still open.  While the net namespace is cleaning up, it ifdown's all its interfaces, which cause the kernel sockets to close, and free their dsts which releases all netdev interface refs, which allows the net namespace to close.  If, however, one of the kernel sockets doesn't pay attention to the net namespace cleaning up, it may hang around, failing to free its dsts, and failing to release the netdev interface reference(s).  That causes this bug.  In some cases (maybe most or all, I'm not sure) the socket eventually times out, and does release its dsts, allowing the net namespace to finish cleaning up and exiting.

My recent patch 4ee806d51176ba7b8ff1efd81f271d7252e03a1d fixes *one* case of this happening - a TCP kernel socket remaining open, trying in vain to complete its TCP FIN sequence (which will never complete since all net namespace interfaces are down and will never come back up, since the net namespace is cleaning up).  The TCP socket eventually would time out the FIN sequence failure, and close, allowing the net namespace to finish cleaning up, but it takes ~2 minutes or so.  There very well could be more cases of this happening with other kernel sockets.

2. alternately, even if all kernel sockets behave correctly and cleanup immediately during net namespace cleanup, it's possible for a dst object to leak a reference.  This has happened before, there are likely existing dst leaks, and there will likely be dst leaks accidentally introduced in the future.  When a dst leaks, its refcount never reaches 0, and so it never releases its reference on its interface (which remember, is moved to the net namespace loopback interface when its socket is closing).  Unfortunately, a dst that has leaked a reference will never free itself, and will never release its reference to the loopback interface.  Thus, netdev_wait_allrefs() will hang forever.

Actually fixing either cause above is the correct thing to do, of course, but that's a long-term thing, as even if all causes currently in the kernel are fixed, more leaks probably will be accidentally introduced in the future.  And while mem leaks are not good, the more serious problem caused by this is that the netdev_wait_allrefs() hang happens while the thread is holding the global net mutex - that prevents the creation of any new net namespaces, and prevents the cleanup of any existing net namespaces (it also blocks registering/unregistering any pernet subsystem or pernet device).  This can render the entire system unusable, and requires a reboot to correct.

Starting from the assumption that it's unsafe to ever ignore the interface's refcount, and just free it anyway after some time period, I have to allow that when a dst leaks, its net namespace will also have to leak, since the dst will always hold a reference to the net namespace loopback (or other) interface.  Given that, what we need to do is move the call to netdev_wait_allrefs() that is hanging, to outside of the net mutex locked section.  I'm working on a patch to do just that currently.

Revision history for this message

Dan Streetman (ddstreet) wrote on 2018-04-13:

#73

Jiri,

I just added a new artful version to the ppa, version 4.13.0-38.43+hf1711407v20180413b1; it's building right now. It should avoid the hang just like the last version from yesterday; but the last one blocked freeing any future net namespaces after one hangs; this one fixes that so even after a net namespace hangs, future ones will correctly clean up and get freed.

I also added debug to this new build that will show each net namespace alloc and free, like:

[ 207.625857] net_namespace: alloced net ffff8d2eb3cae000
...
[ 208.152882] net_namespace: freeing net ffff8d2eb3cae000

and it has the same dbg during reproduction of this bug, where it hangs, like:

[ 1363.377355] unregister_netdevice: possible namespace leak (netns ffff96ddf1c10000), have been waiting for 131032 seconds
[ 1363.377393] (netns ffff96ddf1c10000): open sk ffff96ddf1e77380 family 2 state 4 dst_cache (null) rx_dst ffff96ddf532d800 creator generic_ip_connect+0x257/0x4f0 [cifs]
[ 1363.377406] (netns ffff96ddf1c10000): dst ffff96ddf532d800 expires 0 error 0 obsolete 2 flags 0 refcnt 1 ops ipv4_dst_ops+0x0/0xc0 creator ip_route_input_rcu+0x6b3/0xd80

If you're able to test with today's build, let me know if you are able to reproduce the problem (see the 'namespace leak' message) and if it has worked around the container hang correctly. Thanks!

Revision history for this message

Jiri Horky (jiri-horky) wrote on 2018-04-13:

#74

Hi there, great to hear the progress!

I have just reproduced it using the previous #43+hf1711407v20180412b2" kernel:

Apr 13 19:12:14 prg41-004 kernel: [13621.398319] unregister_netdevice: likely namespace leak (netns ffff8c7a056f0000), have been waiting for 1973452 seconds
Apr 13 19:12:24 prg41-004 kernel: [13631.478266] unregister_netdevice: likely namespace leak (netns ffff8c7a056f0000), have been waiting for 1983532 seconds

The seconds seems to be rather milliseconds.

Waiting here for the newest kernel and will try it right away.

Jiri Horky

Revision history for this message

Dan Streetman (ddstreet) wrote on 2018-04-13:

#75

> I have just reproduced it using the previous #43+hf1711407v20180412b2" kernel:

Excellent. And just to clarify again, I do *not* expect the kernel to fix/avoid the problem - I fully expect it to happen - the change in these kernels is strictly to make sure that the system is still usable after the problem happens. Can you verify that after reproducing? I assume that before, when you reproduced the problem, you were no longer able to create new (or destroy old) containers (i.e., new net namespaces) - that should now be fixed, and you should be able to keep creating new containers with no issues.

> The seconds seems to be rather milliseconds.

yep sorry, kernel building now should fix that to print seconds instead of ms.

> Waiting here for the newest kernel and will try it right away.

thanks!

Revision history for this message

Jiri Horky (jiri-horky) wrote on 2018-04-14:

#76

Download full text (3.6 KiB)

Hi Dan,

so here are the information for the latest kernel (4.13.0-38.43+hf1711407v20180413b1):

The bug got reproduced:

[ 1085.626663] unregister_netdevice: possible namespace leak (netns ffff8d2dc3d34800), have been waiting for 60 seconds
[ 1085.626696] (netns ffff8d2dc3d34800): dst ffff8d2dd006db00 expires 0 error 0 obsolete 2 flags 0 refcnt 1 ops ipv4_dst_ops+0x0/0xc0 creator ip_route_output_key_hash_rcu+0x355/0x9c0

If I grep the netns: ffff8d2dc3d34800, it is visible that during first 10 minutes, it outputs more info, then until ~ 1 hour, it says the same message more frequently and then only outputs it ever hour.

[ 901.854919] net_namespace: alloced net ffff8d2dc3d34800
[ 1085.626663] unregister_netdevice: possible namespace leak (netns ffff8d2dc3d34800), have been waiting for 60 seconds
[ 1085.626696] (netns ffff8d2dc3d34800): dst ffff8d2dd006db00 expires 0 error 0 obsolete 2 flags 0 refcnt 1 ops ipv4_dst_ops+0x0/0xc0 creator ip_route_output_key_hash_rcu+0x355/0x9c0
[ 1095.706659] unregister_netdevice: possible namespace leak (netns ffff8d2dc3d34800), have been waiting for 70 seconds
[ 1095.706692] (netns ffff8d2dc3d34800): dst ffff8d2dd006db00 expires 0 error 0 obsolete 2 flags 0 refcnt 1 ops ipv4_dst_ops+0x0/0xc0 creator ip_route_output_key_hash_rcu+0x355/0x9c0
[ 1105.786651] unregister_netdevice: possible namespace leak (netns ffff8d2dc3d34800), have been waiting for 80 seconds
[ 1105.786669] (netns ffff8d2dc3d34800): dst ffff8d2dd006db00 expires 0 error 0 obsolete 2 flags 0 refcnt 1 ops ipv4_dst_ops+0x0/0xc0 creator ip_route_output_key_hash_rcu+0x355/0x9c0
[ 1115.866654] unregister_netdevice: possible namespace leak (netns ffff8d2dc3d34800), have been waiting for 90 seconds
....
[ 1631.003906] unregister_netdevice: likely namespace leak (netns ffff8d2dc3d34800), have been waiting for 605 seconds
[ 1641.147950] unregister_netdevice: likely namespace leak (netns ffff8d2dc3d34800), have been waiting for 616 seconds
[ 1651.291981] unregister_netdevice: likely namespace leak (netns ffff8d2dc3d34800), have been waiting for 626 seconds
[ 1661.372025] unregister_netdevice: likely namespace leak (netns ffff8d2dc3d34800), have been waiting for 636 seconds
[ 1671.452071] unregister_netdevice: likely namespace leak (netns ffff8d2dc3d34800), have been waiting for 646 seconds
[ 1681.564110] unregister_netdevice: likely namespace leak (netns ffff8d2dc3d34800), have been waiting for 656 seconds
...

[ 4628.301622] unregister_netdevice: namespace leak (netns ffff8d2dc3d34800), have been waiting for 3603 seconds
[ 8252.383668] unregister_netdevice: namespace leak (netns ffff8d2dc3d34800), have been waiting for 7227 seconds
[11877.333159] unregister_netdevice: namespace leak (netns ffff8d2dc3d34800), have been waiting for 10852 seconds
[15502.281254] unregister_netdevice: namespace leak (netns ffff8d2dc3d34800), have been waiting for 14477 seconds
[19127.230600] unregister_netdevice: namespace leak (netns ffff8d2dc3d34800), have been waiting for 18102 seconds
[22752.179625] unregister_netdevice: namespace leak (netns ffff8d2dc3d34800), have been waiting for 21727 seconds
[26377.129256] unregister_netdevice: namespace leak (netns ffff8d2dc3d3...

Hi Dan,

so here are the information for the latest kernel (4.13.0-38.43+hf1711407v20180413b1):

The bug got reproduced:

[ 1085.626663] unregister_netdevice: possible namespace leak (netns ffff8d2dc3d34800), have been waiting for 60 seconds
[ 1085.626696]  (netns ffff8d2dc3d34800): dst ffff8d2dd006db00 expires 0 error 0 obsolete 2 flags 0 refcnt 1 ops ipv4_dst_ops+0x0/0xc0 creator ip_route_output_key_hash_rcu+0x355/0x9c0

If I grep the netns: ffff8d2dc3d34800, it is visible that during first 10 minutes, it outputs more info, then until ~ 1 hour, it says the same message more frequently and then only outputs it ever hour.

[  901.854919] net_namespace: alloced net ffff8d2dc3d34800
[ 1085.626663] unregister_netdevice: possible namespace leak (netns ffff8d2dc3d34800), have been waiting for 60 seconds
[ 1085.626696] 	(netns ffff8d2dc3d34800): dst ffff8d2dd006db00 expires 0 error 0 obsolete 2 flags 0 refcnt 1 ops ipv4_dst_ops+0x0/0xc0 creator ip_route_output_key_hash_rcu+0x355/0x9c0
[ 1095.706659] unregister_netdevice: possible namespace leak (netns ffff8d2dc3d34800), have been waiting for 70 seconds
[ 1095.706692] 	(netns ffff8d2dc3d34800): dst ffff8d2dd006db00 expires 0 error 0 obsolete 2 flags 0 refcnt 1 ops ipv4_dst_ops+0x0/0xc0 creator ip_route_output_key_hash_rcu+0x355/0x9c0
[ 1105.786651] unregister_netdevice: possible namespace leak (netns ffff8d2dc3d34800), have been waiting for 80 seconds
[ 1105.786669] 	(netns ffff8d2dc3d34800): dst ffff8d2dd006db00 expires 0 error 0 obsolete 2 flags 0 refcnt 1 ops ipv4_dst_ops+0x0/0xc0 creator ip_route_output_key_hash_rcu+0x355/0x9c0
[ 1115.866654] unregister_netdevice: possible namespace leak (netns ffff8d2dc3d34800), have been waiting for 90 seconds
....
[ 1631.003906] unregister_netdevice: likely namespace leak (netns ffff8d2dc3d34800), have been waiting for 605 seconds
[ 1641.147950] unregister_netdevice: likely namespace leak (netns ffff8d2dc3d34800), have been waiting for 616 seconds
[ 1651.291981] unregister_netdevice: likely namespace leak (netns ffff8d2dc3d34800), have been waiting for 626 seconds
[ 1661.372025] unregister_netdevice: likely namespace leak (netns ffff8d2dc3d34800), have been waiting for 636 seconds
[ 1671.452071] unregister_netdevice: likely namespace leak (netns ffff8d2dc3d34800), have been waiting for 646 seconds
[ 1681.564110] unregister_netdevice: likely namespace leak (netns ffff8d2dc3d34800), have been waiting for 656 seconds
...

[ 4628.301622] unregister_netdevice: namespace leak (netns ffff8d2dc3d34800), have been waiting for 3603 seconds
[ 8252.383668] unregister_netdevice: namespace leak (netns ffff8d2dc3d34800), have been waiting for 7227 seconds
[11877.333159] unregister_netdevice: namespace leak (netns ffff8d2dc3d34800), have been waiting for 10852 seconds
[15502.281254] unregister_netdevice: namespace leak (netns ffff8d2dc3d34800), have been waiting for 14477 seconds
[19127.230600] unregister_netdevice: namespace leak (netns ffff8d2dc3d34800), have been waiting for 18102 seconds
[22752.179625] unregister_netdevice: namespace leak (netns ffff8d2dc3d34800), have been waiting for 21727 seconds
[26377.129256] unregister_netdevice: namespace leak (netns ffff8d2dc3d34800), have been waiting for 25352 seconds
[30002.078342] unregister_netdevice: namespace leak (netns ffff8d2dc3d34800), have been waiting for 28977 seconds

The machine continue to run and the docker is still able to create new instances, i.e. there is no complete lock out.

The machine continued to run further, and after an ~hour , 3 namespaces leaked. The machine now has load "3", which seems to be an artifact of this. Other than that, it seems to continue to run!

Revision history for this message

Dan Streetman (ddstreet) wrote on 2018-04-16:

#77

> The machine continue to run and the docker is still able to create new instances,
> i.e. there is no complete lock out.

Great!

> The machine continued to run further, and after an ~hour , 3 namespaces leaked.
> The machine now has load "3", which seems to be an artifact of this. Other than that,
> it seems to continue to run!

Hmm, can you check 'top' or 'htop' or similar to see if the kworker thread(s) are actually using cpu time? They should be sleeping almost all the time, not spinning on anything.

I'll work on getting that patch upstream (but likely not nearly as verbose, probably just a single WARNING for leaked namespace).

After that, I'll keep working on adding dst leak debugging to actually try to find where your dst leak is coming from.

Revision history for this message

Jiri Horky (jiri-horky) wrote on 2018-04-18:

#78

Hi Dan,

the CPU usage is zero, there is nothing popping up. The tasks causing the high load value are in D state. If I do "echo w > /proc/sysrq-trigger", I will get for every leaked NS:

[392149.562095] kworker/u81:40 D 0 4290 2 0x80000000
[392149.562112] Workqueue: netns cleanup_net
[392149.562125] Call Trace:
[392149.562135] __schedule+0x291/0x8a0
[392149.562146] schedule+0x2c/0x80
[392149.562157] schedule_timeout+0x182/0x360
[392149.562167] ? call_timer_fn+0x130/0x130
[392149.562177] msleep+0x2d/0x40
[392149.562188] ? msleep+0x2d/0x40
[392149.562199] netdev_run_todo+0x15e/0x5d0
[392149.562210] rtnl_unlock+0xe/0x10
[392149.562221] default_device_exit_batch+0x156/0x180
[392149.562233] ? do_wait_intr_irq+0x90/0x90
[392149.562245] cleanup_net+0x14e/0x2a0
[392149.562257] process_one_work+0x1ec/0x410
[392149.562270] worker_thread+0x32/0x410
[392149.562281] kthread+0x128/0x140
[392149.562292] ? process_one_work+0x410/0x410
[392149.562300] ? kthread_create_on_node+0x70/0x70
[392149.562305] ? do_syscall_64+0x67/0x130
[392149.562310] ? SyS_exit_group+0x14/0x20
[392149.562316] ret_from_fork+0x35/0x40

I am not sure how fine is to have e.g. thousands of such leaked namespaces - if I can't run out of some resources (kworker threads?). I am going to keep the machine running + generate even higher load there to see if I hit any such limit.

Thanks
JH

Revision history for this message

Dan Streetman (ddstreet) wrote on 2018-04-23:

#79

> the CPU usage is zero, there is nothing popping up. The tasks causing the high load
> value are in D state

ok that seems expected then. This doesn't 'fix' the problem of the leaked dst (and namespace), just prevents blocking further net namespace creation/deletion.

> if I can't run out of some resources (kworker threads?)

there is a limit, yes, and once it's hit namespace cleanup will essentially stop, and all further net namespaces will be leaked. However, there's really just no way to avoid that without fixing the actual dst leaks, which I'd still like to try to add debug to help identify.

I've been busy elsewhere last week but I'll prepare the patch to send upstream this week, and then look at adding debug to actually find the dst leak.

Thanks!

Revision history for this message

Gonzalo Servat (gservat) wrote on 2018-04-29:

#80

Hi Dan,

We run a bunch of instances both on AWS and GCP, and we run a significant number of containers on both. We've only ever seen this problem on GCP and never on AWS (it's baffling!). The kernels are as-close-as-possible and the rest (Ubuntu version / Docker version / etc) are identical. To answer your questions:

-what release are you using (trusty/xenial/zesty)?

Xenial

-what kernel version are you using?

4.13.0-1008-gcp

-do you have specific steps to reproduce the problem?

Unfortunately we don't. It just happens on its own after 1/2 weeks. The following shows in dmesg:

[1015401.681728] unregister_netdevice: waiting for veth251ecfe to become free. Usage count = 1
[1015411.761772] unregister_netdevice: waiting for veth251ecfe to become free. Usage count = 1
[1015421.841740] unregister_netdevice: waiting for veth251ecfe to become free. Usage count = 1
[1015431.953729] unregister_netdevice: waiting for veth251ecfe to become free. Usage count = 1

etc etc

We've tried a few things including upgrading kernel and disabling IPv6 to no avail.

Revision history for this message

Dan Streetman (ddstreet) wrote on 2018-04-30:

#81

Hi @gservat,

> To answer your questions:

There's actually been quite a lot of discussion/work in this bug since comment 2, so I don't actually need that info anymore (except for reproduction steps, that's always welcome).

> It just happens on its own after 1/2 weeks. The following shows in dmesg:

Are you having any problems with the system *other* than just the log messages?

Revision history for this message

Gonzalo Servat (gservat) wrote on 2018-04-30:

#82

Hi Dan,

> There's actually been quite a lot of discussion/work in this bug since comment 2, so I don't
> actually need that info anymore (except for reproduction steps, that's always welcome).

Fair enough. I wish I had some reproduction steps but unfortunately it just happens randomly and not because of any one action.

> Are you having any problems with the system *other* than just the log messages?

Apologies here. I should have mentioned how it affects us. The symptoms seen are a number of docker commands that just hang indefinitely (e.g. docker ps, spinning up new containers). Occasionally, after a long time, we have seen it come back and continue working OK (like whatever was holding the lock released it finally), but most of the time it hangs indefinitely. On occasion we can also use the -n parameter (e.g. docker ps -n #) and it may or may not return a partial listing of running containers. We've also found (again, not always) that we might be able to stop a container if we use the container name or ID. When we strace the process, we just see:

write(5, "GET /v1.24/containers/json?limit"..., 97) = 97
futex(0xc8200fa590, FUTEX_WAKE, 1) = 1
futex(0x21fce50, FUTEX_WAIT, 0, NULL

... and that's as far as it goes. I do have a Docker stacktrace if you want to see that.

Revision history for this message

Dan Streetman (ddstreet) wrote on 2018-04-30:

#83

> The symptoms seen are a number of docker commands that just hang indefinitely (e.g.
> docker ps, spinning up new containers). Occasionally, after a long time, we have
> seen it come back and continue working OK (like whatever was holding the lock
> released it finally), but most of the time it hangs indefinitely

ok, great. You didn't mention what release you're using on gcp, but I have a 4.13 test kernel in this ppa, with patches to specifically fix/workaround the hanging problem:
https://launchpad.net/~ddstreet/+archive/ubuntu/lp1711407

Note that *only* version 4.13.0-38.43+hf1711407v20180413b1 in that ppa has the patches to workaround the hanging. If you need a different version, let me know and I can apply the patches and build for whatever release you need (just give me the 'uname -a' output).

The details of the cause of this problem is back in comment 72, if you want that info. I'm preparing to send the workaround patch upstream; after that, there will be (likely ongoing) work to debug the various dst/object leaks and kernel socket "leaks" that are actually causing the symptoms.

Revision history for this message

Gonzalo Servat (gservat) wrote on 2018-05-07: Re: [Bug 1711407] Re: unregister_netdevice: waiting for lo to become free

#84

Hi Dan,

We're using a newer kernel:

4.4.0-109-generic #132-Ubuntu SMP Tue Jan 9 19:52:39 UTC 2018 x86_64 x86_64
x86_64 GNU/Linux

Would you be able to build a test kernel against this version?

Thanks very much, Dan.

On Tue, May 1, 2018 at 12:14 AM, Dan Streetman <email address hidden>
wrote:

> > The symptoms seen are a number of docker commands that just hang
> indefinitely (e.g.
> > docker ps, spinning up new containers). Occasionally, after a long time,
> we have
> > seen it come back and continue working OK (like whatever was holding the
> lock
> > released it finally), but most of the time it hangs indefinitely
>
> ok, great. You didn't mention what release you're using on gcp, but I
> have a 4.13 test kernel in this ppa, with patches to specifically
> fix/workaround the hanging problem:
> https://launchpad.net/~ddstreet/+archive/ubuntu/lp1711407
>
> Note that *only* version 4.13.0-38.43+hf1711407v20180413b1 in that ppa
> has the patches to workaround the hanging. If you need a different
> version, let me know and I can apply the patches and build for whatever
> release you need (just give me the 'uname -a' output).
>
> The details of the cause of this problem is back in comment 72, if you
> want that info. I'm preparing to send the workaround patch upstream;
> after that, there will be (likely ongoing) work to debug the various
> dst/object leaks and kernel socket "leaks" that are actually causing the
> symptoms.
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1711407
>
> Title:
> unregister_netdevice: waiting for lo to become free
>
> Status in linux package in Ubuntu:
> In Progress
> Status in linux source package in Trusty:
> In Progress
> Status in linux source package in Xenial:
> In Progress
> Status in linux source package in Zesty:
> Won't Fix
> Status in linux source package in Artful:
> In Progress
> Status in linux source package in Bionic:
> In Progress
>
> Bug description:
> This is a "continuation" of bug 1403152, as that bug has been marked
> "fix released" and recent reports of failure may (or may not) be a new
> bug. Any further reports of the problem should please be reported
> here instead of that bug.
>
> --
>
> [Impact]
>
> When shutting down and starting containers the container network
> namespace may experience a dst reference counting leak which results
> in this message repeated in the logs:
>
> unregister_netdevice: waiting for lo to become free. Usage count =
> 1
>
> This can cause issues when trying to create net network namespace and
> thus block a user from creating new containers.
>
> [Test Case]
>
> See comment 16, reproducer provided at https://github.com/fho/docker-
> samba-loop
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/
> 1711407/+subscriptions
>

Hi Dan,

We're using a newer kernel:

4.4.0-109-generic #132-Ubuntu SMP Tue Jan 9 19:52:39 UTC 2018 x86_64 x86_64
x86_64 GNU/Linux

Would you be able to build a test kernel against this version?

Thanks very much, Dan.

On Tue, May 1, 2018 at 12:14 AM, Dan Streetman <dan.streetman@canonical.com>
wrote:

> > The symptoms seen are a number of docker commands that just hang
> indefinitely (e.g.
> > docker ps, spinning up new containers). Occasionally, after a long time,
> we have
> > seen it come back and continue working OK (like whatever was holding the
> lock
> > released it finally), but most of the time it hangs indefinitely
>
> ok, great.  You didn't mention what release you're using on gcp, but I
> have a 4.13 test kernel in this ppa, with patches to specifically
> fix/workaround the hanging problem:
> https://launchpad.net/~ddstreet/+archive/ubuntu/lp1711407
>
> Note that *only* version 4.13.0-38.43+hf1711407v20180413b1 in that ppa
> has the patches to workaround the hanging.  If you need a different
> version, let me know and I can apply the patches and build for whatever
> release you need (just give me the 'uname -a' output).
>
> The details of the cause of this problem is back in comment 72, if you
> want that info.  I'm preparing to send the workaround patch upstream;
> after that, there will be (likely ongoing) work to debug the various
> dst/object leaks and kernel socket "leaks" that are actually causing the
> symptoms.
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1711407
>
> Title:
>   unregister_netdevice: waiting for lo to become free
>
> Status in linux package in Ubuntu:
>   In Progress
> Status in linux source package in Trusty:
>   In Progress
> Status in linux source package in Xenial:
>   In Progress
> Status in linux source package in Zesty:
>   Won't Fix
> Status in linux source package in Artful:
>   In Progress
> Status in linux source package in Bionic:
>   In Progress
>
> Bug description:
>   This is a "continuation" of bug 1403152, as that bug has been marked
>   "fix released" and recent reports of failure may (or may not) be a new
>   bug.  Any further reports of the problem should please be reported
>   here instead of that bug.
>
>   --
>
>   [Impact]
>
>   When shutting down and starting containers the container network
>   namespace may experience a dst reference counting leak which results
>   in this message repeated in the logs:
>
>       unregister_netdevice: waiting for lo to become free. Usage count =
>   1
>
>   This can cause issues when trying to create net network namespace and
>   thus block a user from creating new containers.
>
>   [Test Case]
>
>   See comment 16, reproducer provided at https://github.com/fho/docker-
>   samba-loop
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/
> 1711407/+subscriptions
>

Revision history for this message

Dan Streetman (ddstreet) wrote on 2018-05-18:

#85

> Would you be able to build a test kernel against this version?

very sorry, i've been quite busy the last couple weeks. I'll get a test kernel built for you next week.

Revision history for this message

Gonzalo Servat (gservat) wrote on 2018-05-19:

#86

Thanks for the ping, Dan. No worries. Looking forward to it as this issue
seems to be biting us more frequently nowadays so really interested to see
if this kernel helps.

On Sat, May 19, 2018 at 7:19 AM, Dan Streetman <email address hidden>
wrote:

> > Would you be able to build a test kernel against this version?
>
> very sorry, i've been quite busy the last couple weeks. I'll get a test
> kernel built for you next week.
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1711407
>
> Title:
> unregister_netdevice: waiting for lo to become free
>
> Status in linux package in Ubuntu:
> In Progress
> Status in linux source package in Trusty:
> In Progress
> Status in linux source package in Xenial:
> In Progress
> Status in linux source package in Zesty:
> Won't Fix
> Status in linux source package in Artful:
> In Progress
> Status in linux source package in Bionic:
> In Progress
>
> Bug description:
> This is a "continuation" of bug 1403152, as that bug has been marked
> "fix released" and recent reports of failure may (or may not) be a new
> bug. Any further reports of the problem should please be reported
> here instead of that bug.
>
> --
>
> [Impact]
>
> When shutting down and starting containers the container network
> namespace may experience a dst reference counting leak which results
> in this message repeated in the logs:
>
> unregister_netdevice: waiting for lo to become free. Usage count =
> 1
>
> This can cause issues when trying to create net network namespace and
> thus block a user from creating new containers.
>
> [Test Case]
>
> See comment 16, reproducer provided at https://github.com/fho/docker-
> samba-loop
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/
> 1711407/+subscriptions
>

Revision history for this message

Dan Streetman (ddstreet) wrote on 2018-05-25:

#87

> We're using a newer kernel:
> 4.4.0-109-generic

ok, i backported the workaround to the xenial 4.4 kernel, and built it at the same ppa:
https://launchpad.net/~ddstreet/+archive/ubuntu/lp1711407

it's version 4.4.0-127.153+hf1711407v20180524b3

can you test with that and let me know if it works around the hangs? Note that it is still expected to show errors in the logs when the problem is reproduced; this workaround is strictly to avoid internal hangs when the error happens, so you should be able to still create/destroy containers (net namespaces).

Revision history for this message

Gonzalo Servat (gservat) wrote on 2018-05-28:

#88

Many thanks, Dan! Going to roll it out to a few test boxes and try it out over the next few days. Will report back as soon as I have some feedback.

Revision history for this message

lirongqing (lirongqing) wrote on 2018-05-30:

#89

I have the same issue, and I dump the vmcore, and find dst cache, hope it has some help.
this leaked dst is in dst_busy_list, except dst_busy_list, nowhere I can find it in
==============================
First case:

crash> rtable 0xffff880036fbba00 -x
struct rtable {
  dst = {
    callback_head = {
      next = 0xffff880036fbb300,
      func = 0x0
    },
    child = 0x0,
    dev = 0xffff8817f15c3000,
    ops = 0xffffffff8191f2c0 <ipv4_dst_ops>,
    _metrics = 0xffffffff815ef7d1,
    expires = 0x0,
    path = 0xffff880036fbba00,
    from = 0x0,
    xfrm = 0x0,
    input = 0xffffffff8149f777 <dst_discard>,
    {
      output = 0xffffffff8149f762 <dst_discard_sk>,
      __UNIQUE_ID_rh_kabi_hide29 = {
        output = 0xffffffff8149f762 <dst_discard_sk>
      },
      {<No data fields>}
    },
    flags = 0x0,
    pending_confirm = 0x0,
    error = 0x0,
    obsolete = 0x2,
    header_len = 0x0,
    trailer_len = 0x0,
    tclassid = 0x0,
    __pad_to_align_refcnt = {0xe, 0x8006000000000},
    __refcnt = {
      counter = 0x1
    },
    __use = 0x0,
    lastuse = 0x1000860b0,
    {
      next = 0xffff882fdaaa9800,
      rt_next = 0xffff882fdaaa9800,
      rt6_next = 0xffff882fdaaa9800,
      dn_next = 0xffff882fdaaa9800
    },
    rh_reserved1 = 0x2,
    rh_reserved2 = 0x85987845,
    rh_reserved3 = 0x0,
    rh_reserved4 = 0x0
  },
  rt_genid = 0x11,
  rt_flags = 0x80000000,
  rt_type = 0x2,
  rt_is_input = 0x1,
  rt_uses_gateway = 0x0,
  rt_iif = 0x0,
  rt_gateway = 0x0,
  rt_pmtu = 0x0,
  rt_uncached = {
    next = 0xffff880036fbbac0,
    prev = 0xffff880036fbbac0
  }
}
crash>
=========================
seconds case:
crash> rtable ffff8807ec0d9a00 -x
struct rtable {
  dst = {
    callback_head = {
      next = 0xffff8807ec0db900,
      func = 0xffffffff8119e250 <file_free_rcu>
    },
    child = 0x0,
    dev = 0xffff8817ed384800,
    ops = 0xffffffff81b18bc0 <ipv4_dst_ops>,
    _metrics = 0xffffffff817ca681,
    expires = 0x0,
    path = 0xffff8807ec0d9a00,
    from = 0x0,
    xfrm = 0x0,
    input = 0xffffffff8162b200 <dst_discard>,
    {
      output = 0xffffffff8162b1e0 <dst_discard_sk>,
      __UNIQUE_ID_rh_kabi_hide29 = {
        output = 0xffffffff8162b1e0 <dst_discard_sk>
      },
      {<No data fields>}
    },
    flags = 0x6,
    pending_confirm = 0x0,
    error = 0x0,
    obsolete = 0x2,
    header_len = 0x0,
    trailer_len = 0x0,
    tclassid = 0x0,
    __pad_to_align_refcnt = {0xffff880f6df8fec0, 0x0},
    __refcnt = {
      counter = 0x1
    },
    __use = 0x0,
    lastuse = 0x1fcce9056,
    {
      next = 0xffff880404f49400,
      rt_next = 0xffff880404f49400,
      rt6_next = 0xffff880404f49400,
      dn_next = 0xffff880404f49400
    },
    rh_reserved1 = 0x0,
    rh_reserved2 = 0x0,
    rh_reserved3 = 0x0,
    rh_reserved4 = 0x0
  },
  rt_genid = 0x16,
  rt_flags = 0x80000000,
  rt_type = 0x2,
  rt_is_input = 0x0,
  rt_uses_gateway = 0x0,
  rt_iif = 0x0,
  rt_gateway = 0x0,
  rt_pmtu = 0x0,
  rt_uncached = {
    next = 0xffff8807ec0d9ac0,
    prev = 0xffff8807ec0d9ac0
  }
}
crash>

I have the same issue, and I dump the vmcore, and find dst cache, hope it has some help.
this leaked dst is in dst_busy_list, except dst_busy_list, nowhere I can find it in
==============================
First case:

crash> rtable 0xffff880036fbba00 -x
struct rtable {
  dst = {
    callback_head = {
      next = 0xffff880036fbb300, 
      func = 0x0
    }, 
    child = 0x0, 
    dev = 0xffff8817f15c3000, 
    ops = 0xffffffff8191f2c0 <ipv4_dst_ops>, 
    _metrics = 0xffffffff815ef7d1, 
    expires = 0x0, 
    path = 0xffff880036fbba00, 
    from = 0x0, 
    xfrm = 0x0, 
    input = 0xffffffff8149f777 <dst_discard>, 
    {
      output = 0xffffffff8149f762 <dst_discard_sk>, 
      __UNIQUE_ID_rh_kabi_hide29 = {
        output = 0xffffffff8149f762 <dst_discard_sk>
      }, 
      {<No data fields>}
    }, 
    flags = 0x0, 
    pending_confirm = 0x0, 
    error = 0x0, 
    obsolete = 0x2, 
    header_len = 0x0, 
    trailer_len = 0x0, 
    tclassid = 0x0, 
    __pad_to_align_refcnt = {0xe, 0x8006000000000}, 
    __refcnt = {
      counter = 0x1
    }, 
    __use = 0x0, 
    lastuse = 0x1000860b0, 
    {
      next = 0xffff882fdaaa9800, 
      rt_next = 0xffff882fdaaa9800, 
      rt6_next = 0xffff882fdaaa9800, 
      dn_next = 0xffff882fdaaa9800
    }, 
    rh_reserved1 = 0x2, 
    rh_reserved2 = 0x85987845, 
    rh_reserved3 = 0x0, 
    rh_reserved4 = 0x0
  }, 
  rt_genid = 0x11, 
  rt_flags = 0x80000000, 
  rt_type = 0x2, 
  rt_is_input = 0x1, 
  rt_uses_gateway = 0x0, 
  rt_iif = 0x0, 
  rt_gateway = 0x0, 
  rt_pmtu = 0x0, 
  rt_uncached = {
    next = 0xffff880036fbbac0, 
    prev = 0xffff880036fbbac0
  }
}
crash> 
=========================
seconds case:
crash> rtable  ffff8807ec0d9a00 -x
struct rtable {
  dst = {
    callback_head = {
      next = 0xffff8807ec0db900, 
      func = 0xffffffff8119e250 <file_free_rcu>
    }, 
    child = 0x0, 
    dev = 0xffff8817ed384800, 
    ops = 0xffffffff81b18bc0 <ipv4_dst_ops>, 
    _metrics = 0xffffffff817ca681, 
    expires = 0x0, 
    path = 0xffff8807ec0d9a00, 
    from = 0x0, 
    xfrm = 0x0, 
    input = 0xffffffff8162b200 <dst_discard>, 
    {
      output = 0xffffffff8162b1e0 <dst_discard_sk>, 
      __UNIQUE_ID_rh_kabi_hide29 = {
        output = 0xffffffff8162b1e0 <dst_discard_sk>
      }, 
      {<No data fields>}
    }, 
    flags = 0x6, 
    pending_confirm = 0x0, 
    error = 0x0, 
    obsolete = 0x2, 
    header_len = 0x0, 
    trailer_len = 0x0, 
    tclassid = 0x0, 
    __pad_to_align_refcnt = {0xffff880f6df8fec0, 0x0}, 
    __refcnt = {
      counter = 0x1
    }, 
    __use = 0x0, 
    lastuse = 0x1fcce9056, 
    {
      next = 0xffff880404f49400, 
      rt_next = 0xffff880404f49400, 
      rt6_next = 0xffff880404f49400, 
      dn_next = 0xffff880404f49400
    }, 
    rh_reserved1 = 0x0, 
    rh_reserved2 = 0x0, 
    rh_reserved3 = 0x0, 
    rh_reserved4 = 0x0
  }, 
  rt_genid = 0x16, 
  rt_flags = 0x80000000, 
  rt_type = 0x2, 
  rt_is_input = 0x0, 
  rt_uses_gateway = 0x0, 
  rt_iif = 0x0, 
  rt_gateway = 0x0, 
  rt_pmtu = 0x0, 
  rt_uncached = {
    next = 0xffff8807ec0d9ac0, 
    prev = 0xffff8807ec0d9ac0
  }
}
crash>

Revision history for this message

lirongqing (lirongqing) wrote on 2018-05-30:

#90

is it possibility that sk->sk_dst_cache is overwritten? like in __sk_dst_check,
when tcp timer tries to resend a packet, at the same time, tcp_close is called, and a reset packet will send, and ip_queue_xmit will be called concurrent；

cpu 1 cpu 2
tcp_close
   tcp_send_active_reset
      ip_queue_xmit
         __sk_dst_check {
            dst = __sk_dst_get(sk);
                                             tcp timer fire
                                             tcp_retransmit_timer
                                                   __tcp_retransmit_skb
                                                      ip_queue_xmit
                                                         sk_setup_caps
                                                            sk->sk_dst_cache is changed, like dst222

            RCU_INIT_POINTER(sk->sk_dst_cache, NULL);
                                                            dst222 is leaked
            dst_release(dst);

Revision history for this message

Dan Streetman (ddstreet) wrote on 2018-06-04:

#91

> is it possibility that sk->sk_dst_cache is overwritten?

maybe ;-)

> like in __sk_dst_check,
> when tcp timer tries to resend a packet, at the same time, tcp_close is called

tcp_close() locks the sock; tcp_retransmit_timer() is also called with the sock locked. So they should not be operating on the same sock concurrently.

Revision history for this message

Dan Streetman (ddstreet) wrote on 2018-06-04:

#92

> So they should not be operating on the same sock concurrently.

But, of course, all this is caused by bugs, so yes it's certainly possible a bug allowed what you said to happen, but you'll need to dig in deeper to find where such a bug is (if any), since the locking design is correct as far as I can tell.

Revision history for this message

Andy Whitcroft (apw) wrote on 2018-07-24: Closing unsupported series nomination.

#94

This bug was nominated against a series that is no longer supported, ie artful. The bug task representing the artful nomination is being closed as Won't Fix.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu Artful):
status:	In Progress → Won't Fix

Revision history for this message

TomaszChmielewski (mangoo-wpkg) wrote on 2018-08-01:

#95

I'm hitting this bug frequently on AWS with Ubuntu 16.04.5 LTS:

Linux uni02.sys.timedoctor.com 4.4.0-1063-aws #72-Ubuntu SMP Fri Jul 13 07:23:34 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

Revision history for this message

TomaszChmielewski (mangoo-wpkg) wrote on 2018-08-05:

#96

Also seeing it on Ubuntu 18.04 on AWS:

# uname -a
Linux uni09.sys.timedoctor.com 4.15.0-1016-aws #16-Ubuntu SMP Wed Jul 18 09:20:54 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

Revision history for this message

Gonzalo Servat (gservat) wrote on 2018-08-08:

#97

Dan,

We had been running 80-100 instances on the hotfix kernel and I was calling victory on the workaround as it has been a month without this issue popping up, but I just had an instance lockup on, say, "docker ps" and dmesg shows:

[2924480.806202] unregister_netdevice: waiting for vethd078a31 (netns ffffffff81efab00) to become free. Usage count = 2
[2924490.870209] unregister_netdevice: waiting for vethd078a31 (netns ffffffff81efab00) to become free. Usage count = 2
[2924501.006177] unregister_netdevice: waiting for vethd078a31 (netns ffffffff81efab00) to become free. Usage count = 2
[2924511.134204] unregister_netdevice: waiting for vethd078a31 (netns ffffffff81efab00) to become free. Usage count = 2
[2924521.218189] unregister_netdevice: waiting for vethd078a31 (netns ffffffff81efab00) to become free. Usage count = 2

:-(

Revision history for this message

Gonzalo Servat (gservat) wrote on 2018-08-08:

#98

Using:

linux-image-4.4.0-127-generic 4.4.0-127.153+hf1711407v20180524b3

Revision history for this message

Dan Streetman (ddstreet) wrote on 2019-04-04:

#99

unfortunately I've moved away from kernel work, for now at least, and no longer have time to continue this bug; if anyone else would like to pick up the work, I'm happy to answer questions.

Changed in linux (Ubuntu Xenial):
assignee:	Dan Streetman (ddstreet) → nobody
Changed in linux (Ubuntu Artful):
assignee:	Dan Streetman (ddstreet) → nobody
Changed in linux (Ubuntu):
assignee:	Dan Streetman (ddstreet) → nobody

Dan Streetman (ddstreet) on 2019-04-08

Changed in linux (Ubuntu Trusty):
assignee:	Dan Streetman (ddstreet) → nobody
Changed in linux (Ubuntu Bionic):
assignee:	Dan Streetman (ddstreet) → nobody

Brad Figg (brad-figg) on 2019-07-24

tags:

added: cscc

Revision history for this message

Jim Heald (steelcowboy1) wrote on 2020-02-23:

#100

Hi! I'm on Debian and I'm getting this issue with the following kernel:
4.19.0-8-amd64 #1 SMP Debian 4.19.98-1 (2020-01-26)

I feel like this isn't the best place to report this, but can someone point me to the best place to report this issue? Can't find any discussion on it from the past year but I don't know how I could have triggered it...

Revision history for this message

Thiago Vincenzi Conrado (tconrado28) wrote on 2020-02-23:

#101

hey guys, we create our namespace using ip netns add.
Each nameSpace has it's ipv6 and ipv4.
We notice that the ipv6 ping continues to report, but the namespace at /run/netns does not exist anymore.

is it possible to find the namespace using the ipv6 info?

Revision history for this message

Thiago Vincenzi Conrado (tconrado28) wrote on 2020-02-24:

#102

4.15.0-54-generic did work great in the past
we moved to 5.3.0-28-generic and this bug is notable...

funny enough one of our machines was able to recover... both have the very same OS version

Distributor ID: Ubuntu
Description: Ubuntu 18.04.4 LTS
Release: 18.04
Codename: bionic
5.3.0-28-generic

maybe a downgrade de to Ubuntu 18.04.2 is a "solution"

Revision history for this message

Thiago Vincenzi Conrado (tconrado28) wrote on 2020-02-24:

#103

4.15.0-54-generic is bug free
5.3.0-28-generic has the bug

funny enough one of our machines was able to recover... both have the very same OS version

Distributor ID: Ubuntu
Description: Ubuntu 18.04.4 LTS
Release: 18.04
Codename: bionic
5.3.0-28-generic

maybe a downgrade de to Ubuntu 18.04.2 is a "solution"

Revision history for this message

Sandro (supersandro2000) wrote on 2020-08-21:

#104

Bug also happens on 5.4.0-42-generic.

Distributor ID: Ubuntu
Description: Ubuntu 20.04.1 LTS
Release: 20.04
Codename: focal

Revision history for this message

Nivedita Singhvi (niveditasinghvi) wrote on 2021-10-06:

#105

Is anyone still seeing a similar issue on current mainline?

tags:

added: sts

Revision history for this message

Mariano López (justanotherboy) wrote on 2021-10-27:

#106

Our serves are affected by this bug too in focal and it triggered today in one of our servers.

Distributor ID: Ubuntu
Description: Ubuntu 20.04.3 LTS
Release: 20.04
Codename: focal

We have seen this in:
- Ubuntu 5.4.0-89.100-generic
- Ubuntu 5.4.0-88.99-generic

Revision history for this message

Nivedita Singhvi (niveditasinghvi) wrote on 2021-12-22:

#107

We are seeing definitely a problem on kernels after 4.15.0-159-generic,
which is the last known good kernel. 5.3* kernels are affected, but I
do not have data on most recent upstream.

Revision history for this message

Nivedita Singhvi (niveditasinghvi) wrote on 2021-12-23:

#108

As several different forums are discussing this issue,
I'm using this LP bug to continue investigation in to
current manifestation of this bug (after 4.15 kernel).

I suspect it's in one of the other places not fixed, as
my colleague Dan stated a while ago.

Revision history for this message

Mariano López (justanotherboy) wrote on 2022-01-31:

#109

Hi Nivedita,

Can you point me to the threads discussing this problem? I'm aware of this one, https://github.com/moby/moby/issues/5618 and some very old ones that were solved in the 4.15 kernel. I would like to try the solutions and provide feedback.

Thanks!

Ubuntu
linux package

unregister_netdevice: waiting for lo to become free

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches

	Status	Importance	Assigned to
linux (Ubuntu)	In Progress	Medium	Unassigned
Trusty	In Progress	Medium	Unassigned
Xenial	In Progress	Medium	Unassigned
Zesty	Won't Fix	Medium	Unassigned
Artful	Won't Fix	Medium	Unassigned
Bionic	In Progress	Medium	Unassigned

Ubuntulinux package

unregister_netdevice: waiting for lo to become free

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches

Ubuntu
linux package