unregister_netdevice: waiting for lo to become free

Bug #1711407 reported by Dan Streetman on 2017-08-17
86
This bug affects 10 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Status tracked in Bionic
Trusty
Medium
Dan Streetman
Xenial
Medium
Dan Streetman
Zesty
Medium
Dan Streetman
Artful
Medium
Dan Streetman
Bionic
Medium
Dan Streetman

Bug Description

This is a "continuation" of bug 1403152, as that bug has been marked "fix released" and recent reports of failure may (or may not) be a new bug. Any further reports of the problem should please be reported here instead of that bug.

--

[Impact]

When shutting down and starting containers the container network namespace may experience a dst reference counting leak which results in this message repeated in the logs:

    unregister_netdevice: waiting for lo to become free. Usage count = 1

This can cause issues when trying to create net network namespace and thus block a user from creating new containers.

[Test Case]

See comment 16, reproducer provided at https://github.com/fho/docker-samba-loop

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1711407

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Dan Streetman (ddstreet) wrote :

Can anyone experiencing this issue please provide details such as:

-what release are you using (trusty/xenial/zesty)?

-what kernel version are you using?

-do you have specific steps to reproduce the problem?

description: updated
Changed in linux (Ubuntu):
assignee: nobody → Dan Streetman (ddstreet)
importance: Undecided → Medium
status: Incomplete → In Progress
Andrew Reis (drew-reis) wrote :
Download full text (5.6 KiB)

Distributor ID: Ubuntu
Description: Ubuntu 16.04.3 LTS
Release: 16.04
Codename: xenial

cat /etc/apt/sources.list.d/docker.list
deb [arch=amd64] https://apt.dockerproject.org/repo ubuntu-xenial main
# deb-src [arch=amd64] https://apt.dockerproject.org/repo ubuntu-xenial main

cat /etc/apt/sources.list.d/runner_gitlab-ci-multi-runner.list
# this file was generated by packages.gitlab.com for
# the repository at https://packages.gitlab.com/runner/gitlab-ci-multi-runner

deb https://packages.gitlab.com/runner/gitlab-ci-multi-runner/ubuntu/ xenial main
deb-src https://packages.gitlab.com/runner/gitlab-ci-multi-runner/ubuntu/ xenial main

dpkg -l | grep gitlab
ii gitlab-ci-multi-runner 9.4.2 amd64 GitLab Runner

dpkg -l | grep docker
ii docker-engine 17.05.0~ce-0~ubuntu-xenial amd64 Docker: the open-source application container engine

Seems to only occur when running docker with privileged inside of gitlab-runner. On "close" of the docker container, the process fails with the following:

Aug 17 17:53:04 WIS-DOCKERPHY kernel: [ 1579.245473] unregister_netdevice: waiting for lo to become free. Usage count = 1
Aug 17 17:53:15 WIS-DOCKERPHY kernel: [ 1589.356182] unregister_netdevice: waiting for lo to become free. Usage count = 1
Aug 17 17:53:25 WIS-DOCKERPHY kernel: [ 1599.466898] unregister_netdevice: waiting for lo to become free. Usage count = 1
Aug 17 17:53:35 WIS-DOCKERPHY kernel: [ 1609.577597] unregister_netdevice: waiting for lo to become free. Usage count = 1
Aug 17 17:53:45 WIS-DOCKERPHY kernel: [ 1619.692335] unregister_netdevice: waiting for lo to become free. Usage count = 1
Aug 17 17:53:54 WIS-DOCKERPHY dockerd[18386]: time="2017-08-17T17:53:54.994924776-05:00" level=error msg="containerd: start container" error="containerd: container did not start before the specified timeout" id=2ef6677cc840f3597e5b38024cfb337856adfaccdcc1a2739f78413979c0fa2b
Aug 17 17:53:54 WIS-DOCKERPHY dockerd[18386]: time="2017-08-17T17:53:54.995769254-05:00" level=error msg="Create container failed with error: containerd: container did not start before the specified timeout"
Aug 17 17:53:55 WIS-DOCKERPHY kernel: [ 1629.366310] docker0: port 1(veth9f03d7e) entered disabled state
Aug 17 17:53:55 WIS-DOCKERPHY kernel: [ 1629.367889] device veth9f03d7e left promiscuous mode
Aug 17 17:53:55 WIS-DOCKERPHY kernel: [ 1629.367891] docker0: port 1(veth9f03d7e) entered disabled state
Aug 17 17:53:55 WIS-DOCKERPHY kernel: [ 1629.449744] auplink[25254]: segfault at 7ffdd8ddb0d8 ip 00007facdfff1579 sp 00007ffdd8ddb0e0 error 6 in libc-2.23.so[7facdfef8000+1c0000]
Aug 17 17:53:55 WIS-DOCKERPHY dockerd[18386]: time="2017-08-17T17:53:55.215290996-05:00" level=warning msg="Couldn't run auplink before unmount /var/lib/docker/aufs/mnt/90bdc718c5e518ef7b03df0e1769ab4638642ad0e847a38dbf16c6e600783f00: signal: segmentation fault (core dumped)"
Aug 17 17:53:55 WIS-DOCKERPHY dockerd[18386]: time="2017-08-17T17:53:55.215372764-05:00" level=error msg="Error unmounting container 2ef6677cc840f3597e5b38024cfb337856adfaccdcc1a2739f78413979c0fa2b: device or resource...

Read more...

Changed in linux (Ubuntu):
status: In Progress → Confirmed
tags: added: apport-collected xenial

AlsaDevices:
 total 0
 crw-rw---- 1 root audio 116, 1 Aug 17 17:27 seq
 crw-rw---- 1 root audio 116, 33 Aug 17 17:27 timer
AplayDevices: Error: [Errno 2] No such file or directory
ApportVersion: 2.20.1-0ubuntu2.10
Architecture: amd64
ArecordDevices: Error: [Errno 2] No such file or directory
AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1:
DistroRelease: Ubuntu 16.04
HibernationDevice: RESUME=UUID=7356cc51-3c3e-4d0d-bb9c-9d8daf436e5c
InstallationDate: Installed on 2016-09-20 (331 days ago)
InstallationMedia: Ubuntu-Server 16.04.1 LTS "Xenial Xerus" - Release amd64 (20160719)
IwConfig: Error: [Errno 2] No such file or directory
MachineType: Dell Inc. PowerEdge R710
Package: linux (not installed)
PciMultimedia:

ProcEnviron:
 TERM=xterm
 PATH=(custom, no user)
 LANG=en_US.UTF-8
 SHELL=/bin/bash
ProcFB: 0 mgadrmfb
ProcKernelCmdLine: BOOT_IMAGE=/vmlinuz-4.11.0-14-generic root=UUID=35a35d02-8a6c-402f-86c5-c39085c81bd3 ro
ProcVersionSignature: Ubuntu 4.11.0-14.20~16.04.1-generic 4.11.12
RelatedPackageVersions:
 linux-restricted-modules-4.11.0-14-generic N/A
 linux-backports-modules-4.11.0-14-generic N/A
 linux-firmware 1.157.11
RfKill: Error: [Errno 2] No such file or directory
Tags: xenial
Uname: Linux 4.11.0-14-generic x86_64
UpgradeStatus: No upgrade log present (probably fresh install)
UserGroups:

_MarkForUpload: True
dmi.bios.date: 07/23/2013
dmi.bios.vendor: Dell Inc.
dmi.bios.version: 6.4.0
dmi.board.name: 00NH4P
dmi.board.vendor: Dell Inc.
dmi.board.version: A00
dmi.chassis.type: 23
dmi.chassis.vendor: Dell Inc.
dmi.modalias: dmi:bvnDellInc.:bvr6.4.0:bd07/23/2013:svnDellInc.:pnPowerEdgeR710:pvr:rvnDellInc.:rn00NH4P:rvrA00:cvnDellInc.:ct23:cvr:
dmi.product.name: PowerEdge R710
dmi.sys.vendor: Dell Inc.

apport information

apport information

apport information

apport information

apport information

apport information

apport information

apport information

apport information

apport information

apport information

Fabian Holler (fh0) wrote :

With https://github.com/fho/docker-samba-loop I was able to reproduce kernel Oopses on a clean Ubuntu 16.0.4 installation with:

- linux-image-4.10.0-32-generic=4.10.0-32.36~16.04.1
- linux-image-4.11.0-14-generic=4.11.0-14.20~16.04.1

On 4.11.0-14 it was much harder to reproduce. Sometimes only a warning
happened, 1x time I was able to produce an Oops with
https://github.com/fho/docker-samba-loop/commits/more_stress
But I don't know if the changes in the branch made it more likely to happen or
if it was only a coincidence.

The kernel message:
  unregister_netdevice: waiting for lo to become free. Usage count = 1
shows up, some minutes later the oops and/or warning happens.

In a different scenario were the Ubuntu 16.04 servers were running multiple docker containers with Nginx or small network applications in parallel, I was also able to reproduce the kernel Oopses also on:

- linux-image-4.10.0-1004-gcp
- linux-image-4.4.0-93-generic
- linux-image-4.11.0-14-generic
- linux-image-4.12.10-041210-generic=4.12.10-041210.20170830

I haven't tried again to reproduce it with those kernels on a clean Ubuntu
installation and unfortunately didn't kept the kernel logs.

Kernel logs can be found in the attached file.

Fabian Holler (fh0) wrote :

Attached are the logs for an Oops on Ubuntu 14.04 on kernel linux-image-4.4.0-93-generic=4.4.0-93.116~14.04.1

Thomas Morin (tmmorin-orange) wrote :

bug 1715660 seems related

( Lots of "waiting for lo to become free" message, then a "task ip:1358 blocked for more than 120 seconds." with a backtrace )

Fabian Holler (fh0) wrote :

I could not reproduce the bug with the described method with kernel 4.4.0-81-generic and neither with 4.13.0-041300rc7-generic. 4.4.0-81 logged a hung tasks but does not Oops.

So the bug might have been reintroduced between 4.4.0-82 and 4.4.0-93 and 4.13 seems to contain a fix.

Dan Streetman (ddstreet) wrote :

Thank you for the reproducer, I have been able to reproduce this using docker-samba-loop scripts. This should help me to dig into the kernel to see where the dst leak is coming from.

description: updated
Charles (charles-ubuntu-m) wrote :

We're seeing this on our build servers. These are VMWare virtual machines which run docker containers in privileged mode. Inside the containers are Jenkins agents and a build environment. Part of the build process uses unshare to create these namespaces: ipc uts mount pid net. If you have any questions about this arrangement, I can try to provide more detail and maybe example code.

The kernel is 4.4.0-79-generic x86_64. We are using docker 17.05.0-ce.

In one case, the whole VM hung when this was encountered.

Are there any workarounds for 16.04 users beyond rebooting the host?

Jens Elkner (jelmd) wrote :

Since xenial we have this very annoying behavior (unregister_netdevice: waiting for lo to become free. Usage count = 1) as well, which in turn makes a container reboot taking several minutes instead of seconds e.g. on vivid (lxc-ls -f also hangs for that time). E.g.:

 [ +10.244888] unregister_netdevice: waiting for lo to become free. Usage count = 1
[Oct 2 17:16] INFO: task lxc-autostart:5321 blocked for more than 120 seconds.
[ +0.007147] Tainted: P OE 4.4.0-96-generic #119-Ubuntu
[ +0.006796] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ +0.007913] lxc-autostart D ffff883fb99fbcb8 0 5321 1 0x00000000
[ +0.000007] ffff883fb99fbcb8 ffffffffffffffff ffff883ff2a3aa00 ffff881fed2fe200
[ +0.000004] ffff883fb99fc000 ffffffff81ef79a4 ffff881fed2fe200 00000000ffffffff
[ +0.000004] ffffffff81ef79a8 ffff883fb99fbcd0 ffffffff8183f165 ffffffff81ef79a0
[ +0.000004] Call Trace:
[ +0.000014] [<ffffffff8183f165>] schedule+0x35/0x80
[ +0.000005] [<ffffffff8183f40e>] schedule_preempt_disabled+0xe/0x10
[ +0.000016] [<ffffffff81841049>] __mutex_lock_slowpath+0xb9/0x130
[ +0.000014] [<ffffffff818410df>] mutex_lock+0x1f/0x30
[ +0.000009] [<ffffffff8172da4e>] copy_net_ns+0x6e/0x120
[ +0.000009] [<ffffffff810a179b>] create_new_namespaces+0x11b/0x1d0
[ +0.000003] [<ffffffff810a18bd>] copy_namespaces+0x6d/0xa0
[ +0.000007] [<ffffffff8107f1d2>] copy_process+0x8e2/0x1b30
[ +0.000011] [<ffffffff810805b0>] _do_fork+0x80/0x360
[ +0.000015] [<ffffffff81003226>] ? exit_to_usermode_loop+0xa6/0xd0
[ +0.000004] [<ffffffff81080939>] SyS_clone+0x19/0x20
[ +0.000009] [<ffffffff81843272>] entry_SYSCALL_64_fastpath+0x16/0x71
[ +4.882232] unregister_netdevice: waiting for lo to become free. Usage count = 1
...

Dan Streetman (ddstreet) wrote :
Download full text (4.9 KiB)

Hi everyone,

so this is an interesting problem. There are 2 parts to this:

1. in the container being destroyed, some socket(s) remain open for a period of time, which prevents the container from fully exits until all its sockets have exited. While this happens you will see the 'waiting for lo to become free' message repeated.

2. while the previous container is waiting for its sockets to exit, so it can exit, any new containers are blocked from starting.

For issue #2, I'm not clear just yet on what is blocking the new containers from starting. I thought it could be the rtnl_lock since the exiting container does loop in the rtnl_unlock path, however while it's looping it also calls __rtnl_unlock/rtnl_lock during its periods of waiting, so the mutex should not be held the entire time. I'm still investigating that part of this issue to find what exactly is blocking the new containers.

For issue #1, the sockets are lingering because - at least for the reproducer case from comment 16 - there is a TCP connection that never is closed, so the kernel continues to probe the other end until it times out, which takes around 2 minutes. For this specific reproducer, there are 2 easy ways to work around this. First for background, the reproducer uses docker to create a CIFS mount, which uses a TCP connection. This is the script run in the 'client' side, where the hang happens:

    date ; \
    mount.cifs //$SAMBA_PORT_445_TCP_ADDR/public /mnt/ -o vers=3.0,user=nobody,password= ; \
    date ; \
    ls -la /mnt ; \
    umount /mnt ; \
    echo "umount ok"

this is repeated 50 times in the script, but the 2nd (and later) calls hang (for 2 minutes each, if you wait that long).

A. The reason the TCP connection to the CIFS server lingers is beacuse it is never closed; the reason it isn't closed is because the script exits the container immediately after unmounting. The kernel cifs driver, during fs unmount, closes the TCP socket to the CIFS server, which queues a TCP FIN to the server. Normally, this FIN will reach the server who will respond and the TCP connection will close. However, since the container starts exiting immediately, that FIN never reaches the CIFS server - the interface is taken down too quickly. So, the TCP connection remains. To avoid this, simply add a short sleep between unmounting and exiting the container, e.g. (new line prefixed with +):

    date ; \
    mount.cifs //$SAMBA_PORT_445_TCP_ADDR/public /mnt/ -o vers=3.0,user=nobody,password= ; \
    date ; \
    ls -la /mnt ; \
    umount /mnt ; \
+ sleep 1 ; \
    echo "umount ok"

that avoids the problem, and allows the container to exit immediately and the next container to start immediately.

B. Instead of delaying enough to allow the TCP connection to exit, the kernel's TCP configuration can be changed to more quickly timeout. The specific setting to change is tcp_orphan_retries, which controls how many times the kernel keeps trying to talk to a remote host over a closed socket. Normally, this defaults to 0 - this will cause the kernel to actually use a value of 8 (retries). So, change it from 0 to 1, to actually reduce it from 8 to 1 (confusing, I know). Like (n...

Read more...

Charles (charles-ubuntu-m) wrote :

Is there any way for a non-kernel-dev type to see exactly which resource unregister_netdevice is waiting for? (Or does it only keep track of usage count?)

Fabian Holler (fh0) wrote :

Hello Dan,

thanks for the analysis!

That the startup of containers is delayed is annoying.
The much bigger issue is that it can reproducible cause a kernel Oops and crash a whole machine.

Dan Streetman (ddstreet) wrote :
Download full text (3.6 KiB)

My analysis so far of the problem:

1. container A has an outstanding TCP connection (thus, a socket and dst which hold reference on the "lo" interface from the container). When the container is stopped, the TCP connection takes ~2 minutes to timeout (with default settings).

2. when container A is being removed, its net namespace is removed via net/core/net_namespace.c function cleanup_net(). This takes two locks - first, the net_mutex, and then the rtnl mutex (i.e. rtnl_lock). It then cleans up the net ns and calls rtnl_unlock(). However, rtnl_unlock() internally waits for all the namespace's interfaces to be freed, which requires all their references to reach 0. This must wait until the above TCP connection times out, before it releases its reference. So, at this point the container is hung inside rtnl_unlock() waiting, and it still holds the net_mutex. It doesn't release the net_mutex until its lo interface finally is destroyed after the TCP connection times out.

3. When a new container is started, part of its startup is to call net/core/net_namespace.c function copy_net_ns() to copy the caller's net namespace to the new container net namespace. However, this function locks the net_mutex. Since the previous container still is holding the net_mutex as explaned in #2 above, this new container creation blocks until #2 releases the mutex.

There are a few ways to possibly address this:

a) when a container is being removed, all its TCP connections should abort themselves. Currently, TCP connections don't directly register for interface unregister events - they explictly want to stick around, so if an interface is taken down and then brought back up, the TCP connection remains, and the communication riding on top of the connection isn't interrupted. However, the way TCP does this is to move all its dst references off the interface that is unregistering, to the loopback interface. This works for the initial network namespace, where the loopback interface is always present and never removed. However, this doesn't work for non-default network namespaces - like containers - where the loopback interface is unregistered when the container is being removed. So this aspect of TCP may need to change, to correctly handle containers. This also may not cover all causes of this hang, since sockets handle more than just tcp connections.

b) when a container is being removed, instead of holding the net_mutex while the cleaning up (and calling rtnl_unlock), it could release the net_mutex first (after removing all netns marked for cleanup from the pernet list), then call rtnl_unlock. This needs examination to make sure it would not introduce any races or other problems.

c) rtnl_unlock could be simplified - currently, it includes significant side-effects, which include the long delay while waiting to actually remove all references to the namespace's interfaces (including loopback). Instead of blocking rtnl_unlock() to do all this cleanup, the cleanup could be deferred. This also would need investigation to make sure no caller is expecting to be able to free resources that may be accessed later from the cleanup action (which I believe is the case).

As...

Read more...

Dan Streetman (ddstreet) wrote :

> Is there any way for a non-kernel-dev type to see exactly which resource
> unregister_netdevice is waiting for? (Or does it only keep track of usage count?)

not really, no. It's simply waiting for its reference count to drop to 0, and just broadcasts unregister events periodically hoping that everyone with a reference is listening and does whatever they need to do to drop their references. In this case the socket's dst actually is listening, but the way the dst handles it assumes that the loopback interface will never, ever unregister - the dst shifts its reference count from the interface going down, to its netns's loopback interface. If the interface going down is the netns's loopback interface, then that reference shift doesn't help, of course.

So no, there is no simple way for you to tell what it's waiting for.

Dan Streetman (ddstreet) wrote :

> That the startup of containers is delayed is annoying.
> The much bigger issue is that it can reproducible cause a kernel Oops and
> crash a whole machine.

Yes as you found this uncovers other bugs in the kernel, but those are separate and looks like they are getting fixed.

Fabian Holler (fh0) wrote :

Ok, seems that I'm than in the wrong ticket with my issues. :-)
Should separate Ubuntu bug reports be created (if they don't exist already) regarding the kernel
crashes?

Dan Streetman (ddstreet) wrote :

> Should separate Ubuntu bug reports be created (if they don't exist already) regarding
> the kernel crashes

This bug is only to fix the hang/delay in new container creation. Please open a new bug for kernel crashes.

Dan Streetman (ddstreet) on 2017-11-09
Changed in linux (Ubuntu Artful):
status: Confirmed → In Progress
Changed in linux (Ubuntu Bionic):
status: New → In Progress
Changed in linux (Ubuntu Zesty):
status: New → In Progress
Changed in linux (Ubuntu Xenial):
status: New → In Progress
Changed in linux (Ubuntu Bionic):
importance: Undecided → Medium
assignee: nobody → Dan Streetman (ddstreet)
Changed in linux (Ubuntu Zesty):
importance: Undecided → Medium
Changed in linux (Ubuntu Trusty):
assignee: nobody → Dan Streetman (ddstreet)
Changed in linux (Ubuntu Xenial):
assignee: nobody → Dan Streetman (ddstreet)
Changed in linux (Ubuntu Trusty):
status: New → In Progress
Changed in linux (Ubuntu Zesty):
assignee: nobody → Dan Streetman (ddstreet)
Changed in linux (Ubuntu Xenial):
importance: Undecided → Medium
Changed in linux (Ubuntu Trusty):
importance: Undecided → Medium
Jiri Horky (jiri-horky) wrote :

his information might be relevant.

We are able to reproduce the problem with unregister_netdevice: waiting for lo to become free. Usage count = 1 with 4.14.0-rc3 kernel with CONFIG_PREEMPT_NONE=y and running only on one CPU with following boot kernel options:

BOOT_IMAGE=/boot/vmlinuz-4.14.0-rc3 root=/dev/mapper/vg0-root ro quiet vsyscall=emulate nosmp
Once we hit this state, it stays in this state and reboot is needed. No more containers can be spawned. We reproduce it by running images doing ipsec/openvpn connections + downloading a small file inside the tunnels. Then the instances exist (usually they run < 10s). We run 10s of such containers a minute on one machine. With the abovementioned settings (only 1cpu), the machine hits it in ~2 hours.

Another reproducer with the same kernel, but without limiting number of CPUs, is to jus run iperf in UDP mode for 3 seconds inside the container (so there is no TCP communication at all). If we run 10 of such containers in parallel, wait for all of them to finish and do it again, we hit the trouble in less than 10 minutes (on 40 cores machine).

In both of our reproducers, we added "ip route flush table all; ifconfig down; sleep 10" before existing from containers. It does not seem to have any effect.

Dan Streetman (ddstreet) wrote :

> In both of our reproducers, we added "ip route flush table all;

that won't help

> ifconfig down;

that's not a valid cmdline

> sleep 10" before existing from containers.

sleeping without closing existing open sockets doesn't help

what are you using for container mgmt? docker? lxc? lxd?

Dan Streetman (ddstreet) wrote :

simpler reproducer for this using LXD instead of Docker:

1. install/setup LXD
$ sudo apt install lxd
$ sudo lxd init

2. create two containers
$ lxc launch ubuntu:xenial server
$ lxc launch ubuntu:xenial client

3. make the client privileged (so it can mount smb/cifs)
$ lxc config set client security.privileged true
$ lxc stop client
$ lxc start client

4. set up samba server in container
$ lxc exec server apt update
$ lxc exec server apt install cifs-utils samba
$ lxc exec server mkdir /dummy
$ lxc exec server touch /dummy/emptyfile
$ lxc exec server vim /etc/samba/smb.conf

add this section at the end of the container's smb.conf:

[dummy]
  path = /dummy
  browseable = yes
  read only = yes
  guest ok = yes

$ lxc stop server
$ lxc start server

5. note server container's ip
$ lxc list server

e.g.
| server | RUNNING | 10.109.201.68 (eth0)

6. setup client container
$ lxc exec client mkdir /dummy
$ lxc exec apt update
$ lxc exec apt install cifs-utils

7. stop client container
$ lxc stop client

8. full cycle of client container start through stop
$ lxc start client ; sleep 2 ; lxc exec client -- mount -o guest //10.109.201.68/dummy /dummy ; lxc exec client -- ls -l /dummy ; lxc exec client umount /dummy ; lxc stop -f client

9. repeat client cycle; it should hang due to this bug
$ lxc start client ; sleep 2 ; lxc exec client -- mount -o guest //10.109.201.68/dummy /dummy ; lxc exec client -- ls -l /dummy ; lxc exec client umount /dummy ; lxc stop -f client

Note this force-stops the container, instead of normally stopping it; if the container is allowed to exit normally, the problem isn't reproduced (with these steps, at least). Also note the 'sleep 2' is there to give enough time for the client container network to come up, it may need to be increased if your client container takes longer to assign its ip address (you can edit the client container nw config to use a static address if you want, which should be faster).

Stefan Rinkes (darinkes) wrote :

Hi,

In a test-setup I can easily reproduce this issue very quickly.
At my scenario it looks like the issue is just triggered if I change the IP inside the docker container
(Needed to test StrongSwan's MOBIKE feature), send some data (ICMP) and then kill the docker container.

So if there is something to test or debug, I would be happy to help fixing this issue.

Also already tried the latest kernel from http://kernel.ubuntu.com/~kernel-ppa/mainline/:
Linux docker 4.15.0-041500rc3-generic #201712110230 SMP Mon Dec 11 02:31:13 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

Dan Streetman (ddstreet) wrote :

> So if there is something to test or debug, I would be happy to help fixing this issue.

yes, I'm hoping to have a kernel build with some added debug to track/list exactly what socket(s) are still holding lo device references when the container is killed/shutdown, so testing with that would help clarify the specifics of this problem, once I have the debug kernel ready and built. I will update this bug once I have something ready to test.

> Also already tried the latest kernel from http://kernel.ubuntu.com/~kernel-ppa/mainline/:

and I assume that did not help, correct? This problem still exists upstream (assuming we're all encountering the same problem that I'm debugging) and will require an upstream patch.

Daniel Borkmann (dansel) wrote :

Hi Dan,

the test kernel you were using had this one [1] included? It's upstream commit f186ce61bb82 ("Fix an intermittent pr_emerg warning about lo becoming free."). Meaning, it's a rather a dst leak somewhere in latest upstream as well?

Thanks,
Daniel

  [1] https://patchwork.ozlabs.org/patch/773480/

Dan Streetman (ddstreet) wrote :

> the test kernel you were using had this one [1] included? It's upstream commit f186ce61bb82

no that commit doesn't help, because the issue isn't one of dst refcount leaks or delayed dst gc, the problem is kernel sockets that haven't closed yet holding a reference to the netns lo interface device via their dsts.

I do think I have a possible fix for this though, and I have some test kernels building in this ppa (please note they are currently building and will take a few more hours to finish):
https://launchpad.net/~ddstreet/+archive/ubuntu/lp1711407

can anyone please test with any of those kernels (they all contain the patch to maybe fix this)? In my testing it eliminated the problem completely, but I'd like to get some more diverse real-world testing before starting an upstream discussion on the patch.

Stefan Rinkes (darinkes) wrote :

Ok, waiting for the build to finish and then will test&report ASAP.

Stefan Rinkes (darinkes) wrote :

$ sudo add-apt-repository ppa:ddstreet/lp1711407
Cannot add PPA: 'ppa:~ddstreet/ubuntu/lp1711407'.
ERROR: '~ddstreet' user or team does not exist.

DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=16.04
DISTRIB_CODENAME=xenial
DISTRIB_DESCRIPTION="Ubuntu 16.04.3 LTS"

Stefan Rinkes (darinkes) wrote :

nvm, forgot to configure https_proxy :)

Stefan Rinkes (darinkes) wrote :

Looks good so far. Docker is still usable after 5000 runs, no "waiting for lo to become free" in syslog.

Linux docker 4.4.0-104-generic #127+hf1711407v20171220b1-Ubuntu SMP Wed Dec 20 12:31:13 UTC 201 x86_64 x86_64 x86_64 GNU/Linux

$ docker version
Client:
 Version: 17.09.1-ce
 API version: 1.32
 Go version: go1.8.3
 Git commit: 19e2cf6
 Built: Thu Dec 7 22:24:23 2017
 OS/Arch: linux/amd64

Server:
 Version: 17.09.1-ce
 API version: 1.32 (minimum version 1.12)
 Go version: go1.8.3
 Git commit: 19e2cf6
 Built: Thu Dec 7 22:23:00 2017
 OS/Arch: linux/amd64
 Experimental: false

Dan Streetman (ddstreet) wrote :

One note on the test kernels, I'm still updating the patch as the change in the current test kernels will leave sockets open which leaks memory due to the netns not fully closing/exiting. I'll update this bug when I have an updated patch and test kernels.

I would not recommend the current test kernels for long-term use on systems where containers (or namespaces) are created/destroyed regularly.

Daniel Graña (danielgrana) wrote :

Hi Dan,

How do we know when we hit the bug with your patched kernel? is it still logging "waiting for lo to become free" or is there any other signature we can use to detect the original problem and still say "Wow! it works, Dan rocks"

I have been scavenging the internet looking for your linux git fork and the commits for this ticket without luck, nothing in your github fork neither http://kernel.ubuntu.com/git/ nor https://launchpad.net/linux. Is there a public place where we can take a look at the patch?

thanks!

Dan Streetman (ddstreet) wrote :
Download full text (5.3 KiB)

Happy Holidays!

After some more investigation I've changed the patch to what I think is a more appropriate fix, though it will only fix cases where the container hang is due to a TCP kernel socket hanging while trying to close itself (waiting for FIN close sequence, which will never complete since the container interfaces are shut down).

I also added some kernel-owned socket debug so in case of another 'waiting for lo to become free' hang it will print out all open kernel sockets along with their family, state, and creating function, which hopefully will help debug more, assuming my TCP patch doesn't cover all cases.

The same ppa has been updated with the new patches:
https://launchpad.net/~ddstreet/+archive/ubuntu/lp1711407

And if you would like to see the patches they're in my lp git repo:
https://code.launchpad.net/~ddstreet/+git/linux/+ref/lp1711407-tcp-check-xenial

Can those able to reproduce this bug please test with any of the kernels from that ppa? There is some debug output to let you track container net namespace lifetimes; you should see log entries like:

[ 229.134701] net_alloc: created netns ffff8800b99cc980
...
[ 438.930538] net_free: freed netns ffff8800b99cc980

you can track the alloc and free of each container that way, to make sure none of them are remaining open (i.e. leaking). Note that netns addresses may be re-used, so you may see something like:

[ 228.797533] net_free: freed netns ffff8800baab1880
...
[ 352.000520] net_alloc: created netns ffff8800baab1880

that's fine and not a problem.

Also, if you see this line in the logs:

[ 22.149194] TCP: our netns is exiting

that indicates the patch to fix this problem detected the problem and closed the hanging TCP socket.

If you can reproduce the issue with any of those test kernels, please post one of the socket debug output sections that look like:

[ 410.258505] unregister_netdevice: waiting for lo to become free. Usage count = 1
[ 410.261385] netdev_wait_allrefs: waiting on sk ffff8800b9662800 family 16 state 7 creator __netlink_kernel_create+0x6d/0x260
[ 410.261390] netdev_wait_allrefs: waiting on sk ffff8800b9662000 family 16 state 7 creator __netlink_kernel_create+0x6d/0x260
[ 410.261408] netdev_wait_allrefs: waiting on sk ffff8800b9661000 family 16 state 7 creator __netlink_kernel_create+0x6d/0x260
[ 410.261411] netdev_wait_allrefs: waiting on sk ffff8800baeb4000 family 16 state 7 creator __netlink_kernel_create+0x6d/0x260
[ 410.261416] netdev_wait_allrefs: waiting on sk ffff8800b9699680 family 2 state 7 creator inet_ctl_sock_create+0x35/0x80
[ 410.261420] netdev_wait_allrefs: waiting on sk ffff8800bb3743c0 family 2 state 7 creator inet_ctl_sock_create+0x35/0x80
[ 410.261423] netdev_wait_allrefs: waiting on sk ffff8800bb374780 family 2 state 7 creator inet_ctl_sock_create+0x35/0x80
[ 410.261426] netdev_wait_allrefs: waiting on sk ffff8800bb374b40 family 2 state 7 creator inet_ctl_sock_create+0x35/0x80
[ 410.261430] netdev_wait_allrefs: waiting on sk ffff8800bb374f00 family 2 state 7 creator inet_ctl_sock_create+0x35/0x80
[ 410.261433] netdev_wait_allrefs: waiting on sk ffff8800bb3752c0 family 2 state 7 creator inet_ctl_sock_create+0x35/0x8...

Read more...

Dan Streetman (ddstreet) wrote :

I've updated the patch and submitted upstream:
https://marc.info/?l=linux-netdev&m=151631015108584&w=2

The PPA also contains the latest ubuntu kernels patched for the issue, if anyone has time to test:
https://launchpad.net/~ddstreet/+archive/ubuntu/lp1711407

> How do we know when we hit the bug with your patched kernel?

I didn't include any output in the latest kernels to indicate this issue had been detected and fixed. If anyone is able to reproduce it with one of the test kernels, I can provide a debug kernel to gather more details of why it's still failing, though.

Jiri Horky (jiri-horky) wrote :

Hi Dan,

first of all thanks for looking into this long lasting, irritating issue.

I have tried your kernel (4.13.0-30-generic #33+hf1711407v20180118b2-Ubuntu SMP) on Ubuntu 17.10 and was able to get "unregister_netdevice: waiting for lo to become free. Usage count = 1" in less than 5 minutes :-(
Our environment spawns about 1 container a second that does either openvpn (both udp and tcp) or ipsec tunnel testing.
I don't see any debug output in the kernel log, the lines are just old ordinary:

[ 910.690908] unregister_netdevice: waiting for lo to become free. Usage count = 1
[ 920.806382] unregister_netdevice: waiting for lo to become free. Usage count = 1
[ 930.945869] unregister_netdevice: waiting for lo to become free. Usage count = 1
[ 941.025341] unregister_netdevice: waiting for lo to become free. Usage count = 1
[ 951.104797] unregister_netdevice: waiting for lo to become free. Usage count = 1
[ 961.184286] unregister_netdevice: waiting for lo to become free. Usage count = 1
[ 971.263749] unregister_netdevice: waiting for lo to become free. Usage count = 1

I don't even see any "net_free:" or "net_alloc:" messages in the log (and I checked the referrenced git and your recent patch and it does not seem to be there either).

PLease let me know how to enable the debug messages, so we can finally trace it down.

Thanks
Jiri Horky

Dan Streetman (ddstreet) wrote :

> was able to get "unregister_netdevice: waiting for lo to become free. Usage count = 1"
> in less than 5 minutes :-(

well, that's unfortunate that you still see it but fortunate that you're able to reproduce it quickly. The issue that I fixed - TCP kernel socket staying open - is certainly not the only way to encounter the problem.

> Our environment spawns about 1 container a second that does either openvpn (both udp and tcp)
> or ipsec tunnel testing.

That's good to know; my guess would be ipsec, but of course will require investigation. I'll try reproducing separately with openvpn and ipsec also.

> PLease let me know how to enable the debug messages

the previous builds did have some dbg messages, but I removed them for the last build, so I'll need to add dbg again and spin new kernel builds. I'll update this bug once dbg kernel(s) are ready.

Jiri Horky (jiri-horky) wrote :

Ok, waiting for the kernel with dbg messages. I should be able to test it right away.

Dan Streetman (ddstreet) wrote :
Download full text (5.3 KiB)

I uploaded a new debug kernel to the ppa:
https://launchpad.net/~ddstreet/+archive/ubuntu/lp1711407

I've only updated the 4.13 kernel currently, on Artful, so if you need a Xenial HWE 4.13 kernel let me know and I can build that as well. It will likely take hours to finish building but should be ready tomorrow.

With the debug kernel, when you hit the bug, you should see additional debug output, like:

[ 87.028053] unregister_netdevice: waiting for lo (netns ffff9af471fa8000) to become free. Usage count = 1
[ 87.031461] (netns ffff9af471fa8000): open sk ffff9af479983000 family 16 state 7 dst_cache (null) rx_dst (null) creator __netlink_kernel_create+0x6c/0x260
[ 87.031464] (netns ffff9af471fa8000): open sk ffff9af479982000 family 16 state 7 dst_cache (null) rx_dst (null) creator __netlink_kernel_create+0x6c/0x260
[ 87.031467] (netns ffff9af471fa8000): open sk ffff9af479982800 family 16 state 7 dst_cache (null) rx_dst (null) creator __netlink_kernel_create+0x6c/0x260
[ 87.031470] (netns ffff9af471fa8000): open sk ffff9af479985000 family 16 state 7 dst_cache (null) rx_dst (null) creator __netlink_kernel_create+0x6c/0x260
[ 87.031475] (netns ffff9af471fa8000): open sk ffff9af476541540 family 2 state 7 dst_cache (null) rx_dst (null) creator inet_ctl_sock_create+0x37/0x80
[ 87.031479] (netns ffff9af471fa8000): open sk ffff9af471096940 family 2 state 7 dst_cache (null) rx_dst (null) creator inet_ctl_sock_create+0x37/0x80
[ 87.031482] (netns ffff9af471fa8000): open sk ffff9af471094f00 family 2 state 7 dst_cache (null) rx_dst (null) creator inet_ctl_sock_create+0x37/0x80
[ 87.031485] (netns ffff9af471fa8000): open sk ffff9af4710970c0 family 2 state 7 dst_cache (null) rx_dst (null) creator inet_ctl_sock_create+0x37/0x80
[ 87.031488] (netns ffff9af471fa8000): open sk ffff9af471094000 family 2 state 7 dst_cache (null) rx_dst (null) creator inet_ctl_sock_create+0x37/0x80
[ 87.031491] (netns ffff9af471fa8000): open sk ffff9af4710961c0 family 2 state 7 dst_cache (null) rx_dst (null) creator inet_ctl_sock_create+0x37/0x80
[ 87.031494] (netns ffff9af471fa8000): open sk ffff9af471095680 family 2 state 7 dst_cache (null) rx_dst (null) creator inet_ctl_sock_create+0x37/0x80
[ 87.031497] (netns ffff9af471fa8000): open sk ffff9af471096580 family 2 state 7 dst_cache (null) rx_dst (null) creator inet_ctl_sock_create+0x37/0x80
[ 87.031501] (netns ffff9af471fa8000): open sk ffff9af4710943c0 family 2 state 7 dst_cache (null) rx_dst (null) creator inet_ctl_sock_create+0x37/0x80
[ 87.031504] (netns ffff9af471fa8000): open sk ffff9af479987000 family 16 state 7 dst_cache (null) rx_dst (null) creator __netlink_kernel_create+0x6c/0x260
[ 87.031507] (netns ffff9af471fa8000): open sk ffff9af479981800 family 16 state 7 dst_cache (null) rx_dst (null) creator __netlink_kernel_create+0x6c/0x260
[ 87.031510] (netns ffff9af471fa8000): o...

Read more...

Jiri Horky (jiri-horky) wrote :
Download full text (28.1 KiB)

Hi Dan,

Here is the output of the kernel when the "unregister_netdevice" bug was hit for the first time on the reproducer machine.

I hope it helps. Let me know if I can provide you more debug output or try to reproduce it once more.

Thanks
Jiri Horky

Jan 31 08:42:26 prg41-004 kernel: [ 663.049461] unregister_netdevice: waiting for lo (netns ffff88cd026e0000) to become free. Usage count = 1
Jan 31 08:42:26 prg41-004 kernel: [ 663.049477] (netns ffff88cd026e0000): open sk ffff88cd0120a000 family 16 state 7 dst_cache (null) rx_dst (null) creator __netlink_kernel_create+0x6c/0x260
Jan 31 08:42:26 prg41-004 kernel: [ 663.049481] (netns ffff88cd026e0000): open sk ffff88cd0120d800 family 16 state 7 dst_cache (null) rx_dst (null) creator __netlink_kernel_create+0x6c/0x260
Jan 31 08:42:26 prg41-004 kernel: [ 663.049486] (netns ffff88cd026e0000): open sk ffff88cd0120d000 family 16 state 7 dst_cache (null) rx_dst (null) creator __netlink_kernel_create+0x6c/0x260
Jan 31 08:42:26 prg41-004 kernel: [ 663.049490] (netns ffff88cd026e0000): open sk ffff88cd0120c000 family 16 state 7 dst_cache (null) rx_dst (null) creator __netlink_kernel_create+0x6c/0x260
Jan 31 08:42:26 prg41-004 kernel: [ 663.049495] (netns ffff88cd026e0000): open sk ffff88cd18dc2200 family 2 state 7 dst_cache (null) rx_dst (null) creator inet_ctl_sock_create+0x37/0x80
Jan 31 08:42:26 prg41-004 kernel: [ 663.049498] (netns ffff88cd026e0000): open sk ffff88cd10e58f00 family 2 state 7 dst_cache (null) rx_dst (null) creator inet_ctl_sock_create+0x37/0x80
Jan 31 08:42:26 prg41-004 kernel: [ 663.049501] (netns ffff88cd026e0000): open sk ffff88cd10e5da00 family 2 state 7 dst_cache (null) rx_dst (null) creator inet_ctl_sock_create+0x37/0x80
Jan 31 08:42:26 prg41-004 kernel: [ 663.049504] (netns ffff88cd026e0000): open sk ffff88cd10e5b0c0 family 2 state 7 dst_cache (null) rx_dst (null) creator inet_ctl_sock_create+0x37/0x80
Jan 31 08:42:26 prg41-004 kernel: [ 663.049506] (netns ffff88cd026e0000): open sk ffff88cd10e5c380 family 2 state 7 dst_cache (null) rx_dst (null) creator inet_ctl_sock_create+0x37/0x80
Jan 31 08:42:26 prg41-004 kernel: [ 663.049509] (netns ffff88cd026e0000): open sk ffff88cd10e5b480 family 2 state 7 dst_cache (null) rx_dst (null) creator inet_ctl_sock_create+0x37/0x80
Jan 31 08:42:26 prg41-004 kernel: [ 663.049512] (netns ffff88cd026e0000): open sk ffff88cd10e5a1c0 family 2 state 7 dst_cache (null) rx_dst (null) creator inet_ctl_sock_create+0x37/0x80
Jan 31 08:42:26 prg41-004 kernel: [ 663.049514] (netns ffff88cd026e0000): open sk ffff88cd10e5b840 family 2 state 7 dst_cache (null) rx_dst (null) creator inet_ctl_sock_create+0x37/0x80
Jan 31 08:42:26 prg41-004 kernel: [ 663.049517] (netns ffff88cd026e0000): open sk ffff88cd10e5cb00 family 2 state 7 dst_cache (null) rx_dst (null) creator inet_ctl_sock_create+0x37/0x80
...

Dan Streetman (ddstreet) wrote :

> Here is the output of the kernel when the "unregister_netdevice" bug was hit

Thanks! And just to verify, that's the full output of all those lines before the next "waiting..." line, right? If so, it looks like the problem in your case is not a kernel socket holding the netns open, it looks like a good-old dst leak, or something else holding a dst, so I'll need to add debug tracking those now...

Also, when you reproduce the problem, if you leave it for a while (like, 5 minutes or so), does it resolve itself? Or stay hung and printing the "waiting..." messages forever? Make sure to check the netns addr in the "waiting..." message to verify it's stuck waiting for the *same* netns to close forever, not just reproducing the problem over and over with different netns.

I'll get another kernel ready with additional dst debug. I think there is at least one dst leak patch upstream too, I'll check for those.

Jiri Horky (jiri-horky) wrote :

Hi Dan,

to answer your questions:
1) Yes, it was full output before next "waiting for..." message
2) If I leave the box running (and stops generating the load), it actually does not recover itself at all. It keeps outputing the "waiting for..." forever with the same device:

# dmesg | grep waiti
[35684.449735] unregister_netdevice: waiting for lo (netns ffff88cd026e0000) to become free. Usage count = 1
[35694.529742] unregister_netdevice: waiting for lo (netns ffff88cd026e0000) to become free. Usage count = 1
[35704.609742] unregister_netdevice: waiting for lo (netns ffff88cd026e0000) to become free. Usage count = 1
[35714.689748] unregister_netdevice: waiting for lo (netns ffff88cd026e0000) to become free. Usage count = 1
[35724.769751] unregister_netdevice: waiting for lo (netns ffff88cd026e0000) to become free. Usage count = 1
[35734.849759] unregister_netdevice: waiting for lo (netns ffff88cd026e0000) to become free. Usage count = 1
[35744.929763] unregister_netdevice: waiting for lo (netns ffff88cd026e0000) to become free. Usage count = 1
[35755.009758] unregister_netdevice: waiting for lo (netns ffff88cd026e0000) to become free. Usage count = 1
[35765.089759] unregister_netdevice: waiting for lo (netns ffff88cd026e0000) to become free. Usage count = 1
[35775.169756] unregister_netdevice: waiting for lo (netns ffff88cd026e0000) to become free. Usage count = 1
[35785.249761] unregister_netdevice: waiting for lo (netns ffff88cd026e0000) to become free. Usage count = 1
[35795.329763] unregister_netdevice: waiting for lo (netns ffff88cd026e0000) to become free. Usage count = 1
[35805.409758] unregister_netdevice: waiting for lo (netns ffff88cd026e0000) to become free. Usage count = 1
[35815.489753] unregister_netdevice: waiting for lo (netns ffff88cd026e0000) to become free. Usage count = 1
[35825.569749] unregister_netdevice: waiting for lo (netns ffff88cd026e0000) to become free. Usage count = 1
[35835.649747] unregister_netdevice: waiting for lo (netns ffff88cd026e0000) to become free. Usage count = 1
[35845.729752] unregister_netdevice: waiting for lo (netns ffff88cd026e0000) to become free. Usage count = 1
[35855.809741] unregister_netdevice: waiting for lo (netns ffff88cd026e0000) to become free. Usage count = 1
[35865.889736] unregister_netdevice: waiting for lo (netns ffff88cd026e0000) to become free. Usage count = 1
[35875.969733] unregister_netdevice: waiting for lo (netns ffff88cd026e0000) to become free. Usage count = 1
[35886.049726] unregister_netdevice: waiting for lo (netns ffff88cd026e0000) to become free. Usage count = 1
[35896.129717] unregister_netdevice: waiting for lo (netns ffff88cd026e0000) to become free. Usage count = 1

I am happy to continue testing with a new kernel.
Cheers
Jirka H.

Dan Streetman (ddstreet) wrote :

Hi Jirka,

I have a new debug kernel building (will take a few hours as usual); please test when you have a chance. New kernel is still for Artful and version is linux-4.13.0-32.35+hf1711407v20180202b1.

https://launchpad.net/~ddstreet/+archive/ubuntu/lp1711407

Jiri Horky (jiri-horky) wrote :

Hi Dan,

here is the output:

Feb 2 18:50:14 prg41-004 kernel: [ 482.151773] unregister_netdevice: waiting for lo (netns ffff8de11803e000) to become free. Usage count = 1
Feb 2 18:50:14 prg41-004 kernel: [ 482.151876] (netns ffff8de11803e000): dst ffff8dd905360300 expires 0 error 0 obsolete 2 refcnt 1 ops ipv4_dst_ops+0x0/0xc0 creator rt_dst_alloc+0x54/0xe0

It keeps to be output every 10s for over 500s now.

I hope it helps.

Jirka Horky

Dan Streetman (ddstreet) wrote :

And we're on our way down the rabbit hole...

netns exit is blocked by reference from interface...
interface exit is blocked by reference from dst object...
dst object free is blocked by reference from ???...

I'll set up more dbg and have a new test kernel, probably by Monday.

Thanks!

Dan Streetman (ddstreet) wrote :

Here is another Artful debug kernel, with just slightly more debug info; it may help a little while I work on adding more significant debug to track where the dst refcnt is getting leaked. Same PPA:
https://launchpad.net/~ddstreet/+archive/ubuntu/lp1711407

New kernel version is 4.13.0-32.35+hf1711407v20180205b1.

Thanks again for testing these debug kernels!

Jiri Horky (jiri-horky) wrote :

Hi Dan,

Glad to help, I thank you for digging into that! Here is the latest output:

[ 912.489822] unregister_netdevice: waiting for lo (netns ffff8de5c10f3000) to become free. Usage count = 1
[ 912.489927] (netns ffff8de5c10f3000): dst ffff8de5ba0ced00 expires 0 error 0 obsolete 2 flags 0 refcnt 2 ops ipv4_dst_ops+0x0/0xc0 creator ip_route_output_key_hash_rcu+0x355/0x9c0

Waiting for another image ;-)

Regards
Jiri Horky

Dan Streetman (ddstreet) wrote :

Ok, here is a new debug kernel:
http://people.canonical.com/~ddstreet/lp1711407/linux-image-4.13.13-00940-gf16e2bbbddee_4.13.13-00940-gf16e2bbbddee-24_amd64.deb

I didn't use the PPA as this one will dump out quite a lot of debug in the logs all the time. When you reproduce it (which hopefully you can still do, even with the added debug msgs), please check the output to find the address of the leaked dst, e.g.:

[ 2833.421135] (netns ffff9f4d77278000): dst ffff9f4d743e9a00 expires 0 error 0 obsolete 2 flags 0 refcnt 1 ops ipv4_dst_ops+0x0/0xc0 creator ip_route_input_rcu+0x6b0/0xd80

would be dst address "ffff9f4d743e9a00". Then, please grep dmesg for that, so you get output similar to:

$ dmesg | grep ffff9f4d743e9a00
[ 2823.048309] dst_init: dst ffff9f4d743e9a00 refcnt 1 from __dst_alloc+0x7e/0x100
[ 2823.048312] dst_hold: dst ffff9f4d743e9a00 refcnt 2 from rt_cache_route+0x54/0xe0
[ 2823.048327] dst_hold_safe: dst ffff9f4d743e9a00 refcnt 3 from inet_sk_rx_dst_set+0x7f/0xc0
[ 2823.048453] dst_release: dst ffff9f4d743e9a00 refcnt 2 from skb_release_head_state+0x96/0xa0
[ 2823.048572] dst_hold_safe: dst ffff9f4d743e9a00 refcnt 3 from tcp_add_backlog+0x12f/0x170
[ 2823.048580] dst_release: dst ffff9f4d743e9a00 refcnt 2 from skb_release_head_state+0x96/0xa0
[ 2823.231547] dst_release_immediate: dst ffff9f4d743e9a00 refcnt 1 from free_fib_info_rcu+0x173/0x1c0
[ 2833.421088] (netns ffff9f4d77278000): open sk ffff9f4d766aa100 family 2 state 4 dst_cache (null) rx_dst ffff9f4d743e9a00 creator generic_ip_connect+0x24b/0x4f0 [cifs]
[ 2833.421135] (netns ffff9f4d77278000): dst ffff9f4d743e9a00 expires 0 error 0 obsolete 2 flags 0 refcnt 1 ops ipv4_dst_ops+0x0/0xc0 creator ip_route_input_rcu+0x6b0/0xd80

Paste that output here, so I can see the history of the dst refcnt. If you don't see the initial dst_init in the trace, try using /var/log/kern.log instead of dmesg, as dmesg may have rolled over the older logs if you have a lot of output going on.

Dan Streetman (ddstreet) wrote :

Can you also test with this kernel please:
http://people.canonical.com/~ddstreet/lp1711407/linux-image-4.13.13-00941-gf382397cf315_4.13.13-00941-gf382397cf315-26_amd64.deb

It's the same as the last one, with a ipsec/xfrm dst leak patch.

Jiri Horky (jiri-horky) wrote :

Hi Dan,

so for the first kernel (linux-image-4.13.13-00940-gf16e2bbbddee_4.13.13-00940-gf16e2bbbddee-24_amd64.deb), it failed after some 10 minutes. The outputs are:

Feb 13 22:35:54 prg41-004 kernel: [ 736.399342] unregister_netdevice: waiting for lo (netns ffff9adbfbebb000) to become free. Usage count = 1
Feb 13 22:35:54 prg41-004 kernel: [ 736.399376] (netns ffff9adbfbebb000): dst ffff9ae3fee51f00 expires 0 error 0 obsolete 2 flags 0 refcnt 2 ops ipv4_dst_ops+0x0/0xc0 creator ip_route_output_key_hash_rcu+0x355/0xa20

And then:
root@prg41-004:~# grep ffff9ae3fee51f00 /var/log/syslog
Feb 13 22:33:33 prg41-004 kernel: [ 595.493003] dst_init: dst ffff9ae3fee51f00 refcnt 1 from __dst_alloc+0x7e/0x100
Feb 13 22:33:33 prg41-004 kernel: [ 595.493009] dst_hold: dst ffff9ae3fee51f00 refcnt 2 from rt_cache_route+0x54/0xe0
Feb 13 22:33:33 prg41-004 kernel: [ 595.648996] dst_hold_safe: dst ffff9ae3fee51f00 refcnt 3 from ip_route_output_key_hash_rcu+0x8a1/0xa20
Feb 13 22:33:33 prg41-004 kernel: [ 595.689067] dst_release_immediate: dst ffff9ae3fee51f00 refcnt 2 from free_fib_info_rcu+0x146/0x1c0
Feb 13 22:35:54 prg41-004 kernel: [ 736.399376] (netns ffff9adbfbebb000): dst ffff9ae3fee51f00 expires 0 error 0 obsolete 2 flags 0 refcnt 2 ops ipv4_dst_ops+0x0/0xc0 creator ip_route_output_key_hash_rcu+0x355/0xa20
Feb 13 22:36:04 prg41-004 kernel: [ 746.447322] (netns ffff9adbfbebb000): dst ffff9ae3fee51f00 expires 0 error 0 obsolete 2 flags 0 refcnt 2 ops ipv4_dst_ops+0x0/0xc0 creator ip_route_output_key_hash_rcu+0x355/0xa20
Feb 13 22:36:14 prg41-004 kernel: [ 756.527237] (netns ffff9adbfbebb000): dst ffff9ae3fee51f00 expires 0 error 0 obsolete 2 flags 0 refcnt 2 ops ipv4_dst_ops+0x0/0xc0 creator ip_route_output_key_hash_rcu+0x355/0xa20
Feb 13 22:36:24 prg41-004 kernel: [ 766.639159] (netns ffff9adbfbebb000): dst ffff9ae3fee51f00 expires 0 error 0 obsolete 2 flags 0 refcnt 2 ops ipv4_dst_ops+0x0/0xc0 creator ip_route_output_key_hash_rcu+0x355/0xa20
Feb 13 22:36:34 prg41-004 kernel: [ 776.815056] (netns ffff9adbfbebb000): dst ffff9ae3fee51f00 expires 0 error 0 obsolete 2 flags 0 refcnt 2 ops ipv4_dst_ops+0x0/0xc0 creator ip_route_output_key_hash_rcu+0x355/0xa20
Feb 13 22:36:45 prg41-004 kernel: [ 786.958967] (netns ffff9adbfbebb000): dst ffff9ae3fee51f00 expires 0 error 0 obsolete 2 flags 0 refcnt 2 ops ipv4_dst_ops+0x0/0xc0 creator ip_route_output_key_hash_rcu+0x355/0xa20
Feb 13 22:36:55 prg41-004 kernel: [ 797.070875] (netns ffff9adbfbebb000): dst ffff9ae3fee51f00 expires 0 error 0 obsolete 2 flags 0 refcnt 2 ops ipv4_dst_ops+0x0/0xc0 creator ip_route_output_key_hash_rcu+0x355/0xa20

(and obviously more of such messages every 10 seconds)

BTW from the timing it seems that the docker image who produced this error might be interrupted with timeout (normally, the dockers images we run on the machine live for few seconds only).

Will post the second kernel soon.

Jiri Horky (jiri-horky) wrote :

Hi again,

so for the second kernel (), was able to reproduce it in ~same time. The messages are:

Feb 13 23:04:23 prg41-004 kernel: [ 650.285711] unregister_netdevice: waiting for lo (netns ffff943cfe8ce000) to become free. Usage count = 1
Feb 13 23:04:23 prg41-004 kernel: [ 650.285736] (netns ffff943cfe8ce000): dst ffff943d18ac4200 expires 0 error 0 obsolete 2 flags 0 refcnt 2 ops ipv4_dst_ops+0x0/0xc0 creator ip_route_output_key_hash_rcu+0x355/0xa20

root@prg41-004:~# grep ffff943d18ac4200 /var/log/syslog
Feb 13 23:02:14 prg41-004 kernel: [ 521.458077] dst_init: dst ffff943d18ac4200 refcnt 1 from __dst_alloc+0x7e/0x100
Feb 13 23:02:14 prg41-004 kernel: [ 521.458085] dst_hold: dst ffff943d18ac4200 refcnt 2 from rt_cache_route+0x54/0xe0
Feb 13 23:02:14 prg41-004 kernel: [ 521.634088] dst_hold_safe: dst ffff943d18ac4200 refcnt 3 from ip_route_output_key_hash_rcu+0x8a1/0xa20
Feb 13 23:02:14 prg41-004 kernel: [ 521.778104] dst_release_immediate: dst ffff943d18ac4200 refcnt 2 from free_fib_info_rcu+0x146/0x1c0
Feb 13 23:04:23 prg41-004 kernel: [ 650.285736] (netns ffff943cfe8ce000): dst ffff943d18ac4200 expires 0 error 0 obsolete 2 flags 0 refcnt 2 ops ipv4_dst_ops+0x0/0xc0 creator ip_route_output_key_hash_rcu+0x355/0xa20
Feb 13 23:04:33 prg41-004 kernel: [ 660.525715] (netns ffff943cfe8ce000): dst ffff943d18ac4200 expires 0 error 0 obsolete 2 flags 0 refcnt 2 ops ipv4_dst_ops+0x0/0xc0 creator ip_route_output_key_hash_rcu+0x355/0xa20
Feb 13 23:04:43 prg41-004 kernel: [ 670.669641] (netns ffff943cfe8ce000): dst ffff943d18ac4200 expires 0 error 0 obsolete 2 flags 0 refcnt 2 ops ipv4_dst_ops+0x0/0xc0 creator ip_route_output_key_hash_rcu+0x355/0xa20

etc.

I hope it helps!

Regards
Jirka H.

Jiri Horky (jiri-horky) wrote :

Forgot to paste the second kernel version. The previous message was obtained using 4.13.13-00941-gf382397cf315 kernel (the one with ipsec/xfrm dst leak patch).

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.