unregister_netdevice: waiting for lo to become free. Usage count
| Affects | Status | Importance | Assigned to | Milestone | |
|---|---|---|---|---|---|
| | Linux |
Unknown
|
Unknown
|
||
| | linux (Ubuntu) |
Medium
|
Unassigned | ||
| | Trusty |
Medium
|
Chris J Arges | ||
| | Vivid |
Medium
|
Chris J Arges | ||
| | linux-lts-utopic (Ubuntu) |
Medium
|
Unassigned | ||
| | Trusty |
Undecided
|
Unassigned | ||
| | linux-lts-xenial (Ubuntu) |
Undecided
|
Unassigned | ||
| | Trusty |
Undecided
|
Unassigned | ||
Bug Description
SRU Justification:
[Impact]
Users of kernels that utilize NFS may see the following messages when shutting down and starting containers:
unregister_
This can cause issues when trying to create net network namespace and thus block a user from creating new containers.
[Test Case]
Setup multiple containers in parallel to mount and NFS share, create some traffic and shutdown. Eventually you will see the kernel message.
Dave's script here:
https:/
[Fix]
commit de84d89030fa4ef
--
I currently running trusty latest patches and i get on these hardware and software:
Ubuntu 3.13.0-
processor : 7
vendor_id : GenuineIntel
cpu family : 6
model : 77
model name : Intel(R) Atom(TM) CPU C2758 @ 2.40GHz
stepping : 8
microcode : 0x11d
cpu MHz : 2400.000
cache size : 1024 KB
physical id : 0
siblings : 8
core id : 7
cpu cores : 8
apicid : 14
initial apicid : 14
fpu : yes
fpu_exception : yes
cpuid level : 11
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm sse4_1 sse4_2 movbe popcnt tsc_deadline_timer aes rdrand lahf_lm 3dnowprefetch arat epb dtherm tpr_shadow vnmi flexpriority ept vpid tsc_adjust smep erms
bogomips : 4799.48
clflush size : 64
cache_alignment : 64
address sizes : 36 bits physical, 48 bits virtual
power management:
somehow reproducable the subjected error, and lxc is working still but not more managable until a reboot.
managable means every command hangs.
I saw there are alot of bugs but they seams to relate to older version and are closed, so i decided to file a new one?
I run alot of machine with trusty an lxc containers but only these kind of machines produces these errors, all
other don't show these odd behavior.
thx in advance
meno
Related branches
| Changed in linux (Ubuntu): | |
| status: | New → Incomplete |
| Joseph Salisbury (jsalisbury) wrote : | #2 |
Did this issue occur in a previous version of Ubuntu, or is this a new issue?
Would it be possible for you to test the latest upstream kernel? Refer to https:/
If this bug is fixed in the mainline kernel, please add the following tag 'kernel-
If the mainline kernel does not fix this bug, please add the tag: 'kernel-
If you are unable to test the mainline kernel, for example it will not boot, please add the tag: 'kernel-
Once testing of the upstream kernel is complete, please mark this bug as "Confirmed".
Thanks in advance.
[0] http://
| Changed in linux (Ubuntu): | |
| importance: | Undecided → Medium |
| menoabels (meno-abels) wrote : | #3 |
I try to make it reproducable and figured out that the problem is related to use of these kind of
interfaces in a lxc container. The tunnel are working in the running lxc-container but if you stop or reboot the
lxc-container the kernel reports this
unregister_
here the ip link output in the container
1: lo: <LOOPBACK,
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: ip6tnl0: <NOARP> mtu 1452 qdisc noop state DOWN mode DEFAULT group default
link/tunnel6 :: brd ::
3: ip6gre0: <NOARP> mtu 1448 qdisc noop state DOWN mode DEFAULT group default
link/gre6 00:00:00:
4: gt6nactr01: <POINTOPOINT,
link/gre6 2a:4:f:
or the command which creates this kind of tunnels
ip -6 tunnel add gt6nactr01 mode ip6gre local 2a4:4483:
ip link set mtu 1500 dev gt6nactr01 up
ip addr add 2a04:4454:
ip addr add 169.254.193.1/30 dev gt6nactr01
i attach a dmesg output and hope that you can reproduce the problem.
thx
meno
| menoabels (meno-abels) wrote : | #4 |
i tested with
3.18.0-
and the problem is still there
meno
| Rodrigo Vaz (rodrigo-vaz) wrote : | #5 |
Hi,
We're hitting this bug on the latest trusty kernel in a similar context of this docker issue, we also had this problem on lucid with a custom 3.8.11 kernel which seems to be more agressive than trusty but still happens:
https:/
In this issue an upstream kernel bug and a redhat bug are mentioned:
https:/
https:/
On the redhat bug there is an upstream commit that suposedly fix the problem:
Any chance to have this patch backported to the trusty kernel?
Thanks,
Rodrigo.
| Rodrigo Vaz (rodrigo-vaz) wrote : | #6 |
Kernel stack trace when lxc-start process hang:
[27211131.602770] INFO: task lxc-start:25977 blocked for more than 120 seconds.
[27211131.602785] Not tainted 3.13.0-40-generic #69-Ubuntu
[27211131.602789] "echo 0 > /proc/sys/
[27211131.602795] lxc-start D 0000000000000000 0 25977 1 0x00000080
[27211131.602800] ffff8806ee73dd40 0000000000000286 ffff880474bdb000 ffff8806ee73dfd8
[27211131.602803] 0000000000014480 0000000000014480 ffff880474bdb000 ffffffff81cdb760
[27211131.602806] ffffffff81cdb764 ffff880474bdb000 00000000ffffffff ffffffff81cdb768
[27211131.602809] Call Trace:
[27211131.602821] [<ffffffff81723
[27211131.602825] [<ffffffff81725
[27211131.602832] [<ffffffff811a2
[27211131.602835] [<ffffffff81725
[27211131.602840] [<ffffffff8161c
[27211131.602845] [<ffffffff8108f
[27211131.602848] [<ffffffff8108f
[27211131.602851] [<ffffffff81065
[27211131.602854] [<ffffffff81066
[27211131.602857] [<ffffffff810c8
[27211131.602859] [<ffffffff81066
[27211131.602863] [<ffffffff81730
[27211131.602865] [<ffffffff8172f
[27211131.602869] INFO: task lxc-start:26342 blocked for more than 120 seconds.
[27211131.602874] Not tainted 3.13.0-40-generic #69-Ubuntu
| tags: | added: kernel-da-key trusty |
| Changed in linux (Ubuntu): | |
| importance: | Medium → High |
| status: | Incomplete → Triaged |
| Joseph Salisbury (jsalisbury) wrote : | #7 |
The patch mentioned in comment #5 was added to the mainline kernel as of 3.13-rc1, so it should already be in Trusty.
git describe --contains dcdfdf5
v3.13-rc1~7^2~16
Can you again test the latest mainline, which is now 3.19-rc4:
http://
Also, is this a regression? Was there a prior kernel version that did not have this bug?
| Rodrigo Vaz (rodrigo-vaz) wrote : | #8 |
From the docker issue it seems that someone couldn't reproduce the bug when downgrading to kernel 3.13.0-32-generic, I can't validate this statement because kernels prior 3.13.0-35-generic has a regression that crashes my ec2 instance.
Also people testing kernel 3.14.0 couldn't reproduce the bug.
I will try to reproduce with the latest mainline kernel in next few hours.
Thanks,
Rodrigo.
| Rodrigo Vaz (rodrigo-vaz) wrote : | #9 |
Just got an instance with the kernel 3.16.0-29-generic #39-Ubuntu (linux-lts-utopic) hitting this bug in production, we don't have a reliable reproducer so the only way for me to validate is to boot this kernel in production and wait the bug to happen.
Is there anything I can get from an instance showing the bug that may help ?
| Chris J Arges (arges) wrote : | #10 |
At this point a proper reproducer would help the most. This way we could get crashdumps and other useful information that may not be possible in production environments.
Reading through the bug the best description I can see is:
1) Start LXC container
2) Download > 100MB of data
3) Stop LXC container
4) Repeat until you see the kernel message
For example, I can script this a bit:
#!/bin/bash
# setup
sudo lxc-create -t download -n u1 -- -d ubuntu -r trusty -a amd64
fallocate -l 1000000 file
IP=$(sudo lxc-attach -n u1 -- hostname -I)
# test
while true; do
sudo lxc-start --name u1 -d
ssh ubuntu@$IP rm file
scp file ubuntu@$IP:
sudo lxc-stop --name u1
done
Running this on my machine doesn't show a failure. So I suspect there are other variables involved. Please let us know other relevant info that would help isolate a reproducer for this.
Thanks,
| Changed in linux (Ubuntu): | |
| assignee: | nobody → Chris J Arges (arges) |
| Rodrigo Vaz (rodrigo-vaz) wrote : | #11 |
Some additional info:
- The stack trace is always the same posted above and the break point seems to be copy_net_ns every time.
- The process that hangs is always lxc-start in every occurrence that I was able to check.
Rodrigo.
| Rodrigo Vaz (rodrigo-vaz) wrote : | #12 |
I left a couple instances running with the mainline kernel (http://
Kernel version 3.19.0-
| menoabels (meno-abels) wrote : Re: [Bug 1403152] Re: unregister_netdevice: waiting for lo to become free. Usage count | #13 |
Hey,
here is a discussion about the reproducablity. I'm wrote very early in
these thread that
if I set the following network config in a container
ip -6 tunnel add gt6nactr01 mode ip6gre local 2a4:4483:
remote 2a4:4494:
ip link set mtu 1500 dev gt6nactr01 up
ip addr add 2a04:4454:
ip addr add 169.254.193.1/30 dev gt6nactr01
it will hang on reboot, try it again just now an it works/halt! There
is also a other possiblity to stop a lxc with the same result "waiting
for lo to become free".
And this is if you mount a nfs server with a lxc container, you have
to change something apparmor but than it works perfectly until you try
to reboot the lxc container.
Is there any reason why not try to fix these reproducable problems first?
cheers
meno
On Mon, Feb 9, 2015 at 6:50 PM, Rodrigo Vaz <email address hidden> wrote:
> I left a couple instances running with the mainline kernel
> (http://
> the weekend, it took more time to see the bug on the mainline kernel but
> this morning one out of ten instances had the same problem so I'm
> assuming mainline is also affected.
>
> Kernel version 3.19.0-
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https:/
>
> Title:
> unregister_
>
> Status in The Linux Kernel:
> Unknown
> Status in linux package in Ubuntu:
> Triaged
> Status in linux source package in Trusty:
> New
> Status in linux source package in Utopic:
> New
>
> Bug description:
> I currently running trusty latest patches and i get on these hardware
> and software:
>
> Ubuntu 3.13.0-
>
> processor : 7
> vendor_id : GenuineIntel
> cpu family : 6
> model : 77
> model name : Intel(R) Atom(TM) CPU C2758 @ 2.40GHz
> stepping : 8
> microcode : 0x11d
> cpu MHz : 2400.000
> cache size : 1024 KB
> physical id : 0
> siblings : 8
> core id : 7
> cpu cores : 8
> apicid : 14
> initial apicid : 14
> fpu : yes
> fpu_exception : yes
> cpuid level : 11
> wp : yes
> flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm sse4_1 sse4_2 movbe popcnt tsc_deadline_timer aes rdrand lahf_lm 3dnowprefetch arat epb dtherm tpr_shadow vnmi flexpriority ept vpid tsc_adjust smep erms
> bogomips : 4799.48
> clflush size : 64
> cache_alignment : 64
> address sizes : 36 bits physical, 48 bits virtual
> power management:
>
> somehow reproducable the subjected error, and lxc is working still but
> not more managable until a reboot.
>
> managable means every command hangs.
>
> I saw there are alot of bugs but they seams to relate to older v...
| Rodrigo Vaz (rodrigo-vaz) wrote : | #14 |
Meno,
I just tried your testcase where you described adding an ipv6gre to the container and rebooting it, couldn't reproduce the netdev_hung problem so far, do you mind sharing specific details or even a script that will reproduce the problem?
Mounting an NFS share on my containers is not a common activity from our users and still we see the problem happening on multiple instances.
Regards,
Rodrigo.
| Launchpad Janitor (janitor) wrote : | #15 |
Status changed to 'Confirmed' because the bug affects multiple users.
| Changed in linux (Ubuntu Trusty): | |
| status: | New → Confirmed |
| Changed in linux (Ubuntu Utopic): | |
| status: | New → Confirmed |
| Changed in linux (Ubuntu): | |
| assignee: | Chris J Arges (arges) → Stefan Bader (smb) |
| Jordan Curzon (curzonj) wrote : | #17 |
I'm working with Rodrigo Vaz and we've found some details about our occurrence of this issue using systemtap.
rt_cache_route places a dst_entry struct into a fib_nh struct as the nh_rth_input. Occasionally the reference counter on that dst is not decremented by the time free_fib_info_rcu is called on the fib during container teardown. In that case free_fib_info_rcu doesn't call dst_destroy and dev_put is not called on the lo interface of the container. The only situation where we've seen this is when A) the fib_nh->nh_dev points to the eth0 interface of the container, B) the dst is part of an rtable struct where rt-> rt_is_input==1, and C) the dst points to the lo interface of the container. The dst is cached in the fib only once and never replaced and then thousands of dst_hold/
We have only seen this so far on containers making lots of outbound connections. It doesn't appear to depend on the lifetime of the container, some are only alive for 30min and others are alive for 24hrs. The issue occurs when you try to destroy the container because that is when the fib is freed. We don't know when or where the dst ref_cnt becomes incorrect.
We don't know how to reproduce the issue.
| Jordan Curzon (curzonj) wrote : | #18 |
To add to my comment just above, we have found a workload that is not under own our control (which limits what we can do with it, including sharing it) but which exacerbates the issue on our systems. This workload causes the issue less than 5% of the time and that gives us the chance to look at what callers are using dst_release on the dst in question (the one that meets conditions A, B, and C; "ABC dst"). To clarify, failure at this point for us is when the dst->ref_cnt=1 during free_fib_info_rcu, success is when dst->ref_cnt=0 and free_fib_info_rcu calls dst_destroy on it's nh_rth_input member.
In both failure and non-failure scenarios the only two callers we see on are a single call to ipv4_pktinfo_
| Stefan Bader (smb) wrote : | #19 |
Jordan, which kernel version are you using to make your observations (thanks a lot for those, btw).
| Steve Conklin (sconklin) wrote : | #20 |
Answering for Jordan
# uname -a
Linux ip-10-30-153-210 3.13.0-40-generic #69-Ubuntu SMP Thu Nov 13 17:53:56 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
# dpkg -l | grep linux
ii libselinux1:amd64 2.2.2-1ubuntu0.1 amd64 SELinux runtime shared libraries
ii linux-headers-
ii linux-headers-
ii linux-headers-
ii linux-headers-
ii linux-image-
ii linux-image-virtual 3.13.0.40.47 amd64 This package will always depend on the latest minimal generic kernel image.
ii linux-libc-
ii linux-virtual 3.13.0.40.47 amd64 Minimal Generic Linux kernel and headers
ii util-linux 2.20.1-
| Jordan Curzon (curzonj) wrote : | #21 |
An addition to my #18 comment above. I forgot that I had to remove logging on calls from skb_release_
| Dave Richardson (pudnik019) wrote : | #22 |
I am able to reliably recreate this on both Ubuntu 14.04 with kernel 3.13.0-37-generic and Ubuntu 14.10 with kernel 3.16.0-23-generic. I have test code that can hit the bug, but it's not cleaned up enough to share. The steps to repro are:
Start a container (COW clone using overlayfs from a base/parent Ubuntu 14.04 container). Start clone container, mount an NFS share (remote from machine hosting container) to /mnt in container, create a file on the NFS share in container (dd if=/dev/zero of=/mnt/foo bs=1024 count=100000), stop container, destroy container.
I have a script that performs the above sequence over and over again, running 5 such clone containers in parallel at any given time. It only takes about 30s for the kernel to get wedged with:
unregister_
This same script is fine if you do scp instead of writing over NFS. It seems like NFS inside the container is somehow involved here.
| Steve Conklin (sconklin) wrote : | #23 |
David, would you like help getting the reproducer code ready to share?
| Dave Richardson (pudnik019) wrote : | #24 |
Ok, I've got a minimal reproducer ready. It causes the bug when I run it on Ubuntu 14.04 with 3.13.0-24-generic and LXC version 1.0.7-0ubuntu0.1 0. It's python, and the script has some documentation describing its usage as well as the prerequisites. You should run it as root via sudo.
Prereqs require you to install a base ubuntu trusty container, install nfs-common in that container, and modify lxc-default to allow containers to perform NFS mounts. You will also need an NFS share that your containers can mount to scribble data to. The address of this share is passed as the third parameter to the script in the form of IP_ADDRESS:
My typical usage looks like:
sudo ./reproducer.py 5 10 IPADDRESS:
Which starts 5 threads that in each of 10 iterations, creates a container, mounts the nfs share to /mnt, then dd's over some zeros, umounts /mnt, stops, and destroys the container. I can reliably hit the unregister_
~Dave
| Changed in linux (Ubuntu): | |
| assignee: | Stefan Bader (smb) → Chris J Arges (arges) |
| Changed in linux (Ubuntu): | |
| status: | Triaged → In Progress |
| Chris J Arges (arges) wrote : | #25 |
So far I haven't been able to reproduce this with 3.13.0-49, nor 3.13.0-37.
My setup is as follows:
Two vms, one with nfs-server and another for reproducer.
nfs-server setup with these commands:
apt-get install nfs-kernel-server
mkdir /export
chown 777 /export
echo "/export *(rw,no_
service nfs-kernel-server restart
Reproducer VM has LXC container that is bridged to the libvirt bridge.
apt-get update
apt-get install lxc
lxc-create -t download -n base -- -d ubuntu -r trusty -a amd64
Then edit /var/lib/
lxc.aa_profile = unconfined
lxc.network.link = br0
Then run:
sudo ./reproducer.py 5 10 192.168.
I tried running this many times with no luck. Any other variables I need to consider?
| Dave Richardson (pudnik019) wrote : | #26 |
Here's the setup I use: run the reproducer in a 14.04 server VM (Linux ubuntu 3.13.0-24-generic #47-Ubuntu SMP Fri May 2 23:30:00 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux), VM is bridged to host machine's en0. I am using VMware Fusion on OSX 10.10 host.
Containers inside the VM use NAT via lxcbr0 (lxc.network.link = lxcbr0), getting 10. addresses via dhcp. My NFS server is on a different physical machine than the one running the reproducer VM. I am also not using unconfined profile, but rather modifying the default to allow NFS mounting.
I'll see if I can get it repro'ing using a two VM setup that you describe. If so, worst case I could send you the VMs.
| Chris J Arges (arges) wrote : | #27 |
Ok, moved to two real machines and I can repro there:
[ 3797.901145] unregister_
Will continue to look at this.
| Dave Richardson (pudnik019) wrote : | #28 |
Great, thanks Chris! Let me know if I can help further.
| Chris J Arges (arges) wrote : | #29 |
This is fixed in v3.18 and broken in v3.17. Doing a reverse bisect on this to determine which patch(es) fix this issue.
| Dave Richardson (pudnik019) wrote : | #30 |
Ok. For what it's worth, we still see the problem on Ubuntu 15.04 running v3.19. However, the problem is much harder to trigger and the attached reproducer script does not trigger the problem there, so I don't have code to reliably repro it on 3.19.
| Chris J Arges (arges) wrote : | #31 |
Dave,
Good to know. Hopefully the bisect will implicate the right lines of code in order to find a v3.19+ fix.
| Dave Richardson (pudnik019) wrote : | #32 |
Agreed. Thanks for pushing this forward Chris.
| Federico Alves (h-sales) wrote : | #33 |
It happens to me on Centos 7, with kernel 4.0.1-1.
Every time I reboot a container, without any particular heavy data load, I ge
[630362.513630] hrtimer: interrupt took 5912836 ns
[727851.858312] device eth0 left promiscuous mode
[727853.622084] eth0: renamed from mcCDGW73
[727853.632989] device eth0 entered promiscuous mode
[728077.625688] unregister_
[728087.655832] unregister_
[728097.686066] unregister_
So this is not fixed in 4.X
| Chris J Arges (arges) wrote : | #34 |
The result of the reverse bisect between v3.17..v3.18 with the reproducer was:
# first bad commit: [34666d467cbf1e
If I backport this patch plus 7276ca3f on top of v3.17 I no longer get the hang with the simple reproducer (although I suspect a more elaborate reproducer would still trigger the issue).
This isn't a fix because we obviously have issues in later kernels. The 'unregister_
In addition I'm doing a bisect between earlier versions to see where the 'regression' occured. 3.11 seems to pass without regression for 5 threads / 50 iterations.
| Dave Richardson (pudnik019) wrote : | #35 |
Great progress, thanks Chris. I did some initial instrumentation of dev_put and dev_hold calls earlier, dumping out the call stack and comparing a good run to a bad one. Nothing definitive yet, and it's bit tricky to match up on an SMP machine, but one thing stood out: many of the bad traces had UDP in the mix (for example, calls to __udp4_
| Dave Richardson (pudnik019) wrote : | #36 |
Any updates?
| Chris J Arges (arges) wrote : | #37 |
Dave,
Working on this now. I tried a bisect between v3.12..v3.13 where it seemed that 3.12 worked better, but it wasn't conclusive. I tried looking at dev_put dev_hold via systemtap, but that seems to modify the timing sufficiently. Looking at comments from Jordan Curzon to see if I can start instrumenting just those sections of the kernel to track down where we're failing to decrement the reference. I'm tracing the NETDEV_UNREGISTER notifier calls now to see if that leads to something conclusive.
| Dave Richardson (pudnik019) wrote : | #38 |
Sounds good, thanks Chris.
| Chris J Arges (arges) wrote : | #39 |
When running the reproducer, if I mount the nfs drive using -o vers=3,proto=udp (udp instead of tcp), I cannot get any failures when running 10 threads 150 iterations multiple times.
| Chris J Arges (arges) wrote : | #40 |
Additional Experiments:
- No NAT, lxc bridges to VM eth0, which is bridged to host machine br0
Still fails
- Let timeout on unregister_notifier keep going
Still running after an hour, can't create new containers (original issue)
sudo conntrack -{L,F} hangs...
- Purge conntrack flowtable on host right before unmount
Still fails
- Look at conntrack table on host when its stuck
Had issues with this because of hang
Next steps:
- Track down TCP connection that's being held open
| Chris J Arges (arges) wrote : | #41 |
A few more results:
- I can track the dst_entry holding the dev causing the issue.
- I can reproduce on v3.19 kernels by first using 'modprobe br_netfilter'.
| Dave Richardson (pudnik019) wrote : | #42 |
Really great progress! Sounds like you're close to finding a fix. Thanks Chris.
| Chris J Arges (arges) wrote : | #43 |
Steve saw this patch: e53376bef2cd97d
Disabling sysctl_
To work around this issue, if you don't need CONFIG_
| Joe Stringer (joestringer) wrote : | #44 |
Just chiming in here, I contacted Rodrigo off-list and was verging towards that same patch. More below.
I suspect there's two issues here with very similar symptoms. In
particular post #8 which mentions people reporting that 3.14 improves
the situation.
https:/
I've been chasing a bug in 3.13 with docker containers and connection
tracking which is fixed in 3.14, by this patch:
https:/
Note that the commit message for the above commit fixes a different
issue, but I've been able to produce issues of the nature in this thread
(hung docker / ip netns add commands like in post #6) before applying
this patch, but cannot reproduce after.
https:/
In the issue that I face, I can find a kworker thread using up an entire
core, and when I cat /proc/$pid/stack I see this:
<ffffffffbe01e9b6>] ___preempt_
[<ffffffffc0222
[<ffffffffc0223
[nf_conntrack]
[<ffffffffc0224
[<ffffffffbe604
[<ffffffffbe604
[<ffffffffbe084
[<ffffffffbe085
[<ffffffffbe08b
[<ffffffffbe717
[<fffffffffffff
The kworker is looping forever and failing to clean up conntrack state.
All the while, it holds the global netns lock. Given that I've bisected
to the commit linked above which is to do with refcounting, I suspect
that borked refcounting on conntrack entries makes them impossible to
properly free/destroy, which prevents this worker from cleaning up the
namespace, which then goes on to prevent anything else from interacting
with namespaces (add/delete/etc).
| Chris J Arges (arges) wrote : | #45 |
Joe,
I've created bug 1466135 to handle updating our 3.13 to fix your issue. Please add any information to that bug that would be useful in verifying the fix once it goes through our stable updates process. Thanks!
| Chris J Arges (arges) wrote : | #46 |
A summary of the bug so far:
Occasionally starting and stopping many containers with network traffic may
result in new containers being unable to start due to the inability to create
new network namespaces.
The following message repeats in the kernel log until reboot.
unregister_
Eventually when creating a new container this hung task backtrace occurs:
schedule_
__mutex_
? __kmalloc+
mutex_
copy_
create_
copy_
copy_
do_fork+
? call_rcu_
SyS_clone+
stub_
? system_
The following conditions I've been able to test:
- If CONFIG_
- If net.bridge.
- This problem can happen on single processor machines
- This problem can happen with IPv6 disabled
- This problem can happen with xt_conntrack enabled.
The unregister_
set to NETREG_
unregister_
where net_mutex is locked and thus prevents copy_net_ns from executing.
In addition when the unregister netdevice warning happens, a crashdump reveals
the dst_busy_list always contains a dst_entry that references the device above.
This dst_entry has already been through ___dst_free since it has already been
marked DST_OBSOLETE_DEAD. 'dst->ops' is always set to ipv4_dst_ops.
dst->callback_
We can trace where the dst_entry is trying to be freed. When free_fib_info_rcu
is called, if nh_rth_input is set, it eventually calls dst_free. Because there
is still a refcnt held, it does not get immediately destroyed and continues on
to __dst_free. This puts the dst into the dst_garbage list, which is then
examined periodically by the dst_gc_work worker thread. Each time it tries to
clean it up it fails because it still has a non-zero refcnt.
The faulty dst_entry is being allocated via ip_rcv.
addition this dst is most likely being held in response to a new packet via the
ip_rcv.
| Dave Richardson (pudnik019) wrote : | #47 |
I can confirm that disabling CONFIG_
| Chris J Arges (arges) wrote : | #48 |
Email sent to netdev about what we've found so far with this bug:
http://
| Dave Richardson (pudnik019) wrote : | #49 |
Looks like no progress on the netdev side yet?
| Chris J Arges (arges) wrote : | #50 |
So it seems like disabling CONFIG_
| Changed in linux (Ubuntu Trusty): | |
| assignee: | nobody → Chris J Arges (arges) |
| Changed in linux (Ubuntu Utopic): | |
| assignee: | nobody → Chris J Arges (arges) |
| Changed in linux (Ubuntu Vivid): | |
| assignee: | nobody → Chris J Arges (arges) |
| Changed in linux (Ubuntu Trusty): | |
| importance: | Undecided → Critical |
| importance: | Critical → Medium |
| Changed in linux (Ubuntu Utopic): | |
| importance: | Undecided → Medium |
| Changed in linux (Ubuntu Vivid): | |
| importance: | Undecided → Medium |
| status: | New → In Progress |
| Changed in linux (Ubuntu Utopic): | |
| status: | Confirmed → In Progress |
| Changed in linux (Ubuntu Trusty): | |
| status: | Confirmed → In Progress |
| Changed in linux (Ubuntu): | |
| importance: | High → Medium |
| description: | updated |
| Chris J Arges (arges) wrote : | #51 |
Ok so I have reports that NFS may also be a red herring, however I've found a patch that seems to greatly mitigate issues when testing with the reproducer I was provided. So I'm going to provide a test build with that for 3.13, 3.16, 3.19 and users can give me feedback if that helps.
In addition, if you still can reproduce the issue with this fix, it would be helpful to know more about your test case. I'd like to handle any of these issues as a separate bug.
| Chris J Arges (arges) wrote : | #52 |
Build is here:
http://
Please test and give feedback. Thanks!
| Changed in linux (Ubuntu): | |
| assignee: | Chris J Arges (arges) → nobody |
| status: | In Progress → Fix Released |
| Changed in linux (Ubuntu Trusty): | |
| status: | In Progress → Fix Committed |
| Changed in linux (Ubuntu Utopic): | |
| status: | In Progress → Fix Committed |
| Changed in linux (Ubuntu Vivid): | |
| status: | In Progress → Fix Committed |
| Dave Richardson (pudnik019) wrote : | #53 |
Chris, I will start testing this build and let you know how it goes. Thanks.
| Dave Richardson (pudnik019) wrote : | #54 |
What is the patch?
| Chris J Arges (arges) wrote : | #55 |
Dave,
I added it to the description of the bug:
commit de84d89030fa4ef
--chris
| no longer affects: | linux (Ubuntu Utopic) |
| no longer affects: | linux-lts-utopic (Ubuntu Vivid) |
| Brad Figg (brad-figg) wrote : | #56 |
This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-
If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.
See https:/
| tags: | added: verification-needed-trusty |
| Brad Figg (brad-figg) wrote : | #57 |
This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-
If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.
See https:/
| tags: | added: verification-needed-vivid |
| Launchpad Janitor (janitor) wrote : | #58 |
This bug was fixed in the package linux-lts-utopic - 3.16.0-
---------------
linux-lts-utopic (3.16.0-
[ Luis Henriques ]
* Release Tracking Bug
- LP: #1483790
* SAUCE: REBASE-FIXUP: debian/
[ Upstream Kernel Changes ]
* Revert "Bluetooth: ath3k: Add support of 04ca:300d AR3012 device"
linux-lts-utopic (3.16.0-
[ Luis Henriques ]
* Release Tracking Bug
- LP: #1478986
[ Brad Figg ]
* SAUCE: REBASE-FIXUP: debian/
part of the version when looking for the baseCommit for printchanges
[ Upstream Kernel Changes ]
* Revert "crypto: talitos - convert to use be16_add_cpu()"
- LP: #1478852
* storvsc: use cmd_size to allocate per-command data
- LP: #1445195
* storvsc: fix a bug in storvsc limits
- LP: #1445195
* Drivers: hv: vmbus: Support a vmbus API for efficiently sending page
arrays
- LP: #1445195
* scsi: storvsc: Increase the ring buffer size
- LP: #1445195
* scsi: storvsc: Size the queue depth based on the ringbuffer size
- LP: #1445195
* scsi: storvsc: Always send on the selected outgoing channel
- LP: #1445195
* scsi: storvsc: Retrieve information about the capability of the target
- LP: #1445195
* scsi: storvsc: Don't assume that the scatterlist is not chained
- LP: #1445195
* scsi: storvsc: Set the tablesize based on the information given by the
host
- LP: #1445195
* SUNRPC: TCP/UDP always close the old socket before reconnecting
- LP: #1403152
* ALSA: hda - Fix noisy outputs on Dell XPS13 (2015 model)
- LP: #1468582
* Fix kmalloc slab creation sequence
- LP: #1475204
* ARM: clk-imx6q: refine sata's parent
- LP: #1478852
* KVM: nSVM: Check for NRIPS support before updating control field
- LP: #1478852
* nfs: take extra reference to fl->fl_file when running a setlk
- LP: #1478852
* bridge: fix multicast router rlist endless loop
- LP: #1478852
* net: don't wait for order-3 page allocation
- LP: #1478852
* sctp: fix ASCONF list handling
- LP: #1478852
* bridge: fix br_stp_
- LP: #1478852
* packet: read num_members once in packet_rcv_fanout()
- LP: #1478852
* packet: avoid out of bounds read in round robin fanout
- LP: #1478852
* neigh: do not modify unlinked entries
- LP: #1478852
* tcp: Do not call tcp_fastopen_
- LP: #1478852
* net: phy: fix phy link up when limiting speed via device tree
- LP: #1478852
* sctp: Fix race between OOTB responce and route removal
- LP: #1478852
* x86/mce: Fix MCE severity messages
- LP: #1478852
* s5h1420: fix a buffer overflow when checking userspace params
- LP: #1478852
* cx24116: fix a buffer overflow when checking userspace params
- LP: #1478852
* af9013: Don't accept invalid bandwidth
- LP: #1478852
* cx24117: fix a buffer overflow when checking userspace params
- LP: #1478852
* spi: fix race freeing dummy_tx/rx before it is unmapped
- LP: #...
| Changed in linux-lts-utopic (Ubuntu Trusty): | |
| status: | New → Fix Released |
| Launchpad Janitor (janitor) wrote : | #59 |
This bug was fixed in the package linux - 3.19.0-26.28
---------------
linux (3.19.0-26.28) vivid; urgency=low
[ Luis Henriques ]
* Release Tracking Bug
- LP: #1483630
[ Upstream Kernel Changes ]
* Revert "Bluetooth: ath3k: Add support of 04ca:300d AR3012 device"
linux (3.19.0-26.27) vivid; urgency=low
[ Luis Henriques ]
* Release Tracking Bug
- LP: #1479055
* [Config] updateconfigs for 3.19.8-ckt4 stable update
[ Chris J Arges ]
* [Config] Add MTD_POWERNV_FLASH and OPAL_PRD
- LP: #1464560
[ Mika Kuoppala ]
* SAUCE: i915_bpo: drm/i915: Fix divide by zero on watermark update
- LP: #1473175
[ Tim Gardner ]
* [Config] ACORN_PARTITION=n
- LP: #1453117
* [Config] Add i40e[vf] to d-i
- LP: #1476393
[ Timo Aaltonen ]
* SAUCE: i915_bpo: Rebase to v4.2-rc3
- LP: #1473175
* SAUCE: i915_bpo: Revert "mm/fault, drm/i915: Use pagefault_
to check for disabled pagefaults"
- LP: #1473175
* SAUCE: i915_bpo: Revert "drm: i915: Port to new backlight interface
selection API"
- LP: #1473175
[ Upstream Kernel Changes ]
* Revert "tools/vm: fix page-flags build"
- LP: #1473547
* Revert "ALSA: hda - Add mute-LED mode control to Thinkpad"
- LP: #1473547
* Revert "drm/radeon: adjust pll when audio is not enabled"
- LP: #1473547
* Revert "crypto: talitos - convert to use be16_add_cpu()"
- LP: #1479048
* module: Call module notifier on failure after complete_
- LP: #1473547
* gpio: gpio-kempld: Fix get_direction return value
- LP: #1473547
* ARM: dts: imx27: only map 4 Kbyte for fec registers
- LP: #1473547
* ARM: 8356/1: mm: handle non-pmd-aligned end of RAM
- LP: #1473547
* x86/mce: Fix MCE severity messages
- LP: #1473547
* mac80211: don't use napi_gro_receive() outside NAPI context
- LP: #1473547
* iwlwifi: mvm: Free fw_status after use to avoid memory leak
- LP: #1473547
* iwlwifi: mvm: clean net-detect info if device was reset during suspend
- LP: #1473547
* drm/plane-helper: Adapt cursor hack to transitional helpers
- LP: #1473547
* ARM: dts: set display clock correctly for exynos4412-trats2
- LP: #1473547
* hwmon: (ntc_thermistor) Ensure iio channel is of type IIO_VOLTAGE
- LP: #1473547
* mfd: da9052: Fix broken regulator probe
- LP: #1473547
* ALSA: hda - Fix noise on AMD radeon 290x controller
- LP: #1473547
* lguest: fix out-by-one error in address checking.
- LP: #1473547
* xfs: xfs_attr_inactive leaves inconsistent attr fork state behind
- LP: #1473547
* xfs: xfs_iozero can return positive errno
- LP: #1473547
* fs, omfs: add NULL terminator in the end up the token list
- LP: #1473547
* omfs: fix sign confusion for bitmap loop counter
- LP: #1473547
* d_walk() might skip too much
- LP: #1473547
* dm: fix casting bug in dm_merge_bvec()
- LP: #1473547
* hwmon: (nct6775) Add missing sysfs attribute initialization
- LP: #1473547
* hwmon: (nct6683) Add missing sysfs attribute initialization
- LP: #1473547
* target/pscsi: Don't leak scsi_host if hba is VIRTUAL_HOST
- LP: #1473547
* net...
| Changed in linux (Ubuntu Vivid): | |
| status: | Fix Committed → Fix Released |
| Changed in linux (Ubuntu Trusty): | |
| status: | Fix Committed → Fix Released |
| Launchpad Janitor (janitor) wrote : | #60 |
Status changed to 'Confirmed' because the bug affects multiple users.
| Changed in linux-lts-utopic (Ubuntu): | |
| status: | New → Confirmed |
| Andrew Ruthven (andrew-etc) wrote : | #61 |
We're still seeing the same issue on Ubuntu Trusty running the linux-image-
Looking at this thread[0] on the Docker site which is referenced to in the kernel bugzilla for this issue, there is a reference[1] to a patch[2] on the netdev mailing list from 2015-11-05 by Francesco Ruggeri, later that day David Miller accepted it and queued it for -stable. The patch is currently in the master branch of Linus's tree, but hasn't made it into the 4.2 branch yet.
The patch doesn't apply cleanly to the Canonical 3.19 branch, but it looks like it is mostly line skew. I haven't tried hacking it in yet.
[0] https:/
[1] https:/
[2] http://
| Rodrigo Vaz (rodrigo-vaz) wrote : | #62 |
FWIW here is an update of what I've tried in the last couple months trying to fix this problem (unsuccessfully):
- We tried to deny packets to the container's network before we destroy the namesapce
- Backported the patch mentioned in the previous comment to ubuntu kernel 3.19 and 4.2 (linux-lts-vivid and linux-lts-wily)
- Applied patches listed in this thread: http://
- 4.4-rc3 net-next branch
All experiments above failed to fix the problem and the bug was triggered in production machines within 24h using these experimental kernels.
We still can't reliably reproduce this issue outside production but it is easy to validate any proposed solution with a few hours of production load.
Rodrigo.
| James Dempsey (jamespd) wrote : | #63 |
We also backported [1] to 4.2 (linux-lts-wily) and deployed it to our production OpenStack cloud. We just installed it yesterday and our MTBF is between two and twenty days, so we won't know if this has made any difference for a while now.
Some details about our configuration / failure mode:
Three OpenStack "Layer 3" hosts (running 3.19.0-30-generic #34~14.04.1-Ubuntu) providing virtual routers/
Our most recent failures occurred on hosts B and C (within 30 minutes of each other, after having been fine for weeks) while removing routers from A and re-creating them on B and C.
Our stack traces are a slightly different from the ones posted above...
Dec 14 15:37:05 hostname kernel: [961050.119727] INFO: task ip:9865 blocked for more than 120 seconds.
Dec 14 15:37:05 hostname kernel: [961050.126707] Tainted: G C 3.19.0-30-generic #34~14.04.1-Ubuntu
Dec 14 15:37:05 hostname kernel: [961050.135073] "echo 0 > /proc/sys/
Dec 14 15:37:05 hostname kernel: [961050.144094] ip D ffff88097e3e3de8 0 9865 9864 0x00000000
Dec 14 15:37:05 hostname kernel: [961050.144098] ffff88097e3e3de8 ffff880e982693a0 0000000000013e80 ffff88097e3e3fd8
Dec 14 15:37:05 hostname kernel: [961050.144100] 0000000000013e80 ffff88101a8993a0 ffff880e982693a0 0000000000000000
Dec 14 15:37:05 hostname kernel: [961050.144102] ffffffff81cdb2a0 ffffffff81cdb2a4 ffff880e982693a0 00000000ffffffff
Dec 14 15:37:05 hostname kernel: [961050.144104] Call Trace:
Dec 14 15:37:05 hostname kernel: [961050.144109] [<ffffffff817b2
Dec 14 15:37:05 hostname kernel: [961050.144111] [<ffffffff817b4
Dec 14 15:37:05 hostname kernel: [961050.144115] [<ffffffff811cf
Dec 14 15:37:05 hostname kernel: [961050.144117] [<ffffffff816a1
Dec 14 15:37:05 hostname kernel: [961050.144120] [<ffffffff817b4
Dec 14 15:37:05 hostname kernel: [961050.144122] [<ffffffff816a1
Dec 14 15:37:05 hostname kernel: [961050.144125] [<ffffffff81094
Dec 14 15:37:05 hostname kernel: [961050.144127] [<ffffffff81094
Dec 14 15:37:05 hostname kernel: [961050.144130] [<ffffffff81074
Dec 14 15:37:05 hostname kernel: [961050.144133] [<ffffffff817b6
Dec 14 15:37:05 hostname kernel: [961050.144135] INFO: task ip:9896 blocked for more than 120 seconds.
Dec 14 15:37:05 hostname kernel: [961050.151109] Tainted: G C 3.19.0-30-generic #34~14.04.1-Ubuntu
Dec 14 15:37:05 hostname kernel: [961050.159558] "echo 0 > /proc/sys/
Dec 14 15:37:05 hostname kernel: [961050.168551] ip D ffff8804591cfde8 0 9896 9895 0x00000000
Dec 14 15:37:05 hostname kernel: [961050.168556] ffff8804591cfde8 ffff880814031d70 0000000000013e80 ffff8804591cffd8
Dec 14 15:37:05 hostname kernel: [961050.168558] 0000...
| Dan Streetman (ddstreet) wrote : | #64 |
Andrew, Rodrigo, James, do any of you have a crash dump from a system where the problem has happened? or any way to more easily reproduce it?
| Changed in linux-lts-utopic (Ubuntu): | |
| importance: | Undecided → Medium |
| Dan Streetman (ddstreet) wrote : | #65 |
I should note that I'm pretty sure this new problem is almost certainly a different bug (different cause) than the original bug fixed in comment 59 and earlier; just the symptoms appear the same.
| Dan Streetman (ddstreet) wrote : | #66 |
James Dempsey, the upstream commit you referenced is in trusty lts-vivid at Ubuntu-
| Dan Streetman (ddstreet) wrote : | #67 |
> James Dempsey, the upstream commit you referenced is in trusty lts-vivid at Ubuntu-
specifically, upstream it's commit 30f7ea1c2b5f5fb
| Dan Streetman (ddstreet) wrote : | #68 |
Andrew Ruthven that's your patch also ^
| Dan Streetman (ddstreet) wrote : | #69 |
Rodrigo, just in case the problem you're seeing is fixed by ^ commit, please try the latest trusty lts-vivid (at version I listed above, or later).
| Cristian Calin (cristi-calin) wrote : | #70 |
@ddstreet, the Ubuntu-
We just upgraded a Juno openstack environment to this kernel version and we are still seeing this on a network node.
[controller] root@node-28:~# uname -a
Linux node-28.domain.tld 3.19.0-49-generic #55~14.04.1-Ubuntu SMP Fri Jan 22 11:24:31 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
[controller] root@node-28:~# dmesg -T | tail -n 3
[Thu Feb 11 07:14:15 2016] unregister_
[Thu Feb 11 07:14:26 2016] unregister_
[Thu Feb 11 07:14:36 2016] unregister_
[controller] root@node-28:~# dmesg -T | grep -c unregister_
6401
| Dan Streetman (ddstreet) wrote : | #71 |
Cristian ok thnx. I'll keep investigating.
| Dan Streetman (ddstreet) wrote : | #72 |
If anyone is able to reproduce this, I have a debug kernel at this ppa:
https:/
it prints a debug line when any loopback interface is dev_put/dev_hold, which may help show where the extra loopback reference is coming from when this problem happens. If anyone can reproduce the problem using the debug kernel, please attach the logs (and/or sosreport).
| Michael Richardson (mjrichardson) wrote : | #73 |
Credit where credit is due: we've slowly rolled the patch above (per #63) into production over the last six months, initially through custom compiled kernels and more recently via mainline LTS wily packages; and have seen no further issues with hung namespaces. Kudos to ddstreet.
| Dan Streetman (ddstreet) wrote : | #74 |
There are upstream commits that may fix this problem, I've updated the same ppa with a new test build (with only the commits, and without any debug). If anyone can still reproduce this problem, please try the latest kernel in the ppa and report if it fixes the problem or not:
https:/
the upstream commits are:
751eb6b6042a596
a5d0dc810abf3d6
| Alexandre (totalworlddomination) wrote : | #75 |
This kernel bug, is affecting all the container solutions also on Xenial / 16.04.1 up to 4.4.0-38.
Lots of people are hitting this on many distributions / container solutions:
https:/
https:/
...
Apparently this was fixed in 12.04 and now reintroduced in 16.04.
Once you get the following dmesg message:
unregister_
... The only fix is to reboot the host.
What I gathered from the docker issue tracking this, it seems a fix was merged in net-next very recently:
http://
... and is in mainline now:
https:/
As I wasn't sure this thread was tracking Xenial (I just added it to the list), I'm trying to figure out if Ubuntu is tracking this issue upstream and can backport it into 16.04's 4.4.0-? branch, as a lot of container use case start to break down because of this bug for the past weeks?
Thanks!
| Alexandre (totalworlddomination) wrote : | #76 |
Bug was reintroduced in Ubuntu Xenial's kernel.
| Dan Streetman (ddstreet) wrote : | #77 |
Alexandre, please see my last comment, that links to a test kernel PPA with EXACTLY that upstream commit, 751eb6b6042a596
If you can reproduce this, please test with the PPA and report if it fixes the problem.
| Dan Streetman (ddstreet) wrote : | #78 |
Ah, you need a patched test kernel on xenial...I'll upload one to the PPA for testing.
| Dan Streetman (ddstreet) wrote : | #79 |
PPA updated with patched test kernels for xenial as well as trusty lts-xenial.
https:/
| Paul van Schayck (paulvanschayck) wrote : | #80 |
Dan, thanks for your work. I've tried your PPA with the included patch (4.4.0-36-generic #55hf1403152v20
Would it help you test again with this patch, and with your debug patches included?
[1] https:/
| Launchpad Janitor (janitor) wrote : | #81 |
Status changed to 'Confirmed' because the bug affects multiple users.
| Changed in linux-lts-xenial (Ubuntu Trusty): | |
| status: | New → Confirmed |
| Changed in linux-lts-xenial (Ubuntu): | |
| status: | New → Confirmed |
| Dan Streetman (ddstreet) wrote : | #83 |
> I've tried your PPA with the included patch (4.4.0-36-generic #55hf1403152v20
> As I've also reported on the Github docker issue [1], this does not fix the issue for me.
that's unfortunate.
> Would it help you test again with this patch, and with your debug patches included?
I suspect adding the debug would change the timing enough to work around the bug, as it did before. I'll take a deeper look to see if there is a different way to add debug.
| tags: |
added: verification-done-vivid removed: verification-needed-vivid |
| tags: |
added: verification-done-trusty removed: verification-needed-trusty |
| Alexandre (totalworlddomination) wrote : | #84 |
Apparently the fix is in:
https:/
| Dan Streetman (ddstreet) wrote : | #85 |
> Apparently the fix is in:
>
> https:/
Hopefully the commit does fix things, although Paul's comment above indicated it may not, at least not for everyone.
| Alexandre (totalworlddomination) wrote : | #86 |
Oh oops, didn't realize this was the patch with the 751eb6b6042a596
Well, if that made it into Ubuntu's kernel before and including 4.4.0-42, it didn't fix it (I still get the bug on many machines & VMs).
| Dan Streetman (ddstreet) wrote : | #87 |
> Well, if that made it into Ubuntu's kernel before and including 4.4.0-42, it didn't fix it
> (I still get the bug on many machines & VMs).
No, it's not in the released xenial kernel yet, the upstream 4.4.22 stable commits are still in the xenial master-next branch kernel; it'll be a bit longer before xenial's normal kernel is updated with the patch.
| Xav Paice (xavpaice) wrote : | #88 |
From the logs it looks like the patch is now a part of https:/
| Dan Streetman (ddstreet) wrote : | #89 |
> From the logs it looks like the patch is now a part of
> https:/
> (proposed) on 22nd Oct?
yes, the patch is included in the 4.4.0-46.67 kernel (both linux-lts-xenial on trusty, and regular linux on xenial). anyone able to reproduce the problem with that kernel level?
| Alexandre (totalworlddomination) wrote : | #90 |
I was running 4.4.0-46 (not sure if different than the latest .67?) on October 27th and yesterday had to reboot for the "... count = 1" problem.
Was there previous 4.4.0-46 releases? If not, the bug is still present...
Otherwise, I'll confirm if it happens again, or if it doesn't within a few weeks.
| costinel (costinel) wrote : | #91 |
seeing this on xenial after this sequence of operations:
lxc stop run2
lxc delete run2
lxc copy run1 run2
lxc start run2 <- hangs here, with later dmesg
[337766.146479] unregister_
[337772.435786] INFO: task lxd:20665 blocked for more than 120 seconds.
[337772.435856] Tainted: P OE 4.4.0-47-generic #68-Ubuntu
[337772.435922] "echo 0 > /proc/sys/
[337772.436002] lxd D ffff88006b6cbcb8 0 20665 1 0x00000004
[337772.436006] ffff88006b6cbcb8 ffffffff821d0560 ffff880235aa8000 ffff8801b57d2580
[337772.436009] ffff88006b6cc000 ffffffff81ef5f24 ffff8801b57d2580 00000000ffffffff
[337772.436010] ffffffff81ef5f28 ffff88006b6cbcd0 ffffffff81830f15 ffffffff81ef5f20
[337772.436012] Call Trace:
[337772.436020] [<ffffffff81830
[337772.436022] [<ffffffff81831
[337772.436024] [<ffffffff81832
[337772.436026] [<ffffffff81832
[337772.436029] [<ffffffff8171f
[337772.436033] [<ffffffff810a1
[337772.436035] [<ffffffff810a1
[337772.436038] [<ffffffff8107f
[337772.436040] [<ffffffff81080
[337772.436044] [<ffffffff8120b
[337772.436046] [<ffffffff81080
[337772.436048] [<ffffffff81834
fwiw, at the time of stop, run2 had an additional IP address on dev lo, and at the time of copy, run1 also was running and had an additional ip address on dev lo
| Paul van Schayck (paulvanschayck) wrote : | #92 |
Dear Dan, user reshen on the github Docker issue thread [1] did some extensive testing using kernel 4.8.8. He has not been able to reproduce the issue using that version. I've also done testing using the ubuntu mainline kernel builds and also not been able to reproduce the issue anymore.
He has also pointed to two possible kernel patches responsible for the fix. Would it be possible for you to create a backport build of those again?
[1] https:/
| Alexandre (totalworlddomination) wrote : | #93 |
Update:
I've hit the issue with 4.4.0-47... now testing 4.4.0-49 for the next weeks.
| Alexandre (totalworlddomination) wrote : | #94 |
Seems like the fix isn't in or isn't working for this problem as I've just hit this issue again on 4.4.0-49, at least twice:
[434074.636075] unregister_
Since, I see a lot of people talking about 4.8+ kernels fixing this issue, is there a generic package (or future plans for one, say when 4.9 LTS comes out in a week or two) that will allow anyone to upgrade, but also keep up with updates without following versions by hand? (Like the linux-image-generic package)
cheers!
| Alessandro Polverini (polve) wrote : | #95 |
I reproduce the problem with LXC 2.0.6-1 and kernel 4.6.4
| Dan Streetman (ddstreet) wrote : | #96 |
I added the 2 commits referenced in the docker thread:
https:/
which is building right now in the test ppa:
https:/
kernel version 4.4.0-53.
As far as testing with the 4.8 kernel on xenial, you can use the kernel team's build ppa:
https:/
obviously kernels there are completely unsupported and may break everything, etc., etc., and should be used for testing/debug ONLY.
to install the yakkety 4.8 kernel on a xenial system, after adding the ppa to your system:
# sudo apt install linux-generic-
| Timo Furrer (tuxtimo) wrote : | #97 |
I've hit the same issue today with 4.4.0-87-generic on a xenial.
Is there a confirmed fix for a particular released kernel for Ubuntu?
| Timo Furrer (tuxtimo) wrote : | #98 |
I've hit the same issue today with 4.4.0-87-generic on a xenial.
Is there a confirmed fix for a particular released kernel for Ubuntu?
And same for 4.10.0-28-generic
| Dan Streetman (ddstreet) wrote : | #99 |
Since this bug has been long-since marked 'fix released', I opened a new bug 1711407. Let's all please use that one for continued discussion of this problem.


This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:
apport-collect 1403152
and then change the status of the bug to 'Confirmed'.
If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.
This change has been made by an automated script, maintained by the Ubuntu Kernel Team.