Note that the commit message for the above commit fixes a different
issue, but I've been able to produce issues of the nature in this thread
(hung docker / ip netns add commands like in post #6) before applying
this patch, but cannot reproduce after. https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1403152/comments/6
In the issue that I face, I can find a kworker thread using up an entire
core, and when I cat /proc/$pid/stack I see this:
The kworker is looping forever and failing to clean up conntrack state.
All the while, it holds the global netns lock. Given that I've bisected
to the commit linked above which is to do with refcounting, I suspect
that borked refcounting on conntrack entries makes them impossible to
properly free/destroy, which prevents this worker from cleaning up the
namespace, which then goes on to prevent anything else from interacting
with namespaces (add/delete/etc).
Just chiming in here, I contacted Rodrigo off-list and was verging towards that same patch. More below.
I suspect there's two issues here with very similar symptoms. In /bugs.launchpad .net/ubuntu/ +source/ linux/+ bug/1403152/ comments/ 8
particular post #8 which mentions people reporting that 3.14 improves
the situation.
https:/
I've been chasing a bug in 3.13 with docker containers and connection /git.kernel. org/cgit/ linux/kernel/ git/torvalds/ linux.git/ commit/ ?id=e53376bef2c d97d3e3f61fdc67 7fb8da7d03d0da
tracking which is fixed in 3.14, by this patch:
https:/
Note that the commit message for the above commit fixes a different /bugs.launchpad .net/ubuntu/ +source/ linux/+ bug/1403152/ comments/ 6
issue, but I've been able to produce issues of the nature in this thread
(hung docker / ip netns add commands like in post #6) before applying
this patch, but cannot reproduce after.
https:/
In the issue that I face, I can find a kworker thread using up an entire
core, and when I cat /proc/$pid/stack I see this:
<ffffffffbe01e9b6>] ___preempt_ schedule+ 0x56/0xb0 3e4>] nf_ct_iterate_ cleanup+ 0x134/0x160 [nf_conntrack] dae>] nf_conntrack_ cleanup_ net_list+ 0x4e/0x170 36d>] nf_conntrack_ pernet_ exit+0x4d/ 0x60 [nf_conntrack] 0d3>] ops_exit_ list.isra. 1+0x53/ 0x60 8d0>] cleanup_ net+0x100/ 0x1d0 991>] process_ one_work+ 0x171/0x470 63b>] worker_ thread+ 0x11b/0x3a0 b82>] kthread+0xd2/0xf0 57c>] ret_from_ fork+0x7c/ 0xb0 fff>] 0xffffffffffffffff
[<ffffffffc0222
[<ffffffffc0223
[nf_conntrack]
[<ffffffffc0224
[<ffffffffbe604
[<ffffffffbe604
[<ffffffffbe084
[<ffffffffbe085
[<ffffffffbe08b
[<ffffffffbe717
[<fffffffffffff
The kworker is looping forever and failing to clean up conntrack state.
All the while, it holds the global netns lock. Given that I've bisected
to the commit linked above which is to do with refcounting, I suspect
that borked refcounting on conntrack entries makes them impossible to
properly free/destroy, which prevents this worker from cleaning up the
namespace, which then goes on to prevent anything else from interacting
with namespaces (add/delete/etc).