Comment 27 for bug 1711407

Dan Streetman (ddstreet) wrote :

My analysis so far of the problem:

1. container A has an outstanding TCP connection (thus, a socket and dst which hold reference on the "lo" interface from the container). When the container is stopped, the TCP connection takes ~2 minutes to timeout (with default settings).

2. when container A is being removed, its net namespace is removed via net/core/net_namespace.c function cleanup_net(). This takes two locks - first, the net_mutex, and then the rtnl mutex (i.e. rtnl_lock). It then cleans up the net ns and calls rtnl_unlock(). However, rtnl_unlock() internally waits for all the namespace's interfaces to be freed, which requires all their references to reach 0. This must wait until the above TCP connection times out, before it releases its reference. So, at this point the container is hung inside rtnl_unlock() waiting, and it still holds the net_mutex. It doesn't release the net_mutex until its lo interface finally is destroyed after the TCP connection times out.

3. When a new container is started, part of its startup is to call net/core/net_namespace.c function copy_net_ns() to copy the caller's net namespace to the new container net namespace. However, this function locks the net_mutex. Since the previous container still is holding the net_mutex as explaned in #2 above, this new container creation blocks until #2 releases the mutex.

There are a few ways to possibly address this:

a) when a container is being removed, all its TCP connections should abort themselves. Currently, TCP connections don't directly register for interface unregister events - they explictly want to stick around, so if an interface is taken down and then brought back up, the TCP connection remains, and the communication riding on top of the connection isn't interrupted. However, the way TCP does this is to move all its dst references off the interface that is unregistering, to the loopback interface. This works for the initial network namespace, where the loopback interface is always present and never removed. However, this doesn't work for non-default network namespaces - like containers - where the loopback interface is unregistered when the container is being removed. So this aspect of TCP may need to change, to correctly handle containers. This also may not cover all causes of this hang, since sockets handle more than just tcp connections.

b) when a container is being removed, instead of holding the net_mutex while the cleaning up (and calling rtnl_unlock), it could release the net_mutex first (after removing all netns marked for cleanup from the pernet list), then call rtnl_unlock. This needs examination to make sure it would not introduce any races or other problems.

c) rtnl_unlock could be simplified - currently, it includes significant side-effects, which include the long delay while waiting to actually remove all references to the namespace's interfaces (including loopback). Instead of blocking rtnl_unlock() to do all this cleanup, the cleanup could be deferred. This also would need investigation to make sure no caller is expecting to be able to free resources that may be accessed later from the cleanup action (which I believe is the case).

As this is a complex problem there are likely other options to fix it as well. This issue also has been around ever since namespaces were introduced to the kernel, as far as I can tell, but it's not a commonly seen issue because in most cases socket connections are shut down before stopping the container. The socket connection causing the problem here is a kernel socket, which is different than normal userspace-created sockets; that may make a difference though I haven't investigated that angle yet.