Hi everyone,

so this is an interesting problem.  There are 2 parts to this:

1. in the container being destroyed, some socket(s) remain open for a period of time, which prevents the container from fully exits until all its sockets have exited.  While this happens you will see the 'waiting for lo to become free' message repeated.

2. while the previous container is waiting for its sockets to exit, so it can exit, any new containers are blocked from starting.

For issue #2, I'm not clear just yet on what is blocking the new containers from starting.  I thought it could be the rtnl_lock since the exiting container does loop in the rtnl_unlock path, however while it's looping it also calls __rtnl_unlock/rtnl_lock during its periods of waiting, so the mutex should not be held the entire time.  I'm still investigating that part of this issue to find what exactly is blocking the new containers.

For issue #1, the sockets are lingering because - at least for the reproducer case from comment 16 - there is a TCP connection that never is closed, so the kernel continues to probe the other end until it times out, which takes around 2 minutes.  For this specific reproducer, there are 2 easy ways to work around this.  First for background, the reproducer uses docker to create a CIFS mount, which uses a TCP connection.  This is the script run in the 'client' side, where the hang happens:

    date ; \
    mount.cifs //$SAMBA_PORT_445_TCP_ADDR/public /mnt/ -o vers=3.0,user=nobody,password= ; \
    date ; \
    ls -la /mnt ; \
    umount /mnt ; \
    echo "umount ok"

this is repeated 50 times in the script, but the 2nd (and later) calls hang (for 2 minutes each, if you wait that long).

A. The reason the TCP connection to the CIFS server lingers is beacuse it is never closed; the reason it isn't closed is because the script exits the container immediately after unmounting.  The kernel cifs driver, during fs unmount, closes the TCP socket to the CIFS server, which queues a TCP FIN to the server.  Normally, this FIN will reach the server who will respond and the TCP connection will close.  However, since the container starts exiting immediately, that FIN never reaches the CIFS server - the interface is taken down too quickly.  So, the TCP connection remains.  To avoid this, simply add a short sleep between unmounting and exiting the container, e.g. (new line prefixed with +):

    date ; \
    mount.cifs //$SAMBA_PORT_445_TCP_ADDR/public /mnt/ -o vers=3.0,user=nobody,password= ; \
    date ; \
    ls -la /mnt ; \
    umount /mnt ; \
+    sleep 1 ; \
    echo "umount ok"

that avoids the problem, and allows the container to exit immediately and the next container to start immediately.

B. Instead of delaying enough to allow the TCP connection to exit, the kernel's TCP configuration can be changed to more quickly timeout.  The specific setting to change is tcp_orphan_retries, which controls how many times the kernel keeps trying to talk to a remote host over a closed socket.  Normally, this defaults to 0 - this will cause the kernel to actually use a value of 8 (retries).  So, change it from 0 to 1, to actually reduce it from 8 to 1 (confusing, I know).  Like (new lines prefixed with +):

    date ; \
+    echo 1 > /proc/sys/net/ipv4/tcp_orphan_retries ; \
+    grep -H . /proc/sys/net/ipv4/tcp_orphan_retries ; \
    mount.cifs //$SAMBA_PORT_445_TCP_ADDR/public /mnt/ -o vers=3.0,user=nobody,password= ; \
    date ; \
    ls -la /mnt ; \
    umount /mnt ; \
    echo "umount ok"


the 'grep' added there is just so you can verify the value was correctly changed.  Note this method will take slightly longer than method A, since this does perform 1 retry, while method A closes the TCP connection correctly and does no retries.


To extrapolate this generally to situations beside this specific reproducer, here are some suggestions:

1. close all your connections before closing the container.  Especially kernel connections (like the CIFS TCP connection).

2. if you can't close the connections, add a delay as the last action the container takes, after bringing down all its interfaces (except lo) but before exiting - like sleep for up to 2 minutes.  The container will take up a bit more resources during that time but it should allow enough time for all TCP connections to close.

3. if you can't do either of those workarounds, reduce the number of TCP retries that will be done after the container starts exiting by changing the tcp_orphan_retries param (from inside the container - don't change this value in your host system).  Change the parameter right before you close the container, if possible.

Those are simply workarounds - I'm still investigating what is blocking new containers from starting, as well as looking at ways to forcibly destroy a container's sockets when the container is exiting (e.g. maybe sock_diag_destroy)

Please let me know if any of those, or similar, workarounds helps anyone who is seeing this problem.