Comment 72 for bug 1711407

Revision history for this message
Dan Streetman (ddstreet) wrote :

Just to re-summarize this:

The problem here is when a net namespace is being cleaned up, the thread cleaning it up gets hung inside net/core/dev.c netdev_wait_allrefs() function. This is called every time anything calls rtnl_unlock(), which happens a lot, including during net namespace cleanup. This function waits for all netdev device references to be released.

The thing(s) holding dev references here are dst objects from sockets (at least, that's what is failing to release the dev refs, in my debugging). When a net namespace is cleaned up, all its sockets' call dst_dev_put() for their dsts, which moves the netdev refs from the actual net namespace interfaces to the netns loopback device (this is why the message is typically 'waiting for lo to become free', but sometimes it waits for non-lo to become free if the dst_dev_put hasn't been called for a dst).

The dsts are then put, which should reduce their ref count to 0 and free them, which then also reduces the lo device's ref count. In correct operation, this reduces the lo device's ref count to 0, which allows netdev_wait_allrefs() to complete.

This is where the problem arrives:

1. the first possibility a kernel socket that hasn't closed, and still has dst(s). This is possible because while normal (user) sockets take a reference to their net namespace, and the net namespace does not begin cleanup until all its references are released (meaning, no user sockets are still open), kernel sockets do not take a reference to their net namespace. So, net namespace cleanup begins while kernel sockets are still open. While the net namespace is cleaning up, it ifdown's all its interfaces, which cause the kernel sockets to close, and free their dsts which releases all netdev interface refs, which allows the net namespace to close. If, however, one of the kernel sockets doesn't pay attention to the net namespace cleaning up, it may hang around, failing to free its dsts, and failing to release the netdev interface reference(s). That causes this bug. In some cases (maybe most or all, I'm not sure) the socket eventually times out, and does release its dsts, allowing the net namespace to finish cleaning up and exiting.

My recent patch 4ee806d51176ba7b8ff1efd81f271d7252e03a1d fixes *one* case of this happening - a TCP kernel socket remaining open, trying in vain to complete its TCP FIN sequence (which will never complete since all net namespace interfaces are down and will never come back up, since the net namespace is cleaning up). The TCP socket eventually would time out the FIN sequence failure, and close, allowing the net namespace to finish cleaning up, but it takes ~2 minutes or so. There very well could be more cases of this happening with other kernel sockets.

2. alternately, even if all kernel sockets behave correctly and cleanup immediately during net namespace cleanup, it's possible for a dst object to leak a reference. This has happened before, there are likely existing dst leaks, and there will likely be dst leaks accidentally introduced in the future. When a dst leaks, its refcount never reaches 0, and so it never releases its reference on its interface (which remember, is moved to the net namespace loopback interface when its socket is closing). Unfortunately, a dst that has leaked a reference will never free itself, and will never release its reference to the loopback interface. Thus, netdev_wait_allrefs() will hang forever.

Actually fixing either cause above is the correct thing to do, of course, but that's a long-term thing, as even if all causes currently in the kernel are fixed, more leaks probably will be accidentally introduced in the future. And while mem leaks are not good, the more serious problem caused by this is that the netdev_wait_allrefs() hang happens while the thread is holding the global net mutex - that prevents the creation of any new net namespaces, and prevents the cleanup of any existing net namespaces (it also blocks registering/unregistering any pernet subsystem or pernet device). This can render the entire system unusable, and requires a reboot to correct.

Starting from the assumption that it's unsafe to ever ignore the interface's refcount, and just free it anyway after some time period, I have to allow that when a dst leaks, its net namespace will also have to leak, since the dst will always hold a reference to the net namespace loopback (or other) interface. Given that, what we need to do is move the call to netdev_wait_allrefs() that is hanging, to outside of the net mutex locked section. I'm working on a patch to do just that currently.