> The LXC images failed to start under linux-image-4.2.0-28-generic, with a kernel oops.

this bug isn't about kernel oopses.

> Setting /proc/sys/net/ipv4/xfrm4_gc_thresh to 5 causes the failure almost immediately.
> 
> I would like to confirm my procedure however. I've been changing /proc/sys/net/ipv4/xfrm4_gc_thresh inside the containers,
> not the host. Is this correct?

no, that's not correct, and unfortunately the "reproducer" in this bug description is completely invalid (it was copied from a private bug for this issue).  ipsec will NEVER work with gc_thresh set to 5, due to the internal details between the xfrm gc_thresh and the hardcoded flowcache per-cpu hashtable size limit.  In upstream the xfrm4_gc_thresh has been changed to INT_MAX, because using a gc_thresh doesn't make any sense for xfrm dst entries; the total number of them possible is entirely dependent on the number of cpus in the system.

There is 1 xfrm dst entry per flowcache entry, and the flowcache entries are kept in a per-cpu hashtable that is strictly limited at 4096 entries (per cpu).  So, the total number of xfrm dst entries will be 4096 * num_active_cpus(), meaning that since the dst code stops allowing new dst allocation once the dst entry count is >= 2 * gc_thresh, the threshold (with the current default 32k gc_thresh) where dst allocation failures can begin to be seen is (32k * 2) / 4096 = 16 cpus.  On systems with less than 16 cpus, at the default gc_thresh of 32k, there will never be any dst allocation failures (except due to the real bug this addresses).  On systems with 16 or more cpus, with the default gc_thresh of 32k, there will/can be dst alloc failures (with a high enough ipsec usage rate, creating new connections - the flowcache clears all its entries every 10 minutes, so a lightly loaded ipsec with > 16 cpus could be fine).

Setting the xfrm4_gc_thresh value to anything less than (4k * 2 * CPUS) will result in failures, and in fact there is no point in setting xfrm4_gc_thresh to ANYTHING other than INT_MAX because it doesn't actually remove any dst entries.


All that, unfortunately, is actually tangential to this real bug - the problem here is with multiple net namespaces (i.e. multiple containers) all running ipsec, the dst entry counter changes for one container can incorrectly be given to a different container - so one container could have its dst entry count steadily go down (incorrectly) while another container's dst entry count keeps going up (incorrectly).  The latter container there would eventually reach its 2 * gc_thresh limit and encounter dst allocation failures, making its ipsec network unusable.  Unfortunately, that error looks identical to the error when the dst entry counter is correct, but the number of system cpus is > 16 as described above.


To test this fix, multiple containers must be started (just 2 is fine).  On each container, new ipsec connections should be created as fast as possible; e.g. something like:

while true ; do ping -c 1 OTHER_CONTAINER ; done

so each container is pinging each other - but it's important to use -c 1 so that each ping creates a new ipsec dst entry; just normal ping will re-use the existing dst entry.

After a sufficiently long period (depending on the number of containers and number of cpus, and the luck of randomly incorrectly assigning dst entry count changes to different containers) one or more containers should get dst allocation failures and its ipsec network should not be usable anymore.  To speed up reproduction of this bug, lower the xfrm4_gc_thresh to a value ABOVE (2 * 4096 * CPUS), but close to it - e.g. something like 10k * CPUS.  With that gc_thresh value set, this bug should be reproducable fairly quickly (order of days or less) without the patch, but not reproducable with the patch (i.e. with the -proposed repo kernel).