Comment 2 for bug 1486670

Dan Streetman (ddstreet) wrote :

Short summary:

ipsec uses a struct dst_ops object per net-namespace (e.g. per container), but does not correctly initialize each dst_ops object's percpu counter. This results in incorrect values for each net namespace's dst_ops counter.

Full details:

ipsec uses xfrm objects, which contain dst objects, which are tracked via the xfrm dst_ops struct in its percpu counter. However, ipsec creates a dst_ops object for every net namespace, not just the main net namespace. A dst_ops template is created, and its contents copied to the dst_ops object for each new net namespace. However, ipsec only initializes the percpu counter in the dst_ops object once - for the template. The way percpu counters work is, the percpu counter object has a main counter variable, and a pointer to the percpu counter variables. The percpu variables only go up to a small "batch" size (32 or so), at which point the percpu variable's count is moved to the main counter variable. However since multiple ipsec net namespaces are all using different main counter variables but the same percpu counter variables, the count from each percpu variable can be moved to a different net namespace. The result is, one net namespace (i.e. container) may have its xfrm count decrease to below 0, while another net namespace may have its xfrm count increase forever, and eventually cause complete ipsec failure once the xfrm4_gc_thresh limit is exceeded.