Comment 18 for bug 1403152

Revision history for this message
Jordan Curzon (curzonj) wrote :

To add to my comment just above, we have found a workload that is not under own our control (which limits what we can do with it, including sharing it) but which exacerbates the issue on our systems. This workload causes the issue less than 5% of the time and that gives us the chance to look at what callers are using dst_release on the dst in question (the one that meets conditions A, B, and C; "ABC dst"). To clarify, failure at this point for us is when the dst->ref_cnt=1 during free_fib_info_rcu, success is when dst->ref_cnt=0 and free_fib_info_rcu calls dst_destroy on it's nh_rth_input member.

In both failure and non-failure scenarios the only two callers we see on are a single call to ipv4_pktinfo_prepare, and many calls to inet_sock_destruct, and tcp_data_queue. That likely eliminates a unique function call to dst_release that can be identified as missing in failure scenarios. My interpretation is that a single call to inet_sock_destruct or tcp_data_queue is failing to occur when in the failure scenario.