Comment 12 for bug 1837055

Revision history for this message
Matt Peters (mpeters-wrs) wrote :

Further details of the investigation as to why changing the TCP keepalive parameters of the container was not sufficient.

When the cluster networking is used, and a swact occurs, the TCP connection to the kube-apiserver remains active because the NATed connection is not able to route the packets to the new kube-apiserver endpoint on the other controller. The socket is therefore open, but not able to communicate with the far-end, which means the socket must timeout before being cleaned up. With host networking, the packets are routed to the new destination which results in an immediate TCP reset being sent which closes the connection immediately.

The TCP keepalive timer for the tiller connection to the kube-apiserver socket is 30s, combined with the 5 probes for a total of 2.5mins. Therefore, the system settings are not being used, therefore the client side of the connection would not close the connection faster than the options set on the socket. However, it is still taking 15mins to close the connection which is based on the failed retransmit timeout that is configured by the system setting of net.ipv4.tcp_retries2=15. This timer will be activated if a request is made on the socket, overriding the keepalive timer.
Therefore, if no request is made for 2.5mins, the request will be complete successfully since the connection would have been cleaned up by the keepalive timeout. If a request is made within that 2.5mins window, the connection will remain active until the tcp_restries2 timeout.

The kube-apiserver TCP keepalive timer is set to 300s with 5 probes, so it will not be able to detect the failed connection for 15mins.

Based on the above information, it is recommended to keep the hostNetwork configuration option for the Tiller Pod since it provides the fastest socket cleanup time, re-enabling the connectivity.