Comment 2 for bug 1818041

Revision history for this message
Tim Penhey (thumper) wrote :

After discussions with Joel in IS, he strongly suggested not using UDP for heartbeats if we can help it.

Discussed TCP keep alives more with Joel and read more around this too. The keep alives are only sent after periods of inactivity. It is also noted that the connection isn't killed until a number has failed. The default in linux is 9.

Checked the go source and when you set a keep alive time, it sets the socket options for both the interval and time. The time is how long the connection needs to be idle before keep alive packets are sent, and the interval is the duration between packets. The linux defaults are too long to be useful.

Note that the keep alive are only sent during data idle. If there is a pending message then this will be sent, and retransmitted if not acked. If the packet isn't acked then it is retransmitted after about 100 seconds. Current hypothesis is that the socket is considered idle during this waiting time, so a short keep alive timeout should be able to close the socket while awaiting an ack.

The api clients both for normal api connections and the streaming connections for logs should use a 15 second keep alive time.

For the controller streaming websockets we should set a low timeout for the tcp keep alive. If we used a one second keep alive for the pubsub stream connections then if the other side goes away, we should see a socket disconnection after approximately ten seconds. Having a very low tcp keep alive for just the pubsub connections isn't likely to be impactful as we are only talking about six total connections for a three node HA cluster.

TCP dtails checked here: http://man7.org/linux/man-pages/man7/tcp.7.html