Seems like tcp keepalive settings are not enough to provide good failure tolerance. For example, in my openstack cluster nova-conductor and neutron agents always stuck with some unacknowledged tcp traffic, so tcp keepalive timer is never been started. After 900 seconds services began to work.
Currently I use a hacky workaround: set TCP_USER_TIMEOUT with hardcoded value for socket in amqp library (the patch is attached). Is there a more elegant way to solve this problem? Thank you!
Seems like tcp keepalive settings are not enough to provide good failure tolerance. For example, in my openstack cluster nova-conductor and neutron agents always stuck with some unacknowledged tcp traffic, so tcp keepalive timer is never been started. After 900 seconds services began to work.
This problem was expained on Stack Overflow: http:// stackoverflow. com/questions/ 16320039/ getting- disconnection- notification- using-tcp- keep-alive- on-write- blocked- socket
Currently I use a hacky workaround: set TCP_USER_TIMEOUT with hardcoded value for socket in amqp library (the patch is attached). Is there a more elegant way to solve this problem? Thank you!