Comment 21 for bug 856764

Revision history for this message
Sergey Pimkov (sergey-pimkov) wrote :

Seems like tcp keepalive settings are not enough to provide good failure tolerance. For example, in my openstack cluster nova-conductor and neutron agents always stuck with some unacknowledged tcp traffic, so tcp keepalive timer is never been started. After 900 seconds services began to work.

This problem was expained on Stack Overflow: http://stackoverflow.com/questions/16320039/getting-disconnection-notification-using-tcp-keep-alive-on-write-blocked-socket

Currently I use a hacky workaround: set TCP_USER_TIMEOUT with hardcoded value for socket in amqp library (the patch is attached). Is there a more elegant way to solve this problem? Thank you!