Mirantis OpenStack

Overview
Code
Bugs
Blueprints
Translations
Answers

Bug #856764
Comment #21

Comment 21 for bug 856764

Revision history for this message

Sergey Pimkov (sergey-pimkov) wrote on 2013-12-27:

#21

transport.py.patch Edit (458 bytes, text/plain)

Seems like tcp keepalive settings are not enough to provide good failure tolerance. For example, in my openstack cluster nova-conductor and neutron agents always stuck with some unacknowledged tcp traffic, so tcp keepalive timer is never been started. After 900 seconds services began to work.

This problem was expained on Stack Overflow: http://stackoverflow.com/questions/16320039/getting-disconnection-notification-using-tcp-keep-alive-on-write-blocked-socket

Currently I use a hacky workaround: set TCP_USER_TIMEOUT with hardcoded value for socket in amqp library (the patch is attached). Is there a more elegant way to solve this problem? Thank you!