When an IO error is encountered the server state is reset to MEMCACHED_SERVER_STATE_NEW even if it is currently MEMCACHED_SERVER_STATE_IN_TIMEOUT. The call to memcached_mark_server_for_timeout will then incorrectly push the next connection retry time further back and increment the server failure counter to be further incremented. This throws out the connection back-off handling as it appears there has been another failure when in fact we're just dealing with an in-progress failure so to speak.
This may only manifest itself as a problem when using consistent distribution due to the point at which the continuum is recalculated - I haven't tested with any of the other distribution options. It's probably also more obviously a problem when making use of the dead server retry behaviour included in 1.0.3+. In a nutshell it should be possible to observe that retries do not occur at the expected intervals and failure counts are not accurate after a server in the pool is taken offline.
This may well be fixing the symptom rather than the cause, but I have had the change running in production for quite some time now with no apparent side-effects. I do understand that those changes cause at least some of the tests to fail though which certainly warrants further investigation.
I've been meaning to find the time to put together a proper example test case and results for this but that has been proving impossible of late. I still wanted to get the issue logged though - please let me know if I've not been clear enough here or can provide any more useful information.
When an IO error is encountered the server state is reset to MEMCACHED_ SERVER_ STATE_NEW even if it is currently MEMCACHED_ SERVER_ STATE_IN_ TIMEOUT. The call to memcached_ mark_server_ for_timeout will then incorrectly push the next connection retry time further back and increment the server failure counter to be further incremented. This throws out the connection back-off handling as it appears there has been another failure when in fact we're just dealing with an in-progress failure so to speak.
This may only manifest itself as a problem when using consistent distribution due to the point at which the continuum is recalculated - I haven't tested with any of the other distribution options. It's probably also more obviously a problem when making use of the dead server retry behaviour included in 1.0.3+. In a nutshell it should be possible to observe that retries do not occur at the expected intervals and failure counts are not accurate after a server in the pool is taken offline.
I patched io.cc and quit.cc to work around this as part of the following commit to my branch: http:// bazaar. launchpad. net/~trevor/ libmemcached/ dead-retry/ revision/ 978
This may well be fixing the symptom rather than the cause, but I have had the change running in production for quite some time now with no apparent side-effects. I do understand that those changes cause at least some of the tests to fail though which certainly warrants further investigation.
I've been meaning to find the time to put together a proper example test case and results for this but that has been proving impossible of late. I still wanted to get the issue logged though - please let me know if I've not been clear enough here or can provide any more useful information.