libmemcached

Bug #931696
Comment #0

Comment 0 for bug 931696

Revision history for this message

Trevor North (trevor) wrote on 2012-02-13:

When an IO error is encountered the server state is reset to MEMCACHED_SERVER_STATE_NEW even if it is currently MEMCACHED_SERVER_STATE_IN_TIMEOUT. The call to memcached_mark_server_for_timeout will then incorrectly push the next connection retry time further back and increment the server failure counter to be further incremented. This throws out the connection back-off handling as it appears there has been another failure when in fact we're just dealing with an in-progress failure so to speak.

This may only manifest itself as a problem when using consistent distribution due to the point at which the continuum is recalculated - I haven't tested with any of the other distribution options. It's probably also more obviously a problem when making use of the dead server retry behaviour included in 1.0.3+. In a nutshell it should be possible to observe that retries do not occur at the expected intervals and failure counts are not accurate after a server in the pool is taken offline.

I patched io.cc and quit.cc to work around this as part of the following commit to my branch: http://bazaar.launchpad.net/~trevor/libmemcached/dead-retry/revision/978

This may well be fixing the symptom rather than the cause, but I have had the change running in production for quite some time now with no apparent side-effects. I do understand that those changes cause at least some of the tests to fail though which certainly warrants further investigation.

I've been meaning to find the time to put together a proper example test case and results for this but that has been proving impossible of late. I still wanted to get the issue logged though - please let me know if I've not been clear enough here or can provide any more useful information.