Comment 4 for bug 777672

Revision history for this message
Hannu Valtonen (hannu-valtonen) wrote :

Worked on this some more. Now it does what I expect on all the situations I've been able to test. It's also no longer dependent on last_disconnected_host. It changes the program flow a bit as explained in the new comment in memcached_connect()

  /* In case server_failure_limit is set, the retry_timeout/failure_limit works by

     1) client loops until failure_limit has been reached
     2) server gets removed from hash continuum
     3) memcached_servers_reconnect() tries again after retry timeout
     4) Reconnecting to the server resets failure_limit and adds the server to the continuum

     In case timeout is set but failure_limit is not:
     1) We fail once
     2) Until timeout is reached again, we immediately hit MEMCACHED_SERVER_MARKED_DEAD
     3) Try reconnecting */

The regression tests pass except the test on line 5462 testing failure server limit. This is coming from the fact that the tests directly move the failure limit to 2 and then call memcached_connect again with the same server. The new codepath would already remove the server in question from the hash continuum at the point when failure limit rises from 1 to 2, and return memcached_MEMCACHED_SERVER_MARKED_DEAD. The reason this doesn't happen now is that it can connect to the server (it isn't really dead) in the call.. which is never made in the new codepath since the server in question has been removed from the hash continuum.

With the patch it always reconnects and calls update_distribution correctly afaict. Anyway, we can fix the tests up but what do you think of the general approach?