Comment 9 for bug 881983

Revision history for this message
Brian Aker (brianaker) wrote : Re: [Bug 881983] libmemcached resets continuum with dead server

Thanks, I will take a look at it.

Making it a new behavior makes it much more likely that I can accept it.

Cheers,
 -Brian

On Dec 12, 2011, at 1:36 PM, Trevor North wrote:

> I have a branch which more or less achieves what is described above by
> way of a new dead server retry timeout behaviour.
>
> As per the current behaviour with consistent distribution and auto
> ejection, keys on the dead server are moved after we hit the initial
> failure limit by taking that host out of the continuum. We then reset
> the continuum to force a retry every time we hit the dead server retry
> timeout in the same manner as is done for standard connection retries.
> Each dead server retry will result in a miss if the host is not actually
> available which isn't ideal but I wanted to achieve this whilst
> maintaining compatibility with current behaviour so have kept the
> changes to a bare minimum.
>
> It's worth noting here that there are a couple of instances where an IO
> failure would incorrectly reset a server state to new even if it was
> already in timeout. I've corrected this when setting the state although
> I suspect the IO in question probably shouldn't be being attempted in
> the first place in some cases.
>
> I've made no attempt to leave keys on their newly allocated servers once
> the dead server is brought back to life and I don't believe it would be
> sensible to do so. With multiple clients running, network flapping
> would result in effectively random distribution if we attempted to did
> this negating the point of the use of consistent distribution.
>
> Bar the correction to the server state reset on IO failure when in
> timeout the changes introduced do not alter the behaviour currently seen
> if the new dead retry timeout is not used so they should be completely
> backwards compatible.
>
> The branch is available at
> https://code.launchpad.net/~trevor/libmemcached/dead-retry and I've
> attached a patch which will apply to 1.0.2.
>
> Feedback would be welcome as ideally this isn't something I want to have
> to maintain separately.
>
> ** Attachment added: "Add dead server retry"
> https://bugs.launchpad.net/libmemcached/+bug/881983/+attachment/2629812/+files/backoff-dead-reconnect
>
> --
> You received this bug notification because you are subscribed to
> libmemcached.
> https://bugs.launchpad.net/bugs/881983
>
> Title:
> libmemcached resets continuum with dead server
>
> Status in libmemcached - A C and C++ client library for memcached:
> New
> Status in “libmemcached” package in Ubuntu:
> New
>
> Bug description:
> Testing with pylibmc
>
> import pylibmc
> hosts = ["10.234.34.32","10.224.65.34","10.224.71.109"]
> Using libmemcached 0.53
>
> import pylibmc
> hosts = ["IP_of_host_1_here","IP_of_host_2_here","IP_of_host_3_here"]
> mc = pylibmc.Client(hosts,binary=False)
> mc.behaviors['remove_failed']=3
> mc.behaviors['hash']='md5'
> mc.behaviors['distribution']='consistent ketama'
>
> last_exception = None
> while True:
> try:
> mc.set("key","value")
> print mc.get("key")
> except Exception as e:
> print e
>
> 3 servers running, works fine
> takedown the server handing the load
> libmemcached returns
> error 47 from memcached_set: SERVER HAS FAILED AND IS DISABLED UNTIL TIMED RETRY
> until the number of retries has been reached, at which point the server is removed from the pool and the continuum is recalculated.
>
> A different server starts handling the key, until... the retry timeout
> expires again, at which point the continuum is recalculated with the
> dead server back in, and now all calls fail with
>
> error 35 from memcached_set: SERVER IS MARKED DEAD
>
> What should happen is that after the timeouts there should be a single
> return of
>
> error 35 from memcached_set: SERVER IS MARKED DEAD
>
> After which the continuum is recalculated and values go to the new
> server.
>
> Fix is to mark the server as dead, and exclude dead servers whenever
> recalculating the continuum (only works for consistent distributions -
> but that's what I'm using)
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/libmemcached/+bug/881983/+subscriptions