Comment 23 for bug 777672

Revision history for this message
Brian Aker (brianaker) wrote : Re: [Bug 777672] libmemcached autoeject/conn retry problem

Thanks, let me take a look at this. From a first glance it looks good. I need to see if there is any logical reason as to why next should not be reset like this.

On Sep 7, 2011, at 7:39 AM, David Matthew Bond wrote:

> The retry is failing in the 0.51 libmemcached Release.
> The retry is also failing with Branched 953 revision(s) downloaded with bzr branch lp:libmemcached. (Built on SUSE Linux 11 SP1).
> The problem is reproducable.
>
> Scenario:
> Single Memcached Server (running on localhost).
> Application gets a memcached connection (we use
> memcached_set : SUCCESS
>>> Stop the memecached Server.
> memcached_set: UNKNOWN READ FAILURE
> memcached_set: CONNECTION FAILURE
> memcached_set: SERVER IS MARKED DEAD
> memcached_set: SERVER IS MARKED DEAD
> memcached_set: SERVER IS MARKED DEAD
>>> Next Retry is due.
> memcached_set: CONNECTION FAILURE
> memcached_set: SERVER IS MARKED DEAD
> memcached_set: SERVER IS MARKED DEAD
>>> memecached Server is started.
> memcached_set: SERVER IS MARKED DEAD
>>> Next Retry is due.
> memcached_set: CONNECTION FAILURE
> memcached_set: SERVER IS MARKED DEAD
> memcached_set: SERVER IS MARKED DEAD
> etc.
>
> Reason: in connect.cc method network_connect there is a loop over all
> the address_info objects. However ptr->address_info_next is always NULL
> since the reconnect attempt which resulted in the CONNECTION FAILED
> error as this iterated throuch all the available addres_info objects and
> advanced address_info_next to NULL which never gets reset.
>
> /* Create the socket */
> while (ptr->address_info_next && ptr->fd == INVALID_SOCKET)
>
> To solve the problem the ptr->address_info_next needs to be reset to the first address_info
> ptr->address_info_next= ptr->address_info;
> ptr->state= MEMCACHED_SERVER_STATE_ADDRINFO;
>
> Then it works fine again.
>
> Attached is a diff as a suggested patch - however only first looked at the libmemcached code yesterday so there may be a much better way to fix this by someone with more experience.
> Attached two test programs to reproduce this. One using a memcached pool and one without.
>
> Matt
>
>
> ** Patch added: "libmemcached_Branched953revision.diff"
> https://bugs.launchpad.net/libmemcached/+bug/777672/+attachment/2367449/+files/libmemcached_Branched953revision.diff
>
> --
> You received this bug notification because you are a bug assignee.
> https://bugs.launchpad.net/bugs/777672
>
> Title:
> libmemcached autoeject/conn retry problem
>
> Status in libmemcached - A C and C++ client library for memcached:
> Fix Committed
>
> Bug description:
> libmemcached's auto eject + retry support seems broken at the moment.
> I found other reports of the same issue:
> https://github.com/lericson/pylibmc/issues/32 and the a patch from
> someone that does almost the same thing: http://code.google.com/p
> /python-libmemcached/source/browse/trunk/patches/fail_over.patch.
>
> My test setup for this was 3 memcached services running on the same box on different ports. The client was using ketama
> remove_failed_servers, failure_limit of 2, and retry_timeout of 10.
>
> Basically if I shut down one of the memcached services, the client
> would always keep on trying to set to it and return
> MEMCACHED_SERVER_MARKED_DEAD; What I mean is it didn't try
> reconnecting to it, it just tried setting the key to a server it
> thought was live but was actually marked dead internally.
>
> Traced the problem a bit and turns out the retry timeout in this case
> is never set. Also the code repeatedly simply calls
> set_last_disconnected_host() and never gets to the switch gate at the
> end of memcached_connect() so it'd try reconnecting.
>
> Attached patch got both issues solved for me and now in case I turn
> off a server it drops it from the list for the retry timeout wait
> period after first having returned MEMCACHED_SERVER_MARKED_DEAD once.
> It also cleanly reconnects after the timeout if I start the memcached
> service up again.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/libmemcached/+bug/777672/+subscriptions