Thanks, let me take a look at this. From a first glance it looks good. I need to see if there is any logical reason as to why next should not be reset like this. On Sep 7, 2011, at 7:39 AM, David Matthew Bond wrote: > The retry is failing in the 0.51 libmemcached Release. > The retry is also failing with Branched 953 revision(s) downloaded with bzr branch lp:libmemcached. (Built on SUSE Linux 11 SP1). > The problem is reproducable. > > Scenario: > Single Memcached Server (running on localhost). > Application gets a memcached connection (we use > memcached_set : SUCCESS >>> Stop the memecached Server. > memcached_set: UNKNOWN READ FAILURE > memcached_set: CONNECTION FAILURE > memcached_set: SERVER IS MARKED DEAD > memcached_set: SERVER IS MARKED DEAD > memcached_set: SERVER IS MARKED DEAD >>> Next Retry is due. > memcached_set: CONNECTION FAILURE > memcached_set: SERVER IS MARKED DEAD > memcached_set: SERVER IS MARKED DEAD >>> memecached Server is started. > memcached_set: SERVER IS MARKED DEAD >>> Next Retry is due. > memcached_set: CONNECTION FAILURE > memcached_set: SERVER IS MARKED DEAD > memcached_set: SERVER IS MARKED DEAD > etc. > > Reason: in connect.cc method network_connect there is a loop over all > the address_info objects. However ptr->address_info_next is always NULL > since the reconnect attempt which resulted in the CONNECTION FAILED > error as this iterated throuch all the available addres_info objects and > advanced address_info_next to NULL which never gets reset. > > /* Create the socket */ > while (ptr->address_info_next && ptr->fd == INVALID_SOCKET) > > To solve the problem the ptr->address_info_next needs to be reset to the first address_info > ptr->address_info_next= ptr->address_info; > ptr->state= MEMCACHED_SERVER_STATE_ADDRINFO; > > Then it works fine again. > > Attached is a diff as a suggested patch - however only first looked at the libmemcached code yesterday so there may be a much better way to fix this by someone with more experience. > Attached two test programs to reproduce this. One using a memcached pool and one without. > > Matt > > > ** Patch added: "libmemcached_Branched953revision.diff" > https://bugs.launchpad.net/libmemcached/+bug/777672/+attachment/2367449/+files/libmemcached_Branched953revision.diff > > -- > You received this bug notification because you are a bug assignee. > https://bugs.launchpad.net/bugs/777672 > > Title: > libmemcached autoeject/conn retry problem > > Status in libmemcached - A C and C++ client library for memcached: > Fix Committed > > Bug description: > libmemcached's auto eject + retry support seems broken at the moment. > I found other reports of the same issue: > https://github.com/lericson/pylibmc/issues/32 and the a patch from > someone that does almost the same thing: http://code.google.com/p > /python-libmemcached/source/browse/trunk/patches/fail_over.patch. > > My test setup for this was 3 memcached services running on the same box on different ports. The client was using ketama > remove_failed_servers, failure_limit of 2, and retry_timeout of 10. > > Basically if I shut down one of the memcached services, the client > would always keep on trying to set to it and return > MEMCACHED_SERVER_MARKED_DEAD; What I mean is it didn't try > reconnecting to it, it just tried setting the key to a server it > thought was live but was actually marked dead internally. > > Traced the problem a bit and turns out the retry timeout in this case > is never set. Also the code repeatedly simply calls > set_last_disconnected_host() and never gets to the switch gate at the > end of memcached_connect() so it'd try reconnecting. > > Attached patch got both issues solved for me and now in case I turn > off a server it drops it from the list for the retry timeout wait > period after first having returned MEMCACHED_SERVER_MARKED_DEAD once. > It also cleanly reconnects after the timeout if I start the memcached > service up again. > > To manage notifications about this bug go to: > https://bugs.launchpad.net/libmemcached/+bug/777672/+subscriptions