libmemcached autoeject/conn retry problem

Bug #777672 reported by Hannu Valtonen
44
This bug affects 7 people
Affects Status Importance Assigned to Milestone
libmemcached
Fix Released
Medium
Brian Aker

Bug Description

libmemcached's auto eject + retry support seems broken at the moment. I found other reports of the same issue: https://github.com/lericson/pylibmc/issues/32 and the a patch from someone that does almost the same thing: http://code.google.com/p/python-libmemcached/source/browse/trunk/patches/fail_over.patch.

My test setup for this was 3 memcached services running on the same box on different ports. The client was using ketama
remove_failed_servers, failure_limit of 2, and retry_timeout of 10.

Basically if I shut down one of the memcached services, the client would always keep on trying to set to it and return MEMCACHED_SERVER_MARKED_DEAD; What I mean is it didn't try reconnecting to it, it just tried setting the key to a server it thought was live but was actually marked dead internally.

Traced the problem a bit and turns out the retry timeout in this case is never set. Also the code repeatedly simply calls set_last_disconnected_host() and never gets to the switch gate at the end of memcached_connect() so it'd try reconnecting.

Attached patch got both issues solved for me and now in case I turn off a server it drops it from the list for the retry timeout wait period after first having returned MEMCACHED_SERVER_MARKED_DEAD once. It also cleanly reconnects after the timeout if I start the memcached service up again.

Revision history for this message
Hannu Valtonen (hannu-valtonen) wrote :
Revision history for this message
Brian Aker (brianaker) wrote : Re: [Bug 777672] Re: libmemcached autoeject/conn retry problem

Hi!

Have you tried running the test suite once you have applied your patch?

The auto-eject support was a patch from twitter, and the behavior is not very defined but there are test cases in place to make sure that to some degree it continues to work. In the latest version of libmemcached we have memcached_create("--REMOVE-FAILED-SERVERS=3") which has a better defined behavior (if you are using the older behavior set system this would be MEMCACHED_BEHAVIOR_REMOVE_FAILED_SERVERS).

I am on vacation this week so my responses will be a bit slow. I'll toss the patch into the regression system later on and see what it shows. From a glance I don't think you want to have the additional if on the last_disconnected_host since that would setup a situation where one host gets rejected, but there after no other host can be rejected.

Cheers,
 -Brian

On May 4, 2011, at 11:03 PM, Hannu Valtonen wrote:

> ** Patch added: "libmemcached_ha.diff"
> https://bugs.launchpad.net/bugs/777672/+attachment/2114022/+files/libmemcached_ha.diff
>
> --
> You received this bug notification because you are subscribed to
> libmemcached.
> https://bugs.launchpad.net/bugs/777672
>
> Title:
> libmemcached autoeject/conn retry problem
>
> Status in libmemcached - A C and C++ client library for memcached:
> New
>
> Bug description:
> libmemcached's auto eject + retry support seems broken at the moment.
> I found other reports of the same issue:
> https://github.com/lericson/pylibmc/issues/32 and the a patch from
> someone that does almost the same thing: http://code.google.com/p
> /python-libmemcached/source/browse/trunk/patches/fail_over.patch.
>
> My test setup for this was 3 memcached services running on the same box on different ports. The client was using ketama
> remove_failed_servers, failure_limit of 2, and retry_timeout of 10.
>
> Basically if I shut down one of the memcached services, the client
> would always keep on trying to set to it and return
> MEMCACHED_SERVER_MARKED_DEAD; What I mean is it didn't try
> reconnecting to it, it just tried setting the key to a server it
> thought was live but was actually marked dead internally.
>
> Traced the problem a bit and turns out the retry timeout in this case
> is never set. Also the code repeatedly simply calls
> set_last_disconnected_host() and never gets to the switch gate at the
> end of memcached_connect() so it'd try reconnecting.
>
> Attached patch got both issues solved for me and now in case I turn
> off a server it drops it from the list for the retry timeout wait
> period after first having returned MEMCACHED_SERVER_MARKED_DEAD once.
> It also cleanly reconnects after the timeout if I start the memcached
> service up again.

Revision history for this message
Hannu Valtonen (hannu-valtonen) wrote :

The testsuite ./testapp still passes. (i.e. 0 failed, 4 skipped, 1367 succeeded) I'm guessing the test set isn't really that comprehensive.

And yes I'm still using the old behavior system.

But anyway the two problems I see are that the next_retry time is never set and b) we'll never get to the switch statement at the end memcached_connect right now. i.e. we'll never reconnect. Basically it'll always end up in one of the "return MEMCACHED_SERVER_MARKED_DEAD; " code branches.

There are probably some issues with my patch too, it doesn't call update_continuum&friends in all the right places with this either.

Revision history for this message
Hannu Valtonen (hannu-valtonen) wrote :

Worked on this some more. Now it does what I expect on all the situations I've been able to test. It's also no longer dependent on last_disconnected_host. It changes the program flow a bit as explained in the new comment in memcached_connect()

  /* In case server_failure_limit is set, the retry_timeout/failure_limit works by

     1) client loops until failure_limit has been reached
     2) server gets removed from hash continuum
     3) memcached_servers_reconnect() tries again after retry timeout
     4) Reconnecting to the server resets failure_limit and adds the server to the continuum

     In case timeout is set but failure_limit is not:
     1) We fail once
     2) Until timeout is reached again, we immediately hit MEMCACHED_SERVER_MARKED_DEAD
     3) Try reconnecting */

The regression tests pass except the test on line 5462 testing failure server limit. This is coming from the fact that the tests directly move the failure limit to 2 and then call memcached_connect again with the same server. The new codepath would already remove the server in question from the hash continuum at the point when failure limit rises from 1 to 2, and return memcached_MEMCACHED_SERVER_MARKED_DEAD. The reason this doesn't happen now is that it can connect to the server (it isn't really dead) in the call.. which is never made in the new codepath since the server in question has been removed from the hash continuum.

With the patch it always reconnects and calls update_distribution correctly afaict. Anyway, we can fix the tests up but what do you think of the general approach?

Revision history for this message
Hannu Valtonen (hannu-valtonen) wrote :

Newer patch

Revision history for this message
Brian Aker (brianaker) wrote :

BTW I have started to review this/look at the issue.

Changed in libmemcached:
assignee: nobody → Brian Aker (brianaker)
importance: Undecided → Medium
Revision history for this message
Florian Munz (theflow) wrote :

Hello,

I ran into the same problem. Auto eject is broken since 0.39 and I also can't get MEMCACHED_BEHAVIOR_REMOVE_FAILED_SERVERS to work in 0.49, it always returns a SERVER IS MARKED DEAD error and never removes the server.

Would be great to get this sorted.

thanks,
Florian

Revision history for this message
Ilia Alshanetsky (ilia-1) wrote :

I've updated the patch to work with version 0.5.

Revision history for this message
Matt Reiferson (snakes) wrote :

Hi,

Just wanted to throw my hat into the ring here and see if I can lend a hand in helping test or shepherd this patch through into a release.

We're seeing similar issues around the use of this setting.

Thanks,

Matt

Revision history for this message
Christian Holm (cho) wrote :

Any chance of getting the patch integrated? Seems this is a very important bug in the library that has been left unfixed for quite some time.

Revision history for this message
Brian Aker (brianaker) wrote : Re: [Bug 777672] libmemcached autoeject/conn retry problem

Hi!

The big problem with fixing this is the following:

1) We don't have a defined behavior. What should the behavior be, how should it fail.

2) We need some test cases to double check the behavior.

If we have the above, we can define the solution (which has always been the problem with auto-eject, the original patch from twitter was not complete).

Cheers,
 -Brian

On Jul 30, 2011, at 9:45 AM, Matt Reiferson wrote:

> Hi,
>
> Just wanted to throw my hat into the ring here and see if I can lend a
> hand in helping test or shepherd this patch through into a release.
>
> We're seeing similar issues around the use of this setting.
>
> Thanks,
>
> Matt
>
> --
> You received this bug notification because you are a bug assignee.
> https://bugs.launchpad.net/bugs/777672
>
> Title:
> libmemcached autoeject/conn retry problem
>
> Status in libmemcached - A C and C++ client library for memcached:
> New
>
> Bug description:
> libmemcached's auto eject + retry support seems broken at the moment.
> I found other reports of the same issue:
> https://github.com/lericson/pylibmc/issues/32 and the a patch from
> someone that does almost the same thing: http://code.google.com/p
> /python-libmemcached/source/browse/trunk/patches/fail_over.patch.
>
> My test setup for this was 3 memcached services running on the same box on different ports. The client was using ketama
> remove_failed_servers, failure_limit of 2, and retry_timeout of 10.
>
> Basically if I shut down one of the memcached services, the client
> would always keep on trying to set to it and return
> MEMCACHED_SERVER_MARKED_DEAD; What I mean is it didn't try
> reconnecting to it, it just tried setting the key to a server it
> thought was live but was actually marked dead internally.
>
> Traced the problem a bit and turns out the retry timeout in this case
> is never set. Also the code repeatedly simply calls
> set_last_disconnected_host() and never gets to the switch gate at the
> end of memcached_connect() so it'd try reconnecting.
>
> Attached patch got both issues solved for me and now in case I turn
> off a server it drops it from the list for the retry timeout wait
> period after first having returned MEMCACHED_SERVER_MARKED_DEAD once.
> It also cleanly reconnects after the timeout if I start the memcached
> service up again.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/libmemcached/+bug/777672/+subscriptions

Revision history for this message
Brian Aker (brianaker) wrote :

In memcached_servers_reconnect() any error on connection is not being handled. Do you have any thoughts on handling that?

Revision history for this message
Brian Aker (brianaker) wrote :

We tried the latest version of this patch, and it is hanging the regression tests. There is also an issue around it not creating an error about which host has failed (when a failure happens).

Revision history for this message
Brian Aker (brianaker) wrote :

Additional note, user_supplied_bug10() fails after this patch is added. This patch removes the failure logic around timeouts (connect() will just hang after this patch is applied).

Revision history for this message
Brian Aker (brianaker) wrote :
Brian Aker (brianaker)
Changed in libmemcached:
status: New → In Progress
status: In Progress → Fix Committed
Revision history for this message
Matt Reiferson (snakes) wrote :

Brian,

I wanted to mention another failure scenario that seems relevant to this issue.

We've been experimenting against this test daemon https://gist.github.com/1158479 to simulate a situation where connections were being accepted but no clients are receiving any response.

From my perusal of the code it seems like the problem is that server_failure_limit is zero'd upon a successful connection, my feeling is that it should only be cleared upon a successful response (ie. at the end of io_read()).

Thoughts?

-Matt

Revision history for this message
David Matthew Bond (mattbond) wrote :

The retry is failing in the 0.51 libmemcached Release.
The retry is also failing with Branched 953 revision(s) downloaded with bzr branch lp:libmemcached. (Built on SUSE Linux 11 SP1).
The problem is reproducable.

Scenario:
Single Memcached Server (running on localhost).
Application gets a memcached connection (we use
memcached_set : SUCCESS
>> Stop the memecached Server.
memcached_set: UNKNOWN READ FAILURE
memcached_set: CONNECTION FAILURE
memcached_set: SERVER IS MARKED DEAD
memcached_set: SERVER IS MARKED DEAD
memcached_set: SERVER IS MARKED DEAD
>> Next Retry is due.
memcached_set: CONNECTION FAILURE
memcached_set: SERVER IS MARKED DEAD
memcached_set: SERVER IS MARKED DEAD
>> memecached Server is started.
memcached_set: SERVER IS MARKED DEAD
>> Next Retry is due.
memcached_set: CONNECTION FAILURE
memcached_set: SERVER IS MARKED DEAD
memcached_set: SERVER IS MARKED DEAD
etc.

Reason: in connect.cc method network_connect there is a loop over all the address_info objects. However ptr->address_info_next is always NULL since the reconnect attempt which resulted in the CONNECTION FAILED error as this iterated throuch all the available addres_info objects and advanced address_info_next to NULL which never gets reset.

  /* Create the socket */
  while (ptr->address_info_next && ptr->fd == INVALID_SOCKET)

To solve the problem the ptr->address_info_next needs to be reset to the first address_info
  ptr->address_info_next= ptr->address_info;
  ptr->state= MEMCACHED_SERVER_STATE_ADDRINFO;

Then it works fine again.

Attached is a diff as a suggested patch - however only first looked at the libmemcached code yesterday so there may be a much better way to fix this by someone with more experience.
Attached two test programs to reproduce this. One using a memcached pool and one without.

Matt

Revision history for this message
David Matthew Bond (mattbond) wrote :
Revision history for this message
David Matthew Bond (mattbond) wrote :
Revision history for this message
Brian Aker (brianaker) wrote :

It would be helpful if you could provide a test case based on what is currently in the lp:libmemcached (which will likely become version 0.52 shortly).

Revision history for this message
Brian Aker (brianaker) wrote :
Download full text (4.0 KiB)

Thanks, let me take a look at this. From a first glance it looks good. I need to see if there is any logical reason as to why next should not be reset like this.

On Sep 7, 2011, at 7:39 AM, David Matthew Bond wrote:

> The retry is failing in the 0.51 libmemcached Release.
> The retry is also failing with Branched 953 revision(s) downloaded with bzr branch lp:libmemcached. (Built on SUSE Linux 11 SP1).
> The problem is reproducable.
>
> Scenario:
> Single Memcached Server (running on localhost).
> Application gets a memcached connection (we use
> memcached_set : SUCCESS
>>> Stop the memecached Server.
> memcached_set: UNKNOWN READ FAILURE
> memcached_set: CONNECTION FAILURE
> memcached_set: SERVER IS MARKED DEAD
> memcached_set: SERVER IS MARKED DEAD
> memcached_set: SERVER IS MARKED DEAD
>>> Next Retry is due.
> memcached_set: CONNECTION FAILURE
> memcached_set: SERVER IS MARKED DEAD
> memcached_set: SERVER IS MARKED DEAD
>>> memecached Server is started.
> memcached_set: SERVER IS MARKED DEAD
>>> Next Retry is due.
> memcached_set: CONNECTION FAILURE
> memcached_set: SERVER IS MARKED DEAD
> memcached_set: SERVER IS MARKED DEAD
> etc.
>
> Reason: in connect.cc method network_connect there is a loop over all
> the address_info objects. However ptr->address_info_next is always NULL
> since the reconnect attempt which resulted in the CONNECTION FAILED
> error as this iterated throuch all the available addres_info objects and
> advanced address_info_next to NULL which never gets reset.
>
> /* Create the socket */
> while (ptr->address_info_next && ptr->fd == INVALID_SOCKET)
>
> To solve the problem the ptr->address_info_next needs to be reset to the first address_info
> ptr->address_info_next= ptr->address_info;
> ptr->state= MEMCACHED_SERVER_STATE_ADDRINFO;
>
> Then it works fine again.
>
> Attached is a diff as a suggested patch - however only first looked at the libmemcached code yesterday so there may be a much better way to fix this by someone with more experience.
> Attached two test programs to reproduce this. One using a memcached pool and one without.
>
> Matt
>
>
> ** Patch added: "libmemcached_Branched953revision.diff"
> https://bugs.launchpad.net/libmemcached/+bug/777672/+attachment/2367449/+files/libmemcached_Branched953revision.diff
>
> --
> You received this bug notification because you are a bug assignee.
> https://bugs.launchpad.net/bugs/777672
>
> Title:
> libmemcached autoeject/conn retry problem
>
> Status in libmemcached - A C and C++ client library for memcached:
> Fix Committed
>
> Bug description:
> libmemcached's auto eject + retry support seems broken at the moment.
> I found other reports of the same issue:
> https://github.com/lericson/pylibmc/issues/32 and the a patch from
> someone that does almost the same thing: http://code.google.com/p
> /python-libmemcached/source/browse/trunk/patches/fail_over.patch.
>
> My test setup for this was 3 memcached services running on the same box on different ports. The client was using ketama
> remove_failed_servers, failure_limit of 2, and retry_timeout of 10.
>
> Basically if I shut down one of the memcached services, the client
> would alwa...

Read more...

Brian Aker (brianaker)
Changed in libmemcached:
status: Fix Committed → Fix Released
Revision history for this message
TONY (liuning-l) wrote :

hi,Brian.
I download the lastest version(0.52),but still has the same problem.

I test it under:ubuntu 11.04 + php5.3 + memcached-2.0 + libmemcache-0.5.2.

Revision history for this message
Ondrej Holecek (ondrej-holecek) wrote :

We had the same problem with 0.44. First, I made my own patch which worked. Then I discovered a patch attached to this bug in comment #5 and it fixed the issue as well. It seems to be more complex, so finally we applied this patch.

I did not try 0.52 but IMO the final patch is not sufficient. We are still on patched 0.44 happy with the older #5 patch.

Revision history for this message
TONY (liuning-l) wrote :

to Ondrej Holecek (ondrej-holecek) :

thank you for your share. I will try it.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.