Bug #777672 “libmemcached autoeject/conn retry problem” : Bugs : libmemcached

Revision history for this message

Hannu Valtonen (hannu-valtonen) wrote on 2011-05-05:

#1

libmemcached_ha.diff Edit (1.2 KiB, text/plain)

Revision history for this message

Brian Aker (brianaker) wrote on 2011-05-05: Re: [Bug 777672] Re: libmemcached autoeject/conn retry problem

#2

Hi!

Have you tried running the test suite once you have applied your patch?

The auto-eject support was a patch from twitter, and the behavior is not very defined but there are test cases in place to make sure that to some degree it continues to work. In the latest version of libmemcached we have memcached_create("--REMOVE-FAILED-SERVERS=3") which has a better defined behavior (if you are using the older behavior set system this would be MEMCACHED_BEHAVIOR_REMOVE_FAILED_SERVERS).

I am on vacation this week so my responses will be a bit slow. I'll toss the patch into the regression system later on and see what it shows. From a glance I don't think you want to have the additional if on the last_disconnected_host since that would setup a situation where one host gets rejected, but there after no other host can be rejected.

Cheers,
-Brian

On May 4, 2011, at 11:03 PM, Hannu Valtonen wrote:

> ** Patch added: "libmemcached_ha.diff"
> https://bugs.launchpad.net/bugs/777672/+attachment/2114022/+files/libmemcached_ha.diff
>
> --
> You received this bug notification because you are subscribed to
> libmemcached.
> https://bugs.launchpad.net/bugs/777672
>
> Title:
> libmemcached autoeject/conn retry problem
>
> Status in libmemcached - A C and C++ client library for memcached:
> New
>
> Bug description:
> libmemcached's auto eject + retry support seems broken at the moment.
> I found other reports of the same issue:
> https://github.com/lericson/pylibmc/issues/32 and the a patch from
> someone that does almost the same thing: http://code.google.com/p
> /python-libmemcached/source/browse/trunk/patches/fail_over.patch.
>
> My test setup for this was 3 memcached services running on the same box on different ports. The client was using ketama
> remove_failed_servers, failure_limit of 2, and retry_timeout of 10.
>
> Basically if I shut down one of the memcached services, the client
> would always keep on trying to set to it and return
> MEMCACHED_SERVER_MARKED_DEAD; What I mean is it didn't try
> reconnecting to it, it just tried setting the key to a server it
> thought was live but was actually marked dead internally.
>
> Traced the problem a bit and turns out the retry timeout in this case
> is never set. Also the code repeatedly simply calls
> set_last_disconnected_host() and never gets to the switch gate at the
> end of memcached_connect() so it'd try reconnecting.
>
> Attached patch got both issues solved for me and now in case I turn
> off a server it drops it from the list for the retry timeout wait
> period after first having returned MEMCACHED_SERVER_MARKED_DEAD once.
> It also cleanly reconnects after the timeout if I start the memcached
> service up again.

Hi!

Have you tried running the test suite once you have applied your patch?

The auto-eject support was a patch from twitter, and the behavior is not very defined but there are test cases in place to make sure that to some degree it continues to work. In the latest version of libmemcached we have memcached_create("--REMOVE-FAILED-SERVERS=3") which has a better defined behavior (if you are using the older behavior set system this would be MEMCACHED_BEHAVIOR_REMOVE_FAILED_SERVERS).

I am on vacation this week so my responses will be a bit slow. I'll toss the patch into the regression system later on and see what it shows. From a glance I don't think you want to have the additional if on the last_disconnected_host since that would setup a situation where one host gets rejected, but there after no other host can be rejected.

Cheers,
	-Brian

On May 4, 2011, at 11:03 PM, Hannu Valtonen wrote:

> ** Patch added: "libmemcached_ha.diff"
>   https://bugs.launchpad.net/bugs/777672/+attachment/2114022/+files/libmemcached_ha.diff
> 
> -- 
> You received this bug notification because you are subscribed to
> libmemcached.
> https://bugs.launchpad.net/bugs/777672
> 
> Title:
>  libmemcached autoeject/conn retry problem
> 
> Status in libmemcached - A C and C++ client library for memcached:
>  New
> 
> Bug description:
>  libmemcached's auto eject + retry support seems broken at the moment.
>  I found other reports of the same issue:
>  https://github.com/lericson/pylibmc/issues/32 and the a patch from
>  someone that does almost the same thing: http://code.google.com/p
>  /python-libmemcached/source/browse/trunk/patches/fail_over.patch.
> 
>  My test setup for this was 3 memcached services running on the same box on different ports. The client was using ketama
>  remove_failed_servers, failure_limit of  2,  and retry_timeout of 10. 
> 
>  Basically if I shut down one of the memcached services, the client
>  would always keep on trying to set to it and return
>  MEMCACHED_SERVER_MARKED_DEAD; What I mean is it didn't try
>  reconnecting to it, it just tried setting the key to a server it
>  thought was live but was actually marked dead internally.
> 
>  Traced the problem a bit and turns out the retry timeout in this case
>  is never set.  Also the code repeatedly simply calls
>  set_last_disconnected_host() and never gets to the switch gate at the
>  end of memcached_connect() so it'd try reconnecting.
> 
>  Attached patch got both issues solved for me and now in case I turn
>  off a server it drops it from the list for the retry timeout wait
>  period after first having returned MEMCACHED_SERVER_MARKED_DEAD once.
>  It also cleanly reconnects after the timeout if I start the memcached
>  service up again.

Revision history for this message

Hannu Valtonen (hannu-valtonen) wrote on 2011-05-06:

#3

The testsuite ./testapp still passes. (i.e. 0 failed, 4 skipped, 1367 succeeded) I'm guessing the test set isn't really that comprehensive.

And yes I'm still using the old behavior system.

But anyway the two problems I see are that the next_retry time is never set and b) we'll never get to the switch statement at the end memcached_connect right now. i.e. we'll never reconnect. Basically it'll always end up in one of the "return MEMCACHED_SERVER_MARKED_DEAD; " code branches.

There are probably some issues with my patch too, it doesn't call update_continuum&friends in all the right places with this either.

Revision history for this message

Hannu Valtonen (hannu-valtonen) wrote on 2011-05-08:

#4

Worked on this some more. Now it does what I expect on all the situations I've been able to test. It's also no longer dependent on last_disconnected_host. It changes the program flow a bit as explained in the new comment in memcached_connect()

/* In case server_failure_limit is set, the retry_timeout/failure_limit works by

     1) client loops until failure_limit has been reached
     2) server gets removed from hash continuum
     3) memcached_servers_reconnect() tries again after retry timeout
     4) Reconnecting to the server resets failure_limit and adds the server to the continuum

     In case timeout is set but failure_limit is not:
     1) We fail once
     2) Until timeout is reached again, we immediately hit MEMCACHED_SERVER_MARKED_DEAD
     3) Try reconnecting */

The regression tests pass except the test on line 5462 testing failure server limit. This is coming from the fact that the tests directly move the failure limit to 2 and then call memcached_connect again with the same server. The new codepath would already remove the server in question from the hash continuum at the point when failure limit rises from 1 to 2, and return memcached_MEMCACHED_SERVER_MARKED_DEAD. The reason this doesn't happen now is that it can connect to the server (it isn't really dead) in the call.. which is never made in the new codepath since the server in question has been removed from the hash continuum.

With the patch it always reconnects and calls update_distribution correctly afaict. Anyway, we can fix the tests up but what do you think of the general approach?

Revision history for this message

Hannu Valtonen (hannu-valtonen) wrote on 2011-05-08:

#5

Newer HA patch Edit (6.2 KiB, text/plain)

Newer patch

Revision history for this message

Brian Aker (brianaker) wrote on 2011-05-19:

#6

BTW I have started to review this/look at the issue.

Changed in libmemcached:
assignee:	nobody → Brian Aker (brianaker)
importance:	Undecided → Medium

Revision history for this message

Florian Munz (theflow) wrote on 2011-06-20:

#7

Hello,

I ran into the same problem. Auto eject is broken since 0.39 and I also can't get MEMCACHED_BEHAVIOR_REMOVE_FAILED_SERVERS to work in 0.49, it always returns a SERVER IS MARKED DEAD error and never removes the server.

Would be great to get this sorted.

thanks,
Florian

Revision history for this message

Ilia Alshanetsky (ilia-1) wrote on 2011-06-21:

#8

mem.txt Edit (7.5 KiB, text/plain)

I've updated the patch to work with version 0.5.

Revision history for this message

Matt Reiferson (snakes) wrote on 2011-07-30:

#9

Hi,

Just wanted to throw my hat into the ring here and see if I can lend a hand in helping test or shepherd this patch through into a release.

We're seeing similar issues around the use of this setting.

Thanks,

Matt

Revision history for this message

Christian Holm (cho) wrote on 2011-08-01:

#10

Any chance of getting the patch integrated? Seems this is a very important bug in the library that has been left unfixed for quite some time.

Revision history for this message

Brian Aker (brianaker) wrote on 2011-08-02: Re: [Bug 777672] libmemcached autoeject/conn retry problem

#11

Hi!

The big problem with fixing this is the following:

1) We don't have a defined behavior. What should the behavior be, how should it fail.

2) We need some test cases to double check the behavior.

If we have the above, we can define the solution (which has always been the problem with auto-eject, the original patch from twitter was not complete).

Cheers,
-Brian

On Jul 30, 2011, at 9:45 AM, Matt Reiferson wrote:

> Hi,
>
> Just wanted to throw my hat into the ring here and see if I can lend a
> hand in helping test or shepherd this patch through into a release.
>
> We're seeing similar issues around the use of this setting.
>
> Thanks,
>
> Matt
>
> --
> You received this bug notification because you are a bug assignee.
> https://bugs.launchpad.net/bugs/777672
>
> Title:
> libmemcached autoeject/conn retry problem
>
> Status in libmemcached - A C and C++ client library for memcached:
> New
>
> Bug description:
> libmemcached's auto eject + retry support seems broken at the moment.
> I found other reports of the same issue:
> https://github.com/lericson/pylibmc/issues/32 and the a patch from
> someone that does almost the same thing: http://code.google.com/p
> /python-libmemcached/source/browse/trunk/patches/fail_over.patch.
>
> My test setup for this was 3 memcached services running on the same box on different ports. The client was using ketama
> remove_failed_servers, failure_limit of 2, and retry_timeout of 10.
>
> Basically if I shut down one of the memcached services, the client
> would always keep on trying to set to it and return
> MEMCACHED_SERVER_MARKED_DEAD; What I mean is it didn't try
> reconnecting to it, it just tried setting the key to a server it
> thought was live but was actually marked dead internally.
>
> Traced the problem a bit and turns out the retry timeout in this case
> is never set. Also the code repeatedly simply calls
> set_last_disconnected_host() and never gets to the switch gate at the
> end of memcached_connect() so it'd try reconnecting.
>
> Attached patch got both issues solved for me and now in case I turn
> off a server it drops it from the list for the retry timeout wait
> period after first having returned MEMCACHED_SERVER_MARKED_DEAD once.
> It also cleanly reconnects after the timeout if I start the memcached
> service up again.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/libmemcached/+bug/777672/+subscriptions

Hi!

The big problem with fixing this is the following:

1) We don't have a defined behavior. What should the behavior be, how should it fail.

2) We need some test cases to double check the behavior.

If we have the above, we can define the solution (which has always been the problem with auto-eject, the original patch from twitter was not complete).

Cheers,
	-Brian

On Jul 30, 2011, at 9:45 AM, Matt Reiferson wrote:

> Hi,
> 
> Just wanted to throw my hat into the ring here and see if I can lend a
> hand in helping test or shepherd this patch through into a release.
> 
> We're seeing similar issues around the use of this setting.
> 
> Thanks,
> 
> Matt
> 
> -- 
> You received this bug notification because you are a bug assignee.
> https://bugs.launchpad.net/bugs/777672
> 
> Title:
>  libmemcached autoeject/conn retry problem
> 
> Status in libmemcached - A C and C++ client library for memcached:
>  New
> 
> Bug description:
>  libmemcached's auto eject + retry support seems broken at the moment.
>  I found other reports of the same issue:
>  https://github.com/lericson/pylibmc/issues/32 and the a patch from
>  someone that does almost the same thing: http://code.google.com/p
>  /python-libmemcached/source/browse/trunk/patches/fail_over.patch.
> 
>  My test setup for this was 3 memcached services running on the same box on different ports. The client was using ketama
>  remove_failed_servers, failure_limit of  2,  and retry_timeout of 10. 
> 
>  Basically if I shut down one of the memcached services, the client
>  would always keep on trying to set to it and return
>  MEMCACHED_SERVER_MARKED_DEAD; What I mean is it didn't try
>  reconnecting to it, it just tried setting the key to a server it
>  thought was live but was actually marked dead internally.
> 
>  Traced the problem a bit and turns out the retry timeout in this case
>  is never set.  Also the code repeatedly simply calls
>  set_last_disconnected_host() and never gets to the switch gate at the
>  end of memcached_connect() so it'd try reconnecting.
> 
>  Attached patch got both issues solved for me and now in case I turn
>  off a server it drops it from the list for the retry timeout wait
>  period after first having returned MEMCACHED_SERVER_MARKED_DEAD once.
>  It also cleanly reconnects after the timeout if I start the memcached
>  service up again.
> 
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/libmemcached/+bug/777672/+subscriptions

Revision history for this message

Brian Aker (brianaker) wrote on 2011-08-17:

#12

In memcached_servers_reconnect() any error on connection is not being handled. Do you have any thoughts on handling that?

Revision history for this message

Brian Aker (brianaker) wrote on 2011-08-17:

#13

We tried the latest version of this patch, and it is hanging the regression tests. There is also an issue around it not creating an error about which host has failed (when a failure happens).

Revision history for this message

Brian Aker (brianaker) wrote on 2011-08-17:

#14

Additional note, user_supplied_bug10() fails after this patch is added. This patch removes the failure logic around timeouts (connect() will just hang after this patch is applied).

Revision history for this message

Brian Aker (brianaker) wrote on 2011-08-18:

#15

Documentation on retry:

http://docs.libmemcached.org/memcached_behavior.html?highlight=memcached_behavior_retry_timeout#MEMCACHED_BEHAVIOR_RETRY_TIMEOUT

Brian Aker (brianaker) on 2011-08-20

Changed in libmemcached:
status:	New → In Progress
status:	In Progress → Fix Committed

Revision history for this message

Matt Reiferson (snakes) wrote on 2011-08-22:

#16

Brian,

I wanted to mention another failure scenario that seems relevant to this issue.

We've been experimenting against this test daemon https://gist.github.com/1158479 to simulate a situation where connections were being accepted but no clients are receiving any response.

From my perusal of the code it seems like the problem is that server_failure_limit is zero'd upon a successful connection, my feeling is that it should only be cleared upon a successful response (ie. at the end of io_read()).

Thoughts?

-Matt

Revision history for this message

David Matthew Bond (mattbond) wrote on 2011-09-07:

#17

libmemcached_Branched953revision.diff Edit (446 bytes, text/plain)

The retry is failing in the 0.51 libmemcached Release.
The retry is also failing with Branched 953 revision(s) downloaded with bzr branch lp:libmemcached. (Built on SUSE Linux 11 SP1).
The problem is reproducable.

Scenario:
Single Memcached Server (running on localhost).
Application gets a memcached connection (we use
memcached_set : SUCCESS
>> Stop the memecached Server.
memcached_set: UNKNOWN READ FAILURE
memcached_set: CONNECTION FAILURE
memcached_set: SERVER IS MARKED DEAD
memcached_set: SERVER IS MARKED DEAD
memcached_set: SERVER IS MARKED DEAD
>> Next Retry is due.
memcached_set: CONNECTION FAILURE
memcached_set: SERVER IS MARKED DEAD
memcached_set: SERVER IS MARKED DEAD
>> memecached Server is started.
memcached_set: SERVER IS MARKED DEAD
>> Next Retry is due.
memcached_set: CONNECTION FAILURE
memcached_set: SERVER IS MARKED DEAD
memcached_set: SERVER IS MARKED DEAD
etc.

Reason: in connect.cc method network_connect there is a loop over all the address_info objects. However ptr->address_info_next is always NULL since the reconnect attempt which resulted in the CONNECTION FAILED error as this iterated throuch all the available addres_info objects and advanced address_info_next to NULL which never gets reset.

/* Create the socket */
while (ptr->address_info_next && ptr->fd == INVALID_SOCKET)

To solve the problem the ptr->address_info_next needs to be reset to the first address_info
ptr->address_info_next= ptr->address_info;
ptr->state= MEMCACHED_SERVER_STATE_ADDRINFO;

Then it works fine again.

Attached is a diff as a suggested patch - however only first looked at the libmemcached code yesterday so there may be a much better way to fix this by someone with more experience.
Attached two test programs to reproduce this. One using a memcached pool and one without.

Matt

Revision history for this message

David Matthew Bond (mattbond) wrote on 2011-09-07:

#20

Sample for Comment# 17 (no pooling) Edit (2.0 KiB, text/plain)

Revision history for this message

David Matthew Bond (mattbond) wrote on 2011-09-07:

#21

Sample for Comment #17 (pooled) Edit (2.7 KiB, text/plain)

Revision history for this message

Brian Aker (brianaker) wrote on 2011-09-10:

#22

It would be helpful if you could provide a test case based on what is currently in the lp:libmemcached (which will likely become version 0.52 shortly).

Revision history for this message

Brian Aker (brianaker) wrote on 2011-09-10:

#23

Download full text (4.0 KiB)

Thanks, let me take a look at this. From a first glance it looks good. I need to see if there is any logical reason as to why next should not be reset like this.

On Sep 7, 2011, at 7:39 AM, David Matthew Bond wrote:

> The retry is failing in the 0.51 libmemcached Release.
> The retry is also failing with Branched 953 revision(s) downloaded with bzr branch lp:libmemcached. (Built on SUSE Linux 11 SP1).
> The problem is reproducable.
>
> Scenario:
> Single Memcached Server (running on localhost).
> Application gets a memcached connection (we use
> memcached_set : SUCCESS
>>> Stop the memecached Server.
> memcached_set: UNKNOWN READ FAILURE
> memcached_set: CONNECTION FAILURE
> memcached_set: SERVER IS MARKED DEAD
> memcached_set: SERVER IS MARKED DEAD
> memcached_set: SERVER IS MARKED DEAD
>>> Next Retry is due.
> memcached_set: CONNECTION FAILURE
> memcached_set: SERVER IS MARKED DEAD
> memcached_set: SERVER IS MARKED DEAD
>>> memecached Server is started.
> memcached_set: SERVER IS MARKED DEAD
>>> Next Retry is due.
> memcached_set: CONNECTION FAILURE
> memcached_set: SERVER IS MARKED DEAD
> memcached_set: SERVER IS MARKED DEAD
> etc.
>
> Reason: in connect.cc method network_connect there is a loop over all
> the address_info objects. However ptr->address_info_next is always NULL
> since the reconnect attempt which resulted in the CONNECTION FAILED
> error as this iterated throuch all the available addres_info objects and
> advanced address_info_next to NULL which never gets reset.
>
> /* Create the socket */
> while (ptr->address_info_next && ptr->fd == INVALID_SOCKET)
>
> To solve the problem the ptr->address_info_next needs to be reset to the first address_info
> ptr->address_info_next= ptr->address_info;
> ptr->state= MEMCACHED_SERVER_STATE_ADDRINFO;
>
> Then it works fine again.
>
> Attached is a diff as a suggested patch - however only first looked at the libmemcached code yesterday so there may be a much better way to fix this by someone with more experience.
> Attached two test programs to reproduce this. One using a memcached pool and one without.
>
> Matt
>
>
> ** Patch added: "libmemcached_Branched953revision.diff"
> https://bugs.launchpad.net/libmemcached/+bug/777672/+attachment/2367449/+files/libmemcached_Branched953revision.diff
>
> --
> You received this bug notification because you are a bug assignee.
> https://bugs.launchpad.net/bugs/777672
>
> Title:
> libmemcached autoeject/conn retry problem
>
> Status in libmemcached - A C and C++ client library for memcached:
> Fix Committed
>
> Bug description:
> libmemcached's auto eject + retry support seems broken at the moment.
> I found other reports of the same issue:
> https://github.com/lericson/pylibmc/issues/32 and the a patch from
> someone that does almost the same thing: http://code.google.com/p
> /python-libmemcached/source/browse/trunk/patches/fail_over.patch.
>
> My test setup for this was 3 memcached services running on the same box on different ports. The client was using ketama
> remove_failed_servers, failure_limit of 2, and retry_timeout of 10.
>
> Basically if I shut down one of the memcached services, the client
> would alwa...

libmemcached

libmemcached autoeject/conn retry problem

Bug Description

Duplicates of this bug

Other bug subscribers

Patches

Bug attachments

Remote bug watches

Changed in libmemcached:
status:	Fix Committed → Fix Released