Comment 11 for bug 928696

Revision history for this message
Brian Aker (brianaker) wrote : Re: [Bug 928696] Re: incorrect handling of server restart

Thanks, I will look at it.

On Jul 12, 2013, at 13:01, Don MacAskill <email address hidden> wrote:

> As noted, this is *not* fixed. We're testing nathanael-foy's fix right
> now, with promising results.
>
> See: https://github.com/onethumb/libmemcached/pull/1
>
> --
> You received this bug notification because you are a bug assignee.
> https://bugs.launchpad.net/bugs/928696
>
> Title:
> incorrect handling of server restart
>
> Status in libmemcached - A C and C++ client library for memcached:
> Fix Released
>
> Bug description:
> libmemcached version: 1.0.3, 1.0.4
>
> we have the following setup:
> - backend (membase/couchbase: several sasl buckets)
> - libmemcached-based client (c++, multi-threaded, uses memcached_pool, single server record, binary protocol with sasl authentication, tcp sockets)
>
> client is organized to dispatch every single get/set/del request to
> distinct thread (thread pool is used). upon receiving task, thread
> obtains connection from memcached_pool related to backend bucket
> specified in request, performs needed memcached_* calls to process
> request, returns backend connection to memcached_pool, and waits for
> next request.
>
> everything goes ok until server is restarted.
> after server restarts, next request gets CONNECTION_FAILURE result (which is ok). and then comes the problem: all following memcached_set calls (performed with in-between time interval less than memcached_st::retry_timeout) get WRITE_FAILURE result. hours of gdb'ing revealed that every such request "renews" memcached_server_write_instance_st::next_retry field. this does not cause any problems if requests come less frequently than memcached_st::retry_timeout - once memcached_server_write_instance_st::next_retry "expires", all functionality gets back to normal.
>
> some info on failing requests:
> memcached_set() returns "WRITE FAILURE" (5)
> error stack(root): {
> "(166484608) SERVER HAS FAILED AND IS DISABLED UNTIL TIMED RETRY, host: 192.168.65.3:11211 -> libmemcached/storage.cc:180",
> "(166484608) SERVER HAS FAILED AND IS DISABLED UNTIL TIMED RETRY, host: 192.168.65.3:11211 -> libmemcached/connect.cc:614"
> }
> error stack(server-0): {
> "(166484608) SERVER HAS FAILED AND IS DISABLED UNTIL TIMED RETRY, host: 192.168.65.3:11211 -> libmemcached/storage.cc:180",
> "(166484608) SERVER HAS FAILED AND IS DISABLED UNTIL TIMED RETRY, host: 192.168.65.3:11211 -> libmemcached/connect.cc:614"
> }
>
>
> there was dumb (in terms of overall libmemcached code awareness) attempt to add the following check at the very beginning of <void memcached_quit_server(memcached_server_st *ptr, bool io_death)>:
> --------------------------------------------
> if ( ptr->state == MEMCACHED_SERVER_STATE_IN_TIMEOUT )
> return;
> --------------------------------------------
> this DID help with described problem, but at the same time it broke other functionality.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/libmemcached/+bug/928696/+subscriptions