incorrect handling of server restart

Bug #928696 reported by sinny
30
This bug affects 4 people
Affects Status Importance Assigned to Milestone
libmemcached
Fix Released
Undecided
Brian Aker

Bug Description

libmemcached version: 1.0.3, 1.0.4

we have the following setup:
 - backend (membase/couchbase: several sasl buckets)
 - libmemcached-based client (c++, multi-threaded, uses memcached_pool, single server record, binary protocol with sasl authentication, tcp sockets)

client is organized to dispatch every single get/set/del request to distinct thread (thread pool is used). upon receiving task, thread obtains connection from memcached_pool related to backend bucket specified in request, performs needed memcached_* calls to process request, returns backend connection to memcached_pool, and waits for next request.

everything goes ok until server is restarted.
after server restarts, next request gets CONNECTION_FAILURE result (which is ok). and then comes the problem: all following memcached_set calls (performed with in-between time interval less than memcached_st::retry_timeout) get WRITE_FAILURE result. hours of gdb'ing revealed that every such request "renews" memcached_server_write_instance_st::next_retry field. this does not cause any problems if requests come less frequently than memcached_st::retry_timeout - once memcached_server_write_instance_st::next_retry "expires", all functionality gets back to normal.

some info on failing requests:
memcached_set() returns "WRITE FAILURE" (5)
error stack(root): {
    "(166484608) SERVER HAS FAILED AND IS DISABLED UNTIL TIMED RETRY, host: 192.168.65.3:11211 -> libmemcached/storage.cc:180",
    "(166484608) SERVER HAS FAILED AND IS DISABLED UNTIL TIMED RETRY, host: 192.168.65.3:11211 -> libmemcached/connect.cc:614"
}
error stack(server-0): {
    "(166484608) SERVER HAS FAILED AND IS DISABLED UNTIL TIMED RETRY, host: 192.168.65.3:11211 -> libmemcached/storage.cc:180",
    "(166484608) SERVER HAS FAILED AND IS DISABLED UNTIL TIMED RETRY, host: 192.168.65.3:11211 -> libmemcached/connect.cc:614"
}

there was dumb (in terms of overall libmemcached code awareness) attempt to add the following check at the very beginning of <void memcached_quit_server(memcached_server_st *ptr, bool io_death)>:
--------------------------------------------
  if ( ptr->state == MEMCACHED_SERVER_STATE_IN_TIMEOUT )
    return;
--------------------------------------------
this DID help with described problem, but at the same time it broke other functionality.

Revision history for this message
Brian Aker (brianaker) wrote :

That for using the word "code awareness", it greatly increases the odds that one of us will spend more time on this :)

Let me see if we can simulate this behavior in a test case.

Changed in libmemcached:
assignee: nobody → Brian Aker (brianaker)
Revision history for this message
sinny (sinnydono) wrote :

well, if there is any info that is missing but could help you in any way - please name it.

the basic scenario is:
1. start server (in our case it was membase/couchbase)
2. start client
3. perform some set requests (success)
4. restart server
5. continuously perform some set requests (in our case frequency was 1-2 requests per second)

expected : getting errors for time interval (while server is considered dead), return to normal operation mode after that

actual result : single CONNECTION FAILURE followed by unlimited WRITE FAILUREs (for the time long exceeding memcached_st::retry_timeout)

Revision history for this message
Trevor North (trevor) wrote :

Bug #931696 describes the same problem (and should probably really have been a comment here had I noticed this report at the time). Anyway, the information there may be of use/help in resolving the issue.

Revision history for this message
Chen Gang (gangchen) wrote :

I have met the similar issue.

Revision history for this message
Chen Gang (gangchen) wrote :

I have met the similar issue , and I just revise one line of libmemcached 1.0.5 (connect.cc:585) :
  from :
     server->server_failure_counter= server->root->server_failure_limit -1;
  to :
     server->server_failure_counter= server->root->server_failure_limit -2;
Then, the libmemcached seems normal, and the restarted server can appear in the list again.
can any body give me some comments or review.

Revision history for this message
Trevor North (trevor) wrote :

The release announcement for 1.0.6 mentions "Fix for restart issue that happens under testing". I've had a look through the commit logs, but I'm not sure whether the changes relate directly to this issue or not. Could someone elaborate?

Revision history for this message
Brian Aker (brianaker) wrote :

Can anyone confirm that this is still an issue?

Changed in libmemcached:
status: New → Incomplete
Brian Aker (brianaker)
Changed in libmemcached:
status: Incomplete → Fix Released
Revision history for this message
Nicolas Motte (nicolas-motte) wrote :

This bug is still not fixed in libmemcached 1.0.16

When memcached_quit_server is called, libmemcached calls close_socket which will set the state to NEW. Then mark_server_for_timeout will reset the next_retry.

As a quick fix, I copied the code of mark_server_for_timeout in memcached_quit_server without the part that reset the next_retry.

My tests are fine now, but maybe I broke something.

-------------------------------------------
void memcached_quit_server(org::libmemcached::Instance* instance, bool io_death)
{
  if (instance->valid())
  {
    if (io_death == false and memcached_is_udp(instance->root) == false and instance->is_shutting_down() == false)
    {
      send_quit_message(instance);

      instance->start_close_socket();
      drain_instance(instance);
    }
  }

  instance->close_socket();

  if (io_death)
  {
      instance->state= MEMCACHED_SERVER_STATE_IN_TIMEOUT;
      if (instance->server_failure_counter_query_id != instance->root->query_id)
      {
        instance->server_failure_counter++;
        instance->server_failure_counter_query_id= instance->root->query_id;
      }
      set_last_disconnected_host(instance);
  }
}
----------------------------------------

Revision history for this message
nathanael foy (nathanael-foy) wrote :
Download full text (3.3 KiB)

More information about this bug, reported in version 1.0.3, still not fixed in version 1.0.16:

How to reproduce:
1. start server and client with binary protocol and retry timeout = 2
3. perform some set request (success)
4. restart server
5. continuously perform some set requests at frequency about 1 per second

Results, as per version 1.0.8:
set TSTKey1, result is SUCCESS
/*** Restart Memcached Node ***/
set TSTKey1, result is CONNECTION FAILURE
set TSTKey1, result is WRITE FAILURE
set TSTKey1, result is WRITE FAILURE
/*** End of Retry Timeout ***/
set TSTKey1, result is WRITE FAILURE
set TSTKey1, result is WRITE FAILURE
set TSTKey1, result is WRITE FAILURE
set TSTKey1, result is WRITE FAILURE

First of all, the bug doesn't occur in ascii protocol. In this case, it behaves as expected:

set TSTKey1, result is SUCCESS
/*** Restart Memcached Node ***/
set TSTKey1, result is CONNECTION FAILURE
set TSTKey1, result is SERVER HAS FAILED AND IS DISABLED UNTIL TIMED RETRY
set TSTKey1, result is SERVER HAS FAILED AND IS DISABLED UNTIL TIMED RETRY
/*** End of Retry Timeout ***/
set TSTKey1, result is SUCCESS
set TSTKey1, result is SUCCESS
set TSTKey1, result is SUCCESS

So, the bug comes from a difference in error management between memcached_send_binary and memcached_send_ascii.

Both functions call memcached_vdo function. After the first failure, this functions returns CONNECTION_FAILURE or WRITE_FAILURE depending on the error. memcached_quit_server is called and the server is marked in timeout: next_retry is increased by RETRY_TIMEOUT.

On subsequent 'set' requests, memcached_vdo directly returns MEMCACHED_SERVER_TEMPORARILY_DISABLED without trying to connect or write anything because of the RETRY_TIMEOUT.
So far, everything is as expected.

Here is the difference between memcached_send_ascii and memcached_send_binary:

In memcached_send_binary, the 'rc' returned by memcached_vdo is immediately interpreted as a WRITE_FAILURE as soon as it is different from MEMCACHED_SUCCESS, and calls memcached_io_reset, that calls memcached_quit_server, that marks the server for timeout again...
In memcached_send_ascii, we first test if rc==MEMCACHED_WRITE_FAILURE, and if it is false we don't call memcached_io_reset.

/******************* memcached_send_ascii ****************/
  memcached_return_t rc= memcached_vdo(instance, vector, 12, flush);
  [...]
  if (rc == MEMCACHED_SUCCESS)
  {
      [...]
      return MEMCACHED_SUCCESS;
   }

  if (rc == MEMCACHED_WRITE_FAILURE)
  {
    memcached_io_reset(instance);
  }
  return rc;
/************************************************************/
/******************* memcached_send_binary v. 1.0.8 *********/
  if ((rc= memcached_vdo(server, vector, 5, flush)) != MEMCACHED_SUCCESS)
  {
    memcached_io_reset(server);

    return MEMCACHED_WRITE_FAILURE;
  }
  [...]
/*************************************************************/
/******************* memcached_send_binary v. 1.0.16 *********/
  if ((rc= memcached_vdo(server, vector, 5, flush)) != MEMCACHED_SUCCESS)
  {
    memcached_io_reset(server);
    [...]
    return memcached_last_error(server->root);
  }
  [...]
/***************************************************...

Read more...

Revision history for this message
Don MacAskill (don-smugmug) wrote :

As noted, this is *not* fixed. We're testing nathanael-foy's fix right now, with promising results.

See: https://github.com/onethumb/libmemcached/pull/1

Revision history for this message
Brian Aker (brianaker) wrote : Re: [Bug 928696] Re: incorrect handling of server restart

Thanks, I will look at it.

On Jul 12, 2013, at 13:01, Don MacAskill <email address hidden> wrote:

> As noted, this is *not* fixed. We're testing nathanael-foy's fix right
> now, with promising results.
>
> See: https://github.com/onethumb/libmemcached/pull/1
>
> --
> You received this bug notification because you are a bug assignee.
> https://bugs.launchpad.net/bugs/928696
>
> Title:
> incorrect handling of server restart
>
> Status in libmemcached - A C and C++ client library for memcached:
> Fix Released
>
> Bug description:
> libmemcached version: 1.0.3, 1.0.4
>
> we have the following setup:
> - backend (membase/couchbase: several sasl buckets)
> - libmemcached-based client (c++, multi-threaded, uses memcached_pool, single server record, binary protocol with sasl authentication, tcp sockets)
>
> client is organized to dispatch every single get/set/del request to
> distinct thread (thread pool is used). upon receiving task, thread
> obtains connection from memcached_pool related to backend bucket
> specified in request, performs needed memcached_* calls to process
> request, returns backend connection to memcached_pool, and waits for
> next request.
>
> everything goes ok until server is restarted.
> after server restarts, next request gets CONNECTION_FAILURE result (which is ok). and then comes the problem: all following memcached_set calls (performed with in-between time interval less than memcached_st::retry_timeout) get WRITE_FAILURE result. hours of gdb'ing revealed that every such request "renews" memcached_server_write_instance_st::next_retry field. this does not cause any problems if requests come less frequently than memcached_st::retry_timeout - once memcached_server_write_instance_st::next_retry "expires", all functionality gets back to normal.
>
> some info on failing requests:
> memcached_set() returns "WRITE FAILURE" (5)
> error stack(root): {
> "(166484608) SERVER HAS FAILED AND IS DISABLED UNTIL TIMED RETRY, host: 192.168.65.3:11211 -> libmemcached/storage.cc:180",
> "(166484608) SERVER HAS FAILED AND IS DISABLED UNTIL TIMED RETRY, host: 192.168.65.3:11211 -> libmemcached/connect.cc:614"
> }
> error stack(server-0): {
> "(166484608) SERVER HAS FAILED AND IS DISABLED UNTIL TIMED RETRY, host: 192.168.65.3:11211 -> libmemcached/storage.cc:180",
> "(166484608) SERVER HAS FAILED AND IS DISABLED UNTIL TIMED RETRY, host: 192.168.65.3:11211 -> libmemcached/connect.cc:614"
> }
>
>
> there was dumb (in terms of overall libmemcached code awareness) attempt to add the following check at the very beginning of <void memcached_quit_server(memcached_server_st *ptr, bool io_death)>:
> --------------------------------------------
> if ( ptr->state == MEMCACHED_SERVER_STATE_IN_TIMEOUT )
> return;
> --------------------------------------------
> this DID help with described problem, but at the same time it broke other functionality.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/libmemcached/+bug/928696/+subscriptions

Revision history for this message
Don MacAskill (don-smugmug) wrote :

For us, this fixed the problem entirely for sets, but not for deletes. For deletes, the problem still exists - it never detects that the server has returned. I haven't dug into why, yet.

Using the ASCII protocol, this problem doesn't occur for us.

Revision history for this message
Peter Walsh (peter-walsh) wrote :

Has this actually been fixed for ascii mode? I'm using 1.0.18 and after I restart a server it never reconnects.

Revision history for this message
Don MacAskill (don-smugmug) wrote :

Brian, did this ever get really fixed? Should I/we open a new bug for it so it doesn't get lost?

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.