_is_auto_eject_host() not causing run_distribution()

Bug #810888 reported by Ondrej Holecek
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
libmemcached
New
Medium
Brian Aker

Bug Description

we are using libmemcached-0.44 but I observed the same problem on
latest 0.50 as well.

We have a cluster of let say 10 memcaches. We have
MEMCACHED_DISTRIBUTION_CONSISTENT enabled.

One of the memcashes went down and memcachd_get() returned
MEMCACHED_SERVER_MARKED_DEAD since the key should be stored on the
died memcache. So, we called memcached_autoeject() to redistribute
keys to the other servers. (let me note, there is nothing written
about memcached_autoeject() in the docs). memcachd_get() still
returned MEMCACHED_SERVER_MARKED_DEAD.

I discovered run_distribution(ptr); does not call because of "if
(_is_auto_eject_host(ptr) && ptr->next_distribution_rebuild)"
condition in _regen_for_auto_eject().

The only place where ptr->next_distribution_rebuild is being filled is
inside run_distribution() which can not be called because of the
condition. So, my first proposal is:

void memcached_autoeject(memcached_st *ptr)
{
    if (_is_auto_eject_host(ptr))
        run_distribution(ptr);
}

I'm not sure if the patch is correct. However, it still did not fix
all my problems. memcachd_get() was still returning
MEMCACHED_SERVER_MARKED_DEAD. So, my second proposal is:

@@ -134,6 +134,7 @@ static memcached_return_t
update_continuum(memcached_st *ptr)
    {
      if (list[host_index].next_retry <= now.tv_sec)
      {
+ list[host_index].server_failure_counter = 0;
        live_servers++;
      }

Consider the server went up and next retry timeout occurred. Due to
code in memcached_connect()....

514 if (ptr->root->server_failure_limit &&
ptr->server_failure_counter >= ptr->root->server_failure_limit)
515 {
516 set_last_disconnected_host(ptr);
517
518 // @todo fix this by fixing behavior to no longer make use of
519 // memcached_st
520 if (_is_auto_eject_host(ptr->root))
521 {
522 run_distribution((memcached_st *)ptr->root);
523 }
524
525 return MEMCACHED_SERVER_MARKED_DEAD;
526 }

run_distribution() is called but it does not set
server_failure_counter to zero! The server has to be connected but the
condition "ptr->server_failure_counter >=
ptr->root->server_failure_limit" cause MEMCACHED_SERVER_MARKED_DEAD
was returned again and again.

So the server never got back.

Brian Aker (brianaker)
summary: - consistent hashing broken
+ _is_auto_eject_host() not causing run_distribution()
Changed in libmemcached:
assignee: nobody → Brian Aker (brianaker)
importance: Undecided → Medium
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.