consistent hash picks a different server for the same key after upgrade of libmemcached

Bug #996813 reported by Jason Toffaletti
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
libmemcached
Fix Released
Medium
Brian Aker

Bug Description

I recently tried to upgrade from libmemcached 0.48 to 1.0.7 and it looks like there was an undocumented change to the way consistent hashing works because 1.0.7 is choosing a different server given the same list of 17 servers.

libmemcached-0.48:
connect(5, {sa_family=AF_INET, sin_port=htons(11211), sin_addr=inet_addr("10.5.8.8")}, 16) = -1 EINPROGRESS (Operation now in progress)
poll([{fd=5, events=POLLOUT}], 1, 100) = 1 ([{fd=5, revents=POLLOUT}])
getsockopt(5, SOL_SOCKET, SO_ERROR, [17179869184], [4]) = 0
sendto(5, "get S9NqygXmkamkok1tyvQn \r\n", 27, MSG_DONTWAIT|MSG_NOSIGNAL, NULL, 0) = 27
recvfrom(5, 0x7f0900f5e080, 8196, 0, 0, 0) = -1 EAGAIN (Resource temporarily unavailable)
poll([{fd=5, events=POLLIN}], 1, 50) = 1 ([{fd=5, revents=POLLIN}])
recvfrom(5, "VALUE S9NqygXmkamkok1tyvQn 2 39\r\n\37\213\0\0\0\0\0\0\0\377\313())\260\322\327/)\317,)I-\322K\316\317\325O\314.(J\314,N\5\0\r\nEND\r\n", 8196, 0, NULL, NULL) = 79

libmemcached-1.0.7 chooses a different server:
connect(5, {sa_family=AF_INET, sin_port=htons(11211), sin_addr=inet_addr("10.5.8.12")}, 16) = -1 EINPROGRESS (Operation now in progress)
poll([{fd=5, events=POLLOUT}], 1, 100) = 1 ([{fd=5, revents=POLLOUT}])
getsockopt(5, SOL_SOCKET, SO_ERROR, [17179869184], [4]) = 0
sendto(5, "get S9NqygXmkamkok1tyvQn \r\n", 27, MSG_DONTWAIT|MSG_NOSIGNAL, NULL, 0) = 27
recvfrom(5, 0x7fed79cd23e0, 8196, 64, 0, 0) = -1 EAGAIN (Resource temporarily unavailable)
poll([{fd=5, events=POLLIN}], 1, 50) = 1 ([{fd=5, revents=POLLIN}])
recvfrom(5, "END\r\n", 8196, MSG_DONTWAIT, NULL, NULL) = 5

This is the configuration:

        _mc->setBehavior(MEMCACHED_BEHAVIOR_DISTRIBUTION, MEMCACHED_DISTRIBUTION_CONSISTENT_KETAMA);
        _mc->setBehavior(MEMCACHED_BEHAVIOR_NO_BLOCK, 1);
        _mc->setBehavior(MEMCACHED_BEHAVIOR_TCP_NODELAY, 1);
        _mc->setBehavior(MEMCACHED_BEHAVIOR_BINARY_PROTOCOL, 0);
        _mc->setBehavior(MEMCACHED_BEHAVIOR_POLL_TIMEOUT, 50);
        _mc->setBehavior(MEMCACHED_BEHAVIOR_CONNECT_TIMEOUT, 100);
        _mc->setBehavior(MEMCACHED_BEHAVIOR_RETRY_TIMEOUT, 30);

Revision history for this message
Brian Aker (brianaker) wrote :

Hi,

There was a bug fix from the zero trunk to the 1.0 trunk that changed consistency (everything is still consistent, but fixing the bug altered the behavior). This was one of the reasons for the major release.

Do you have particular keys that are articulating this? I want to make sure that the above reason is why the difference exists before closing this (i.e. make sure there isn't some other bug).

Thanks,
   -Brian

Revision history for this message
Jason Toffaletti (jason) wrote :

All of the keys I've tried are off (cache hit drops to 0% with the 1.0.7 lib). Our keys are a specialized hashing function based on SHA1 encoded with base64. Here are some examples:

SZ6hu0SHweFmpwpc0w2R
SQCK9eiCf53YxHWnYA.o
SUSDkGXuuZC9t9VhMwa.
SnnqnJARfaCNT679iAF_

Brian Aker (brianaker)
Changed in libmemcached:
assignee: nobody → Brian Aker (brianaker)
Revision history for this message
Brian Aker (brianaker) wrote :

Can you try this test case and see how it works for you? (it is in trunk)

Changed in libmemcached:
importance: Undecided → Medium
status: New → Incomplete
Revision history for this message
Jason Toffaletti (jason) wrote :

Sorry, this took so long, I went on vacation and just got back today. I took your test code and made a CMake project that builds both 0.48 and 1.0.7 and links two exes that print the results of the memcached_generate_hash calls.

$ ./test107/test_107
16
1
10
8
$ ./test048/test_048
6
1
9
0

I'll attach a zip of what I've done. In the process of doing this, I ran across two other minor build bugs #998227 and #1000417.

Revision history for this message
Jason Toffaletti (jason) wrote :

To build:

$ cd libmemcached-testing
$ mkdir build
$ cd build/
$ cmake ..
$ make

Brian Aker (brianaker)
Changed in libmemcached:
status: Incomplete → In Progress
Revision history for this message
Brian Aker (brianaker) wrote :

Update on this. Another bug report found an error in the code that adjust the hash distribution,... so I may have a fix for this.

Revision history for this message
Jason Toffaletti (jason) wrote :

Bump, was this fixed? I ran a test today with libmemcached 1.0.11 and it appears to be using the same hash distribution as 0.4x series.

Revision history for this message
Brian Aker (brianaker) wrote :

I've been waiting to hear back from a couple of folks to see if they could confirm the fix that went in.

So it is working for you?

Revision history for this message
Jason Toffaletti (jason) wrote :

It is working for me, I have an automated test that loads a 2 node memcache cluster with one version of libmemcached and then reads the same keys with another version of libmemcached. It was failing with 1.0.8 and 0.44, but it is now passing with 1.0.11 and 0.44.

Brian Aker (brianaker)
Changed in libmemcached:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.