Connect in a blocking mode hangs

Bug #583031 reported by Egor Egorov on 2010-05-19
28
This bug affects 4 people
Affects Status Importance Assigned to Milestone
libmemcached
Undecided
Unassigned

Bug Description

If the connect is performed in a blocking mode, then there is no timeout on connection. Therefor, trying to connect to a non-reachable host will timeout in a few minutes instead of values specified.

If I remove the around connect.c:333 then connection works well:

          if (ptr->root->flags.no_block == false)
            timeout= -1;

How to reproduce:
1. pick an unreachable IP, like, 10.2.3.4 routed via 127.0.0.1.
2. Use the following simple (php, unfortunately) code:

  $_memcache = new Memcached();
  $_memcache->addServer("10.2.3.4", "11211");

  $_memcache->setOption(Memcached::OPT_CONNECT_TIMEOUT, 1000);
  $_memcache->setOption(Memcached::OPT_RETRY_TIMEOUT, 1000);
  $_memcache->setOption(Memcached::OPT_SEND_TIMEOUT, 10000);
  $_memcache->setOption(Memcached::OPT_RECV_TIMEOUT, 10000);
  $_memcache->setOption(Memcached::OPT_POLL_TIMEOUT, 1000);
  $_memcache->setOption(Memcached::OPT_SERVER_FAILURE_LIMIT, 3);

  printf("here1\n");
  $result = $_memcache->get("dsf");
  printf("here2\n");

Expected result: timeout.

Actual result: timeout in something like ~6 minutes.

Actual result with the mentioned line removed: timeout.

Was tested on libmemcached 0.40, memcached php extension 1.0.2 on php 5.3.2 on Mac OS X 10.6.3.

Related branches

Egor Egorov (me-egorfine) wrote :

It turns out this is not the best way to fix the problem. What I had to do is to set no_block to true, bring back the above two lines, but add the following in connect.c:

...
            int error= poll(fds, 1, timeout);

            switch (error)
            {
...
            case 0:
                if (loop_max==1) {
                    return MEMCACHED_TIMEOUT;
                }
              continue;
              // A real error occurred and we need to completely bail
...

Steve Corona (steve-twitpic) wrote :

We ran into the same problem at Twitpic- when using libmemcached with PECL-Memcached, the process would hang indefinitely when a memcached server was unreachable. I attached the patchfile that contains the fix described in the above comment by Egor Egorov.

Steve Corona
Head Engineer @ Twitpic

Egor Egorov (me-egorfine) wrote :

Steve, for how long you have been using this fix in production? We at http://pushme.to/ are using it for about ten days already and it seems to work perfect.

Steve Corona (steve-twitpic) wrote :

HI Egor,

I just patched all of our webservers this morning, so it's only been running in production for about 6 hours. So far it works great- it fixed the timeout issue and I have not noticed any odd behavior. I'll follow-up if I see anything weird.

Thanks for the fix!

Egor Egorov (me-egorfine) wrote :

Steve,

I would also suggest you to log failed connection attempts in the logs. We experience about 20-30 failures/day within a single EC2 availability zone with no downtime of nodes! While this itself is a relatively small number, it is still not okay. See if you will have a different experience with that.

Please let me know. Thanks!

Hi!

So looking at this I have created a test case. While the patch does modify the error returned, I am not finding that it adjusts the timeout (aka, I am still get a constant timeout).

Let me tinker with this.

The logic is supposed to be that a connection which errors out immediately returns error, but that we allow the connection to proceed while we move on (aka, we check it when needed).

Cheers,
 -Brian

On Jun 7, 2010, at 12:19 PM, Egor Egorov wrote:

> Steve,
>
> I would also suggest you to log failed connection attempts in the logs.
> We experience about 20-30 failures/day within a single EC2 availability
> zone with no downtime of nodes! While this itself is a relatively small
> number, it is still not okay. See if you will have a different
> experience with that.
>
> Please let me know. Thanks!
>
> --
> Connect in a blocking mode hangs
> https://bugs.launchpad.net/bugs/583031
> You received this bug notification because you are a member of
> Libmemcached-developers, which is the registrant for libmemcached.
>
> Status in libmemcached - A C and C++ client library for memcached: New
>
> Bug description:
> If the connect is performed in a blocking mode, then there is no timeout on connection. Therefor, trying to connect to a non-reachable host will timeout in a few minutes instead of values specified.
>
> If I remove the around connect.c:333 then connection works well:
>
> if (ptr->root->flags.no_block == false)
> timeout= -1;
>
> How to reproduce:
> 1. pick an unreachable IP, like, 10.2.3.4 routed via 127.0.0.1.
> 2. Use the following simple (php, unfortunately) code:
>
>
> $_memcache = new Memcached();
> $_memcache->addServer("10.2.3.4", "11211");
>
> $_memcache->setOption(Memcached::OPT_CONNECT_TIMEOUT, 1000);
> $_memcache->setOption(Memcached::OPT_RETRY_TIMEOUT, 1000);
> $_memcache->setOption(Memcached::OPT_SEND_TIMEOUT, 10000);
> $_memcache->setOption(Memcached::OPT_RECV_TIMEOUT, 10000);
> $_memcache->setOption(Memcached::OPT_POLL_TIMEOUT, 1000);
> $_memcache->setOption(Memcached::OPT_SERVER_FAILURE_LIMIT, 3);
>
> printf("here1\n");
> $result = $_memcache->get("dsf");
> printf("here2\n");
>
>
> Expected result: timeout.
>
> Actual result: timeout in something like ~6 minutes.
>
> Actual result with the mentioned line removed: timeout.
>
> Was tested on libmemcached 0.40, memcached php extension 1.0.2 on php 5.3.2 on Mac OS X 10.6.3.
>
>

Steve Corona (steve-twitpic) wrote :

We have been running this patch in production for a couple of weeks now and it works great when the memcache daemon goes down. However, one problem that I noticed is that if the the memcache server is physically disconnected/shutdown, the connection timeout will be ignored and the client will hang again.

Egor Egorov (me-egorfine) wrote :

Brian, we definitely need your attention here.

Brian Aker (brianaker) wrote :

This is in trunk already I believe, so .41 will have it.

Cheers,
 -Brian

On Jun 22, 2010, at 2:10 PM, Egor Egorov wrote:

> Brian, we definitely need your attention here.
>
> --
> Connect in a blocking mode hangs
> https://bugs.launchpad.net/bugs/583031
> You received this bug notification because you are a member of
> Libmemcached-developers, which is the registrant for libmemcached.
>
> Status in libmemcached - A C and C++ client library for memcached: New
>
> Bug description:
> If the connect is performed in a blocking mode, then there is no timeout on connection. Therefor, trying to connect to a non-reachable host will timeout in a few minutes instead of values specified.
>
> If I remove the around connect.c:333 then connection works well:
>
> if (ptr->root->flags.no_block == false)
> timeout= -1;
>
> How to reproduce:
> 1. pick an unreachable IP, like, 10.2.3.4 routed via 127.0.0.1.
> 2. Use the following simple (php, unfortunately) code:
>
>
> $_memcache = new Memcached();
> $_memcache->addServer("10.2.3.4", "11211");
>
> $_memcache->setOption(Memcached::OPT_CONNECT_TIMEOUT, 1000);
> $_memcache->setOption(Memcached::OPT_RETRY_TIMEOUT, 1000);
> $_memcache->setOption(Memcached::OPT_SEND_TIMEOUT, 10000);
> $_memcache->setOption(Memcached::OPT_RECV_TIMEOUT, 10000);
> $_memcache->setOption(Memcached::OPT_POLL_TIMEOUT, 1000);
> $_memcache->setOption(Memcached::OPT_SERVER_FAILURE_LIMIT, 3);
>
> printf("here1\n");
> $result = $_memcache->get("dsf");
> printf("here2\n");
>
>
> Expected result: timeout.
>
> Actual result: timeout in something like ~6 minutes.
>
> Actual result with the mentioned line removed: timeout.
>
> Was tested on libmemcached 0.40, memcached php extension 1.0.2 on php 5.3.2 on Mac OS X 10.6.3.
>
>

peter mcarthur (ptrmcrthr) wrote :

I think the fix is incorrect for non-blocking connections. Now there is no timeout setting available for non-blocking connections to avoid a down server.

OPT_CONNECT_TIMEOUT is ignored, because of this section on line 22:

  int timeout= ptr->root->connect_timeout;
  if (ptr->root->flags.no_block == true)
    timeout= -1;
...
    error= poll(fds, 1, timeout);

I believe that makes the connect call blocking.

        $options = array(
                                                 Memcached::OPT_CONNECT_TIMEOUT => 1, // milliseconds
                                                 Memcached::OPT_POLL_TIMEOUT => 1, // milliseconds
                                                 Memcached::OPT_RECV_TIMEOUT => 1000, // usec
                                                 Memcached::OPT_SEND_TIMEOUT => 1000, // usec
        );

$m = new memcached();
$m->setOptions($options);
$m->addServer("192.168.7.9", "11211");

$start = microtime(true);
$m->get('test');
echo $m->getResultMessage() . " after " . (microtime(true) - $start) . "s \n";
$m->get('test');
echo $m->getResultMessage() . " after " . (microtime(true) - $start) . "s \n";
$m->get('test');
echo $m->getResultMessage() . " after " . (microtime(true) - $start) . "s \n";

$end = microtime(true);
echo $end - $start . "\n";

# With OPT_NO_BLOCK:
$ php ~/test-nonblocking.php
SYSTEM ERROR after 189.00634503365s
SYSTEM ERROR after 378.03397202492s
SYSTEM ERROR after 567.05481410027s
567.05880594254

# Without OPT_NO_BLOCK:
$ php ~/test-blocking.php
A TIMEOUT OCCURRED after 0.042192935943604s
A TIMEOUT OCCURRED after 0.074706792831421s
A TIMEOUT OCCURRED after 0.089533805847168s
0.089677810668945

Andrey Sibiryov (kobolog) wrote :

This bug is related to https://bugs.launchpad.net/libmemcached/+bug/681778, where I've outlined the reason of it and the way to victory.

Brian Aker (brianaker) wrote :

Just an update on this, my plan is to just remove the "non-block" mode for all situations except for socket shutdown. I've not decided on what to do about that.

If you look in trunk you can see the current solution.

Brian Aker (brianaker) on 2011-02-13
Changed in libmemcached:
status: New → Fix Committed
Brian Aker (brianaker) on 2011-02-15
Changed in libmemcached:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers