Connect syscall hangs in non-blocking mode

Bug #681778 reported by Javier Uruen Val
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
libmemcached
Fix Released
Undecided
Unassigned

Bug Description

In 0.44, I can confirm that if the connection is set to non-blocking, the connect() call will hang. This seems to be related to the fix that was introduced to solve this bug:

https://bugs.launchpad.net/libmemcached/+bug/583031 (Peter seems to report the same issue in that thread)

It seems to be that that patch fixes the blocking mode but breaks the non-blocking mode.

The relevant lines in connect.c

 23 if (ptr->root->flags.no_block == true)
 24 timeout= -1;

If non-blocking is true there will be no timeout and as a result the poll() call after connect() will hang until the socket descriptor used in connect() generates an event. This is a snippet of the strace output that captures that moment:

socket(PF_INET, SOCK_STREAM, IPPROTO_TCP) = 3
setsockopt(3, SOL_SOCKET, SO_LINGER, {onoff=1, linger=0}, 8) = 0
fcntl64(3, F_GETFL) = 0x2 (flags O_RDWR)
fcntl64(3, F_SETFL, O_RDWR|O_NONBLOCK) = 0
connect(3, {sa_family=AF_INET, sin_port=htons(11211), sin_addr=inet_addr("127.0.0.1")}, 16) = -1 EINPROGRESS
poll([{fd=3, events=POLLOUT}], 1, -1

Please, correct me If I am wrong but we should be able to set a timeout for the connect() call in non-blocking mode as well.

As a possible fix I wonder if it would make sense to set the connect_timeout for both blocking and non-blocking?

Thanks

Revision history for this message
Andrey Sibiryov (kobolog) wrote :

Related bug: https://bugs.launchpad.net/libmemcached/+bug/583031

There are two places in the library where this code appears:

1) in connect.c:23-24 (connect_poll()), where it tries to connect to the remote socket;
2) in io.c:60-61 (io_wait()), where it waits for a blocked socket to become ready while transmitting data.

I'd like to point out that the library _always_ sets O_NONBLOCK on the socket, regardless of the behavior set by the user. The NO_BLOCK behavior only sets a linger structure for the socket for graceful socket shutdown. With this considered, the beforementioned code fragments are there to *simulate* blocking behavior by setting infinite poll timeouts on socket operations in non-blocking mode.

But in the current version (0.44), those two code fragments actually perform _opposite_ things - in the connect() part it sets infinite poll timeout for nonblocking mode, and in transmitting part it sets infinite poll timeout for blocking mode, which is weird.

To get the intended behavior, as I see it, both code fragments should set infinite timeouts for blocking mode to simulate completely blocking socket calls with nonblocking socket setup. But this approach is also flawed, because in that case we would get insane timeouts in this 'pseudo-blocking' mode.

With all that considered, I recommend removing blocking mode completely and use reasonably large user-set timeouts to achieve the same results.

Revision history for this message
Brian Aker (brianaker) wrote :

Thanks, let me think about your proposal. If I were to write this again today I wouldn't bother with a blocking mode. At the time, which was several years ago, it was something requested.

Let me think about what this would mean for all users.

Revision history for this message
Brian Aker (brianaker) wrote :

Just an update on this, my plan is to just remove the "non-block" mode for all situations except for socket shutdown. I've not decided on what to do about that.

If you look in trunk you can see the current solution.

BTW when I went back through and looked at the code I realize that years ago when I did this I wasn't really thinking of "block" and "non-block", I was thinking more about how buffering worked internally in the library.

Either way, the next version will have just a single timeout that can be set, and the block/no-block won't influence it.

Brian Aker (brianaker)
Changed in libmemcached:
status: New → Fix Committed
Brian Aker (brianaker)
Changed in libmemcached:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.