sock.recv fails to return in some cases (infinite wait)

Bug #865859 reported by Dillon Sadofsky on 2011-10-04
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
MySQL Connector/Python
Triaged
Critical
Geert JM Vanderkelen

Bug Description

Python version 3.2
connector version: .32 devel 292
MySQL version 5.0.51a-12-log
I've been debugging a very strange issue in an application that uses this connector. It manifests in only some queries. Strangely, any modification to the query (removal/addition to whitespace) will change whether its happening.

Using liberal traces, I drilled from fetchall all the way down to connection.py line 162:
            buf.extend(self.sock.recv(self.recvsize))

On the queries this issue manifests on, fetchall starts in the cursor, several packets are successfully read from the socket, but then recv_plain goes to line 162 and self.sock.recv never returns. By default the socket has no timeout value, and is blocking.

I have been using this connector without issue up to this point, so this issue is baffling and strange. I will include an example query that caused this behavior for me. I doubt this will manifest for anyone else, as I'm guessing the data in question may affect it:

SELECT fileid, filechunkid, `sequence`, `data`, file.name, file.created, file.updated, file.hash, file.type FROM `filechunk` JOIN `file` USING (fileid) WHERE fileid IN(16945, 16946, 16947, 16948, 16949, 16950, 16951, 16952, 16953, 16954, 16955, 16956, 16957, 16958, 16959, 16960, 16961, 16962, 16963, 16964, 16965, 16966, 16967, 16968, 16969)

Modifying any whitespace or backticks will cause the issue to disappear. This suggests to me that maybe the return packets may include the original query in some way, and even barely changing the size of the return causes some offset to change?

It is worth noting that the app in question is retrieving blobs to replicate. The above query is attempting to retrieve 25 ~100kb blobs in the data column. I would guess it might be a size issue, but I've also done 500 at once and had no trouble (but again some queries will have the error). These queries run and return successfully in sqlYog in a reasonable amount of time. The instances in which it happens seem random but are deterministic (reproducable at least).

I have worked around this by adding in a timeout in my application of 60 seconds. I do not like this solution, plus, due to the nature of this application, it is not an acceptable solution. Any help, suggestions, or workaround would be much appreciated.

Maybe making the socket non-blocking, and doing a timeout wait on it using select() would be a better solution? Possibly the content of my return bytes are being misinterpreted by some layer and causing an issue in the python socket?

Related branches

As an addendum, I've tested this on both windows and unix to the same effect. Additionally, the faulty query returns as expected when I run it inside a c++ mysql wrapper that I use.

I was able to reproduce this issue in IDLE using the vanilla 0.3.2 connector (I have a subclass/wrapper with custom behavior so I wanted to test without it).

The fact that the exact text affects it is starting to make me thing that it might be caching-related. I added SQL_NO_CACHE to my selects to see if that fixes it. If this is the case, it must be a bug with that version of mysql and failing to properly deal with large, improperly cached result sets. The PROCESSLIST on the server will show the query as complete, and the connection as sleeping while the client is hung. I'm thinking that maybe the mysql thread for the client has some kind of fart related to the cache and fails to send the goodbye handshake? I"m not sure why it would happen in the middle of result retrieval (adding traces show that some packets are received during fetchall).

Trying to reproduce this problem, I've come up with the code below. Please check if it mimics your operations a bit.
There are problem some more factors which trigger the problem. Has it been seen again lately?

def bug865859():
    cnx = mysql.connector.connect(user='root',database='test')
    cur = cnx.cursor()
    cur.execute("DROP TABLE IF EXISTS t1")
    cur.execute("CREATE TABLE t1 (id TINYINT UNSIGNED AUTO_INCREMENT, "
        "c1 LONGBLOB, PRIMARY KEY (id))")

    # Populate
    for i in range(0,25):
        query = "INSERT INTO t1 (c1) VALUES (REPEAT('a',{size}))".format(
            size=200*1024)
        cur.execute(query)

    # Get data
    cur.execute("SELECT id,c1 FROM t1 WHERE id IN (1,2,3,4,5,6,7,8,9,10,"
        "11,12,13,14,15,16,17,18,19,20)")
    for row in cur:
        print("{0} has size {1}".format(row[0],len(row[1])))

    cur.close()
    cnx.close()

Changed in myconnpy:
status: New → Incomplete
assignee: nobody → Geert JM Vanderkelen (geertjmvdk)

Confirming the bug. Happened while running test case for bug lp:691836.
Seems Python 3.2 specific for now.

Changed in myconnpy:
status: Incomplete → Confirmed
importance: Undecided → High

No, confirming the problem is also with Python 3.1. Seems to work OK on MacOS.

I kept working on this, and it went away permanently when I added SELECT SQL_NO_CACHE ... to my query. This means that its not this application, but some bug in MySQL that is causing it. Maybe this wrapper should detect such locks in some way and give an error return.

My application looped, scanning for changes in large blobs, read them in, then write them out to another connection (changes detected using hashes). I found that everything would work until I killed the application in the middle of a query. It seems that if you run a query that will return a bunch of blobs, then kill it after it executed and before it has finished returning results, running that EXACT SAME QUERY again will produce the above behavior. MySQL's result set caching is based on the exact query text, so my guess is that, if a blob-returning query gets killed, the cached data becomes 'bad' in some way.

The really baffling thing is that, in that case, the MySQL processlist shows the query come in, get handled, then the server thinks the operation is complete (but the client is still waiting for data). I'm guessing there is a very rare case here, in which the server returns a non-traditional result (maybe not the normal error return we expect), and the wrapper is interpreting it as a normal "Ok, query was run, I'll send you the data now". If the sockets had timeout values, this might allow a workaround.

Changed in myconnpy:
importance: High → Critical

@Dillon

I've a similar problem with one of the test cases (mentioned in note #4) and found a fix. It seems that for Python 3, it's good to have __del__ and call close(). I'm not sure it will help you, but please can you patch and see if it works for you?
The change goes in MySQLConnection:

class MySQLConnection(object):

  def __del__(self):
      self.close()

Test case involves assigning a new connection to a connection still in use, which did not commit changes for example:

    cnx = mysql.connector.connection.MySQLConnection(**config)
    cur = cnx.cursor()
    cur.execute("DROP TABLE IF EXISTS t1")
    cur.execute("CREATE TABLE t1 (c1 INT)")
    cur.execute("INSERT INTO t1 (c1) VALUES (1)")
    cnx = mysql.connector.connection.MySQLConnection(**config)
    cur = cnx.cursor()
    cur.execute("DROP TABLE IF EXISTS t1")

I'm adding a explicit __del__-destructor to Python 2 code as well.

I used to have a similar destructor in my wrapper class, but I ended up commenting it out, I'm not sure why. I'll re-enable it, but since I somewhat already solved it in my case, it will be difficult for me to know if it affects it. Again, the issue seems to go away magically until you kill a query selecting blobs, then run that query again later. I had a lot of false positive cures until I did that again.

Changed in myconnpy:
milestone: none → 0.4

Code for reading/writing from network has been refactored. It's a bit odd that SQL_NO_CACHE (used for the query cache) would have an effect.
Lets see if next release (no date yet..) will fix the above.

Changed in myconnpy:
status: Confirmed → Triaged
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers