Comment 2 for bug 882833

Revision history for this message
fimbulvetr (fimbulvetr) wrote :

I had the wrong query on my initial report. Anonymizing the columns messed with me a bit. Here is the correct one:

select a from t1 where a in ( select a from t1 where e = 5 and d = 0 ) and e = 5 and d = 0 and ( g is null or g in (1) ) order by b limit 11;

(The change is that now we use d instead of f).

I have inserted some data from my prod table into t1, and ran the query - the explain I paste below exactly matches the explain from the production, failing dataset (With the exception of # rows for the subquery - in the one pasted they both list 222k rows - in the production data set, the subquery rows is 1/3 of the primary's rows, if that matters).

explain select a from t1 where a in ( select a from t1 where e = 5 and d = 0 ) and e = 5 and d = 0 and ( g is null or g in (1) ) order by b limit 11;

+----+-------------+-------+-------------+-------------------------+------+---------+-------------------+--------+---------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------------+-------------------------+------+---------+-------------------+--------+---------------------------------------+
| 1 | PRIMARY | t1 | ref_or_null | i,l,m,n | n | 16 | const,const,const | 221063 | Using index condition; Using filesort |
| 2 | SUBQUERY | t1 | ref | PRIMARY,i,j,m,n,o,a,a_2 | n | 7 | const,const | 221062 | Using index |
+----+-------------+-------+-------------+-------------------------+------+---------+-------------------+--------+---------------------------------------+

Note there is something else peculiar that I did not report initially but now see as a correlation: When the query executes correctly, I get results back in 1.61 or so seconds. When it executes incorrectly, the results come in about .20 seconds.

Whether the query works or not appears to still be random at this point, but the explain plan never changes regardless of it being immediately before/after a a correct/incorrect result. I've executed the explain immediately after & before the query countless times.

On a cold start, this query when incorrect takes 1.4s. On a cold start this query when correct takes 1m21s (15GB Table).

This production data table is constantly being inserted/updated (Never deleted) by way of replication. It follows a diurnal pattern, with peaks about 40 inserts a second and updates about 20 per second. The incorrect result has been spotted regardless of the #inserts/updates a second.