Query with (a is null or a in (1)) sometimes returns only rows with a = 1

Bug #882833 reported by fimbulvetr
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
MariaDB
New
Undecided
Unassigned

Bug Description

I've tried very hard to reproduce this on a test set, or reliably, but I cannot. I spoke with Monty in IRC yesterday, he advised I use a debug build since I could create a test case. That option failed as well - with the debug build the result is reliable.

I have three mariadb servers all of the same build (3249) all running repl and built from snapshots of 5.5.5-m3-log data. The problem persists after a optimize table t1 on all three servers.

The table is similar to:

CREATE TABLE `t1` (
  `a` bigint(20) NOT NULL DEFAULT '0',
  `b` varchar(512) DEFAULT NULL,
  `c` int(10) DEFAULT NULL,
  `d` tinyint(1) DEFAULT NULL,
  `e` int(10) DEFAULT NULL,
  `f` tinyint(1) DEFAULT NULL,
  `g` bigint(20) DEFAULT NULL,
  PRIMARY KEY (`a`),
  UNIQUE KEY (`a`,`d`),
  UNIQUE KEY (`a`,`d`,`e`),
  KEY `h` (`c`),
  KEY `i` (`e`,`b`),
  KEY `j` (`b`,`d`,`e`),
  KEY `k` (`b`),
  KEY `l` (`g`),
  KEY `m` (`e`,`d`,`c`,`b`) USING BTREE,
  KEY `n` (`e`,`d`,`g`,`c`) USING BTREE,
  KEY `o` (`f`,`d`,`e`,`b`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1

The query is similar to:

select a from t1 where a in ( select a from t1 where e = 5 and f = 0 ) and e = 5 and f = 0 and ( g is null or g in (1) ) order by b limit 10;

There are 35m records in the table. There should be 150,000 records returned by this query. There are usually 0-4 of these records where g = 1 when I see the problem manifest. The rest are null.

The problem appears about 80% of the time, and when it does it _only_ returns the rows where g = 1, it does not honor the g is null portion.

The problem disappears if I write

( g is null or g in (1, null))

I am fairly certain order by related. The problem has never occurred with no order by.

I am confident limit is not related. Monty asked me to test this. If I take limit off, it will only return those rows where g = 1.

I am also confident this is not related to optimizer_switch. I shut all of them off (and alternated with in_to_exists and materialization, where appropriate) and was able to reliably reproduce regardless of any optimizer_switch setting.

I am also confident query_cache_type is not related. The problem manifests regardless of this setting.

I have worked around it by using:

( g is null or g in (1, null))

for now. But I did want to open a bug for future reference. I am willing to assist in many ways, but I cannot share my schema or data.

Revision history for this message
Philip Stoev (pstoev-askmonty) wrote :

Hello,

Thank you very much for your bug report. Can you please provide the EXPLAIN of the problematic query? If possible, please run EXPLAIN on the same query several dozen times and report all the different EXPLAINs that you see.

Thank you.

Revision history for this message
fimbulvetr (fimbulvetr) wrote :

I had the wrong query on my initial report. Anonymizing the columns messed with me a bit. Here is the correct one:

select a from t1 where a in ( select a from t1 where e = 5 and d = 0 ) and e = 5 and d = 0 and ( g is null or g in (1) ) order by b limit 11;

(The change is that now we use d instead of f).

I have inserted some data from my prod table into t1, and ran the query - the explain I paste below exactly matches the explain from the production, failing dataset (With the exception of # rows for the subquery - in the one pasted they both list 222k rows - in the production data set, the subquery rows is 1/3 of the primary's rows, if that matters).

explain select a from t1 where a in ( select a from t1 where e = 5 and d = 0 ) and e = 5 and d = 0 and ( g is null or g in (1) ) order by b limit 11;

+----+-------------+-------+-------------+-------------------------+------+---------+-------------------+--------+---------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------------+-------------------------+------+---------+-------------------+--------+---------------------------------------+
| 1 | PRIMARY | t1 | ref_or_null | i,l,m,n | n | 16 | const,const,const | 221063 | Using index condition; Using filesort |
| 2 | SUBQUERY | t1 | ref | PRIMARY,i,j,m,n,o,a,a_2 | n | 7 | const,const | 221062 | Using index |
+----+-------------+-------+-------------+-------------------------+------+---------+-------------------+--------+---------------------------------------+

Note there is something else peculiar that I did not report initially but now see as a correlation: When the query executes correctly, I get results back in 1.61 or so seconds. When it executes incorrectly, the results come in about .20 seconds.

Whether the query works or not appears to still be random at this point, but the explain plan never changes regardless of it being immediately before/after a a correct/incorrect result. I've executed the explain immediately after & before the query countless times.

On a cold start, this query when incorrect takes 1.4s. On a cold start this query when correct takes 1m21s (15GB Table).

This production data table is constantly being inserted/updated (Never deleted) by way of replication. It follows a diurnal pattern, with peaks about 40 inserts a second and updates about 20 per second. The incorrect result has been spotted regardless of the #inserts/updates a second.

Revision history for this message
Philip Stoev (pstoev-askmonty) wrote :

Hello,

Thank you very much for the additional information. Can you please try disabling the index condition pushdown using the following option:

 SET GLOBAL optimizer_switch='index_condition_pushdown=off';
 SET SESSION optimizer_switch='index_condition_pushdown=off';

And check if the issue is still there?

Revision history for this message
Philip Stoev (pstoev-askmonty) wrote :

I have tried to reproduce your bug using your schema and synthetic data, but I could not. I get the exact same EXPLAIN as you do, however the result is consistent for me.

So, it seems to me that the bug is caused by something in your exact data and the distribution of its values. Since you have already obfuscated the names of the columns and tables, if you also drop the varchar column (and the bug remains repeatable), the table will contain integers only.

If this alleviates your privacy concerns, would you be willing to provide the obfuscated data set? We can arrange a secure transfer method and we will only use it to debug the problem and will not share it with anyone else.

Revision history for this message
fimbulvetr (fimbulvetr) wrote :

Hello,

The problem exists with or without index condition pushdown. I actually have another wrong result bug related to index_condition_pushdown=on on the same table but am in the same situation - unable to synthesize failing data.

As far as the varchar - I cannot remove this column as it is the order by. The problem disappears when I remove the order by or order by an alternate column. (This may be very valuable information - to repeat, the problem only appears if I sort by this varchar). The problem remains whichever direction I sort by. I do not believe I will have time today or tomorrow to continue working on this but middle of this week I will attempt again to synthesize data. I hope you can help me come up with ideas on how to do this.

Revision history for this message
Elena Stepanova (elenst) wrote :

Hi,

Have you observed the problem ever since, or is it gone?

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.