Query with impossible or constant subquery in WHERE or HAVING is not precomputed and thus not part of optimization

Bug #944706 reported by Elena Stepanova
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
MariaDB
Fix Released
Medium
Timour Katchaounov

Bug Description

The following query

SELECT MAX( alias2.a ) AS field
FROM t1 AS alias1, t1 AS alias2, t1 AS alias3
WHERE alias1.a = alias2.a OR alias1.a = 'y'
HAVING field>'B' AND ( 'Moscow' ) IN ( SELECT a FROM t1 );

works almost instantly on MariaDB 5.2, but takes quite long, depending on the amount of data in t1, on MariaDB 5.3.

bzr version-info
revision-id: <email address hidden>
date: 2012-02-29 23:28:16 -0800
build-date: 2012-03-02 14:57:35 +0400
revno: 3451

bzr version-info
revision-id: <email address hidden>
date: 2012-02-28 13:50:30 +0200
build-date: 2012-02-29 03:39:46 +0400
revno: 3116
branch-nick: maria-5.2

EXPLAIN in 5.3:

id select_type table type possible_keys key key_len ref rows filtered Extra
1 PRIMARY alias3 index NULL a 19 NULL 133 100.00 Using index
1 PRIMARY alias2 index a a 19 NULL 133 100.00 Using index; Using join buffer (flat, BNL join)
1 PRIMARY alias1 index a a 19 NULL 133 100.00 Using where; Using index; Using join buffer (incremental, BNL join)
2 MATERIALIZED t1 index a a 19 NULL 133 100.00 Using index
Warnings:
Note 1003 select max(`test`.`alias2`.`a`) AS `field` from `test`.`t1` `alias1` join `test`.`t1` `alias2` join `test`.`t1` `alias3` where ((`test`.`alias1`.`a` = `test`.`alias2`.`a`) or (`test`.`alias1`.`a` = 'y')) having ((`field` > 'B') and <expr_cache><'Moscow'>(<in_optimizer>('Moscow','Moscow' in ( <materialize> (select `test`.`t1`.`a` from `test`.`t1` ), <primary_index_lookup>('Moscow' in <temporary table> on distinct_key where (('Moscow' = `<subquery2>`.`a`)))))))

optimizer_switch in 5.3 (default):
index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_merge_sort_intersection=off,index_condition_pushdown=on,derived_merge=on,derived_with_keys=on,firstmatch=on,loosescan=on,materialization=on,in_to_exists=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on,mrr=off,mrr_cost_based=off,mrr_sort_keys=off,outer_join_with_cache=on,semijoin_with_cache=on,join_cache_incremental=on,join_cache_hashed=on,join_cache_bka=on,optimize_join_buffer_size=off,table_elimination=on
join_cache_level=2 (default)

EXPLAIN in 5.2:

id select_type table type possible_keys key key_len ref rows filtered Extra
1 PRIMARY NULL NULL NULL NULL NULL NULL NULL NULL Impossible HAVING
2 SUBQUERY t1 index_subquery a a 19 const 1 100.00 Using index; Using where
Warnings:
Note 1003 select max(`test`.`alias2`.`a`) AS `field` from `test`.`t1` `alias1` join `test`.`t1` `alias2` join `test`.`t1` `alias3` where (multiple equal(`test`.`alias1`.`a`, `test`.`alias2`.`a`) or multiple equal('y', `test`.`alias1`.`a`)) having 0

optimizer_switch in 5.2 (default):
index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,table_elimination=on

Test case:

CREATE TABLE t1 ( a VARCHAR(16), KEY (a) );
INSERT INTO t1 VALUES
('Abilene'),('Akron'),('Albany'),('Albuquerque'),
('Alexandria'),('Allentown'),('Amarillo'),('Anaheim'),
('Anchorage'),('Ann Arbor'),('Arden-Arcade'),
('Arlington'),('Arlington'),('Arvada'),
('Athens-Clarke County'),('Atlanta'),
('Augusta-Richmond County'),('Aurora'),('Aurora'),
('Austin'),('Bakersfield'),('Baltimore'),
('Baton Rouge'),('Beaumont'),('Bellevue'),
('Berkeley'),('Billings'),('Birmingham'),
('Boise City'),('Boston'),('Boulder'),('Bridgeport'),
('Brockton'),('Brownsville'),('Buffalo'),('Burbank'),
('Cambridge'),('Cape Coral'),('Carrollton'),
('Carson'),('Cary'),('Cedar Rapids'),('Chandler'),
('Charleston'),('Charlotte'),('Chattanooga'),
('Chesapeake'),('Chicago'),('Chula Vista'),
('Cincinnati'),('Citrus Heights'),('Clarksville'),
('Clearwater'),('Cleveland'),('Colorado Springs'),
('Columbia'),('Columbus'),('Columbus'),('Compton'),
('Concord'),('Coral Springs'),('Corona'),
('Corpus Christi'),('Costa Mesa'),('Dallas'),
('Daly City'),('Davenport'),('Dayton'),('Denver'),
('Des Moines'),('Detroit'),('Downey'),('Durham'),
('East Los Angeles'),('El Cajon'),('El Monte'),
('El Paso'),('Elgin'),('Elizabeth'),('Erie'),
('Escondido'),('Eugene'),('Evansville'),('Fairfield'),
('Fall River'),('Fayetteville'),('Flint'),('Fontana'),
('Fort Collins'),('Fort Lauderdale'),('Fort Wayne'),
('Fort Worth'),('Fremont'),('Fresno'),('Fullerton'),
('Gainesville'),('Garden Grove'),('Garland'),('Gary'),
('Gilbert'),('Glendale'),('Glendale'),
('Grand Prairie'),('Grand Rapids'),('Green Bay'),
('Greensboro'),('Hampton'),('Hartford'),('Hayward'),
('Henderson'),('Hialeah'),('Hollywood'),('Honolulu'),
('Houston'),('Huntington Beach'),('Huntsville'),
('Independence'),('Indianapolis'),('Inglewood'),
('Irvine'),('Irving'),('Jackson'),('Jacksonville'),
('Jersey City'),('Joliet'),('Kansas City'),
('Kansas City'),('Kenosha'),('Knoxville'),
('Lafayette'),('Lakewood'),('Lancaster'),('Lansing')
;
SELECT MAX( alias2.a ) AS field
FROM t1 AS alias1, t1 AS alias2, t1 AS alias3
WHERE alias1.a = alias2.a OR alias1.a = 'y'
HAVING field>'B' AND ( 'Moscow' ) IN ( SELECT a FROM t1 );

# End of test case

Tags: optimizer
Revision history for this message
Sergey Petrunia (sergefp) wrote :

(comment based on the original, non-simplified testcase. The testcase posted here looks ok but I did not do a real check with it)

This bug demonstrates a problem with the optimizer, in particular with the choice between IN->EXISTS and Materialization strategies.

We have a query:

  const1 IN (SELECT inner_expr FROM ... )

Why would the optimizer pick Materialization, when we have "const1" on the left side, and so will make only one lookup in the materialized table? It is obvious that IN->EXISTS will always be better than Materialization for such cases.

Changed in maria:
status: New → Confirmed
Changed in maria:
importance: Undecided → High
Changed in maria:
assignee: Sergey Petrunia (sergefp) → Timour Katchaounov (timour)
Changed in maria:
status: Confirmed → In Progress
Revision history for this message
Timour Katchaounov (timour) wrote :

Unfortunately the analysis in comment #1 is incorrect as shown below.

The real reason for the difference in execution times between 5.2 and 5.3 is a result of re-engineering subquery optimization to be done during the optimization phase. One consequence of this change is that subqueries can no longer be executed during optimization of the outer query. The reason for this is two-fold:

(1) A subquery may be arbitrarily expensive, thus it may make optimization and explain arbitrarily slow.
(2) Execution of subqueries during the optimization phase has side-effects that permanently change some data structures that represent a subquery. These side-effects later may lead to crashes.

In 5.2 a subquery of the type "const1 IN (SELECT inner_expr FROM ... )" is evaluated during the optimize_cond phase in JOIN::optimize, the result is substituted in the containing condition, and the condition is simplified. In the case of this example, the optimizer finds that the HAVING clause is FALSE, and produces an empty result without even executing the query.

In 5.3 subquery predicates are not evaluated during optimization, therefore such subsitution cannot be performed, and the query is executed as a normal JOIN.

Further investigation of the query plan in 5.3 shows that if we use the default join_cache_level=2, execution is very slow ~1.8 sec, while with join_cache_level=0, execution takes ~0.07 sec. This problem can be shown without a subquery, thus it is not related to this bug.

The attachment below contains detailed EXPLAINs, and execution times for 5.2/5.3 with various combinaitons of materialization=on/off, and join_cache_level=0/2.

Revision history for this message
Timour Katchaounov (timour) wrote :

EXPLAINs, and execution times for 5.2, 5.3 with different join_cache_level, and materialization.

Revision history for this message
Timour Katchaounov (timour) wrote :

Possible solutions:

1) Delayed constant optimization.

There are various optimizations that rely on constant evaluation.
One way to address this problem would be to:
- Decide on which constant optimizations are the most important ones, and
- add an extra phase in the beginning of query execution that repeats
  all important constant optimizations for expensive conditions.
As a result of this phase we may detect at the very start of query execution that
a query result is empty, e.g. because of FALSE WHERE or HAVING clauses.

The disadvantages of this solution are:
- There may be dependencies on other optimizations, which in turn may make the
approach unfeasible for some (or most) of the constant optimizations.
- If not all optimizations are repeated, there will still be cases of performance
regressions compared to other server versions that use subqueries during
constant optimization, and it is hard to know which cases are the most common ones.

2) Constant optimization for "cheap" subqueries

The second approach is to:
- Pre-optimize early those queries that may potentially be used for constant optimization.
- Estimate the cost of subquery evaluation for the above subqueries.
- Modify the Item::is_expensive() method to use the cost estimate to decide if a subquery is "cheap".
- Evaluate "cheap" subqueries whenever necessary. This would follow automatically from the changed
implementation of is_expensive().

Problems/disadvantages:
- There are few cases when EXPLAIN needs data structures (TABLE_LIST) that were changed destructively
while a subquery was executed *before* the actual explain code.
- The most time-consuming part seems to be that enabling constant optimization for subqueries results in
several wrong results. Some of them are known bugs in MySQL, others possibly specific to MariaDB.

Revision history for this message
Timour Katchaounov (timour) wrote :
Download full text (4.2 KiB)

If subqueries are not expensive (5.2, or changed 5.3) the query is optimized as follows:

- JOIN::optimize for the outer query calls:
  having= optimize_cond(this, having, join_list, &having_value, &having_equal);
- optimize_cond() calls remove_eq_conds()
- remove_eq_conds() calls eval_const_cond() as follows:
  else if (cond->const_item() && !cond->is_expensive())
  {
    *cond_value= eval_const_cond(cond) ? Item::COND_TRUE : Item::COND_FALSE;
    return (COND*) 0;
  }
- eval_const_cond() calls Item_in_optimizer::val_int(), which in turn
  optimizes and executes the subquery, and returns FALSE to remove_eq_conds().
- remove_eq_conds() returns a NULL item, and sets JOIN::having_value = Item::COND_FALSE.
- JOIN::optimize checks that having_value is Item::COND_FALSE, and sets
  zero_result_cause= "Impossible HAVING"
- JOIN::exec checks for zero_result_cause and returns empty result without executing.
  EXPLAIN shows "Impossible HAVING".

If subqueries are expensive (5.3), the query is processed as follows:
- optimize_cond for the outer JOIN doesn't remove the HAVING clause.
- The optimizer optimizes both the query and the subquery.
- Execution proceeds normally by executing the three-way join in the
  query until end_send_group evaluates the HAVING clause once, which
  results in a single subquery execution.

Since the subquery is evaluated only once, one may wonder why this query
takes so much time. The reason is an inefficient JOIN plan that is chosen
with the default join_cache_level=2. The plan is:
+----+--------------------+--------+----------------+---------------+------+---------+-------+------+----------+--------------------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+--------------------+--------+----------------+---------------+------+---------+-------+------+----------+--------------------------------------------------------------+
| 1 | PRIMARY | alias3 | index | NULL | a | 19 | NULL | 133 | 100.00 | Using index |
| 1 | PRIMARY | alias2 | index | a | a | 19 | NULL | 133 | 100.00 | Using index; Using join buffer (flat, BNL join) |
| 1 | PRIMARY | alias1 | index | a | a | 19 | NULL | 133 | 100.00 | Using where; Using index; Using join buffer (flat, BNL join) |
| 2 | DEPENDENT SUBQUERY | t1 | index_subquery | a | a | 19 | const | 10 | 100.00 | Using index; Using where |
+----+--------------------+--------+----------------+---------------+------+---------+-------+------+----------+--------------------------------------------------------------+

This plan takes 1.8 sec.

With join_cache_level=0 there is a much better plan:
+----+--------------------+--------+----------------+---------------+------+---------+-------+------+----------+--------------------------+
| id | select_type | table | type ...

Read more...

Revision history for this message
Timour Katchaounov (timour) wrote :
Download full text (4.7 KiB)

Let's analyze a simpler query:

SELECT MAX(alias2.a)
FROM t1 AS alias1, t1 AS alias2, t1 AS alias3
WHERE alias1.a = alias2.a OR ('Moscow') IN ( SELECT a FROM t1 );

If subqueries are not expensive (5.2, or changed 5.3) the query is optimized as the
original test case, with the only difference, that optimize_cond is run for the WHERE
clause, and the WHERE clause is transformed into: "alias1.a = alias2.a".

This allows the JOIN optimizer to find a good plan that takes 0.02 sec:
+----+--------------------+--------+----------------+---------------+------+---------+--------------------+------+----------+-------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+--------------------+--------+----------------+---------------+------+---------+--------------------+------+----------+-------------------------------------------------+
| 1 | PRIMARY | alias1 | index | a | a | 19 | NULL | 133 | 100.00 | Using where; Using index |
| 1 | PRIMARY | alias2 | ref | a | a | 19 | lpb944706.alias1.a | 1 | 100.00 | Using index |
| 1 | PRIMARY | alias3 | index | NULL | a | 19 | NULL | 133 | 100.00 | Using index; Using join buffer (flat, BNL join) |
| 2 | DEPENDENT SUBQUERY | t1 | index_subquery | a | a | 19 | const | 10 | 100.00 | Using index; Using where |
+----+--------------------+--------+----------------+---------------+------+---------+--------------------+------+----------+-------------------------------------------------+

In 5.3 where subqueries are expensive, the IN cannot be optimized away, and the OR condition doesn't allow a plan with ref access. The resulting plan takes ~ 3 sec:

+----+--------------------+--------+----------------+---------------+------+---------+-------+------+----------+---------------------------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+--------------------+--------+----------------+---------------+------+---------+-------+------+----------+---------------------------------------------------------------------+
| 1 | PRIMARY | alias3 | index | NULL | a | 19 | NULL | 133 | 100.00 | Using index |
| 1 | PRIMARY | alias2 | index | a | a | 19 | NULL | 133 | 100.00 | Using index; Using join buffer (flat, BNL join) |
| 1 | PRIMARY | alias1 | index | a | a | 19 | NULL | 133 | 100.00 | Using where; Using index; Using join buffer (incremental, BNL join) |
| 2 | DEPENDENT SUBQUERY | t1 | index_sub...

Read more...

Michael Widenius (monty)
Changed in maria:
importance: High → Medium
milestone: 5.3 → 5.5
Michael Widenius (monty)
summary: - Query with impossible HAVING takes 1 millisec on 5.2 and 8 sec on 5.3
+ Query with impossible or constant subquery in WHERE or HAVING is not
+ precomputed and thus not part of optimization
Revision history for this message
Elena Stepanova (elenst) wrote :

Also filed in JIRA as MDEV-193

Elena Stepanova (elenst)
tags: added: optimizer
Revision history for this message
Timour Katchaounov (timour) wrote :

Pushed to MariaDB 5.5.25.

Changed in maria:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.