[29 Dec 6:39] Laurynas Biveinis
Description:
If something changes the index tree between the two dives in btr_estimate_n_rows_in_range, this goes undetected, making it return bogus row count for the range, resulting in bogus query plans.
create table t3
(
key1 int not null,
key2 int not null,
INDEX i1(key1),
INDEX i2(key2)
) engine=InnoDB;
--disable_query_log
insert into t3 values (1,1),(2,2);
let $1=9;
set @d=2;
while ($1)
{
eval insert into t3 select key1+@d, key2+@d from t3;
eval set @d=@d*2;
dec $1;
}
--enable_query_log
analyze table t3;
SET @@GLOBAL.innodb_purge_stop_now=TRUE;
alter table t3 add keyC int not null, add index iC(keyC);
update t3 set keyC=key1;
analyze table t3;
--connect(con1,localhost,root,,)
explain select * from t3 where
key1=1 or key2=2 or keyC=12;
SET DEBUG_SYNC = "btr_estimate_n_rows_in_range_between_dives SIGNAL estimate_ready WAIT_FOR estimate_finish";
send explain select * from t3 where
key1=1 or key2=2 or keyC=12;
connection default;
SET DEBUG_SYNC = "now WAIT_FOR estimate_ready";
SET @@GLOBAL.innodb_purge_run_now=TRUE;
--source include/wait_innodb_all_purged.inc
SET DEBUG_SYNC = "now SIGNAL estimate_finish";
connection con1;
reap;
explain select * from t3 where
key1=1 or key2=2 or keyC=12;
disconnect con1;
connection default;
drop table t3;
The testcase above executes the same EXPLAIN SELECT three times: with the iC index tree not purged at all, with it being purged between the two dives, and with the fully clean tree. The first and the last execution use index_merge, and the middle one is a table scan instead.
This is reduced from an intermittent main.index_merge_innodb failure:
main.index_merge_innodb [ fail ]
Test ended at 2016-12-09 07:08:51
CURRENT_TEST: main.index_merge_innodb
--- /mnt/workspace/percona-server-5.6-param/BUILD_TYPE/release/Host/centos6-64/mysql-test/r/index_merge_innodb.result 2016-12-09 14:08:06.818000030 +0300
+++ /mnt/workspace/percona-server-5.6-param/BUILD_TYPE/release/Host/centos6-64/build/mysql-test/var/log/index_merge_innodb.reject 2016-12-09 15:08:51.296000042 +0300
@@ -299,7 +299,7 @@
key5=5 or key6=6 or key7=7 or key8=8 or
key9=9 or keyA=10 or keyB=11 or keyC=12;
id select_type table type possible_keys key key_len ref rows Extra
-1 SIMPLE t3 index_merge i1,i2,i3,i4,i5,i6,i7,i8,i9,iA,iB,iC i1,i2,i3,i4,i5,i6,i7,i8,i9,iA,iB,iC 4,4,4,4,4,4,4,4,4,4,4,4 NULL 12 Using union(i1,i2,i3,i4,i5,i6,i7,i8,i9,iA,iB,iC); Using where
+1 SIMPLE t3 ALL i1,i2,i3,i4,i5,i6,i7,i8,i9,iA,iB,iC NULL NULL NULL 1024 Using where
select * from t3 where
key1=1 or key2=2 or key3=3 or key4=4 or
key5=5 or key6=6 or key7=7 or key8=8 or
mysqltest: Result content mismatch
Suggested fix:
The bug results in index dive paths with path1[0].nth_rec > path2[0].nth_rec, returning hundreds of rows in the range instead of expected one. Even without concurrent modifications aside, such path crossing is detected in btr_estimate_n_rows_in_range:
/* It is possible that
slot1->nth_rec >= slot2->nth_rec
if, for example, we have a single page
tree which contains (inf, 5, 6, supr)
and we select where x > 20 and x < 30;
in this case slot1->nth_rec will point
to the supr record and slot2->nth_rec
will point to 6 */
n_rows = 0;
But instead of returning zero right there, path tracing continues. Shouldn't it stop and return zero here instead?
Then, 5.7 actually catches some of the tree modifications in this function by [1]. It does not catch this particular tree change because both paths still point to the same page. A possible fix is extending [1] with a condition that the same page id node in the two paths should contain the same number of records too - but this might be too aggressive for the trees with heavy write volume, resulting in needless dive restarts and eventual returning arbitrary 10 rows in range fallback constant. Perhaps adding a check that the same page in the two paths should have the same number of records - but only if the paths have crossed - would be reasonable.
In any case such fix would be partial only as it's possible to modify the tree and still have the same number of records before and after.
Relax a too strict assert. If the tree is changed between both dives for
the left boundary and the right boundary, then our markers (slot1 and
slot2) could end up on different levels in the tree.
If we detect that this has happened - then retry a few times and if
still unsuccessful then return an arbitrary number for an estimate.
Reviewed-by: Marko M<C3><A4>kel<C3><A4> <email address hidden>
Reviewed-by: Jimmy Yang <email address hidden>
RB: 8570
[29 Dec 6:39] Laurynas Biveinis n_rows_ in_range, this goes undetected, making it return bogus row count for the range, resulting in bogus query plans.
Description:
If something changes the index tree between the two dives in btr_estimate_
How to repeat:
Apply
--- storage/ innobase/ btr/btr0cur. cc.orig 2016-12-29 07:16:06.000000000 +0200 innobase/ btr/btr0cur. cc 2016-12-29 07:16:08.000000000 +0200
+++ storage/
@@ -3708,6 +3708,9 @@
mtr_commit(&mtr);
+ if (!strcmp( index-> name, "iC")) C("btr_ estimate_ n_rows_ in_range_ between_ dives") ;
+ DEBUG_SYNC_
+
mtr_start(&mtr);
cursor.path_arr = path2;
Then use the following MTR testcase. It uses purge as the concurrent index tree writer.
--source include/ have_debug. inc have_debug_ sync.inc have_innodb. inc
--source include/
--source include/
create table t3
(
key1 int not null,
key2 int not null,
INDEX i1(key1),
INDEX i2(key2)
) engine=InnoDB;
--disable_query_log
insert into t3 values (1,1),(2,2);
let $1=9;
set @d=2;
while ($1)
{
eval insert into t3 select key1+@d, key2+@d from t3;
eval set @d=@d*2;
dec $1;
}
--enable_query_log
analyze table t3;
SET @@GLOBAL. innodb_ purge_stop_ now=TRUE;
alter table t3 add keyC int not null, add index iC(keyC);
update t3 set keyC=key1;
analyze table t3;
--connect( con1,localhost, root,,)
explain select * from t3 where
key1=1 or key2=2 or keyC=12;
SET DEBUG_SYNC = "btr_estimate_ n_rows_ in_range_ between_ dives SIGNAL estimate_ready WAIT_FOR estimate_finish";
send explain select * from t3 where
key1=1 or key2=2 or keyC=12;
connection default;
SET DEBUG_SYNC = "now WAIT_FOR estimate_ready";
SET @@GLOBAL. innodb_ purge_run_ now=TRUE; wait_innodb_ all_purged. inc
--source include/
SET DEBUG_SYNC = "now SIGNAL estimate_finish";
connection con1;
reap;
explain select * from t3 where
key1=1 or key2=2 or keyC=12;
disconnect con1;
connection default;
drop table t3;
The testcase above executes the same EXPLAIN SELECT three times: with the iC index tree not purged at all, with it being purged between the two dives, and with the fully clean tree. The first and the last execution use index_merge, and the middle one is a table scan instead.
This is reduced from an intermittent main.index_ merge_innodb failure:
main.index_ merge_innodb [ fail ]
Test ended at 2016-12-09 07:08:51
CURRENT_TEST: main.index_ merge_innodb percona- server- 5.6-param/ BUILD_TYPE/ release/ Host/centos6- 64/mysql- test/r/ index_merge_ innodb. result 2016-12-09 14:08:06.818000030 +0300 percona- server- 5.6-param/ BUILD_TYPE/ release/ Host/centos6- 64/build/ mysql-test/ var/log/ index_merge_ innodb. reject 2016-12-09 15:08:51.296000042 +0300 i4,i5,i6, i7,i8,i9, iA,iB,iC i1,i2,i3, i4,i5,i6, i7,i8,i9, iA,iB,iC 4,4,4,4, 4,4,4,4, 4,4,4,4 NULL 12 Using union(i1, i2,i3,i4, i5,i6,i7, i8,i9,iA, iB,iC); Using where i4,i5,i6, i7,i8,i9, iA,iB,iC NULL NULL NULL 1024 Using where
--- /mnt/workspace/
+++ /mnt/workspace/
@@ -299,7 +299,7 @@
key5=5 or key6=6 or key7=7 or key8=8 or
key9=9 or keyA=10 or keyB=11 or keyC=12;
id select_type table type possible_keys key key_len ref rows Extra
-1 SIMPLE t3 index_merge i1,i2,i3,
+1 SIMPLE t3 ALL i1,i2,i3,
select * from t3 where
key1=1 or key2=2 or key3=3 or key4=4 or
key5=5 or key6=6 or key7=7 or key8=8 or
mysqltest: Result content mismatch
Suggested fix: n_rows_ in_range:
The bug results in index dive paths with path1[0].nth_rec > path2[0].nth_rec, returning hundreds of rows in the range instead of expected one. Even without concurrent modifications aside, such path crossing is detected in btr_estimate_
/* It is possible that
slot1->nth_rec >= slot2->nth_rec
if, for example, we have a single page
tree which contains (inf, 5, 6, supr)
and we select where x > 20 and x < 30;
in this case slot1->nth_rec will point
to the supr record and slot2->nth_rec
will point to 6 */
n_rows = 0;
But instead of returning zero right there, path tracing continues. Shouldn't it stop and return zero here instead?
Then, 5.7 actually catches some of the tree modifications in this function by [1]. It does not catch this particular tree change because both paths still point to the same page. A possible fix is extending [1] with a condition that the same page id node in the two paths should contain the same number of records too - but this might be too aggressive for the trees with heavy write volume, resulting in needless dive restarts and eventual returning arbitrary 10 rows in range fallback constant. Perhaps adding a check that the same page in the two paths should have the same number of records - but only if the paths have crossed - would be reasonable.
In any case such fix would be partial only as it's possible to modify the tree and still have the same number of records before and after.
[1]:
commit 9ac1425053312b9 8fc9e6a999087ef b09e60daec
Author: Vasil Dimov <email address hidden>
Date: Fri Apr 17 15:04:58 2015 +0300
Fix Bug#20618309 ASSERT SLOT1->PAGE_LEVEL == SLOT2->PAGE_LEVEL, BTR_ESTIMATE_ N_ROWS_ IN_RANGE( )
Relax a too strict assert. If the tree is changed between both dives for
the left boundary and the right boundary, then our markers (slot1 and
slot2) could end up on different levels in the tree.
If we detect that this has happened - then retry a few times and if
still unsuccessful then return an arbitrary number for an estimate.
Reviewed-by: Marko M<C3><A4> kel<C3> <A4> <email address hidden>
Reviewed-by: Jimmy Yang <email address hidden>
RB: 8570