Failing the test subselect_notembedded

Bug #532481 reported by benjaminlin on 2010-03-05
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Drizzle
Fix Released
Medium
David Shrewsbury
Cherry
Won't Fix
Medium
Patrick Crews
Dexter
Fix Released
Medium
David Shrewsbury

Bug Description

make test failed in testing subselect_notembedded.

Related branches

Lee Bieber (kalebral) wrote :

What OS version are you seeing this failure on?

Mac OS X

Ben

On Mon, Mar 22, 2010 at 2:31 PM, Lee Bieber <email address hidden> wrote:
> What OS version are you seeing this failure on?
>
> --
> Failing the test subselect_notembedded
> https://bugs.launchpad.net/bugs/532481
> You received this bug notification because you are a direct subscriber
> of the bug.
>
> Status in A Lightweight SQL Database for Cloud and Web: New
>
> Bug description:
> make test failed in testing subselect_notembedded.
>
> To unsubscribe from this bug, go to:
> https://bugs.launchpad.net/drizzle/+bug/532481/+subscribe
>

Lee Bieber (kalebral) wrote :

I'm not able to reproduce this on any of our Mac machines with the latest from trunk.

Changed in drizzle:
status: New → Incomplete
benjaminlin (benjaminlin1019) wrote :

Hi Beiber:

  Thanks, FWIW, I was running snow leopard. I can verify it again but
my macbook recently just died,
so I need to fix it within some days...:(

Ben

On Wed, Mar 24, 2010 at 9:15 AM, Lee Bieber <email address hidden> wrote:
> I'm not able to reproduce this on any of our Mac machines with the
> latest from trunk.
>
> ** Changed in: drizzle
>       Status: New => Incomplete
>
> ** Also affects: drizzle/cherry
>   Importance: Undecided
>       Status: Incomplete
>
> ** Changed in: drizzle/cherry
>    Milestone: None => 2010-03-29
>
> --
> Failing the test subselect_notembedded
> https://bugs.launchpad.net/bugs/532481
> You received this bug notification because you are a direct subscriber
> of the bug.
>
> Status in A Lightweight SQL Database for Cloud and Web: Incomplete
> Status in Drizzle cherry series: Incomplete
>
> Bug description:
> make test failed in testing subselect_notembedded.
>
> To unsubscribe from this bug, go to:
> https://bugs.launchpad.net/drizzle/+bug/532481/+subscribe
>

David Shrewsbury (dshrews) wrote :

Just reproduced this on my OS X 10.6 machine with revno 1509:

main.subselect_notembedded [ fail ]
drizzletest: At line 42: query '$start $end' failed with wrong errno 20: 'drizzle_state_read:lost connection to server (EOF)', instead of 0...

The result from queries just before the failure was:
< snip >
22
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3

More results from queries before failure can be found in /Users/david/Devel/ignore/repos/drizzle/my_copy/tests/var/log/subselect_notembedded.log

Aborting: main.subselect_notembedded failed in default mode.
To continue, re-run with '--force'.
Stopping All Servers

Looks like the server may be crashing within InnoDB code:

Davids-iMac:tests david$ ./test-run --gdb subselect_notembedded

Version: '2010.04.1509' Source distribution (my_copy)
[Switching to process 71716]

Breakpoint 1, drizzled::mysql_parse (session=0x10202b400, inBuf=0x10af7ba68 "create table t1(a int,b int,key(a),key(b))", length=42) at drizzled/sql_parse.cc:718
718 lex_start(session);
(gdb) c
Continuing.

Program received signal EXC_BAD_ACCESS, Could not access memory.
Reason: KERN_PROTECTION_FAILURE at address: 0x000000010cb65ff8
0x0000000100353844 in page_cur_search_with_match (block=0x102b102c0, index=0x10af82e90, tuple=0xcf61012d00000000, mode=10955007193116212510, iup_matched_fields=0x18969010001d4, iup_matched_bytes=0xd2f3d70001cf2001, ilow_matched_fields=0x10cb667c8, ilow_matched_bytes=0x10cb667c0, cursor=0x10cb687b8) at plugin/innobase/page/page0cur.c:269
269 {
(gdb)

Lee Bieber (kalebral) wrote :

Which version of gcc are you using?

David Shrewsbury (dshrews) wrote :

$ gcc --version
i686-apple-darwin10-gcc-4.2.1 (GCC) 4.2.1 (Apple Inc. build 5659)

David Shrewsbury (dshrews) wrote :

Configured with the following tools:
  * libtoolize (GNU libtool) 2.2.6b
  * aclocal (GNU automake) 1.11.1
  * autoheader (GNU Autoconf) 2.65
  * automake (GNU automake) 1.11.1
  * autoconf (GNU Autoconf) 2.65

David Shrewsbury (dshrews) wrote :

Just verified with revno 1518 that this occurs only when configured with --with-debug option.

This passes all tests: ./configure --with-lib-prefix=/opt/local

This will fail on subselect_notembedded: ./configure --with-lib-prefix=/opt/local --with-debug

Brian Aker (brianaker) wrote :

Hi!

Is this on assert on field? Known issue, I am not sure if there is a bug already open for it.

Cheers,
 -Brian

On May 4, 2010, at 6:52 AM, David Shrewsbury wrote:

> Just verified with revno 1518 that this occurs only when configured with
> --with-debug option.
>
> This passes all tests: ./configure --with-lib-prefix=/opt/local
>
> This will fail on subselect_notembedded: ./configure --with-lib-
> prefix=/opt/local --with-debug
>
> --
> Failing the test subselect_notembedded
> https://bugs.launchpad.net/bugs/532481
> You received this bug notification because you are a member of Drizzle-
> developers, which is subscribed to Drizzle.
>
> Status in A Lightweight SQL Database for Cloud and Web: Incomplete
> Status in Drizzle cherry series: Incomplete
>
> Bug description:
> make test failed in testing subselect_notembedded.
>
>

David Shrewsbury (dshrews) wrote :

Brian,

It doesn't appear to be caused by any assert, but rather a memory access violation.
See the debug output from note #5 in the bug report.

Lee Bieber (kalebral) on 2010-05-05
Changed in drizzle:
status: Incomplete → Confirmed
importance: Undecided → Medium
milestone: 2010-03-29 → none
David Shrewsbury (dshrews) wrote :

Using GuardMalloc ("man Guard_Malloc" for how to enable on OS X), it seems as if the problem may be with the "tuple" parameter (note the "Cannot access memory" message at the first stack frame):

Breakpoint 1, drizzled::mysql_parse (session=0x12bca1e20, inBuf=0x12bcd6fc8 "create table t1(a int,b int,key(a),key(b))", length=42) at drizzled/sql_parse.cc:718
718 lex_start(session);
(gdb) c
Continuing.

Program received signal EXC_BAD_ACCESS, Could not access memory.
Reason: KERN_PROTECTION_FAILURE at address: 0x0000000128eb2ff8
0x0000000100347360 in page_cur_search_with_match (block=0x1141022c0, index=0x12be3ee50, tuple=Cannot access memory at address 0x128eb2ff8
) at plugin/innobase/page/page0cur.c:269
269 {
(gdb) where 5
#0 0x0000000100347360 in page_cur_search_with_match (block=0x1141022c0, index=0x12be3ee50, tuple=Cannot access memory at address 0x128eb2ff8
) at plugin/innobase/page/page0cur.c:269
#1 0x00000001002d9e5f in btr_cur_search_to_nth_level (index=0x12be3ee50, level=0, tuple=0x13f894f70, mode=1, latch_mode=1, cursor=0x128eb57b0, has_search_latch=0, mtr=0x128eb5860) at plugin/innobase/btr/btr0cur.c:601
#2 0x00000001002e0b17 in btr_estimate_n_rows_in_range (index=0x12be3ee50, tuple1=0x13f894f70, mode1=1, tuple2=0x13f894fb8, mode2=2) at plugin/innobase/btr/btr0cur.c:3101
#3 0x00000001002c1004 in ha_innobase::records_in_range (this=0x12bf7ec38, keynr=0, min_key=0x128eb5e60, max_key=0x0) at plugin/innobase/handler/ha_innodb.cc:6294
#4 0x000000010006d381 in drizzled::Cursor::multi_range_read_info_const (this=0x12bf7ec38, keyno=0, seq=0x128eb61e0, seq_init_param=0x128eb5f40, bufsz=0x128eb62d8, flags=0x128eb62dc, cost=0x128eb6270) at drizzled/cursor.cc:963
(More stack frames follow...)
Current language: auto; currently c
(gdb)

Brian Aker (brianaker) wrote :

What version of OSX are you using?

Thanks,
   -Brian

David Shrewsbury (dshrews) wrote :

Brian,

The most recent OS... Snow Leopard (10.6) with all updates applied.

David Shrewsbury (dshrews) wrote :

The query that kills it has 28 subselects (see below). One less and it works. To reproduce this without going through the test-run nonsense...

create database test;
use test;
create table t1(a int, b int, key(a), key(b));
insert into t1(a,b) values (1,2),(2,1),(2,3),(3,4),(5,4),(5,5),(6,7),(7,4),(5,3);
select sum(a),a from t1 where a> (
 select sum(a) from t1 where a> (
  select sum(a) from t1 where a> (
   select sum(a) from t1 where a> (
    select sum(a) from t1 where a> (
     select sum(a) from t1 where a> (
      select sum(a) from t1 where a> (
       select sum(a) from t1 where a> (
        select sum(a) from t1 where a> (
         select sum(a) from t1 where a> (
          select sum(a) from t1 where a> (
           select sum(a) from t1 where a> (
            select sum(a) from t1 where a> (
             select sum(a) from t1 where a> (
              select sum(a) from t1 where a> (
               select sum(a) from t1 where a> (
                select sum(a) from t1 where a> (
                 select sum(a) from t1 where a> (
                  select sum(a) from t1 where a> (
                   select sum(a) from t1 where a> (
                    select sum(a) from t1 where a> (
                     select sum(a) from t1 where a> (
                      select sum(a) from t1 where a> (
                       select sum(a) from t1 where a> (
                        select sum(a) from t1 where a> (
                         select sum(a) from t1 where a> (
                          select sum(a) from t1 where a> (
                           select sum(a) from t1 where a> (
                            select sum(a) from t1 where a> ( select sum(a) from t1 ) group by b limit 1 )
                           group by b limit 1 )
                          group by b limit 1 )
                         group by b limit 1 )
                        group by b limit 1 )
                       group by b limit 1 )
                      group by b limit 1 )
                     group by b limit 1 )
                    group by b limit 1 )
                   group by b limit 1 )
                  group by b limit 1 )
                 group by b limit 1 )
                group by b limit 1 )
               group by b limit 1 )
              group by b limit 1 )
             group by b limit 1 )
            group by b limit 1 )
           group by b limit 1 )
          group by b limit 1 )
         group by b limit 1 )
        group by b limit 1 )
       group by b limit 1 )
      group by b limit 1 )
     group by b limit 1 )
    group by b limit 1 )
   group by b limit 1 )
  group by b limit 1 )
 group by b limit 1 )
group by a;

Patrick Crews (patrick-crews) wrote :

This is the crash that is causing us to fail on hades:
http://hudson.drizzle.org/job/drizzle-build-hades/

I've only seen it pass with one revision=1663. The revisions I saw that preceded and followed this one failed in the same manner.

Patrick Crews (patrick-crews) wrote :

UPDATE: I just retested revision #1663 and it failed. It seems that we maybe got lucky on that one? It was the last time the test passed in hudson.

Patrick Crews (patrick-crews) wrote :

Pinpointed the failure to revision#= 1660.1.4 via bzr bisect runs against subselect_notembedded

Running make clean
On revision 1660.1.4 (brian@gaz-20100716155319-yseq7s87lg2sf193):
Merge in fixes for PBMS

It should be noted that revison#= 1660.1.3 failed to build and this *might* have been introduced then.
On revision 1660.1.3 (brian@gaz-20100716054008-cfcvyu0akpwqnwqd):
Encapsulate Table in field

Bug was *not* present in 1660.1.2.

Patrick Crews (patrick-crews) wrote :

I would say that 1660.1.3 introduced this bug as 1660.1.4 was only changes to the plugin/pbms files.
The crash wasn't detected by bisect as the build failed for this build.
However, look to the changes in this revision id to find the culprit.

bzr diff -c 1660.1.3

suspect:

=== modified file 'drizzled/sql_base.cc'
--- drizzled/sql_base.cc 2010-06-29 22:50:01 +0000
+++ drizzled/sql_base.cc 2010-07-16 05:40:08 +0000
@@ -2500,7 +2500,7 @@ find_field_in_natural_join(Session *sess
     return NULL;
   {
     /* This is a base table. */
- assert(nj_col->table_ref->table == nj_col->table_field->table);
+ assert(nj_col->table_ref->table == nj_col->table_field->getTable()); <--shouldn't it be assert(nj_col->table_ref->getTable()?
     found_field= nj_col->table_field;
     update_field_dependencies(session, found_field, nj_col->table_ref->table);
   }

Patrick Crews (patrick-crews) wrote :

I was wrong on the suspect code, changing it broke build : ) However, it was worth a guess.

Brian Aker (brianaker) wrote :

Hi!

On Jul 29, 2010, at 3:17 PM, Patrick Crews wrote:

> - assert(nj_col->table_ref->table == nj_col->table_field->table);
> + assert(nj_col->table_ref->table == nj_col->table_field->getTable()); <--shouldn't it be assert(nj_col->table_ref->getTable()?

That would be it :)

I just love pointers.

Cheers,
 -Brian

Brian Aker (brianaker) wrote :

Suck :)

Oh well, I was wondering how it had ever worked if that change was bad.

On Jul 29, 2010, at 3:21 PM, Patrick Crews wrote:

> I was wrong on the suspect code, changing it broke build : ) However,
> it was worth a guess.
>
> --
> Failing the test subselect_notembedded
> https://bugs.launchpad.net/bugs/532481
> You received this bug notification because you are a member of Drizzle-
> developers, which is subscribed to Drizzle.
>
> Status in A Lightweight SQL Database for Cloud and Web: Confirmed
> Status in Drizzle cherry series: Won't Fix
> Status in Drizzle dexter series: Confirmed
>
> Bug description:
> make test failed in testing subselect_notembedded.
>
>

Patrick Crews (patrick-crews) wrote :

From what I've seen in the changes for 1660.1.3, it doesn't seem like it is an obvious change (but my C++ leaves something to be desired).

The bulk of the changes are moving from the use of table to getTable().

My thinking is that this somehow introduced a more subtle change. It should also be noted that I'm not seeing this crash at all on my Mac (10.5.8), using macports' boost

I use: LDFLAGS=-L/opt/local/lib ./configure --with-libprotobuf-prefix=/opt/local \
--with-libpcre-prefix=/opt/local --with-libevent-prefix=/opt/local

Configured with the following tools:
  * libtoolize (GNU libtool) 2.2.10
  * aclocal (GNU automake) 1.11.1
  * autoheader (GNU Autoconf) 2.65
  * automake (GNU automake) 1.11.1
  * autoconf (GNU Autoconf) 2.65

Brian Aker (brianaker) wrote :

Hi!

On Jul 30, 2010, at 10:23 AM, Patrick Crews wrote:

> My thinking is that this somehow introduced a more subtle change. It
> should also be noted that I'm not seeing this crash at all on my Mac
> (10.5.8), using macports' boost

You mean you don't see the crash using macports with the latest tree?

Cheers,
 -Brian

Patrick Crews (patrick-crews) wrote :

Yes, that is correct. Or at least it was the last time I ran on my machine
(about a week ago). I'm re-building / testing now and will update if things
have changed.

On Fri, Jul 30, 2010 at 2:04 PM, Brian Aker <email address hidden> wrote:

> Hi!
>
> On Jul 30, 2010, at 10:23 AM, Patrick Crews wrote:
>
> > My thinking is that this somehow introduced a more subtle change. It
> > should also be noted that I'm not seeing this crash at all on my Mac
> > (10.5.8), using macports' boost
>
> You mean you don't see the crash using macports with the latest tree?
>
> Cheers,
> -Brian
>
> --
> Failing the test subselect_notembedded
> https://bugs.launchpad.net/bugs/532481
> You received this bug notification because you are a bug assignee.
>
> Status in A Lightweight SQL Database for Cloud and Web: Confirmed
> Status in Drizzle cherry series: Won't Fix
> Status in Drizzle dexter series: Confirmed
>
> Bug description:
> make test failed in testing subselect_notembedded.
>
>
>

David Shrewsbury (dshrews) wrote :

Ok, so this is odd... and not quite sure what it means yet. Maybe a stack frame size issue since I think some recursion is going on here for handling the subselects.

The server is crashing in the prolog for the page_cur_search_with_match() function.
The access violation occurs when attempting to preserve the value of the %rdi register
by pushing it onto the stack, but dereferencing the frame pointer causes the access
violation (%rip register contains the address of the instruction that caused it... see below).

Caused by: -0x3b8(%rbp) which is (0x10ee8c380 - 0x3b8 = 0x10ee8bfc8)

Exception Type: EXC_BAD_ACCESS (SIGBUS)
Exception Codes: KERN_PROTECTION_FAILURE at 0x000000010ee8bfc8

(gdb) frame
#0 0x000000010038b056 in page_cur_search_with_match (block=0x0, index=0x0, tuple=0x0, mode=0, iup_matched_fields=0x0, iup_matched_bytes=0x0, ilow_matched_fields=0x10ee8c788, ilow_matched_bytes=0x10ee8c780, cursor=0x10ee8e778) at plugin/innobase/page/page0cur.c:269
269 {

(gdb) info registers
rax 0x10ee8c788 4545103752
rbx 0x3f 63
rcx 0x1 1
rdx 0x10b055f90 4479868816
rsi 0x10b040bc0 4479781824
rdi 0x102a01870 4339013744
rbp 0x10ee8c380 0x10ee8c380
rsp 0x10ee8bf90 0x10ee8bf90
r8 0x10ee8c798 4545103768
r9 0x10ee8c790 4545103760
r10 0x10b040bc0 4479781824
r11 0x102a01870 4339013744
r12 0x1018fe9f8 4321176056
r13 0x1018fdc00 4321172480
r14 0x102128950 4329736528
r15 0x1018fdc00 4321172480
rip 0x10038b056 0x10038b056 <page_cur_search_with_match+11>
eflags 0x10206 66054
cs 0x27 39
ss 0x0 0
ds 0x0 0
es 0x0 0
fs 0x0 0
gs 0x0 0

(gdb) disass
Dump of assembler code for function page_cur_search_with_match:
0x000000010038b04b <page_cur_search_with_match+0>: push %rbp
0x000000010038b04c <page_cur_search_with_match+1>: mov %rsp,%rbp
0x000000010038b04f <page_cur_search_with_match+4>: sub $0x3f0,%rsp
0x000000010038b056 <page_cur_search_with_match+11>: mov %rdi,-0x3b8(%rbp) <-- PROBLEM INSTRUCTION
0x000000010038b05d <page_cur_search_with_match+18>: mov %rsi,-0x3c0(%rbp)
0x000000010038b064 <page_cur_search_with_match+25>: mov %rdx,-0x3c8(%rbp)
0x000000010038b06b <page_cur_search_with_match+32>: mov %rcx,-0x3d0(%rbp)
0x000000010038b072 <page_cur_search_with_match+39>: mov %r8,-0x3d8(%rbp)
0x000000010038b079 <page_cur_search_with_match+46>: mov %r9,-0x3e0(%rbp)
0x000000010038b080 <page_cur_search_with_match+53>: movq $0x0,-0x80(%rbp)
0x000000010038b088 <page_cur_search_with_match+61>: lea -0x3b0(%rbp),%rax
0x000000010038b08f <page_cur_search_with_match+68>: mov %rax,-0x88(%rbp)
0x000000010038b096 <page_cur_search_with_match+75>: lea -0x3b0(%rbp),%rdi
0x000000010038b09d <page_cur_search_with_match+82>: mov $0x64,%esi
0x000000010038b0a2 <page_cur_search_with_match+87>: callq 0x10038b526 <rec_offs_set_n_alloc>

David Shrewsbury (dshrews) wrote :

An additional run and using vmmap seems to confirm that the thread is running out of writable stack and attempting to store into the non-writable stack area:

Exception Codes: KERN_PROTECTION_FAILURE at 0x000000010e53ffc8

frame pointer (%rbp) = 0x10e540380
-0x3b8(%rbp) = 0x10e53ffc8

Virtual Memory Map of process 16452 (drizzled)
Output report format: 2.2 -- 64-bit process

==== Non-writable regions for process 16452
Stack 000000010e53f000-000000010e540000 [ 4K] r--/rwx SM=PRV thread 15

==== Writable regions for process 16452
Stack 000000010e540000-000000010e5c2000 [ 520K] rw-/rwx SM=PRV

David Shrewsbury (dshrews) wrote :

So, the OS X default thread stack size is 512K. Linux default thread stack size appears to be 8M.

Confirmed that increasing thread stack size at pthread_create() time fixes this on OS X. I just increased it to 8M for the purpose of this test.

However, our scheduler plugin(s) does not handle the --thread_stack option, which is a bug.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers