Delete table/readd table can cause corruption, was: Testing TPCC-MYSQL get odd db/index coruption

Bug #389495 reported by MadMatt
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
WaffleGrid
In Progress
Critical
MadMatt

Bug Description

Everything loads fine...

Memcached Blocks missed 96096 : Total Time: 45469 (ms) : Avg Time : 47 (us)
Memcahced Blocks set norm 1 1001 : Total Time: 508614 (us): Avg Time : 508 (us)
Memcached Blocks set norm 1 96096 : Total Time: 500398 (ms): Avg Time : 520 (us)
090619 9:18:41 InnoDB: Error: space id and page n:o stored in the page
InnoDB: read in are 27:0, should be 27:12618!
InnoDB: Database page corruption on disk or a failed
InnoDB: file read of page 12618.
InnoDB: You may have to recover from a backup.
090619 9:18:41 InnoDB: Page dump in ascii and hex (16384 bytes):
...

Not sure if this is the first get or something else.

MadMatt (yonkovim)
Changed in wafflegrid:
assignee: nobody → MadMatt (yonkovim)
importance: Undecided → Critical
status: New → Confirmed
Revision history for this message
MadMatt (yonkovim) wrote :

This error did occur after I ran a full waffle test without memcached...

I then dropped the TPCC database

I then reimported the tpcc database

root@frankenmatt:/benchmarks/tpcc-mysql# mysql -e"set @@global.innodb_memcached_print_stats=1"
root@frankenmatt:/benchmarks/tpcc-mysql# mysql -e"set @@global.innodb_memcached_servers='localhost'"
root@frankenmatt:/benchmarks/tpcc-mysql# mysql -e"set @@global.innodb_memcached_enable=1"

I then ran the count.sql to help warm the cache

I then started the tpcc benchmark.

Tpcc crashed mysql right away

This was cream revision 7.

Revision history for this message
MadMatt (yonkovim) wrote :

Verified after a restart that following these steps does not produce the error:

mysql -e"set @@global.innodb_memcached_print_stats=1"
mysql -e"set @@global.innodb_memcached_servers='localhost'"
mysql -e"set @@global.innodb_memcached_enable=1"
./warmcache.sh
mysql tpcc < count.sql
mysql
./warmcache.sh

---------
MEMCACHED
---------
Memcached puts 170217
Memcached hits 73936
Memcached misses 170228
Memcached Prefix: 27049
Memcached Get Total Lat 85 (us)
Memcached Get Recent Lat 76 (us)
Memcached Miss Total Lat 45 (us)
Memcached Miss Recent Lat 35 (us)
Memcached Set Total Lat 528 (us)
Memcached Set Recent Lat 484 (us)
--------------

So memcached pages are going back and forth...

But as soon as a tpcc:
nohup ./tpcc_start localhost tpcc tpcc tpcc 20 16 20 900 > tpcc_waffle.out 2>&1 &
[2] 27659
root@frankenmatt:/benchmarks/tpcc-mysql# 090619 09:38:47 mysqld_safe Number of processes running now: 0
090619 09:38:47 mysqld_safe mysqld restarted

[2]+ Segmentation fault nohup ./tpcc_start localhost tpcc tpcc tpcc 20 16 20 900 > tpcc_waffle.out 2>&1
root@frankenmatt:/benchmarks/tpcc-mysql#

error:
Memcached Blocks read 71071 : Total Time: 60609 (ms) : Avg Time : 85 (us)
Memcached Blocks read 1001 : Total Time: 82484 (us) : Avg Time : 82 (us)
Memcached Blocks read 72072 : Total Time: 61434 (ms) : Avg Time : 85 (us)
Memcached Blocks read 1001 : Total Time: 76644 (us) : Avg Time : 76 (us)
Memcached Blocks read 73073 : Total Time: 62200 (ms) : Avg Time : 85 (us)
[0x7a5025]
[0x4ce335]
[0x656ac0]
/lib/libc.so.6(fgets_unlocked+0x26)[0x7fefbb673716]
[0x403a06b0]
090619 9:38:46 - mysqld got signal 11 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
We will try our best to scrape up some info that will hopefully help diagnose
the problem, but since we have already crashed, something is definitely wrong
and this may fail.

key_buffer_size=67108864
read_buffer_size=262144
max_used_connections=16
max_threads=151
threads_connected=16
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 144381 K
bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

thd: 0x7fef7805beb0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went

Revision history for this message
MadMatt (yonkovim) wrote :

Strange...

I wiped the tpcc database, shutdown mysql, started mysql, reloaded the tpcc db and ran through the same steps... and it works. No error so far.

Revision history for this message
MadMatt (yonkovim) wrote :

<TpmC>
                 9427.333 TpmC

hmmmmm...

Just to be sure this is not something to do with dropping tables:

Turn waffle on...
Run TPCC
Drop Database
load dump
Run tpcc...

root@frankenmatt:/benchmarks/tpcc-mysql# nohup ./tpcc_start localhost tpcc tpcc tpcc 20 16 20 300 > tpcc_waffle2.out 2>&1 &
[2] 29865
root@frankenmatt:/benchmarks/tpcc-mysql# 090619 10:41:59 mysqld_safe Number of processes running now: 0
090619 10:41:59 mysqld_safe mysqld restarted

[2]+ Segmentation fault nohup ./tpcc_start localhost tpcc tpcc tpcc 20 16 20 300 > tpcc_waffle2.out 2>&1

crashed it with:

Memcached (write) set: Block: 6e06:1b:0 : Thread : 1089898832 time:129
Memcached (write) set: Block: 6e06:1b:1 : Thread : 1089898832 time:132
Memcached (write) set: Block: 6e06:1b:2 : Thread : 1089898832 time:126
Memcached (write) set: Block: 6e06:1b:3 : Thread : 1089898832 time:128
[0x7a5025]
[0x4ce335]
[0x656ac0]
[0x42af7d]
[0x50dc21]
[0x4d8b6d]
[0x557966]
[0x55956a]
[0x4e4aae]
[0x4e5e2e]
[0x4d84bd]
[0x65311d]
[0x8796d9]
Memcached get: Block: 6e06:17:3 : Thread : 1080105296 time:394
Memcached get: Block: 6e06:17:1 : Thread : 1080105296 time:128
Memcached get: Block: 6e06:17:13e : Thread : 1080105296 time:125
Memcached get: Block: 6e06:18:3 : Thread : 1080105296 time:141
Memcached get: Block: 6e06:18:1 : Thread : 1080105296 time:118
Memcached get: Block: 6e06:13:3 : Thread : 1080105296 time:147
Memcached get: Block: 6e06:13:1 : Thread : 1080105296 time:120
Memcached get: Block: 6e06:13:3e : Thread : 1080105296 time:118
Memcached get: Block: 6e06:13:5558 : Thread : 1080105296 time:121
Memcached get: Block: 6e06:13:4001 : Thread : 1080105296 time:159
090619 10:41:59 - mysqld got signal 11 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
We will try our best to scrape up some info that will hopefully help diagnose
the problem, but since we have already crashed, something is definitely wrong
and this may fail.

key_buffer_size=67108864
read_buffer_size=262144
max_used_connections=17
max_threads=151
threads_connected=16
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 144381 K
bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

thd: 0x7f4da80bc450
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x4102af70 thread_stack 0x40000
Trying to get some variables.
Some pointers may be invalid and cause the dump to abort...
Trying to get some variables.
Some pointers may be invalid and cause the dump to abort...
thd->query at 0x7f4da827bec0 is an invalid pointer
thd->thread_id=46
thd->killed=NOT_KILLED
The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains
information that should help you find out what is causing the crash.

MadMatt (yonkovim)
summary: - Testing TPCC-MYSQL get odd db/index coruption
+ Delete table/readd table can cause corruption, was: Testing TPCC-MYSQL
+ get odd db/index coruption
Revision history for this message
MadMatt (yonkovim) wrote :

Ahhhh, this is a problem with a delete and re-add.

To repeat the steps,

Create 2 databases whose combined total fit into memcached, but which only 1 can fit into the internal BP at any one time. Make sure all the data is in memcached. Drop one of the databases. Then load the other database into the internal buffer pool. Then re-add the other database. This will work fine... I can even select data out of the database perfectly fine... But when I startup a dbt2/tpcc it crashes.

[0x7a5025]
[0x4ce335]
[0x656ac0]
/lib/libc.so.6(fgets_unlocked+0x26)[0x7fbd59158716]
[0x442ced30]
090619 11:28:42 - mysqld got signal 11 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
We will try our best to scrape up some info that will hopefully help diagnose
the problem, but since we have already crashed, something is definitely wrong
and this may fail.

key_buffer_size=67108864
read_buffer_size=262144
max_used_connections=32
max_threads=151
threads_connected=32
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 144381 K
bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

thd: 0x3023540
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x442d7f70 thread_stack 0x40000
Trying to get some variables.
Some pointers may be invalid and cause the dump to abort...
thd->query at 0x302e950 = SELECT c_id
FROM customer
WHERE c_w_id = 1
  AND c_d_id = 5
  AND c_last = 'BARABLEABLE'
ORDER BY c_first ASC
thd->thread_id=121
thd->killed=NOT_KILLED

Revision history for this message
MadMatt (yonkovim) wrote :

Both LRU and NONLRU Exhibit this issue.

Here is the error from the LRU version:

Memcached get: Block: 26b3:38:32 : Thread : 1150749008 time:325
Memcached set: Block: 26b3:a:6a30 : Thread : 1147554128 time:283
Memcached get: Block: 26b3:3c:1 : Thread : 1147554128 time:198
090619 18:20:45 - mysqld got signal 11 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
We will try our best to scrape up some info that will hopefully help diagnose
the problem, but since we have already crashed, something is definitely wrong
and this may fail.

key_buffer_size=67108864
read_buffer_size=262144
max_used_connections=18
max_threads=151
threads_connected=17
It is poss[0x7a5025]
[0x4ce335]
[0x656ac0]
[0x401214]
[0x557c5d]
[0x55956a]
[0x4e4aae]
[0x4e5e2e]
[0x4d84bd]
[0x65311d]
[0x8796d9]
ible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 144381 K
bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

thd: 0x7ffa1452f1b0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x447e9f70 thread_stack 0x40000
Trying to get some variables.
Some pointers may be invalid and cause the dump to abort...
thd->query at 0x7ffa103df4d0 is an invalid pointer
thd->thread_id=196
thd->killed=NOT_KILLED
The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains
information that should help you find out what is causing the crash.
Memcached set: Block: 26b3:a:6a31 : Thread : 1147554128 time:144

**************
The quick solution here is actually to force file_per_table and then at ever space init increment a prefix to be used with memcached.

This is of course a complete hack.

The real solution is to clear memcached on a table drop... the question is how to effectively do this?

Revision history for this message
MadMatt (yonkovim) wrote :

Looking for an easy place to include or track deleted/reused space id's... Got all the way down to the file level here:

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
/* The tablespace memory cache; also the totality of logs = the log data space,
is stored here; below we talk about tablespaces, but also the ib_logfiles
form a 'space' and it is handled here */

typedef struct fil_system_struct fil_system_t;
struct fil_system_struct {
 mutex_t mutex; /* The mutex protecting the cache */
 hash_table_t* spaces; /* The hash table of spaces in the
     system; they are hashed on the space
     id */
 hash_table_t* name_hash; /* hash table based on the space
     name */
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Being a un-programer, I was curious as to if the hash table implementation in inno was dynamic ( built on a linked list )...
So I found where this was populated.

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 system->spaces = hash_create(hash_size);
 system->name_hash = hash_create(hash_size);
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

So they are not. But looking at this made me scratch my head a bit. This means if I am reading this 100% correctly innodb has a fixed number of tables, and the memory footprint is getting pre-allocated to track these. So we could create a hash or an array to tracks tables then right? So lets look at the size

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 fil_init(
 /*=====*/
 ulint max_n_open) /* in: max number of open files */
 {
 ulint hash_size;

 ut_a(fil_system == NULL);

 if (srv_file_per_table) {
  hash_size = 50000;
 } else {
  hash_size = 5000;
 }

 fil_system = fil_system_create(hash_size, max_n_open);
 }
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

So with file per table turned on the hash table is allocated 50K nodes/records. Is an array more efficient then a hash table? I would think so as the array is more primitive, but I am not 100% sure.

I thought about adding the prefix hash/array to the fil_system_struct, however this is not accessed during the normal LRU/get process so we would have to add in more stpes to get it to work. But it seems like its the place it should go. Thoughts? Should we just make this a hash of 50K? and keep consistent? or should we make this an array outside the struct?

NOTE both the LRU and NONLRU code crash after table deletes. It's less likely in the LRU mode, but not all together eliminated ( 33% ).

Revision history for this message
MadMatt (yonkovim) wrote :

I think alter table is also going to be an issue here as well...
This works via:
ha_innobase::discard_or_import_tablespace(

I am trying to trace back from the table create where the spaceid is assigned, but I can not find it:(

Revision history for this message
MadMatt (yonkovim) wrote :

Spaceid's do not appear to be reused! I am verifying now.

Revision history for this message
MadMatt (yonkovim) wrote :

Actually think the problem has to do with this error:

[0x7a5135]
[0x4ce335]
[0x656ac0]
/lib/libc.so.6(fgets_unlocked+0x26)[0x7f4b72cac836]
[0x44d49cf0]

Revision history for this message
MadMatt (yonkovim) wrote :
Download full text (3.2 KiB)

ok reproduced the error differently...
imported tpcc...

enabled memcached...

kicked off 1 tpcc run:
./tpcc_start localhost tpcc tpcc tpcc 20 64 10 300

Server: localhost (11211)
  pid: 21526
  uptime: 2846
  time: 1246225919
  version: 1.4.0-rc1
  pointer_size: 64
  rusage_user: 6.24376
  rusage_system: 19.433214
  curr_items: 246988
  total_items: 247112
  bytes: 4068732716
  curr_connections: 70
  total_connections: 112
  connection_structures: 71
  cmd_get: 290855
  cmd_set: 247112
  get_hits: 147393
  get_misses: 143462
  evictions: 0
  bytes_read: 4068732716
  bytes_written: 4068732716
  limit_maxbytes: 4294967296
  threads: 5

Kill that process

restarted it :

./tpcc_start localhost tpcc tpcc tpcc 20 64 10 300

in new single space create!!! #1 passed in id :1095197368
in new single space create!!! #2 passed in id :1095197368
in new single space create!!! #3 passed in id :1095197368
in space create!!! passed in id :18
in space create!!! Middle in the create id now :18
in space create!!! Lower in the create id now :18
in new single space create!!! #4 end passed in id :1095197368
[0x7a5135]
[0x4ce335]
[0x656ac0]
/lib/libc.so.6(fgets_unlocked+0x26)[0x7f5c01667836]
[0x446516b0]
090628 17:52:10 - mysqld got signal 11 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
We will try our best to scrape up some info that will hopefully help diagnose
the problem, but since we have already crashed, something is definitely wrong
and this may fail.

key_buffer_size=67108864
read_buffer_size=262144
max_used_connections=64
max_threads=151
threads_connected=64
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 144381 K
bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

thd: 0x4391490
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x44654f70 thread_stack 0x40000
Trying to get some variables.
Some pointers may be invalid and cause the dump to abort...
thd->query at 0x4a577a0 = UPDATE customer SET c_balance = c_balance + ? , c_delivery_cnt = c_delivery_cnt + 1 WHERE c_id = ? AND c_d_id = ? AND c_w_id = ?
thd->thread_id=186
thd->killed=NOT_KILLED
The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains
information that should help you find out what is causing the crash.
090628 17:52:10 mysqld_safe Number of processes running now: 0
090628 17:52:10 mysqld_safe mysqld restarted
InnoDB: The InnoDB memory heap has been disabled.
InnoDB: Mutex and rw_lock use atomics.
InnoDB: ios_per_array 256 read threads 1 write threads 1
in space create!!! passed in id :0
in space create!!! Middle in the create id now :0
in space create!!! Lower in the create id now :0
in space create!!! passed in id :4294967280
in space create!!! Middle in the create id now :4294967280
in space create!!! Lower in the create id now :4294967280

Need to upgrade t...

Read more...

Revision history for this message
MadMatt (yonkovim) wrote :

Adding the general log from the crash...

Revision history for this message
MadMatt (yonkovim) wrote :

Have been trying to reproduce with dbt2, to no avail. Also tpcc works without any crash if waffle is disabled. This has me nervous.

Revision history for this message
MadMatt (yonkovim) wrote :

I can confirm sysbench fails now.

in space create!!! Lower in the create id now :24
in new single space create!!! #4 end passed in id :1079176184
[0x7a5135]
[0x4ce335]
[0x656ac0]
/lib/libc.so.6(fgets_unlocked+0x26)[0x7f958853a836]
[0x4021edc0]
090628 20:45:29 - mysqld got signal 11 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
We will try our best to scrape up some info that will hopefully help diagnose
the problem, but since we have already crashed, something is definitely wrong
and this may fail.

key_buffer_size=67108864
read_buffer_size=262144
max_used_connections=65
max_threads=151
threads_connected=17
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_t

Revision history for this message
MadMatt (yonkovim) wrote :

Something is really really amiss here....

Could this be a stack problem? using libmemcached 0.30 + memcached 1.2.8 or 1.4.0-rc... mysql 5.4...

for a single threaded oltp test connecting to localhost:

---------
MEMCACHED
---------
Memcached puts 86595
Memcached hits 19595
Memcached misses 86579
Memcached Prefix: 18126
Memcached Get Total Lat 303 (us)
Memcached Get Recent Lat 202 (us)
Memcached Miss Total Lat 115 (us)
Memcached Miss Recent Lat 58 (us)
Memcached Set Total Lat 38043 (us)
Memcached Set Recent Lat 4917 (us)
--------------

5 second set latency... something is terribly terribly wrong here!!!

Revision history for this message
MadMatt (yonkovim) wrote :

Never mind... this was my bug... the set time included the retrieval from disk ... doooh....

Revision history for this message
MadMatt (yonkovim) wrote :
Download full text (6.0 KiB)

Attaching and error.log with extra debuging... i was trying to see if one of the open connections to memcached caused the issue... does not look like it... here is the code used to generate this:

buf_read_page_low(
/*==============*/
   /* out: 1 if a read request was queued, 0 if the page
   already resided in buf_pool, or if the page is in
   the doublewrite buffer blocks in which case it is never
   read into the pool, or if the tablespace does not
   exist or is being dropped */
 ulint* err, /* out: DB_SUCCESS or DB_TABLESPACE_DELETED if we are
   trying to read from a non-existent tablespace, or a
   tablespace which is just now being dropped */
 ibool sync, /* in: TRUE if synchronous aio is desired */
 ulint mode, /* in: BUF_READ_IBUF_PAGES_ONLY, ...,
   ORed to OS_AIO_SIMULATED_WAKE_LATER (see below
   at read-ahead functions) */
 ulint space, /* in: space id */
 ib_longlong tablespace_version, /* in: if the space memory object has
   this timestamp different from what we are giving here,
   treat the tablespace as dropped; this is a timestamp we
   use to stop dangling page reads from a tablespace
   which we have DISCARDed + IMPORTed back */
 ulint offset) /* in: page number */
{
 buf_block_t* block;
 ulint wake_later;
 char tmpkey[40];
        /* Variables for WaffleGrid, the innodb_buffer_pool layer 2 cache */
        MEMC_CONN_T *memct;
        ulint sec,fsec,fmsec,msec;
        memcached_server_st *mcservers;
 *err = DB_SUCCESS;

 wake_later = mode & OS_AIO_SIMULATED_WAKE_LATER;
 mode = mode & ~OS_AIO_SIMULATED_WAKE_LATER;

 if (trx_doublewrite && space == TRX_SYS_SPACE
     && ( (offset >= trx_doublewrite->block1
      && offset < trx_doublewrite->block1
      + TRX_SYS_DOUBLEWRITE_BLOCK_SIZE)
     || (offset >= trx_doublewrite->block2
         && offset < trx_doublewrite->block2
         + TRX_SYS_DOUBLEWRITE_BLOCK_SIZE))) {
  ut_print_timestamp(stderr);
  fprintf(stderr,
   " InnoDB: Warning: trying to read"
   " doublewrite buffer page %lu\n",
   (ulong) offset);

  return(0);
 }

 if (ibuf_bitmap_page(offset) || trx_sys_hdr_page(space, offset)) {

  /* Trx sys header is so low in the latching order that we play
  safe and do not leave the i/o-completion to an asynchronous
  i/o-thread. Ibuf bitmap pages must always be read with
  syncronous i/o, to make sure they do not get involved in
  thread deadlocks. */

  sync = TRUE;
 }

 /* The following call will also check if the tablespace does not exist
 or is being dropped; if we succeed in initing the page in the buffer
 pool for read, then DISCARD cannot proceed until the read has
 completed */
 block = buf_page_init_for_read(err, mode, space, tablespace_version,
           offset);
 if (block == NULL) {

  return(0);
 }

#ifdef UNIV_DEBUG
 if (buf_debug_prints) {
  fprintf(stderr,
   "Posting read request for page %lu, sync %lu\n",
   (ulong) offset,
   (ulong) sync);
 }
#endif

 ut_a(block->state == BUF_BLOCK_FILE_PAGE);

 /* set the block in memcached */
 memct = (MEMC_CONN_T*) thr_local_get_memc(os_thread_get_curr_id());

 if (srv_memcached_enable>0) {
            fprintf(stderr,"Memcached miss Enabled ! \n");
                            if (!memct->mc_...

Read more...

Revision history for this message
Yves Trudeau (y-trudeau) wrote :

Bug is always related to page offset 1. At some point, there is a "set" for offset set followed by a get a few operations later that causes the crash.

**Memcached set: Block: 2763:4:1 : Thread : 2973268880 time:326
Memcached set: Block: 2763:4:29e : Thread : 2973268880 time:213
Memcached set: Block: 2763:4:2a0 : Thread : 2973268880 time:226
Memcached get: Block: 2763:4:385 : Thread : 2973268880 time:254
**Memcached get: Block: 2763:4:1 : Thread : 2973268880 time:223
[0x83f4821]
[0x810b62f]
[0xb7ff8420]
...

from buf_page_print, the page seems to be related to the insert buffer.

InnoDB: Page may be an insert buffer bitmap page

Revision history for this message
Yves Trudeau (y-trudeau) wrote :

Easy way to reproduce... With a 8 MB buffer_pool (hacked 5.4 to allow smaller BP), load 100000 rows for the sysbench oltp test. Shutdown and restart the database and try a mysqldump > /dev/null.

Revision history for this message
Yves Trudeau (y-trudeau) wrote :

Excluding insert buffer bitmap pages does not solve the bug. I found a faulty if statement in buf0flu.c:

if (srv_memcached_enable) {

it has been replaced by:

if (srv_memcached_enable == 1 && block->space > 0) {

in function buf_flush_write_block_low. Eventually, values 1 and 2 for srv_memcached_enable should have proper #define values.

Revision history for this message
Yves Trudeau (y-trudeau) wrote :

Since the following procedure cause corruption....

1. start mysql
2. sysbench prepare
3. restart mysql
4. mysqldump

That point to the fact that the ibdata file is corrupted since so reads are done from memcached. The recent hack added code in buf0flu which handles writes to ibdata. I will focus on buf0flu for now.

Revision history for this message
Yves Trudeau (y-trudeau) wrote :

The LRU code is working from version 2864 of branch lp:~y-trudeau/wafflegrid/cream-5.4-tree

Revision history for this message
Yves Trudeau (y-trudeau) wrote :

The non-LRU code is working with 2864 of branch lp:~y-trudeau/wafflegrid/cream-5.4-tree if mysqld is started with MALLOC_CHECK_=0. Something, somewhere is freed twice, need to find out.

Revision history for this message
Yves Trudeau (y-trudeau) wrote :

The bug is resolved in lp:~y-trudeau/wafflegrid/cream-5.4-tree revision 2865.

Revision history for this message
Yves Trudeau (y-trudeau) wrote :
Changed in wafflegrid:
status: Confirmed → Fix Released
Revision history for this message
MadMatt (yonkovim) wrote :

THIS is still not fixed yet. Verified with your most recent revision....

-rwxr-xr-x 1 root root 7.9M 2009-07-05 18:46 /mysql_home/mysql540-waffle//libexec/mysqld
090705 18:48:37 [Note] /mysql_home/mysql540-waffle//libexec/mysqld: ready for connections.
Version: '5.4.2-beta' socket: '/var/run/mysqld/mysqld.sock' port: 3306 Source distribution
[0x7a4d55]
[0x4ce335]
[0x656ac0]
/lib/libc.so.6[0x7f06e508bbb3]
/lib/libc.so.6(cfree+0x76)[0x7f06e508c0a6]
[0x406d9950]
090705 18:49:10 - mysqld got signal 11 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
We will try our best to scrape up some info that will hopefully help diagnose
the problem, but since we have already crashed, something is definitely wrong
and this may fail.

key_buffer_size=67108864
read_buffer_size=262144
max_used_connections=16
max_threads=151
threads_connected=16
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 144381 K
bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

thd: 0x43d9940
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x406d8f70 thread_stack 0x40000

Changed in wafflegrid:
status: Fix Released → In Progress
Revision history for this message
Yves Trudeau (y-trudeau) wrote :

On Matt server, I clean and recreated the datadir and sysbench now runs fine. Maybe a corruption of the ibdata file.

Revision history for this message
MadMatt (yonkovim) wrote :

Yves, still getting crashes...

InnoDB master thread running with io_capacity 800
090707 21:29:43 [Note] Event Scheduler: Loaded 0 events
090707 21:29:43 [Note] /mysql_home/mysql540-waffle//libexec/mysqld: ready for connections.
Version: '5.4.2-beta' socket: '/var/run/mysqld/mysqld.sock' port: 3306 Source distribution
[0x7a4d55]
[0x4ce335]
[0x656ac0]
/lib/libc.so.6(fgets_unlocked+0x26)[0x7f263e517836]
[0x41697cc0]
090707 21:31:23 - mysqld got signal 11 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
We will try our best to scrape up some info that will hopefully help diagnose
the problem, but since we have already crashed, something is definitely wrong
and this may fail.

key_buffer_size=67108864
read_buffer_size=262144
max_used_connections=16
max_threads=151
threads_connected=16
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 144381 K
bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

thd: 0x442cae0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x416a0f70 thread_stack 0x40000
Trying to get some variables.
Some pointers may be invalid and cause the dump to abort...
thd->query at 0x4910660 = SELECT count(c_id) FROM customer WHERE c_w_id = ? AND c_d_id = ? AND c_last = ?
thd->thread_id=7
thd->killed=NOT_KILLED

Revision history for this message
MadMatt (yonkovim) wrote :

Crash happened after this:

Clean datadir...
Set Waffle off
Load tpcc data...
Run tpcc with no waffle
Waffle Enable ...
Run tpcc with waffle...
   Waffle Times Unacceptable, 30sec + set times!
Kill tpcc
  stop mysql, start mysql with waffle enabled
Runn Tpcc crash!!!! core dump.

[2]+ Segmentation fault (core dumped) ./tpcc_start localhost tpcc tpcc tpcc 20 16 10 600 > waffle_test.out

Revision history for this message
MadMatt (yonkovim) wrote :

Tried this:

Clean datadir...
Set waffle = 0
restart mysql
Load tpcc
shutdown mysql
set waffle = 1
startup mysql
Run tpcc
Crash! same core dump

Revision history for this message
MadMatt (yonkovim) wrote :

Sysbench fails still as well:

shutdown mysql
clean data dir
startup
load sysbench tables
root@frankenmatt:/development/sysbench-0.4.12# sysbench --test=oltp --mysql-table-engine=innodb --mysql-socket=/tmp/mysql.sock --oltp-read-only=on --num-threads=16 --max-requests=100 run --mysql-user=root --mysql-db=sbtest --oltp-table-size=10000000
sysbench 0.4.12: multi-threaded system evaluation benchmark

No DB drivers specified, using mysql
Running the test with following options:
Number of threads: 16

Doing OLTP test.
Running mixed OLTP test
Doing read-only test
Using Special distribution (12 iterations, 1 pct of values are returned in 75 pct cases)
Using "BEGIN" for starting transactions
Using auto_inc on the id column
Maximum number of requests for OLTP test is limited to 100
Threads started!
ALERT: failed to execute mysql_stmt_execute(): Err2013 Lost connection to MySQL server during query
FATAL: database error, exiting...
(last message repeated 1 times)
ALERT: failed to execute mysql_stmt_execute(): Err2013 Lost connection to MySQL server during query
ALERT: failed to execute mysql_stmt_execute(): Err2013 Lost connection to MySQL server during query
FATAL: database error, exiting...
(last message repeated 2 times)
FATAL: database error, exiting...
(last message repeated 1 times)
ALERT: failed to execute mysql_stmt_execute(): Err2013 Lost connection to MySQL server during query
FATAL: database error, exiting...
(last message repeated 1 times)
ALERT: failed to execute mysql_stmt_execute(): Err2013 Lost connection to MySQL server during query
(last message repeated 1 times)
FATAL: database error, exiting...
ALERT: failed to execute mysql_stmt_execute(): Err2013 Lost connection to MySQL server during query
FATAL: database error, exiting...
ALERT: failed to execute mysql_stmt_execute(): Err2013 Lost connection to MySQL server during query
FATAL: database error, exiting...
ALERT: failed to execute mysql_stmt_execute(): Err2013 Lost connection to MySQL server during query
(last message repeated 1 times)
FATAL: database error, exiting...
(last message repeated 1 times)
ALERT: failed to execute MySQL query: `BEGIN`:
ALERT: Error 2013 Lost connection to MySQL server during query
FATAL: database error, exiting...
ALERT: failed to execute mysql_stmt_execute(): Err2013 Lost connection to MySQL server during query
FATAL: database error, exiting...
(last message repeated 1 times)
ALERT: failed to execute mysql_stmt_execute(): Err2013 Lost connection to MySQL server during query
FATAL: database error, exiting...
(last message repeated 1 times)
Done.

seg fault....

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.