Bug #389495 “Delete table/readd table can cause corruption, was:...” : Bugs : WaffleGrid

MadMatt (yonkovim) on 2009-06-19

Changed in wafflegrid:
assignee:	nobody → MadMatt (yonkovim)
importance:	Undecided → Critical
status:	New → Confirmed

Revision history for this message

MadMatt (yonkovim) wrote on 2009-06-19:

#1

This error did occur after I ran a full waffle test without memcached...

I then dropped the TPCC database

I then reimported the tpcc database

root@frankenmatt:/benchmarks/tpcc-mysql# mysql -e"set @@global.innodb_memcached_print_stats=1"
root@frankenmatt:/benchmarks/tpcc-mysql# mysql -e"set @@global.innodb_memcached_servers='localhost'"
root@frankenmatt:/benchmarks/tpcc-mysql# mysql -e"set @@global.innodb_memcached_enable=1"

I then ran the count.sql to help warm the cache

I then started the tpcc benchmark.

Tpcc crashed mysql right away

This was cream revision 7.

Revision history for this message

MadMatt (yonkovim) wrote on 2009-06-19:

#2

Verified after a restart that following these steps does not produce the error:

mysql -e"set @@global.innodb_memcached_print_stats=1"
mysql -e"set @@global.innodb_memcached_servers='localhost'"
mysql -e"set @@global.innodb_memcached_enable=1"
./warmcache.sh
mysql tpcc < count.sql
mysql
./warmcache.sh

---------
MEMCACHED
---------
Memcached puts 170217
Memcached hits 73936
Memcached misses 170228
Memcached Prefix: 27049
Memcached Get Total Lat 85 (us)
Memcached Get Recent Lat 76 (us)
Memcached Miss Total Lat 45 (us)
Memcached Miss Recent Lat 35 (us)
Memcached Set Total Lat 528 (us)
Memcached Set Recent Lat 484 (us)
--------------

So memcached pages are going back and forth...

But as soon as a tpcc:
nohup ./tpcc_start localhost tpcc tpcc tpcc 20 16 20 900 > tpcc_waffle.out 2>&1 &
[2] 27659
root@frankenmatt:/benchmarks/tpcc-mysql# 090619 09:38:47 mysqld_safe Number of processes running now: 0
090619 09:38:47 mysqld_safe mysqld restarted

[2]+ Segmentation fault nohup ./tpcc_start localhost tpcc tpcc tpcc 20 16 20 900 > tpcc_waffle.out 2>&1
root@frankenmatt:/benchmarks/tpcc-mysql#

error:
Memcached Blocks read 71071 : Total Time: 60609 (ms) : Avg Time : 85 (us)
Memcached Blocks read 1001 : Total Time: 82484 (us) : Avg Time : 82 (us)
Memcached Blocks read 72072 : Total Time: 61434 (ms) : Avg Time : 85 (us)
Memcached Blocks read 1001 : Total Time: 76644 (us) : Avg Time : 76 (us)
Memcached Blocks read 73073 : Total Time: 62200 (ms) : Avg Time : 85 (us)
[0x7a5025]
[0x4ce335]
[0x656ac0]
/lib/libc.so.6(fgets_unlocked+0x26)[0x7fefbb673716]
[0x403a06b0]
090619 9:38:46 - mysqld got signal 11 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
We will try our best to scrape up some info that will hopefully help diagnose
the problem, but since we have already crashed, something is definitely wrong
and this may fail.

key_buffer_size=67108864
read_buffer_size=262144
max_used_connections=16
max_threads=151
threads_connected=16
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 144381 K
bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

thd: 0x7fef7805beb0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went

Verified after a restart that following these steps does not produce the error:

mysql -e"set @@global.innodb_memcached_print_stats=1"
mysql -e"set @@global.innodb_memcached_servers='localhost'"
mysql -e"set @@global.innodb_memcached_enable=1"
./warmcache.sh 
mysql tpcc < count.sql 
mysql
./warmcache.sh

---------
MEMCACHED
---------
Memcached puts    170217
Memcached hits    73936
Memcached misses  170228
Memcached Prefix:  27049
Memcached Get Total Lat 85 (us)
Memcached Get Recent Lat 76 (us)
Memcached Miss Total Lat 45 (us)
Memcached Miss Recent Lat 35 (us)
Memcached Set Total Lat 528 (us)
Memcached Set Recent Lat 484 (us)
--------------

So memcached pages are going back and forth...

But as soon as a tpcc:
nohup ./tpcc_start localhost tpcc tpcc tpcc 20 16 20 900 > tpcc_waffle.out 2>&1 &
[2] 27659
root@frankenmatt:/benchmarks/tpcc-mysql# 090619 09:38:47 mysqld_safe Number of processes running now: 0
090619 09:38:47 mysqld_safe mysqld restarted

[2]+  Segmentation fault      nohup ./tpcc_start localhost tpcc tpcc tpcc 20 16 20 900 > tpcc_waffle.out 2>&1
root@frankenmatt:/benchmarks/tpcc-mysql#

error:
Memcached Blocks read 71071 :  Total Time: 60609 (ms) :  Avg Time : 85 (us)
Memcached Blocks read 1001 :  Total Time: 82484 (us) :  Avg Time : 82 (us)
Memcached Blocks read 72072 :  Total Time: 61434 (ms) :  Avg Time : 85 (us)
Memcached Blocks read 1001 :  Total Time: 76644 (us) :  Avg Time : 76 (us)
Memcached Blocks read 73073 :  Total Time: 62200 (ms) :  Avg Time : 85 (us)
[0x7a5025]
[0x4ce335]
[0x656ac0]
/lib/libc.so.6(fgets_unlocked+0x26)[0x7fefbb673716]
[0x403a06b0]
090619  9:38:46 - mysqld got signal 11 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
We will try our best to scrape up some info that will hopefully help diagnose
the problem, but since we have already crashed, something is definitely wrong
and this may fail.

key_buffer_size=67108864
read_buffer_size=262144
max_used_connections=16
max_threads=151
threads_connected=16
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 144381 K
bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

thd: 0x7fef7805beb0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went

Revision history for this message

MadMatt (yonkovim) wrote on 2009-06-19:

#3

Strange...

I wiped the tpcc database, shutdown mysql, started mysql, reloaded the tpcc db and ran through the same steps... and it works. No error so far.

Revision history for this message

MadMatt (yonkovim) wrote on 2009-06-19:

#4

<TpmC>
9427.333 TpmC

hmmmmm...

Just to be sure this is not something to do with dropping tables:

Turn waffle on...
Run TPCC
Drop Database
load dump
Run tpcc...

root@frankenmatt:/benchmarks/tpcc-mysql# nohup ./tpcc_start localhost tpcc tpcc tpcc 20 16 20 300 > tpcc_waffle2.out 2>&1 &
[2] 29865
root@frankenmatt:/benchmarks/tpcc-mysql# 090619 10:41:59 mysqld_safe Number of processes running now: 0
090619 10:41:59 mysqld_safe mysqld restarted

[2]+ Segmentation fault nohup ./tpcc_start localhost tpcc tpcc tpcc 20 16 20 300 > tpcc_waffle2.out 2>&1

crashed it with:

Memcached (write) set: Block: 6e06:1b:0 : Thread : 1089898832 time:129
Memcached (write) set: Block: 6e06:1b:1 : Thread : 1089898832 time:132
Memcached (write) set: Block: 6e06:1b:2 : Thread : 1089898832 time:126
Memcached (write) set: Block: 6e06:1b:3 : Thread : 1089898832 time:128
[0x7a5025]
[0x4ce335]
[0x656ac0]
[0x42af7d]
[0x50dc21]
[0x4d8b6d]
[0x557966]
[0x55956a]
[0x4e4aae]
[0x4e5e2e]
[0x4d84bd]
[0x65311d]
[0x8796d9]
Memcached get: Block: 6e06:17:3 : Thread : 1080105296 time:394
Memcached get: Block: 6e06:17:1 : Thread : 1080105296 time:128
Memcached get: Block: 6e06:17:13e : Thread : 1080105296 time:125
Memcached get: Block: 6e06:18:3 : Thread : 1080105296 time:141
Memcached get: Block: 6e06:18:1 : Thread : 1080105296 time:118
Memcached get: Block: 6e06:13:3 : Thread : 1080105296 time:147
Memcached get: Block: 6e06:13:1 : Thread : 1080105296 time:120
Memcached get: Block: 6e06:13:3e : Thread : 1080105296 time:118
Memcached get: Block: 6e06:13:5558 : Thread : 1080105296 time:121
Memcached get: Block: 6e06:13:4001 : Thread : 1080105296 time:159
090619 10:41:59 - mysqld got signal 11 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
We will try our best to scrape up some info that will hopefully help diagnose
the problem, but since we have already crashed, something is definitely wrong
and this may fail.

key_buffer_size=67108864
read_buffer_size=262144
max_used_connections=17
max_threads=151
threads_connected=16
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 144381 K
bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

thd: 0x7f4da80bc450
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x4102af70 thread_stack 0x40000
Trying to get some variables.
Some pointers may be invalid and cause the dump to abort...
Trying to get some variables.
Some pointers may be invalid and cause the dump to abort...
thd->query at 0x7f4da827bec0 is an invalid pointer
thd->thread_id=46
thd->killed=NOT_KILLED
The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains
information that should help you find out what is causing the crash.

<TpmC>
                 9427.333 TpmC

hmmmmm...

Just to be sure this is not something to do with dropping tables:

Turn waffle on...
Run TPCC
Drop Database
load dump
Run tpcc...

root@frankenmatt:/benchmarks/tpcc-mysql# nohup ./tpcc_start localhost tpcc tpcc tpcc 20 16 20 300 > tpcc_waffle2.out 2>&1 &
[2] 29865
root@frankenmatt:/benchmarks/tpcc-mysql# 090619 10:41:59 mysqld_safe Number of processes running now: 0
090619 10:41:59 mysqld_safe mysqld restarted

[2]+  Segmentation fault      nohup ./tpcc_start localhost tpcc tpcc tpcc 20 16 20 300 > tpcc_waffle2.out 2>&1

crashed it with:

Memcached (write) set: Block: 6e06:1b:0 : Thread : 1089898832 time:129
Memcached (write) set: Block: 6e06:1b:1 : Thread : 1089898832 time:132
Memcached (write) set: Block: 6e06:1b:2 : Thread : 1089898832 time:126
Memcached (write) set: Block: 6e06:1b:3 : Thread : 1089898832 time:128
[0x7a5025]
[0x4ce335]
[0x656ac0]
[0x42af7d]
[0x50dc21]
[0x4d8b6d]
[0x557966]
[0x55956a]
[0x4e4aae]
[0x4e5e2e]
[0x4d84bd]
[0x65311d]
[0x8796d9]
Memcached get: Block: 6e06:17:3 : Thread : 1080105296 time:394
Memcached get: Block: 6e06:17:1 : Thread : 1080105296 time:128
Memcached get: Block: 6e06:17:13e : Thread : 1080105296 time:125
Memcached get: Block: 6e06:18:3 : Thread : 1080105296 time:141
Memcached get: Block: 6e06:18:1 : Thread : 1080105296 time:118
Memcached get: Block: 6e06:13:3 : Thread : 1080105296 time:147
Memcached get: Block: 6e06:13:1 : Thread : 1080105296 time:120
Memcached get: Block: 6e06:13:3e : Thread : 1080105296 time:118
Memcached get: Block: 6e06:13:5558 : Thread : 1080105296 time:121
Memcached get: Block: 6e06:13:4001 : Thread : 1080105296 time:159
090619 10:41:59 - mysqld got signal 11 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
We will try our best to scrape up some info that will hopefully help diagnose
the problem, but since we have already crashed, something is definitely wrong
and this may fail.

key_buffer_size=67108864
read_buffer_size=262144
max_used_connections=17
max_threads=151
threads_connected=16
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 144381 K
bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

thd: 0x7f4da80bc450
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x4102af70 thread_stack 0x40000
Trying to get some variables.
Some pointers may be invalid and cause the dump to abort...
Trying to get some variables.
Some pointers may be invalid and cause the dump to abort...
thd->query at 0x7f4da827bec0 is an invalid pointer
thd->thread_id=46
thd->killed=NOT_KILLED
The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains
information that should help you find out what is causing the crash.

MadMatt (yonkovim) on 2009-06-19

summary:

- Testing TPCC-MYSQL get odd db/index coruption
+ Delete table/readd table can cause corruption, was: Testing TPCC-MYSQL
+ get odd db/index coruption

Revision history for this message

MadMatt (yonkovim) wrote on 2009-06-19:

#5

Ahhhh, this is a problem with a delete and re-add.

To repeat the steps,

Create 2 databases whose combined total fit into memcached, but which only 1 can fit into the internal BP at any one time. Make sure all the data is in memcached. Drop one of the databases. Then load the other database into the internal buffer pool. Then re-add the other database. This will work fine... I can even select data out of the database perfectly fine... But when I startup a dbt2/tpcc it crashes.

[0x7a5025]
[0x4ce335]
[0x656ac0]
/lib/libc.so.6(fgets_unlocked+0x26)[0x7fbd59158716]
[0x442ced30]
090619 11:28:42 - mysqld got signal 11 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
We will try our best to scrape up some info that will hopefully help diagnose
the problem, but since we have already crashed, something is definitely wrong
and this may fail.

key_buffer_size=67108864
read_buffer_size=262144
max_used_connections=32
max_threads=151
threads_connected=32
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 144381 K
bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

thd: 0x3023540
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x442d7f70 thread_stack 0x40000
Trying to get some variables.
Some pointers may be invalid and cause the dump to abort...
thd->query at 0x302e950 = SELECT c_id
FROM customer
WHERE c_w_id = 1
AND c_d_id = 5
AND c_last = 'BARABLEABLE'
ORDER BY c_first ASC
thd->thread_id=121
thd->killed=NOT_KILLED

Revision history for this message

MadMatt (yonkovim) wrote on 2009-06-19:

#6

Both LRU and NONLRU Exhibit this issue.

Here is the error from the LRU version:

Memcached get: Block: 26b3:38:32 : Thread : 1150749008 time:325
Memcached set: Block: 26b3:a:6a30 : Thread : 1147554128 time:283
Memcached get: Block: 26b3:3c:1 : Thread : 1147554128 time:198
090619 18:20:45 - mysqld got signal 11 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
We will try our best to scrape up some info that will hopefully help diagnose
the problem, but since we have already crashed, something is definitely wrong
and this may fail.

key_buffer_size=67108864
read_buffer_size=262144
max_used_connections=18
max_threads=151
threads_connected=17
It is poss[0x7a5025]
[0x4ce335]
[0x656ac0]
[0x401214]
[0x557c5d]
[0x55956a]
[0x4e4aae]
[0x4e5e2e]
[0x4d84bd]
[0x65311d]
[0x8796d9]
ible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 144381 K
bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

thd: 0x7ffa1452f1b0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x447e9f70 thread_stack 0x40000
Trying to get some variables.
Some pointers may be invalid and cause the dump to abort...
thd->query at 0x7ffa103df4d0 is an invalid pointer
thd->thread_id=196
thd->killed=NOT_KILLED
The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains
information that should help you find out what is causing the crash.
Memcached set: Block: 26b3:a:6a31 : Thread : 1147554128 time:144

**************
The quick solution here is actually to force file_per_table and then at ever space init increment a prefix to be used with memcached.

This is of course a complete hack.

The real solution is to clear memcached on a table drop... the question is how to effectively do this?

Both LRU and NONLRU Exhibit this issue.

Here is the error from the LRU version:

Memcached get: Block: 26b3:38:32 : Thread : 1150749008 time:325
Memcached set: Block: 26b3:a:6a30 : Thread : 1147554128 time:283
Memcached get: Block: 26b3:3c:1 : Thread : 1147554128 time:198
090619 18:20:45 - mysqld got signal 11 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
We will try our best to scrape up some info that will hopefully help diagnose
the problem, but since we have already crashed, something is definitely wrong
and this may fail.

key_buffer_size=67108864
read_buffer_size=262144
max_used_connections=18
max_threads=151
threads_connected=17
It is poss[0x7a5025]
[0x4ce335]
[0x656ac0]
[0x401214]
[0x557c5d]
[0x55956a]
[0x4e4aae]
[0x4e5e2e]
[0x4d84bd]
[0x65311d]
[0x8796d9]
ible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 144381 K
bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

thd: 0x7ffa1452f1b0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x447e9f70 thread_stack 0x40000
Trying to get some variables.
Some pointers may be invalid and cause the dump to abort...
thd->query at 0x7ffa103df4d0 is an invalid pointer
thd->thread_id=196
thd->killed=NOT_KILLED
The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains
information that should help you find out what is causing the crash.
Memcached set: Block: 26b3:a:6a31 : Thread : 1147554128 time:144

**************
The quick solution here is actually to force file_per_table and then at ever space init increment a prefix to be used with memcached.

This is of course a complete hack.

The real solution is to clear memcached on a table drop...  the question is how to effectively do this?

Revision history for this message

MadMatt (yonkovim) wrote on 2009-06-20:

#7

Looking for an easy place to include or track deleted/reused space id's... Got all the way down to the file level here:

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
/* The tablespace memory cache; also the totality of logs = the log data space,
is stored here; below we talk about tablespaces, but also the ib_logfiles
form a 'space' and it is handled here */

typedef struct fil_system_struct fil_system_t;
struct fil_system_struct {
mutex_t mutex; /* The mutex protecting the cache */
hash_table_t* spaces; /* The hash table of spaces in the
     system; they are hashed on the space
     id */
hash_table_t* name_hash; /* hash table based on the space
     name */
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Being a un-programer, I was curious as to if the hash table implementation in inno was dynamic ( built on a linked list )...
So I found where this was populated.

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
system->spaces = hash_create(hash_size);
system->name_hash = hash_create(hash_size);
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

So they are not. But looking at this made me scratch my head a bit. This means if I am reading this 100% correctly innodb has a fixed number of tables, and the memory footprint is getting pre-allocated to track these. So we could create a hash or an array to tracks tables then right? So lets look at the size

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
fil_init(
/*=====*/
ulint max_n_open) /* in: max number of open files */
{
ulint hash_size;

ut_a(fil_system == NULL);

if (srv_file_per_table) {
hash_size = 50000;
} else {
hash_size = 5000;
}

fil_system = fil_system_create(hash_size, max_n_open);
}
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

So with file per table turned on the hash table is allocated 50K nodes/records. Is an array more efficient then a hash table? I would think so as the array is more primitive, but I am not 100% sure.

I thought about adding the prefix hash/array to the fil_system_struct, however this is not accessed during the normal LRU/get process so we would have to add in more stpes to get it to work. But it seems like its the place it should go. Thoughts? Should we just make this a hash of 50K? and keep consistent? or should we make this an array outside the struct?

NOTE both the LRU and NONLRU code crash after table deletes. It's less likely in the LRU mode, but not all together eliminated ( 33% ).

Looking for an easy place to include or track deleted/reused space id's...  Got all the way down to the file level here:

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
/* The tablespace memory cache; also the totality of logs = the log data space,
is stored here; below we talk about tablespaces, but also the ib_logfiles
form a 'space' and it is handled here */

typedef	struct fil_system_struct	fil_system_t;
struct fil_system_struct {
	mutex_t		mutex;		/* The mutex protecting the cache */
	hash_table_t*	spaces;		/* The hash table of spaces in the
					system; they are hashed on the space
					id */
	hash_table_t*	name_hash;	/* hash table based on the space
					name */
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Being a un-programer, I was curious as to if the hash table implementation in inno was dynamic ( built on a linked list )... 
So I found where this was populated.

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
	system->spaces = hash_create(hash_size);
	system->name_hash = hash_create(hash_size);
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

So they are not.  But looking at this made me scratch my head a bit.  This means if I am reading this 100% correctly innodb has a fixed number of tables, and the memory footprint is getting pre-allocated to track these.  So we could create a hash or an array to tracks tables then right?  So lets look at the size

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
	fil_init(
	/*=====*/
	ulint	max_n_open)	/* in: max number of open files */
	{
	ulint	hash_size;

ut_a(fil_system == NULL);

if (srv_file_per_table) {
		hash_size = 50000;
	} else {
		hash_size = 5000;
	}

fil_system = fil_system_create(hash_size, max_n_open);
	}
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

So with file per table turned on the hash table is allocated 50K nodes/records.  Is an array more efficient then a hash table?  I would think so as the array is more primitive, but I am not 100% sure.

I thought about adding the prefix hash/array to the fil_system_struct, however this is not accessed during the normal LRU/get process so we would have to add in more stpes to get it to work.  But it seems like its the place it should go.  Thoughts? Should we just make this a hash of 50K?  and keep consistent?  or should we make this an array outside the struct?

NOTE both the LRU and NONLRU code crash after table deletes.  It's less likely in the LRU mode, but not all together eliminated ( 33% ).

Revision history for this message

MadMatt (yonkovim) wrote on 2009-06-28:

#8

I think alter table is also going to be an issue here as well...
This works via:
ha_innobase::discard_or_import_tablespace(

I am trying to trace back from the table create where the spaceid is assigned, but I can not find it:(

Revision history for this message

MadMatt (yonkovim) wrote on 2009-06-28:

#9

Spaceid's do not appear to be reused! I am verifying now.

Revision history for this message

MadMatt (yonkovim) wrote on 2009-06-28:

#10

Actually think the problem has to do with this error:

[0x7a5135]
[0x4ce335]
[0x656ac0]
/lib/libc.so.6(fgets_unlocked+0x26)[0x7f4b72cac836]
[0x44d49cf0]

Revision history for this message

MadMatt (yonkovim) wrote on 2009-06-28:

#11

Download full text (3.2 KiB)

ok reproduced the error differently...
imported tpcc...

enabled memcached...

kicked off 1 tpcc run:
./tpcc_start localhost tpcc tpcc tpcc 20 64 10 300

Server: localhost (11211)
  pid: 21526
  uptime: 2846
  time: 1246225919
  version: 1.4.0-rc1
  pointer_size: 64
  rusage_user: 6.24376
  rusage_system: 19.433214
  curr_items: 246988
  total_items: 247112
  bytes: 4068732716
  curr_connections: 70
  total_connections: 112
  connection_structures: 71
  cmd_get: 290855
  cmd_set: 247112
  get_hits: 147393
  get_misses: 143462
  evictions: 0
  bytes_read: 4068732716
  bytes_written: 4068732716
  limit_maxbytes: 4294967296
  threads: 5

Kill that process

restarted it :

./tpcc_start localhost tpcc tpcc tpcc 20 64 10 300

in new single space create!!! #1 passed in id :1095197368
in new single space create!!! #2 passed in id :1095197368
in new single space create!!! #3 passed in id :1095197368
in space create!!! passed in id :18
in space create!!! Middle in the create id now :18
in space create!!! Lower in the create id now :18
in new single space create!!! #4 end passed in id :1095197368
[0x7a5135]
[0x4ce335]
[0x656ac0]
/lib/libc.so.6(fgets_unlocked+0x26)[0x7f5c01667836]
[0x446516b0]
090628 17:52:10 - mysqld got signal 11 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
We will try our best to scrape up some info that will hopefully help diagnose
the problem, but since we have already crashed, something is definitely wrong
and this may fail.

key_buffer_size=67108864
read_buffer_size=262144
max_used_connections=64
max_threads=151
threads_connected=64
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 144381 K
bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

thd: 0x4391490
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x44654f70 thread_stack 0x40000
Trying to get some variables.
Some pointers may be invalid and cause the dump to abort...
thd->query at 0x4a577a0 = UPDATE customer SET c_balance = c_balance + ? , c_delivery_cnt = c_delivery_cnt + 1 WHERE c_id = ? AND c_d_id = ? AND c_w_id = ?
thd->thread_id=186
thd->killed=NOT_KILLED
The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains
information that should help you find out what is causing the crash.
090628 17:52:10 mysqld_safe Number of processes running now: 0
090628 17:52:10 mysqld_safe mysqld restarted
InnoDB: The InnoDB memory heap has been disabled.
InnoDB: Mutex and rw_lock use atomics.
InnoDB: ios_per_array 256 read threads 1 write threads 1
in space create!!! passed in id :0
in space create!!! Middle in the create id now :0
in space create!!! Lower in the create id now :0
in space create!!! passed in id :4294967280
in space create!!! Middle in the create id now :4294967280
in space create!!! Lower in the create id now :4294967280

Need to upgrade t...

Attaching and error.log with extra debuging...  i was trying to see if one of the open connections to memcached caused the issue... does not look like it...  here is the code used to generate this:

buf_read_page_low(
/*==============*/
			/* out: 1 if a read request was queued, 0 if the page
			already resided in buf_pool, or if the page is in
			the doublewrite buffer blocks in which case it is never
			read into the pool, or if the tablespace does not
			exist or is being dropped */
	ulint*	err,	/* out: DB_SUCCESS or DB_TABLESPACE_DELETED if we are
			trying to read from a non-existent tablespace, or a
			tablespace which is just now being dropped */
	ibool	sync,	/* in: TRUE if synchronous aio is desired */
	ulint	mode,	/* in: BUF_READ_IBUF_PAGES_ONLY, ...,
			ORed to OS_AIO_SIMULATED_WAKE_LATER (see below
			at read-ahead functions) */
	ulint	space,	/* in: space id */
	ib_longlong tablespace_version, /* in: if the space memory object has
			this timestamp different from what we are giving here,
			treat the tablespace as dropped; this is a timestamp we
			use to stop dangling page reads from a tablespace
			which we have DISCARDed + IMPORTed back */
	ulint	offset)	/* in: page number */
{
	buf_block_t*	block;
	ulint		wake_later;
	char        tmpkey[40];
        /* Variables for WaffleGrid, the innodb_buffer_pool layer 2 cache */
        MEMC_CONN_T   *memct;
        ulint   sec,fsec,fmsec,msec;
        memcached_server_st *mcservers;
	*err = DB_SUCCESS;

wake_later = mode & OS_AIO_SIMULATED_WAKE_LATER;
	mode = mode & ~OS_AIO_SIMULATED_WAKE_LATER;

if (trx_doublewrite && space == TRX_SYS_SPACE
	    && (   (offset >= trx_doublewrite->block1
		    && offset < trx_doublewrite->block1
		    + TRX_SYS_DOUBLEWRITE_BLOCK_SIZE)
		   || (offset >= trx_doublewrite->block2
		       && offset < trx_doublewrite->block2
		       + TRX_SYS_DOUBLEWRITE_BLOCK_SIZE))) {
		ut_print_timestamp(stderr);
		fprintf(stderr,
			"  InnoDB: Warning: trying to read"
			" doublewrite buffer page %lu\n",
			(ulong) offset);

return(0);
	}

if (ibuf_bitmap_page(offset) || trx_sys_hdr_page(space, offset)) {

/* Trx sys header is so low in the latching order that we play
		safe and do not leave the i/o-completion to an asynchronous
		i/o-thread. Ibuf bitmap pages must always be read with
		syncronous i/o, to make sure they do not get involved in
		thread deadlocks. */

sync = TRUE;
	}

/* The following call will also check if the tablespace does not exist
	or is being dropped; if we succeed in initing the page in the buffer
	pool for read, then DISCARD cannot proceed until the read has
	completed */
	block = buf_page_init_for_read(err, mode, space, tablespace_version,
				       offset);
	if (block == NULL) {
		
		return(0);
	}

#ifdef UNIV_DEBUG
	if (buf_debug_prints) {
		fprintf(stderr,
			"Posting read request for page %lu, sync %lu\n",
			(ulong) offset,
			(ulong) sync);
	}
#endif

ut_a(block->state == BUF_BLOCK_FILE_PAGE);

/* set the block in memcached */
	memct = (MEMC_CONN_T*) thr_local_get_memc(os_thread_get_curr_id());

if (srv_memcached_enable>0) {
            fprintf(stderr,"Memcached miss Enabled ! \n");
                            if (!memct->mc_conn ||
                                (memct->mc_conn &&
                                memct->connect_prefix != srv_memcached_prefix)) {
                             fprintf(stderr,"Memcached if #1  ! \n");
                                if (memct->mc_conn) {
                                    memcached_free(memct->mc_conn);
                                    memct->mc_conn = NULL;
                                    thr_local_set_memc(os_thread_get_curr_id(),memct);
                                    fprintf(stderr,"Memcached connection found! ! \n");
                                }

/* need to initialize the memcached connection */
                            fprintf(stderr,"Memcached initialize connection \n");
                                memct->mc_conn = memcached_create(NULL);
                                mcservers = memcached_servers_parse(srv_memcached_servers);
                                memct->mcrc= memcached_server_push(memct->mc_conn, mcservers);
                                fprintf(stderr,"Memcached server set \n");
                                assert(memct->mcrc == MEMCACHED_SUCCESS);
                                memct->connect_prefix = srv_memcached_prefix;
                                fprintf(stderr,"Memcached server set \n");
                                thr_local_set_memc(os_thread_get_curr_id(),memct);
			  
			  sprintf(memct->mc_key,"%lx:%lx:%lx",srv_memcached_prefix,
                          (ulong) block->space,(ulong) block->offset);
                            ut_usectime(&sec,&msec);
                            fprintf(stderr,"Memcached all good \n");
                            }

if (memct->mc_conn) {
                if (srv_memcached_prefix != memct->connect_prefix) {
                    memct->mc_ret_value = NULL;
                } else {
                    /* set the memcache key */
                    sprintf(memct->mc_key,"%lx:%lx:%lx",srv_memcached_prefix,(ulong) block->space,
                            (ulong) block->offset);
                     fprintf(stderr,"Memcached setting key! \n");
                    ut_usectime(&sec,&msec);
                    if (block->space > 0) {
		    	   memct->mc_ret_value = memcached_get(memct->mc_conn, memct->mc_key, strlen(memct->mc_key),
                                            &memct->mc_ret_length, &memct->mc_flags, &memct->mcrc);
                           fprintf(stderr,"get key key! \n");
		    } else {
                            memct->mc_ret_value = NULL;
		    }
                    if (memct->mc_ret_value) {
                        if (memct->mc_flags != (uint32_t) crc32(0L, memct->mc_ret_value,memct->mc_ret_length)) {
                            fprintf(stderr,"Block corruption for key %s\n",memct->mc_key);
                            memct->mc_ret_value = NULL;
                        }
                    }

Revision history for this message

Yves Trudeau (y-trudeau) wrote on 2009-07-01:

#18

Bug is always related to page offset 1. At some point, there is a "set" for offset set followed by a get a few operations later that causes the crash.

**Memcached set: Block: 2763:4:1 : Thread : 2973268880 time:326
Memcached set: Block: 2763:4:29e : Thread : 2973268880 time:213
Memcached set: Block: 2763:4:2a0 : Thread : 2973268880 time:226
Memcached get: Block: 2763:4:385 : Thread : 2973268880 time:254
**Memcached get: Block: 2763:4:1 : Thread : 2973268880 time:223
[0x83f4821]
[0x810b62f]
[0xb7ff8420]
...

from buf_page_print, the page seems to be related to the insert buffer.

InnoDB: Page may be an insert buffer bitmap page

Revision history for this message

Yves Trudeau (y-trudeau) wrote on 2009-07-01:

#19

Easy way to reproduce... With a 8 MB buffer_pool (hacked 5.4 to allow smaller BP), load 100000 rows for the sysbench oltp test. Shutdown and restart the database and try a mysqldump > /dev/null.

Revision history for this message

Yves Trudeau (y-trudeau) wrote on 2009-07-01:

#20

Excluding insert buffer bitmap pages does not solve the bug. I found a faulty if statement in buf0flu.c:

if (srv_memcached_enable) {

it has been replaced by:

if (srv_memcached_enable == 1 && block->space > 0) {

in function buf_flush_write_block_low. Eventually, values 1 and 2 for srv_memcached_enable should have proper #define values.

Revision history for this message

Yves Trudeau (y-trudeau) wrote on 2009-07-01:

#21

Since the following procedure cause corruption....

1. start mysql
2. sysbench prepare
3. restart mysql
4. mysqldump

That point to the fact that the ibdata file is corrupted since so reads are done from memcached. The recent hack added code in buf0flu which handles writes to ibdata. I will focus on buf0flu for now.

Revision history for this message

Yves Trudeau (y-trudeau) wrote on 2009-07-01:

#22

The LRU code is working from version 2864 of branch lp:~y-trudeau/wafflegrid/cream-5.4-tree

Revision history for this message

Yves Trudeau (y-trudeau) wrote on 2009-07-01:

#23

The non-LRU code is working with 2864 of branch lp:~y-trudeau/wafflegrid/cream-5.4-tree if mysqld is started with MALLOC_CHECK_=0. Something, somewhere is freed twice, need to find out.

Revision history for this message

Yves Trudeau (y-trudeau) wrote on 2009-07-02:

#24

The bug is resolved in lp:~y-trudeau/wafflegrid/cream-5.4-tree revision 2865.

Revision history for this message

Yves Trudeau (y-trudeau) wrote on 2009-07-02:

#25

in lp:~y-trudeau/wafflegrid/cream-5.4-tree revision 2865

Changed in wafflegrid:
status:	Confirmed → Fix Released

Revision history for this message

MadMatt (yonkovim) wrote on 2009-07-05:

#26

THIS is still not fixed yet. Verified with your most recent revision....

-rwxr-xr-x 1 root root 7.9M 2009-07-05 18:46 /mysql_home/mysql540-waffle//libexec/mysqld
090705 18:48:37 [Note] /mysql_home/mysql540-waffle//libexec/mysqld: ready for connections.
Version: '5.4.2-beta' socket: '/var/run/mysqld/mysqld.sock' port: 3306 Source distribution
[0x7a4d55]
[0x4ce335]
[0x656ac0]
/lib/libc.so.6[0x7f06e508bbb3]
/lib/libc.so.6(cfree+0x76)[0x7f06e508c0a6]
[0x406d9950]
090705 18:49:10 - mysqld got signal 11 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
We will try our best to scrape up some info that will hopefully help diagnose
the problem, but since we have already crashed, something is definitely wrong
and this may fail.

key_buffer_size=67108864
read_buffer_size=262144
max_used_connections=16
max_threads=151
threads_connected=16
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 144381 K
bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

thd: 0x43d9940
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x406d8f70 thread_stack 0x40000

Changed in wafflegrid:
status:	Fix Released → In Progress

Revision history for this message

Yves Trudeau (y-trudeau) wrote on 2009-07-06:

#27

On Matt server, I clean and recreated the datadir and sysbench now runs fine. Maybe a corruption of the ibdata file.

Revision history for this message

MadMatt (yonkovim) wrote on 2009-07-08:

#28

Yves, still getting crashes...

InnoDB master thread running with io_capacity 800
090707 21:29:43 [Note] Event Scheduler: Loaded 0 events
090707 21:29:43 [Note] /mysql_home/mysql540-waffle//libexec/mysqld: ready for connections.
Version: '5.4.2-beta' socket: '/var/run/mysqld/mysqld.sock' port: 3306 Source distribution
[0x7a4d55]
[0x4ce335]
[0x656ac0]
/lib/libc.so.6(fgets_unlocked+0x26)[0x7f263e517836]
[0x41697cc0]
090707 21:31:23 - mysqld got signal 11 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
We will try our best to scrape up some info that will hopefully help diagnose
the problem, but since we have already crashed, something is definitely wrong
and this may fail.

key_buffer_size=67108864
read_buffer_size=262144
max_used_connections=16
max_threads=151
threads_connected=16
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 144381 K
bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

thd: 0x442cae0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x416a0f70 thread_stack 0x40000
Trying to get some variables.
Some pointers may be invalid and cause the dump to abort...
thd->query at 0x4910660 = SELECT count(c_id) FROM customer WHERE c_w_id = ? AND c_d_id = ? AND c_last = ?
thd->thread_id=7
thd->killed=NOT_KILLED

Revision history for this message

MadMatt (yonkovim) wrote on 2009-07-08:

#29

Crash happened after this:

Clean datadir...
Set Waffle off
Load tpcc data...
Run tpcc with no waffle
Waffle Enable ...
Run tpcc with waffle...
Waffle Times Unacceptable, 30sec + set times!
Kill tpcc
stop mysql, start mysql with waffle enabled
Runn Tpcc crash!!!! core dump.

[2]+ Segmentation fault (core dumped) ./tpcc_start localhost tpcc tpcc tpcc 20 16 10 600 > waffle_test.out

Revision history for this message

MadMatt (yonkovim) wrote on 2009-07-08:

#30

Tried this:

Clean datadir...
Set waffle = 0
restart mysql
Load tpcc
shutdown mysql
set waffle = 1
startup mysql
Run tpcc
Crash! same core dump

Revision history for this message

MadMatt (yonkovim) wrote on 2009-07-08:

#31

Sysbench fails still as well:

shutdown mysql
clean data dir
startup
load sysbench tables
root@frankenmatt:/development/sysbench-0.4.12# sysbench --test=oltp --mysql-table-engine=innodb --mysql-socket=/tmp/mysql.sock --oltp-read-only=on --num-threads=16 --max-requests=100 run --mysql-user=root --mysql-db=sbtest --oltp-table-size=10000000
sysbench 0.4.12: multi-threaded system evaluation benchmark

No DB drivers specified, using mysql
Running the test with following options:
Number of threads: 16

Doing OLTP test.
Running mixed OLTP test
Doing read-only test
Using Special distribution (12 iterations, 1 pct of values are returned in 75 pct cases)
Using "BEGIN" for starting transactions
Using auto_inc on the id column
Maximum number of requests for OLTP test is limited to 100
Threads started!
ALERT: failed to execute mysql_stmt_execute(): Err2013 Lost connection to MySQL server during query
FATAL: database error, exiting...
(last message repeated 1 times)
ALERT: failed to execute mysql_stmt_execute(): Err2013 Lost connection to MySQL server during query
ALERT: failed to execute mysql_stmt_execute(): Err2013 Lost connection to MySQL server during query
FATAL: database error, exiting...
(last message repeated 2 times)
FATAL: database error, exiting...
(last message repeated 1 times)
ALERT: failed to execute mysql_stmt_execute(): Err2013 Lost connection to MySQL server during query
FATAL: database error, exiting...
(last message repeated 1 times)
ALERT: failed to execute mysql_stmt_execute(): Err2013 Lost connection to MySQL server during query
(last message repeated 1 times)
FATAL: database error, exiting...
ALERT: failed to execute mysql_stmt_execute(): Err2013 Lost connection to MySQL server during query
FATAL: database error, exiting...
ALERT: failed to execute mysql_stmt_execute(): Err2013 Lost connection to MySQL server during query
FATAL: database error, exiting...
ALERT: failed to execute mysql_stmt_execute(): Err2013 Lost connection to MySQL server during query
(last message repeated 1 times)
FATAL: database error, exiting...
(last message repeated 1 times)
ALERT: failed to execute MySQL query: `BEGIN`:
ALERT: Error 2013 Lost connection to MySQL server during query
FATAL: database error, exiting...
ALERT: failed to execute mysql_stmt_execute(): Err2013 Lost connection to MySQL server during query
FATAL: database error, exiting...
(last message repeated 1 times)
ALERT: failed to execute mysql_stmt_execute(): Err2013 Lost connection to MySQL server during query
FATAL: database error, exiting...
(last message repeated 1 times)
Done.

seg fault....

Sysbench fails still as well:

shutdown mysql
clean data dir
startup
load sysbench tables
root@frankenmatt:/development/sysbench-0.4.12# sysbench --test=oltp --mysql-table-engine=innodb --mysql-socket=/tmp/mysql.sock --oltp-read-only=on --num-threads=16  --max-requests=100 run --mysql-user=root --mysql-db=sbtest --oltp-table-size=10000000
sysbench 0.4.12:  multi-threaded system evaluation benchmark

No DB drivers specified, using mysql
Running the test with following options:
Number of threads: 16

Doing OLTP test.
Running mixed OLTP test
Doing read-only test
Using Special distribution (12 iterations,  1 pct of values are returned in 75 pct cases)
Using "BEGIN" for starting transactions
Using auto_inc on the id column
Maximum number of requests for OLTP test is limited to 100
Threads started!
ALERT: failed to execute mysql_stmt_execute(): Err2013 Lost connection to MySQL server during query
FATAL: database error, exiting...
(last message repeated 1 times)
ALERT: failed to execute mysql_stmt_execute(): Err2013 Lost connection to MySQL server during query
ALERT: failed to execute mysql_stmt_execute(): Err2013 Lost connection to MySQL server during query
FATAL: database error, exiting...
(last message repeated 2 times)
FATAL: database error, exiting...
(last message repeated 1 times)
ALERT: failed to execute mysql_stmt_execute(): Err2013 Lost connection to MySQL server during query
FATAL: database error, exiting...
(last message repeated 1 times)
ALERT: failed to execute mysql_stmt_execute(): Err2013 Lost connection to MySQL server during query
(last message repeated 1 times)
FATAL: database error, exiting...
ALERT: failed to execute mysql_stmt_execute(): Err2013 Lost connection to MySQL server during query
FATAL: database error, exiting...
ALERT: failed to execute mysql_stmt_execute(): Err2013 Lost connection to MySQL server during query
FATAL: database error, exiting...
ALERT: failed to execute mysql_stmt_execute(): Err2013 Lost connection to MySQL server during query
(last message repeated 1 times)
FATAL: database error, exiting...
(last message repeated 1 times)
ALERT: failed to execute MySQL query: `BEGIN`:
ALERT: Error 2013 Lost connection to MySQL server during query
FATAL: database error, exiting...
ALERT: failed to execute mysql_stmt_execute(): Err2013 Lost connection to MySQL server during query
FATAL: database error, exiting...
(last message repeated 1 times)
ALERT: failed to execute mysql_stmt_execute(): Err2013 Lost connection to MySQL server during query
FATAL: database error, exiting...
(last message repeated 1 times)
Done.

seg fault....

WaffleGrid

Delete table/readd table can cause corruption, was: Testing TPCC-MYSQL get odd db/index coruption

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches