WaffleGrid: InnoDB extensions to use memcached

Random Crashes and Index Curroption

Reported by MadMatt on 2009-05-22
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
WaffleGrid
Critical
MadMatt

Bug Description

I want to track all the related data we are collecting and all the things we try in this one bug.

Basically under heavy load, innodb will start to throw random errors in secondary indexes.

here is the error:

InnoDB: error in sec index entry update in
InnoDB: index `myidx1` of table `dbt2`.`new_order`
InnoDB: tuple DATA TUPLE: 3 fields;
 0: len 4; hex 80000001; asc ;;
 1: len 4; hex 80000bea; asc ;;
 2: len 4; hex 80000005; asc ;;

InnoDB: record PHYSICAL RECORD: n_fields 3; compact format; info bits 32
 0: len 4; hex 80000001; asc ;;
 1: len 4; hex 80000bea; asc ;;
 2: len 4; hex 80000004; asc ;;

TRANSACTION 14469, ACTIVE 1 sec, process no 7982, OS thread id 2995481488 updating or deleting
mysql tables in use 1, locked 1
26 lock struct(s), heap size 2496, 65 row lock(s), undo log entries 60
MySQL thread id 31, query id 1246503 localhost root updating
DELETE FROM new_order
WHERE no_o_id = 3050
  AND no_w_id = 1
  AND no_d_id = 5

  Here is an example of the error with some extra logging turned on:

49332298-2994417918-Innodb inside btr_cur_search_to_nth_level index myidx1 thread: 1160882512
49332299-2994417994-... S1: space: 5 page_no 4 : thread : 1160882512
49332300-2994418044-... S1: space: 3 page_no 40949 : thread : 1100904784
49332301-2994418098-Innodb inside btr_cur_search_to_nth_level index PRIMARY thread: 1080850768
49332302-2994418175-... S1: space: 5 page_no 3 : thread : 1080850768
49332303-2994418225-Innodb inside btr_cur_search_to_nth_level index PRIMARY thread: 1100904784
49332304-2994418302-... S1: space: 3 page_no 3 : thread : 1100904784
49332305-2994418352-... S1: space: 5 page_no 1062 : thread : 1080850768
49332306-2994418405-... S1: space: 3 page_no 54 : thread : 1100904784
49332307-2994418456-Innodb inside btr_cur_search_to_nth_level index PRIMARY thread: 1100106064
49332308-2994418533-... S1: space: 3 page_no 40949 : thread : 1100904784
49332309-2994418587-Innodb inside btr_cur_search_to_nth_level index PRIMARY thread: 1100904784
49332310-2994418664-... S1: space: 3 page_no 3 : thread : 1100904784
49332311-2994418714-... S1: space: 3 page_no 54 : thread : 1100904784
49332312-2994418765-Innodb inside btr_cur_search_to_nth_level index PRIMARY thread: 1077369168
49332313-2994418842-... S1: space: 3 page_no 40949 : thread : 1100904784
49332314-2994418896-... S1: space: 7 page_no 3 : thread : 1077369168
49332315-2994418946-... S1: space: 7 page_no 329 : thread : 1077369168
49332316-2994418998-... S1: space: 7 page_no 2825 : thread : 1077369168
49332317-2994419051-Innodb inside btr_cur_search_to_nth_level index PRIMARY thread: 1100106064
49332318:2994419128:InnoDB: error in sec index entry update in
49332319-2994419171-InnoDB: index `myidx1` of table `dbt2`... S1: space: 3 page_no 57834 : thread : 1080318288
49332320-2994419263-Innodb inside btr_cur_search_to_nth_level index GEN_CLUST_INDEX thread: 1100372304
49332321-2994419348-... S1: space: 6 page_no 3 : thread : 1100372304
49332322-2994419398-... S1: space: 6 page_no 41 : thread : 1100372304
49332323-2994419449-... S1: space: 6 page_no 5974 : thread : 1100372304
49332324-2994419502-Innodb inside btr_cur_search_to_nth_level index PRIMARY thread: 1100106064
49332325-2994419579-... S1: space: 7 page_no 3 : thread : 1100106064
49332326-2994419629-.`new_order`
49332327-2994419642-InnoDB: tuple DATA TUPLE: 3 fields;
49332328-2994419678- 0: len 4; hex 80000006; asc ... S1: space: 7 page_no 358 : thread : 1100106064
49332329-2994419763-... S1: space: 7 page_no 17866 : thread : 1100106064
49332330-2994419817-Innodb inside btr_cur_search_to_nth_level index PRIMARY thread: 1100106064
49332331-2994419894-... S1: space: 7 page_no 3 : thread : 1100106064
49332332-2994419944-;; 1: len 4; hex 80000efd; asc ... S1: space: 1 page_no 57 : thread : 1081928016
49332333-2994420029-... S1: space: 1 page_no 17623 : thread : 1081928016
49332334-2994420083-Innodb inside btr_cur_search_to_nth_level index PRIMARY thread: 1100638544
49332335-2994420160-... S1: space: 5 page_no 3 : thread : 1100638544
49332336-2994420210-... S1: space: 5 page_no 849 : thread : 1100638544
49332337-2994420262-Innodb inside btr_cur_search_to_nth_level index PRIMARY thread: 1080318288
49332338-2994420339-Innodb inside btr_cur_search_to_nth_level index PRIMARY thread: 1100638544

This will eventually crash the machine.

Yves has managed to get this to fail with 1 single warehouse on DBT2. In order to do this you need to use a tmpfs ( fast disk ) and store everything on it.

I have gotten this to appear after about 20 minutes of running a 20W test with regular disk and a 768M BP....

We appear to be hitting some event. Possibly a flush or some process that only wakes up after so many changes.

Additionally we think we may have missed a call the invalidates, reads a page, moves a block/page, or does something outside of buf_LRU_search_and_free_block & buf_read_page_low ....

MadMatt (yonkovim) wrote :

tried walking through several different areas. Yves found fseg functions that can pruge pages outside the LRU process. This could mean that pages are left in memcached even after they have been "purged". He tried to fix this by calling a memcached get during the pruge process, but this still gave the error.

MadMatt (yonkovim) wrote :

One idea I had was to skip sending system tablespace pages to memcached. I did this by adding:

If ( block->space > 1)

and then calling the memcached functions. THis worked for several runs in a row, but lowered the transation per minute by about 50%. Additionally the number of memcached calls dropped by 2/3's that's a lot of space #1 calls. Odd things happened when I did this, on one run I did get a completely different error. I am enclosing the log.

"InnoDB: Page directory corruption: infimum not pointed to
090521 21:48:22 InnoDB: Page dump in ascii and hex (16384 bytes):"

This may or not be related. But this change should not have this type of blocks are changing outside the functions we hook into.

description: updated
MadMatt (yonkovim) wrote :

Had to change the code to where block > 0, 0 is the sys temp space.

This works, but still causes the random corruption.

Yves found another free page function in the fsp side of things in the fseg_free_page_low function. I want ahead and added this:

        /* Variables for WaffleGrid, the innodb_buffer_pool layer 2 cache */
        MEMC_CONN_T *memct;
        ulint sec,fsec,fmsec,msec;

        /* Drop search system page hash index if the page is found in
        the pool and is hashed */

memct = (MEMC_CONN_T*) thr_local_get_memc(os_thread_get_curr_id());

        if (srv_memcached_enable) {

            if (memct->mc_conn) {
                if (srv_memcached_prefix != memct->connect_prefix) {
                    memct->mc_ret_value = NULL;
                } else {
                    /* set the memcache key */
                    sprintf(memct->mc_key,"%lx:%lx:%lx",srv_memcached_prefix,(ulong) space,
                            (ulong) page);
                    ut_usectime(&sec,&msec);
                    memct->mc_ret_value = memcached_get(memct->mc_conn, memct->mc_key, strlen(memct->mc_key),
                                            &memct->mc_ret_length, &memct->mc_flags, &memct->mcrc);
                    if (memct->mc_ret_value) {
                        if (memct->mc_flags != (uint32_t) crc32(0L, memct->mc_ret_value,memct->mc_ret_length)) {
                            fprintf(stderr,"Block corruption for key %s\n",memct->mc_key);
                            memct->mc_ret_value = NULL;
                        }
                        fprintf(stderr, "found deleted page via fspc : %s \n", memct->mc_key);
                    }
                }
              }
          }

I got 427 "found deleted" messages. But still the random corruption.

Digging deeper into the fseg_free_page_low function the next call is to

btr_search_drop_page_hash_when_freed(space, page);

this function calls:

buf_page_get_gen which calls buf_read_page which calls read_page_low which is a memcached function....

This means the call I added was redundant. However I am still looking at that function, it does move the entire extend to the free lsit... so maybe this could have some negative impact.

MadMatt (yonkovim) wrote :

trying to add the fsp alloc page functions to memcached... its possible something is freeing the page we are missing... then the page gets reallocated, but the first get returns an old memcached page...

MadMatt (yonkovim) wrote :

in a 30 minute run over 945 "allocs" of a free page have a page in memcached, these may or may not have an empty page in memcached... corruption is still occuring however.

MadMatt (yonkovim) wrote :

Ok, now I am starting to move into the page/record moving/merging functions... the first on the list to look at :

/*****************************************************************
If page is the only on its level, this function moves its records to the
father page, thus reducing the tree height. */
static
void
btr_lift_page_up(

Which calls:

/*****************************************************************
Updates the lock table when a page is copied to another and the original page
is removed from the chain of leaf pages, except if page is the root! */

void
lock_update_copy_and_discard(
/*=========================*/

MadMatt (yonkovim) wrote :

I did not see anything of note in the lift_page_up function.

Looking into the file_io again...

root@frankenmatt:/development/mysql-5.4.0-beta/storage/innobase# grep "fil_io(OS_FILE_READ" */*
buf/buf0rea.c: *err = fil_io(OS_FILE_READ | wake_later,
fil/fil0fil.c: return(fil_io(OS_FILE_READ, sync, space_id, block_offset,
log/log0log.c: fil_io(OS_FILE_READ | OS_FILE_LOG, TRUE, group->space_id,
log/log0log.c: fil_io(OS_FILE_READ | OS_FILE_LOG, sync, group->space_id,
log/log0recv.c: error = fil_io(OS_FILE_READ, TRUE, recv_addr->space,
log/log0recv.c: fil_io(OS_FILE_READ | OS_FILE_LOG, TRUE, max_cp_group->space_id,
log/log0recv.c: fil_io(OS_FILE_READ | OS_FILE_LOG, TRUE, group->archive_space_id, 0, 0,
log/log0recv.c: fil_io(OS_FILE_READ | OS_FILE_LOG, TRUE,
trx/trx0sys.c: fil_io(OS_FILE_READ, TRUE, TRX_SYS_SPACE, TRX_SYS_PAGE_NO, 0,
trx/trx0sys.c: fil_io(OS_FILE_READ, TRUE, TRX_SYS_SPACE, block1, 0,
trx/trx0sys.c: fil_io(OS_FILE_READ, TRUE, TRX_SYS_SPACE, block2, 0,
trx/trx0sys.c: fil_io(OS_FILE_READ, TRUE, space_id, page_no, 0,

I don't see any read that would read in a file page outside of memcached.

MadMatt (yonkovim) wrote :

I am know leaning toward trying to hack file_io and pushing file_io writes to the tablespaces to memcached.... I think this will eliminate the secondary index issues. Additionally, I think we can then skip the "mru" memcached patch. Only an actual write will invalidate whats in memcached. I think we may need to push to memcached on read from disk ( maybe ) in order to get that to work.

MadMatt (yonkovim) wrote :

OK now I am ticked!!!

I added memcached set code to the buf_flush_buffered_writes which calls the file_io write function for all space related page operations... and I still get a secondary index error!!!!

This means the following:

A page is set into memcached via some mechanism
A page bypasses the normal memcached read process, to load a page into the BP
A page then is changed in the BP
Then that page is either re-read from memcached overwriting the change or the change is written to disk without going through the "normal" fil_io or lru process!!!! What the heck could it be!

MadMatt (yonkovim) wrote :
Download full text (4.1 KiB)

Here is the bug:
Memcached get: Block: 51f0:7:48d8 : Thread : 1095784784 time:226
Memcached get: Block: 51f0:1:6e8a : Thread : 1096051024 time:542
Memcached get: Block: 51f0:1:5c41 : Thread : 1095203152 time:463
Memcached get: Block: 51f0:7:2aff : Thread : 1095469392 time:614
Memcached get: Block: 51f0:1:8fb5 : Thread : 1094334800 time:473
Memcached get: Block: 51f0:4:2a8 : Thread : 1150839120 time:153
Memcached get: Block: 51f0:7:4e7 : Thread : 1150306640 time:921
InnoDB: error in sec index entry update in
InnoDB: index `myidx1` of table `dbt2`.`new_order`
InnoDB: tuple DATA TUPLE: 3 fields;
 0: len 4; hex 8000000f; asc ;; 1: len 4; hex 80000cec; asc ;; 2: len 4; hex 80000003; asc ;;

InnoDB: record PHYSICAL RECORD: n_fields 3; compact format; info bits 32
 0: len 4; hex 8000000f; asc ;; 1: len 4; hex 80000cec; asc ;; 2: len 4; hex 80000002; asc ;;

TRANSACTION 0 1387899, ACTIVE 0 sec, process no 20975, OS thread id 1150572880 updating or deleting
mysql tables in use 1, locked 1
15 lock struct(s), heap size 3024, 24 row lock(s), undo log entries 22
MySQL thread id 28, query id 20822257 localhost root updating
DELETE FROM new_order
WHERE no_o_id = 3308
  AND no_w_id = 15
  AND no_d_id = 3

InnoDB: Submit a detailed bug report to http://bugs.mysql.com
Memcached get: Block: 51f0:7:6186 : Thread : 1096583504 time:536
Memcached get: Block: 51f0:7:2709 : Thread : 1094334800 time:773
Memcached get: Block: 51f0:7:270a : Thread : 1094334800 time:551
Memcached get: Block: 51f0:1:b37e : Thread : 1150040400 time:523

Memcached get: Block: 51f0:7:44d4 : Thread : 1094334800 time:270
Memcached get: Block: 51f0:1:d9e : Thread : 1095203152 time:216
Memcached get: Block: 51f0:7:46b5 : Thread : 1150306640 time:1036
Memcached get: Block: 51f0:1:2a65 : Thread : 1094867280 time:358
Memcached get: Block: 51f0:7:3495 : Thread : 1095469392 time:417
Memcached get: Block: 51f0:1:959c : Thread : 1097738576 time:323
Memcached get: Block: 51f0:3:a1c7 : Thread : 1094601040 time:450
Memcached get: Block: 51f0:1:9b18 : Thread : 1097738576 time:471
Memcached get: Block: 51f0:7:276 : Thread : 1150572880 time:270
Memcached get: Block: 51f0:7:4596 : Thread : 1094334800 time:277
Memcached get: Block: 51f0:7:f20 : Thread : 1150040400 time:273
Memcached get: Block: 51f0:7:28d1 : Thread : 1094334800 time:301
Memcached get: Block: 51f0:7:4b9c : Thread : 1094601040 time:655
Memcached get: Block: 51f0:1:6de2 : Thread : 1096317264 time:194
Memcached get: Block: 51f0:1:29ec : Thread : 1150839120 time:141
Memcached get: Block: 51f0:7:298 : Thread : 1150572880 time:521
Memcached get: Block: 51f0:1:3d0 : Thread : 1095203152 time:107
Memcached get: Block: 51f0:7:1e71 : Thread : 1094867280 time:2160
Memcached get: Block: 51f0:1:54c7 : Thread : 1094334800 time:317
Memcached get: Block: 51f0:1:a799 : Thread : 1094068560 time:450
InnoDB: error in sec index entry update in
InnoDB: index `myidx1` of table `dbt2`.`new_order`
InnoDB: tuple DATA TUPLE: 3 fields;
 0: len 4; hex 8000000f; asc ;; 1: len 4; hex 80000ced; asc ;; 2: len 4; hex 80000003; asc ;;

InnoDB: record PHYSICAL RECORD: n_fields 3; compact format; info bits 32
 0: le...

Read more...

MadMatt (yonkovim) wrote :

Walking through the fil_io functions, the only function that reads directly from the datafiles without going through the normal process is fil_write_lsn_and_arch_no_to_file, I verified this is only called on startup in my failure scenario.

Another function I looked at is buf_flush_write_block_low which calls:

      if (!srv_use_doublewrite_buf || !trx_doublewrite) {
                 fprintf(stderr,"fil_io write without dblewrite \n");

                fil_io(OS_FILE_WRITE | OS_AIO_SIMULATED_WAKE_LATER,
                       FALSE, block->space, block->offset, 0, UNIV_PAGE_SIZE,
                       (void*)block->frame, (void*)block);
        } else {
                buf_flush_post_to_doublewrite_buf(block);
        }
}

This was called during startup, but not later on... so I think i can rule this out.

MadMatt (yonkovim) wrote :

For my experiment this morning I removed all the memcached code from the LRU. Instead If I read from disk I then push the read page to memcached, and then overwrite that page only on a call to fil_io write. The result ... it still fails.

1675830-113234822-Memcached get: Block: 35c:1:a905 : Thread : 1103825232 time:158
1675831-113234887-alloc_low found a page in memcached : 35c:0:4fb
1675832-113234937-Memcached get: Block: 35c:1:3b9 : Thread : 1101429072 time:346
1675833-113235001-Memcached get: Block: 35c:1:3b6 : Thread : 1101961552 time:541
1675834-113235065-Memcached get: Block: 35c:1:5ac2 : Thread : 1103292752 time:278
1675835-113235130-Memcached get: Block: 35c:7:1ed9 : Thread : 1083132240 time:437
1675836-113235195-alloc_low found a page in memcached : 35c:5:56
1675837-113235244-Memcached get: Block: 35c:7:6191 : Thread : 1102494032 time:629
1675838-113235309-Memcached get: Block: 35c:1:9ec7 : Thread : 1101162832 time:434
1675839-113235374-Memcached get: Block: 35c:1:9299 : Thread : 1103558992 time:459
1675840-113235439-Memcached get: Block: 35c:1:154 : Thread : 1101961552 time:400
1675841-113235503-Memcached get: Block: 35c:1:ae94 : Thread : 1100630352 time:1103
1675842-113235569-Memcached get: Block: 35c:1:4cf : Thread : 1101429072 time:354
1675843-113235633-Memcached get: Block: 35c:1:9b8d : Thread : 1105553744 time:195
1675844:113235698:InnoDB: error in sec index entry update in
1675845-113235741-InnoDB: index `myidx1` of table `dbt2`.`new_order`
1675846-113235792-InnoDB: tuple DATA TUPLE: 3 fields;
1675847-113235828- 0: len 4; hex 80000013; asc ;; 1: len 4; hex 80000bb9; asc ;; 2: len 4; hex 80000004; asc ;;
1675848-113235934-
1675849-113235935-InnoDB: record PHYSICAL RECORD: n_fields 3; compact format; info bits 32
1675850-113236008- 0: len 4; hex 80000013; asc ;; 1: len 4; hex 80000bb9; asc ;; 2: len 4; hex 80000003; asc ;;
1675851-113236114-
1675852-113236115-TRANSACTION 0 1028740, ACTIVE 0 sec, process no 859, OS thread id 1102494032 updating or deleting
1675853-113236213-mysql tables in use 1, locked 1
1675854-113236245-20 lock struct(s), heap size 3024, 47 row lock(s), undo log entries 44
1675855-113236316-MySQL thread id 19, query id 15449238 localhost root updating
1675856-113236378-DELETE FROM new_order
1675857-113236400-WHERE no_o_id = 3001
1675858-113236421- AND no_w_id = 19
1675859-113236440- AND no_d_id = 4
1675860-113236458-
1675861-113236459-InnoDB: Submit a detailed bug report to http://bugs.mysql.com

MadMatt (yonkovim) wrote :

After thinking about this, I think this is an issue with the async nature of a flush. Since the page/block could change ( i think ) between the time when the write is requested and the time when the actual write occurs I think we need to look at move the push to memcached down all the way to os_file_write .... maybe.

MadMatt (yonkovim) wrote :

could this be the insert buffer?

MadMatt (yonkovim) wrote :

AHHHH

/*************************************************************************
When an index page is read from a disk to the buffer pool, this function
inserts to the page the possible index entries buffered in the insert buffer.
The entries are deleted from the insert buffer. If the page is not read, but
created in the buffer pool, this function deletes its buffered entries from
the insert buffer; there can exist entries for such a page if the page
belonged to an index which subsequently was dropped. */

void
ibuf_merge_or_delete_for_page(

This is called by complete_io which we do not call after copying from memcached! I am going to try this...

Also attaching the patch that removed the LRU push...

MadMatt (yonkovim) wrote :
Download full text (3.3 KiB)

complete_io does seem to fix some issues... however it raises others:

root@frankenmatt:/benchmarks/waffle_2.0# grep -n "7:1 " /mysql_home/mysql540-waffle/error.log
51:Memcached miss: Block: bf2:7:1 : Thread : 1144666448 time:51
52:Memcached set: Block: bf2:7:1 : Thread : 1144666448 time:468
53:Memcached set: Block: bf2:7:1 : Thread : 1144666448 time:8401
147431:Memcached get: Block: bf2:7:1 : Thread : 1144666448 time:125
147434:Memcached set: Block: bf2:7:1 : Thread : 1144666448 time:6028
185727:Memcached (write) set: Block: bf2:7:1 : Thread : 1144400208 time:82
454229:Memcached (write) set: Block: bf2:7:1 : Thread : 1144400208 time:81
743017:Memcached (write) set: Block: bf2:7:1 : Thread : 1144400208 time:137
1559529:Memcached (write) set: Block: bf2:7:1 : Thread : 1144400208 time:77
2209311:Memcached (write) set: Block: bf2:7:1 : Thread : 1144400208 time:85
root@frankenmatt:/benchmarks/waffle_2.0# grep -n "7:3 " /mysql_home/mysql540-waffle/error.log
50:Memcached miss: Block: bf2:7:3 : Thread : 1144666448 time:382
147430:Memcached miss: Block: bf2:7:3 : Thread : 1144666448 time:52
root@frankenmatt:/benchmarks/waffle_2.0# grep -i -n -b3 error /mysql_home/mysql540-waffle/error.log
147429-9567137-Memcached miss: Block: bf2:0:2 : Thread : 1144666448 time:0
147430-9567198-Memcached miss: Block: bf2:7:3 : Thread : 1144666448 time:52
147431-9567260-Memcached get: Block: bf2:7:1 : Thread : 1144666448 time:125
147432:9567322:090528 11:45:49 InnoDB: Error: space id and page n:o stored in the page
147433-9567395-InnoDB: read in are 7:3, should be 7:1!
147434-9567435-Memcached set: Block: bf2:7:1 : Thread : 1144666448 time:6028
147435-9567498-Memcached miss: Block: bf2:7:147 : Thread : 1144666448 time:43
--
147477-9570174-Memcached miss: Block: bf2:1:27 : Thread : 1144932688 time:51
147478-9570237-Memcached set: Block: bf2:2:116 : Thread : 1145198928 time:15078
147479-9570303-Memcached get: Block: bf2:1:1 : Thread : 1144666448 time:126
147480:9570365:090528 11:45:49 InnoDB: Error: space id and page n:o stored in the page
147481-9570438-InnoDB: read in are 1:3, should be 1:1!
147482-9570478-Memcached set: Block: bf2:1:1 : Thread : 1144666448 time:16129
147483-9570542-Memcached miss: Block: bf2:3:3 : Thread : 1144666448 time:53
--
149045-9670798-Memcached get: Block: bf2:1:e8d : Thread : 1147861328 time:153
149046-9670862-Memcached set: Block: bf2:3:b8f : Thread : 1147062608 time:24442
149047-9670928-Memcached get: Block: bf2:7:4001 : Thread : 1146263888 time:223
149048:9670993:090528 11:45:51 InnoDB: Error: space id and page n:o stored in the page
149049-9671066-InnoDB: read in are 7:22621, should be 7:16385!
149050-9671114-Memcached set: Block: bf2:7:4001 : Thread : 1146263888 time:3318
149051-9671180-Memcached miss: Block: bf2:7:432c : Thread : 1146263888 time:48
--
174720-11338188-Memcached set: Block: bf2:1:63ca : Thread : 1145731408 time:8270
174721-11338254-Memcached get: Block: bf2:1:85a2 : Thread : 1145731408 time:161
174722-11338319-Memcached get: Block: bf2:1:8001 : Thread : 1145731408 time:121
174723:11338384:090528 11:46:11 InnoDB: Error: space id and page n:o stored in the page
174724-11338457-InnoDB: read...

Read more...

MadMatt (yonkovim) wrote :
MadMatt (yonkovim) wrote :

path_working.out is a patch that completely rearchitects waffle to not store on LRU instead it stores on get. This does bypass the "sec index error" ... however the IO complete throws a different error the the space id & page do not match. Could this be a memcached bug?

MadMatt (yonkovim) wrote :

This makes no sense:

Memcached miss: Block: 3278:7:6b21 : Thread : 1150593360 time:50
Disk Read: Block: 3278:7:6b21 : Thread : 1150593360 time:304
Memcached set (via disk read): Block: 3278:7:6b21 : Thread : 1150593360 time:429 settime: 1243546013 : 706607
Memcached miss: Block: 3278:1:3 : Thread : 1150593360 time:249
Memcached miss: Block: 3278:1:1 : Thread : 1150593360 time:45
Disk Read: Block: 3278:1:1 : Thread : 1150593360 time:309
Memcached set (via disk read): Block: 3278:1:1 : Thread : 1150593360 time:416 settime: 1243546014 : 451986
Disk Read: Block: 3278:1:1 : Thread : 1150593360 time:4695
Memcached set (via disk read): Block: 3278:1:1 : Thread : 1150593360 time:4774 settime: 1243546014 : 452069
Memcached miss: Block: 3278:1:2 : Thread : 1150593360 time:25
Disk Read: Block: 3278:1:2 : Thread : 1150593360 time:306
Memcached set (via disk read): Block: 3278:1:2 : Thread : 1150593360 time:410 settime: 1243546014 : 452511

we issue a miss for 1:3 & 1:1 coming from the exact same thread... both misses come back and push into the same var....

Maybe a sync vs async issue?

MadMatt (yonkovim) wrote :

Ahhhhhh, its not thread safe...

Memcached set (via disk read): Block: 3a05:1:63ca : tmpkey: 63ca : Thread : 1077635408 time:433 settime: 1243554853 : 211071
Memcached miss: Block: 3a05:1:875d : Thread : 1077635408 time:39
Disk Read: Block: 3a05:1:875d : Thread : 1077635408 time:5905
Memcached set (via disk read): Block: 3a05:1:875d : tmpkey: 875d : Thread : 1077635408 time:6026 settime: 1243554853 : 217150
Memcached miss: Block: 3a05:9:3 : Thread : 1077635408 time:49
Memcached miss: Block: 3a05:9:1 : Thread : 1077635408 time:43
Disk Read: Block: 3a05:9:1 : Thread : 1077635408 time:11414
Memcached set (via disk read): Block: 3a05:9:1 : tmpkey: 1 : Thread : 1077635408 time:11541 settime: 1243554853 : 237219
Disk Read: Block: 3a05:9:1 : Thread : 1077635408 time:19482
Memcached set (via disk read): Block: 3a05:9:1 : tmpkey: 3 : Thread : 1077635408 time:19600 settime: 1243554853 : 237342
Memcached miss: Block: 3a05:9:2 : Thread : 1077635408 time:38
Disk Read: Block: 3a05:9:2 : Thread : 1077635408 time:316
Memcached set (via disk read): Block: 3a05:9:2 : tmpkey: 2 : Thread : 1077635408 time:437 settime: 1243554853 : 237828

Two calls to memcached happen at the same time, since the "key" is set per thread... one overwrite the other. THe tmpkey above is the actual block->offset that the page should be saved under! See its pushed to 1, but really its got 3's data.

MadMatt (yonkovim) wrote :

1 hour, no errors. TODO:

- Fix LRU version
- decide which version to promote
- Fix doublewrite issue ( open another bug )

MadMatt (yonkovim) wrote :

Fixed in Cream 0.5

Changed in wafflegrid:
assignee: nobody → MadMatt (yonkovim)
importance: Undecided → Critical
status: New → Fix Committed
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers