Drizzle transaction log showing differences from server state in concurrent testing

Bug #673105 reported by Patrick Crews
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Drizzle
Fix Released
Critical
David Shrewsbury
7.0
Fix Released
Critical
David Shrewsbury

Bug Description

When testing Drizzle's transaction log in a concurrent environment with the randgen, we are seeing differences between the server's state and the SQL captured in the transaction log. The failures appear to be the results of INSERT's, UPDATE's, and DELETE's.

The test consists of executing several queries (in several threads) against the server. We then generate SQL from the transaction log and use the SQL to populate a validation server. We compare Drizzledump output between 'master' and 'slave' to validate.

The failure is generally consistent, but appears to be subject to timing issues. The randgen fork()'s off processes for each 'thread' used and the exact timing and interaction between threads varies. However, we see failures at certain thread/query_count combinations:

2 threads, 2500 queries
5 threads, 70-75 queries
10 thread, 50 queries.

This test grammar was designed to produce a higher percentage of valid queries and tables can easily grow quite large.

At present, I am not certain what is causing the issue, but it is an interaction between connections. These failures are not repeatable in single-thread scenarios.

TO REPEAT:
1) Start a drizzle server to test, enable the transaction log
./test-run --start-and-exit --drizzled=--transaction-log.enable (expected port 9306) --basedir=$PWD/..
2) Start a second server to validate (expected port 9307)
3) Run the randgen and observe the results. Command line:
perl ./gentest.pl \
--dsn=dbi:drizzle:host=localhost:port=9306:user=root:password="":database=test \
--gendata=conf/drizzle/translog_drizzle.zz --grammar=conf/drizzle/translog_concurrent3.yy \
 --queries=70 --threads=5 --sqltrace --debug --Reporter=DrizzleConcurrentTransactionLog

Sample output:
# 2010-11-09T11:51:18 Executing diff --unified /tmp//translog_rpl_dump_4869_9306.sql /tmp//translog_rpl_dump_4869_9307.sql
--- /tmp//translog_rpl_dump_4869_9306.sql 2010-11-09 11:51:17.981424189 -0500
+++ /tmp//translog_rpl_dump_4869_9307.sql 2010-11-09 11:51:18.491212405 -0500
@@ -1014,21 +1014,46 @@
 INSERT INTO `a` VALUES (NULL,'a','u',6,'7073747639715495936','b','','GHGTD',8,'you','mean',1143,'b','RZARM','UUDDE','UKVQC','','','1147292005072633856','BSUXB',6,8,'there','',NULL);
 INSERT INTO `a` VALUES (NULL,'q','z',6,'7891995397013372928','e','','JFLJB',NULL,'okay','SALOT',1144,'t','ZPJCL','z','NMCDY','','','7','want',1,1,'TLLTD','',NULL);
 INSERT INTO `a` VALUES (NULL,'q','z',6,'7891995397013372928','e','','JFLJB',NULL,'okay','SALOT',1145,'t','ZPJCL','z','NMCDY','','','7','want',1,1,'TLLTD','',NULL);
+INSERT INTO `a` VALUES ('-7326793643778375680','h','m',1383792640,'6322209451898044416','v','','MZTZG',NULL,'will','he',1146,'u','MYQYS','I','that\'s','','','-7835700401671241728','k',6,3,'HMCNM','','6');
+INSERT INTO `a` VALUES ('-6968194523448999936','i','t',-1211891712,'-6355704974126612480','o','','j',NULL,'LQVCJ','we',1147,'c','could','MWBTI','I\'ll','','','3','that',0,-988217344,'CCQZH','','8');
 INSERT INTO `a` VALUES (NULL,'q','z',6,'7891995397013372928','e','','JFLJB',NULL,'okay','SALOT',1148,'t','ZPJCL','z','NMCDY','','','7','want',1,126,'TLLTD','',NULL);
+INSERT INTO `a` VALUES ('-6968194523448999936','i','t',-1211891712,'-6355704974126612480','o','','j',NULL,'LQVCJ','we',1149,'c','could','MWBTI','I\'ll','','','3','that',0,-988217344,'CCQZH','','8');
+INSERT INTO `a` VALUES ('-6882063180575539200','v','q',NULL,'6230167134513659904','e','','who',-351141888,'t','k',1150,'x','yes','l','s','','','8706302504637300736','y',-1453522944,5,'NTVJG','','287104476244869120');

Tags: replication

Related branches

Changed in drizzle:
assignee: nobody → David Shrewsbury (dshrews)
status: New → Confirmed
tags: added: replication
Revision history for this message
Patrick Crews (patrick-crews) wrote :
Download full text (4.1 KiB)

Tarball with:
* transaction log after the randgen run
* dumpfile to create the initial test bed
* dumpfile from master post-run
* dumpfile from slave post-run (generated / populated from transaction log via transaction_reader)
* output from the randgen - this is the list of queries generated and sent to the server. Each query includes the thread_id in the comments.

This is from a 5 thread, 75 query run. The diff is easier than in many previous runs (only one query appears to differ between master and slave):
--- /tmp//translog_rpl_dump_6637_9306.sql 2010-11-09 12:20:38.670957495 -0500
+++ /tmp//translog_rpl_dump_6637_9307.sql 2010-11-09 12:20:39.171210764 -0500
@@ -1930,6 +1930,22 @@
 INSERT INTO `aa` VALUES ('','e','d','',270270464,64290816,9,'EOADF','e','2','KQZTC','VWWRE','d','',990,'my','say','just','7','k','3','','y',5,'5');
 INSERT INTO `aa` VALUES ('','z','o','',5,6,NULL,'EPLEV','m','6','f','not','w','',991,'RHVXU','s','I\'m','9','f','-7584061772491915264','','j',1863581696,'9');
 INSERT INTO `aa` VALUES ('','o','x','',2,166133760,NULL,'RQWBU','g','3','your','QTFXV','h','',992,'a','hey','going','2','p','5483695496277000192','','e',751239168,'5769111122661605376');
+INSERT INTO `aa` VALUES ('','been','e','',3,5,NULL,'a','z',NULL,'OIILA','r','o','',993,'NILAN','about','CVZCR','5','z','-4307411568603168768','','I',9,NULL);
+INSERT INTO `aa` VALUES ('','been','e','',3,5,NULL,'a','z',NULL,'OIILA','r','o','',994,'NILAN','about','CVZCR','5','z','-4307411568603168768','','I',9,NULL);
+INSERT INTO `aa` VALUES ('','been','e','',3,5,NULL,'a','z',NULL,'OIILA','r','o','',995,'NILAN','about','CVZCR','5','z','-4307411568603168768','','I',9,NULL);
+INSERT INTO `aa` VALUES ('','been','e','',3,5,NULL,'a','z',NULL,'OIILA','r','o','',996,'NILAN','about','CVZCR','5','z','-4307411568603168768','','I',9,NULL);
+INSERT INTO `aa` VALUES ('','been','e','',3,5,NULL,'a','z',NULL,'OIILA','r','o','',997,'NILAN','about','CVZCR','5','z','-4307411568603168768','','I',9,NULL);
+INSERT INTO `aa` VALUES ('','been','e','',3,5,NULL,'a','z',NULL,'OIILA','r','o','',998,'NILAN','about','CVZCR','5','z','-4307411568603168768','','I',9,NULL);
+INSERT INTO `aa` VALUES ('','been','e','',3,5,NULL,'a','z',NULL,'OIILA','r','o','',999,'NILAN','about','CVZCR','5','z','-4307411568603168768','','I',9,NULL);
+INSERT INTO `aa` VALUES ('','been','e','',3,5,NULL,'a','z',NULL,'OIILA','r','o','',1000,'NILAN','about','CVZCR','5','z','-4307411568603168768','','I',9,NULL);
+INSERT INTO `aa` VALUES ('','been','e','',3,5,NULL,'a','z',NULL,'OIILA','r','o','',1001,'NILAN','about','CVZCR','5','z','-4307411568603168768','','I',9,NULL);
+INSERT INTO `aa` VALUES ('','been','e','',3,5,NULL,'a','z',NULL,'OIILA','r','o','',1002,'NILAN','about','CVZCR','5','z','-4307411568603168768','','I',9,NULL);
+INSERT INTO `aa` VALUES ('','been','e','',3,5,NULL,'a','z',NULL,'OIILA','r','o','',1003,'NILAN','about','CVZCR','5','z','-4307411568603168768','','I',9,NULL);
+INSERT INTO `aa` VALUES ('','been','e','',3,5,NULL,'a','z',NULL,'OIILA','r','o','',1004,'NILAN','about','CVZCR','5','z','-4307411568603168768','','I',9,NULL);
+INSERT INTO `aa` VALUES ('','been','e','',3,5,NULL,'a','z',NULL,'OIILA','r','o','',1005,'NILA...

Read more...

Changed in drizzle:
status: Confirmed → In Progress
importance: Undecided → Critical
Revision history for this message
Patrick Crews (patrick-crews) wrote :
Download full text (14.1 KiB)

Failing set of queries:
COMMIT /*Generated by THREAD_ID 2*/ ;
 SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED /*Generated by THREAD_ID 2*/ ;
 SET AUTOCOMMIT=ON /*Generated by THREAD_ID 2*/;
INSERT INTO `aa` ( `col_char_10` , `col_char_10_key` , `col_char_10_not_null` , `col_char_10_not_null_key` , `col_char_1024` , `col_char_1024_key` , `col_char_1024_not_null` , `col_char_1024_not_null_key` , `col_int` , `col_int_key` , `col_int_not_null` , `col_int_not_null_key` , `col_bigint` , `col_bigint_key` , `col_bigint_not_null` , `col_bigint_not_null_key` , `col_enum` , `col_enum_key` , `col_enum_not_null` , `col_enum_not_null_key` , `col_text` , `col_text_key` , `col_text_not_null` , `col_text_not_null_key` ) SELECT `col_char_10` , `col_char_10_key` , `col_char_10_not_null` , `col_char_10_not_null_key` , `col_char_1024` , `col_char_1024_key` , `col_char_1024_not_null` , `col_char_1024_not_null_key` , `col_int` , `col_int_key` , `col_int_not_null` , `col_int_not_null_key` , `col_bigint` , `col_bigint_key` , `col_bigint_not_null` , `col_bigint_not_null_key` , `col_enum` , `col_enum_key` , `col_enum_not_null` , `col_enum_not_null_key` , `col_text` , `col_text_key` , `col_text_not_null` , `col_text_not_null_key` FROM `a` ORDER BY `col_bigint`,`col_bigint_key`,`col_bigint_not_null`,`col_bigint_not_null_key`,`col_char_10`,`col_char_1024`,`col_char_1024_key`,`col_char_1024_not_null`,`col_char_1024_not_null_key`,`col_char_10_key`,`col_char_10_not_null`,`col_char_10_not_null_key`,`col_enum`,`col_enum_key`,`col_enum_not_null`,`col_enum_not_null_key`,`col_int`,`col_int_key`,`col_int_not_null`,`col_int_not_null_key`,`col_text`,`col_text_key`,`col_text_not_null`,`col_text_not_null_key`,`pk` LIMIT 50 /*Generated by THREAD_ID 2*/;
# 2010-11-09T12:20:17 Query: INSERT INTO `aa` ( `col_char_10` , `col_char_10_key` , `col_char_10_not_null` , `col_char_10_not_null_key` , `col_char_1024` , `col_char_1024_key` , `col_char_1024_not_null` , `col_char_1024_not_null_key` , `col_int` , `col_int_key` , `col_int_not_null` , `col_int_not_null_key` , `col_bigint` , `col_bigint_key` , `col_bigint_not_null` , `col_bigint_not_null_key` , `col_enum` , `col_enum_key` , `col_enum_not_null` , `col_enum_not_null_key` , `col_text` , `col_text_key` , `col_text_not_null` , `col_text_not_null_key` ) SELECT `col_char_10` , `col_char_10_key` , `col_char_10_not_null` , `col_char_10_not_null_key` , `col_char_1024` , `col_char_1024_key` , `col_char_1024_not_null` , `col_char_1024_not_null_key` , `col_int` , `col_int_key` , `col_int_not_null` , `col_int_not_null_key` , `col_bigint` , `col_bigint_key` , `col_bigint_not_null` , `col_bigint_not_null_key` , `col_enum` , `col_enum_key` , `col_enum_not_null` , `col_enum_not_null_key` , `col_text` , `col_text_key` , `col_text_not_null` , `col_text_not_null_key` FROM `a` ORDER BY `col_bigint`,`col_bigint_key`,`col_bigint_not_null`,`col_bigint_not_null_key`,`col_char_10`,`col_char_1024`,`col_char_1024_key`,`col_char_1024_not_null`,`col_char_1024_not_null_key`,`col_char_10_key`,`col_char_10_not_null`,`col_char_10_not_null_key`,`col_enum`,`col_enum_key`,`col_enum_not_null`,`col_enum_not_null_key`,`col_int`,`col_int_key`,`col_int_not_null`,`col_int_not_n...

Revision history for this message
David Shrewsbury (dshrews) wrote :

This looks suspiciously related to the fix that I put in for bug 660779. An INSERT..SELECT failing mid-processing is supposed to undo GPB Statement message modifications. Probably something amiss there...

Revision history for this message
David Shrewsbury (dshrews) wrote :

Note to self:

Failing transaction_id in the supplied transaction log is 1393.

DELETE records should not have been a part of this same transaction.

Revision history for this message
Patrick Crews (patrick-crews) wrote :

The dumpfile will get the DB into a state just prior to the problem queries.

It should be noted that I don't see the INSERT failing consistently when repeating this run on my machine.
It is possible that we should look at what the other threads were doing to the tables in question during the same time period that could cause the borked INSERT. Will pursue this angle next.

Revision history for this message
Patrick Crews (patrick-crews) wrote :
Download full text (4.0 KiB)

It looks like there were two other queries using table `a` in the same time frame as our failure here - an UPDATE to `a` and another INSERT...SELECT...FROM `a`. It is possible that we were *very* lucky in hitting this bug and the threads coordinated just right for us. I leave it to the experts to determine if this might have contributed or not.

UPDATE `a` SET `col_bigint_key` = 7 WHERE `col_char_1024_not_null_key` >= 'b' ORDER BY `col_bigint`,`col_bigint_key`,`col_bigint_not_null`,`col_bigint_not_null_key`,`col_char_10`,`col_char_1024`,`col_char_1024_key`,`col_char_1024_not_null`,`col_char_1024_not_null_key`,`col_char_10_key`,`col_char_10_not_null`,`col_char_10_not_null_key`,`col_enum`,`col_enum_key`,`col_enum_not_null`,`col_enum_not_null_key`,`col_int`,`col_int_key`,`col_int_not_null`,`col_int_not_null_key`,`col_text`,`col_text_key`,`col_text_not_null`,`col_text_not_null_key`,`pk` LIMIT 8 /*Generated by THREAD_ID 3*/;

 INSERT INTO `b` ( `col_char_10` , `col_char_10_key` , `col_char_10_not_null` , `col_char_10_not_null_key` , `col_char_1024` , `col_char_1024_key` , `col_char_1024_not_null` , `col_char_1024_not_null_key` , `col_int` , `col_int_key` , `col_int_not_null` , `col_int_not_null_key` , `col_bigint` , `col_bigint_key` , `col_bigint_not_null` , `col_bigint_not_null_key` , `col_enum` , `col_enum_key` , `col_enum_not_null` , `col_enum_not_null_key` , `col_text` , `col_text_key` , `col_text_not_null` , `col_text_not_null_key` ) SELECT `col_char_10` , `col_char_10_key` , `col_char_10_not_null` , `col_char_10_not_null_key` , `col_char_1024` , `col_char_1024_key` , `col_char_1024_not_null` , `col_char_1024_not_null_key` , `col_int` , `col_int_key` , `col_int_not_null` , `col_int_not_null_key` , `col_bigint` , `col_bigint_key` , `col_bigint_not_null` , `col_bigint_not_null_key` , `col_enum` , `col_enum_key` , `col_enum_not_null` , `col_enum_not_null_key` , `col_text` , `col_text_key` , `col_text_not_null` , `col_text_not_null_key` FROM `a` WHERE `col_char_10` >= 'c' ORDER BY `col_bigint`,`col_bigint_key`,`col_bigint_not_null`,`col_bigint_not_null_key`,`col_char_10`,`col_char_1024`,`col_char_1024_key`,`col_char_1024_not_null`,`col_char_1024_not_null_key`,`col_char_10_key`,`col_char_10_not_null`,`col_char_10_not_null_key`,`col_enum`,`col_enum_key`,`col_enum_not_null`,`col_enum_not_null_key`,`col_int`,`col_int_key`,`col_int_not_null`,`col_int_not_null_key`,`col_text`,`col_text_key`,`col_text_not_null`,`col_text_not_null_key`,`pk` LIMIT 25 /*Generated by THREAD_ID 4*/ ;

# 2010-11-09T12:20:17 Query: INSERT INTO `aa` ( `col_char_10` , `col_char_10_key` , `col_char_10_not_null` , `col_char_10_not_null_key` , `col_char_1024` , `col_char_1024_key` , `col_char_1024_not_null` , `col_char_1024_not_null_key` , `col_int` , `col_int_key` , `col_int_not_null` , `col_int_not_null_key` , `col_bigint` , `col_bigint_key` , `col_bigint_not_null` , `col_bigint_not_null_key` , `col_enum` , `col_enum_key` , `col_enum_not_null` , `col_enum_not_null_key` , `col_text` , `col_text_key` , `col_text_not_null` , `col_text_not_null_key` ) SELECT `col_char_10` , `col_char_10_key` , `col_char_10_not_null` , `col_char_10_not_null_key` , `col_char_1024` , `col_char_1024_...

Read more...

description: updated
Revision history for this message
Patrick Crews (patrick-crews) wrote :

I found a simpler diff on one of these runs. Here, it appears that the slave tables are missing two rows that exist in the master.

Looking deeper, it appears that two DELETE's conflicted.
In the transaction log in the tarball, there are two transactions:
transaction id: 2077 and 2081
Both of these are DELETE's against table cc and both reference pk 1236 and pk 1237.

Transaction 2081 only delete's these pk's. It should be noted that transaction id 2081 is repeated 5x in the log file.
Transaction 2077 deletes 1236, 1237, 1238, and 1239.
I suspect that these two queries conflicted:
There are other queries happening at the same time, but these are

DELETE FROM `cc` WHERE `col_text` < 'x' ORDER BY `col_bigint`,`col_bigint_key`,`col_bigint_not_null`,`col_bigint_not_null_key`,`col_char_10`,`col_char_1024`,`col_char_1024_key`,`col_char_1024_not_null`,`col_char_1024_not_null_key`,`col_char_10_key`,`col_char_10_not_null`,`col_char_10_not_null_key`,`col_enum`,`col_enum_key`,`col_enum_not_null`,`col_enum_not_null_key`,`col_int`,`col_int_key`,`col_int_not_null`,`col_int_not_null_key`,`col_text`,`col_text_key`,`col_text_not_null`,`col_text_not_null_key`,`pk` LIMIT 2 /*Generated by THREAD_ID 4*/;

 DELETE FROM `cc` WHERE `col_text_not_null` < 'y' ORDER BY `col_bigint`,`col_bigint_key`,`col_bigint_not_null`,`col_bigint_not_null_key`,`col_char_10`,`col_char_1024`,`col_char_1024_key`,`col_char_1024_not_null`,`col_char_1024_not_null_key`,`col_char_10_key`,`col_char_10_not_null`,`col_char_10_not_null_key`,`col_enum`,`col_enum_key`,`col_enum_not_null`,`col_enum_not_null_key`,`col_int`,`col_int_key`,`col_int_not_null`,`col_int_not_null_key`,`col_text`,`col_text_key`,`col_text_not_null`,`col_text_not_null_key`,`pk` LIMIT 2 /*Generated by THREAD_ID 5*/ ;

Revision history for this message
Patrick Crews (patrick-crews) wrote :

I should have added that pk's 1238 and 1239 are the two rows that are present in the master, but missing in the slave.

Revision history for this message
Joe Daly (skinny.moey) wrote :

it may be easier for debugging purposes to change the trans id to be an atomic until 674588 is fixed. I did it to verify that the duplicate ids seen in this bug were not somehow related to the messages being incorrectly formed somehow. They were not, the dup ids went away.

Revision history for this message
David Shrewsbury (dshrews) wrote :

We noticed transaction ids were being duplicated in the transaction log. Ran the same tests against revno 1899 which does not have transaction ids being assigned through InnoDB. This problem still exists, but transaction ids were not duplicated (bug 674588). So these appear to be two separate problems.

Revision history for this message
Joe Daly (skinny.moey) wrote :

Ill be pushing up a partial fix for 674588 soon. Currently the only time I see dup ids with my fix is when I see this bug as well.

Although I have not been able to quantify what is going on in the database dump other then it being different between master and slave.

Revision history for this message
David Shrewsbury (dshrews) wrote :

So far, the only thing I've been able to conclude is that the code I added for bug 660779 isn't even being called. So that theory is squashed.

Revision history for this message
David Shrewsbury (dshrews) wrote :
Download full text (8.7 KiB)

Found a case where a transaction has been broken up for some reason:

SET AUTOCOMMIT=OFF /*Generated by THREAD_ID 8*/ ;

 START TRANSACTION /*Generated by THREAD_ID 8*/ ;

 UPDATE `a` SET `col_int_not_null_key` = 6 WHERE `col_char_1024` IN ('a') ORDER BY `col_bigint`,`col_bigint_key`,`col_bigint_not_null`,`col_bigint_not_null_key`,`col_char_10`,`col_char_1024`,`col_char_1024_key`,`col_char_1024_not_null`,`col_char_1024_not_null_key`,`col_char_10_key`,`col_char_10_not_null`,`col_char_10_not_null_key`,`col_enum`,`col_enum_key`,`col_enum_not_null`,`col_enum_not_null_key`,`col_int`,`col_int_key`,`col_int_not_null`,`col_int_not_null_key`,`col_text`,`col_text_key`,`col_text_not_null`,`col_text_not_null_key`,`pk` LIMIT 5 /*Generated by THREAD_ID 8*/ ;

 SELECT * FROM `c` AS X WHERE X . `col_enum_not_null_key` < 6 LIMIT 5 /*Generated by THREAD_ID 8*/ ;

 INSERT INTO `aa` ( `col_char_10` , `col_char_10_key` , `col_char_10_not_null` , `col_char_10_not_null_key` , `col_char_1024` , `col_char_1024_key` , `col_char_1024_not_null` , `col_char_1024_not_null_key` , `col_int` , `col_int_key` , `col_int_not_null` , `col_int_not_null_key` , `col_bigint` , `col_bigint_key` , `col_bigint_not_null` , `col_bigint_not_null_key` , `col_enum` , `col_enum_key` , `col_enum_not_null` , `col_enum_not_null_key` , `col_text` , `col_text_key` , `col_text_not_null` , `col_text_not_null_key` ) SELECT `col_char_10` , `col_char_10_key` , `col_char_10_not_null` , `col_char_10_not_null_key` , `col_char_1024` , `col_char_1024_key` , `col_char_1024_not_null` , `col_char_1024_not_null_key` , `col_int` , `col_int_key` , `col_int_not_null` , `col_int_not_null_key` , `col_bigint` , `col_bigint_key` , `col_bigint_not_null` , `col_bigint_not_null_key` , `col_enum` , `col_enum_key` , `col_enum_not_null` , `col_enum_not_null_key` , `col_text` , `col_text_key` , `col_text_not_null` , `col_text_not_null_key` FROM `bb` WHERE `col_int_not_null_key` IN (25) ORDER BY `col_bigint`,`col_bigint_key`,`col_bigint_not_null`,`col_bigint_not_null_key`,`col_char_10`,`col_char_1024`,`col_char_1024_key`,`col_char_1024_not_null`,`col_char_1024_not_null_key`,`col_char_10_key`,`col_char_10_not_null`,`col_char_10_not_null_key`,`col_enum`,`col_enum_key`,`col_enum_not_null`,`col_enum_not_null_key`,`col_int`,`col_int_key`,`col_int_not_null`,`col_int_not_null_key`,`col_text`,`col_text_key`,`col_text_not_null`,`col_text_not_null_key`,`pk` LIMIT 100 /*Generated by THREAD_ID 8*/ ;

 UPDATE `a` SET `col_int_not_null_key` = 5 WHERE `col_int` = 4 ORDER BY `col_bigint`,`col_bigint_key`,`col_bigint_not_null`,`col_bigint_not_null_key`,`col_char_10`,`col_char_1024`,`col_char_1024_key`,`col_char_1024_not_null`,`col_char_1024_not_null_key`,`col_char_10_key`,`col_char_10_not_null`,`col_char_10_not_null_key`,`col_enum`,`col_enum_key`,`col_enum_not_null`,`col_enum_not_null_key`,`col_int`,`col_int_key`,`col_int_not_null`,`col_int_not_null_key`,`col_text`,`col_text_key`,`col_text_not_null`,`col_text_not_null_key`,`pk` LIMIT 5 /*Generated by THREAD_ID 8*/ ;

 INSERT INTO `bb` ( `col_char_10` , `col_char_10_key` , `col_char_10_not_null` , `col_char_10_not_null_key` , `col_char_1024` , `col_char_1024_key` , `col_char_1024_not_null` ,...

Read more...

Revision history for this message
David Shrewsbury (dshrews) wrote :

Note that also, this entire transaction should have been rolled back.

Revision history for this message
David Shrewsbury (dshrews) wrote :

I think we are dealing with more than one bug here.

It looks like there is an issue around the ROLLBACK TO SAVEPOINT code. If we remove all SAVEPOINT references from the test, failures occur MUCH less frequently.

Without SAVEPOINTS, there is still the rare failure. In this case, it looks like the master is having a failed statement rolled back, but the the TransactionServices code is failing to recognize it for some reason and does not rollback the GPB message appropriately. This one is more difficult to reproduce, but in the cases that I've seen, it has always been on a failing statment when AUTOCOMMIT=ON.

Revision history for this message
David Shrewsbury (dshrews) wrote :

I'm more and more convinced that this is related to transactions/statements being rolled back due to lock wait timeouts. For example, this failed statement made it into the transaction log:

SET AUTOCOMMIT=ON /*Generated by THREAD_ID 5*/;

UPDATE `AA` SET `col_int_not_null_key` = 50
WHERE `col_int_key` IN (8)
ORDER BY
   `col_bigint`,`col_bigint_key`,`col_bigint_not_null`,`col_bigint_not_null_key`,`col_char_10`,
   `col_char_1024`,`col_char_1024_key`,`col_char_1024_not_null`,`col_char_1024_not_null_key`,
   `col_char_10_key`,`col_char_10_not_null`,`col_char_10_not_null_key`,`col_enum`,`col_enum_key`,
   `col_enum_not_null`,`col_enum_not_null_key`,`col_int`,`col_int_key`,`col_int_not_null`,
   `col_int_not_null_key`,`col_text`,`col_text_key`,`col_text_not_null`,`col_text_not_null_key`,`pk`
LIMIT 5 /*Generated by THREAD_ID 5*/;

# 2010-11-29T10:17:54 Query: UPDATE `AA` SET `col_int_not_null_key` = 50 WHERE `col_int_key` IN (8) ORDER BY `col_bigint`,`col_bigint_key`,`col_bigint_not_null`,`col_bigint_not_null_key`,`col_char_10`,`col_char_1024`,`col_char_1024_key`,`col_char_1024_not_null`,`col_char_1024_not_null_key`,`col_char_10_key`,`col_char_10_not_null`,`col_char_10_not_null_key`,`col_enum`,`col_enum_key`,`col_enum_not_null`,`col_enum_not_null_key`,`col_i
nt`,`col_int_key`,`col_int_not_null`,`col_int_not_null_key`,`col_text`,`col_text_key`,`col_text_not_null`,`col_text_not_null_key`,`pk` LIMIT 5 /*Generated by THREAD_ID 5*/ failed: 1213 Deadlock found when trying to get lock; try restarting transaction. Further errors of this kind will be suppressed.

My test cases that I've created so far have all handled this error correctly. I do not know what the other mystery factor is that causes it to be handled incorrectly.

Revision history for this message
Patrick Crews (patrick-crews) wrote :

I am agreeing with you here - there is a disconnect between what the engine is doing and how the log is interpreting things.

Would it be difficult for us to log:
1) When the log code starts a statement + the sql that caused it (I believe we have some of this already)
      Something like - New Transaction ID: NNNNN, SQL: UPDATE 'bb'...

2) When the log code ends / writes a message - "ID: NNNNN ended/written"

3) When the log code is rolling back something - rolling back ID: NNNNN

4) When a query is added to an existing message.

I'll be taking a look at the code myself to see about this, but figured I would ask the expert : )
I'm hoping that this can provide us with additional insight into how the log is managing / interpreting the queries the randgen is throwing at it. It's proving difficult to try to reproduce this any other way without more info on what the server is doing.

Revision history for this message
Stewart Smith (stewart) wrote :

Going to attempt to explain what i'm doing here.....

I have rigged the engine so that on every 2nd row insert, it returns LOCK_WAIT_TIMEOUT (and calls the appropriate mark_transaction_to_rollback)

So, in a single thread:

+INSERT INTO t1 (id,padding) values (40, "boo"),(41,"scary");
+Got one of the listed errors

which means what *should* happen is that we do the equiv of "begin; insert; insert (error); rollback;"

+begin;
+INSERT INTO t1 (id,padding) VALUES (1, "I love testing.");
+INSERT INTO t1 (id,padding) VALUES (2, "I hate testing.");
+Got one of the listed errors
+update t1 set padding="bar" where id=1;
+commit;
+INSERT INTO t1 (id,padding) VALUES (2, "I hate testing.");
+COMMIT;
+select * from t1;
+pk id padding
+2 1 bar
+3 2 I hate testing.

Note how we do *NOT* have the "boo" line from above.

+CREATE TABLE `test`.`t1` ( `pk` INT NOT NULL AUTO_INCREMENT, `id` INT NOT NULL, `padding` VARCHAR(200) COLLATE utf8_general_ci NOT NULL, PRIMARY KEY (`pk`) ) ENGINE=STORAGE_ENGINE_API_TESTER COLLATE = utf8_general_ci;
 COMMIT;
 START TRANSACTION;
+INSERT INTO `test`.`t1` (`pk`,`id`,`padding`) VALUES (1,40,'boo');
+INSERT INTO `test`.`t1` (`pk`,`id`,`padding`) VALUES (2,1,'I love testing.');
+UPDATE `test`.`t1` SET `padding`='bar' WHERE `pk`=2;
 COMMIT;

But notice how we *DO* have it here. We got the rollback to the engine okay, but *NOT* to the transaction log.

 START TRANSACTION;
+INSERT INTO `test`.`t1` (`pk`,`id`,`padding`) VALUES (3,2,'I hate testing.');
 COMMIT;
 START TRANSACTION;
 DROP TABLE `test`.`t1`;

Revision history for this message
David Shrewsbury (dshrews) wrote :
Download full text (3.3 KiB)

Fixing bug 683147 (statement rollbacks) eliminated many of the errors. We've discovered that if we don't use SAVEPOINTs, the rest of the errors seem to go away.

I finally have a test case that proves that savepoints linger around longer than they should (i.e., they don't get deleted for some reason). The portion of the test that echos "Definitely should have no savepoints" shows that we still have a savepoint, and the query "UPDATE t3 SET b = 'TRX1c'" will get rolled back after the "ROLLBACK TO SAVEPOINT A" by InnoDB, but the transaction log does not roll it back. If this were to operate correctly, there would be no lingering savepoint, the "ROLLBACK TO SAVEPOINT A" would produce an error, and the changes would show up in both InnoDB and the transaction log.

# Ignore startup/shutdown events
--disable_query_log
--source ../plugin/transaction_log/tests/t/truncate_log.inc
--enable_query_log

--disable_warnings
DROP TABLE IF EXISTS t1;
DROP TABLE IF EXISTS t2;
--enable_warnings

CREATE TABLE t1(a INT NOT NULL AUTO_INCREMENT, b VARCHAR(10), c VARCHAR(10), PRIMARY KEY(a));
CREATE TABLE t2(a INT NOT NULL AUTO_INCREMENT, b VARCHAR(10), c VARCHAR(10), PRIMARY KEY(a));
CREATE TABLE t3(a INT NOT NULL AUTO_INCREMENT, b VARCHAR(10), c VARCHAR(10), PRIMARY KEY(a));

INSERT INTO t1 (b,c) VALUES ('1','ok'), ('3','ok');
INSERT INTO t2 (b,c) VALUES ('2','ok'), ('4','ok');
INSERT INTO t3 (b,c) VALUES ('1','ok'), ('2','ok'), ('3','ok'), ('4','ok'), ('5','ok'), ('6','ok'), ('7','ok'), ('8','ok'), ('9','ok'), ('10','ok');

--source ../plugin/transaction_log/tests/t/truncate_log.inc
--echo

--echo Test deadlock
--echo

# To create a deadlock (without a lock wait timeout), we need two
# connections dependent on each other:
# con1> lock t1;
# con2> lock t2;
# con2> lock t1; <-- a deadlock that would timeout
# con1> lock t2; <-- a deadlock causing InnoDB to choose a trx to rollback

connect (con1, localhost, root, , test);
connect (con2, localhost, root, , test);

connection con1;
SET AUTOCOMMIT=OFF;
START TRANSACTION;
UPDATE t1 SET c = 'trx1' WHERE a > 0;

SAVEPOINT A;
--echo
--echo Should have one savepoint: A
SELECT * FROM DATA_DICTIONARY.USER_SAVEPOINTS;
--echo

connection con2;
SET AUTOCOMMIT=OFF;
START TRANSACTION;
UPDATE t2 SET a = a*3;
send UPDATE t1 SET b = 'trx2' WHERE a > 0;
sleep 1;

connection con1;
--ERROR 1213
DELETE FROM t2 WHERE a > 0;

--echo
--echo Should have no savepoints
SELECT * FROM DATA_DICTIONARY.USER_SAVEPOINTS;
--echo
SAVEPOINT A;
UPDATE t3 SET c = 'TRX1a' WHERE a < 10 ORDER BY b LIMIT 4;
UPDATE t3 SET b = 'TRX1b' WHERE a < 10 ORDER BY c LIMIT 4;
ROLLBACK TO SAVEPOINT A;
SAVEPOINT A;
ROLLBACK;

START TRANSACTION;
--echo
--echo Definitely should have no savepoints
SELECT * FROM DATA_DICTIONARY.USER_SAVEPOINTS;
--echo
UPDATE t3 SET b = 'TRX1c';
ROLLBACK TO SAVEPOINT A;

COMMIT;

connection con2;
reap;
COMMIT;

disconnect con1;
disconnect con2;

connection default;

--echo
SELECT * FROM t1;
--echo
SELECT * FROM t2;
--echo
SELECT * FROM t3;

--echo
--replace_regex /start_timestamp: [0-9]+/START_TIMESTAMP/g /end_timestamp: [0-9]+/END_TIMESTAMP/g /creation_timestamp: [0-9]+/CREATE_TIMESTAMP/ /update_timestamp: [0-9]+/UPDATE_TIMESTAMP...

Read more...

Revision history for this message
David Shrewsbury (dshrews) wrote :

Savepoints were lingering and/or not setup properly by the engines because the transaction was not started properly when a SAVEPOINT (or even ROLLBACK TO SAVEPOINT) implicitly begins a new transaction with AUTOCOMMIT=OFF.

If an implicit transaction is started by a statement that touches tables, the call to startStatement() from lockTables() would automatically start a transaction if one was needed. Since savepoints don't go through this code, we needed checks to look for this case.

Revision history for this message
Patrick Crews (patrick-crews) wrote : Re: [Bug 673105] Re: Drizzle transaction log showing differences from server state in concurrent testing
Download full text (4.7 KiB)

w00t! You are the man! 12 hour days and randgen crash courses paid off ; )

On Fri, Dec 17, 2010 at 4:09 PM, David Shrewsbury <<email address hidden>
> wrote:

> Savepoints were lingering and/or not setup properly by the engines
> because the transaction was not started properly when a SAVEPOINT (or
> even ROLLBACK TO SAVEPOINT) implicitly begins a new transaction with
> AUTOCOMMIT=OFF.
>
> If an implicit transaction is started by a statement that touches
> tables, the call to startStatement() from lockTables() would
> automatically start a transaction if one was needed. Since savepoints
> don't go through this code, we needed checks to look for this case.
>
> --
> You received this bug notification because you are a direct subscriber
> of the bug.
> https://bugs.launchpad.net/bugs/673105
>
> Title:
> Drizzle transaction log showing differences from server state in
> concurrent testing
>
> Status in A Lightweight SQL Database for Cloud Infrastructure and Web
> Applications:
> In Progress
>
> Bug description:
> When testing Drizzle's transaction log in a concurrent environment with
> the randgen, we are seeing differences between the server's state and the
> SQL captured in the transaction log. The failures appear to be the results
> of INSERT's, UPDATE's, and DELETE's.
>
> The test consists of executing several queries (in several threads) against
> the server. We then generate SQL from the transaction log and use the SQL
> to populate a validation server. We compare Drizzledump output between
> 'master' and 'slave' to validate.
>
> The failure is generally consistent, but appears to be subject to timing
> issues. The randgen fork()'s off processes for each 'thread' used and the
> exact timing and interaction between threads varies. However, we see
> failures at certain thread/query_count combinations:
>
> 2 threads, 2500 queries
> 5 threads, 70-75 queries
> 10 thread, 50 queries.
>
> This test grammar was designed to produce a higher percentage of valid
> queries and tables can easily grow quite large.
>
> At present, I am not certain what is causing the issue, but it is an
> interaction between connections. These failures are not repeatable in
> single-thread scenarios.
>
> TO REPEAT:
> 1) Start a drizzle server to test, enable the transaction log
> ./test-run --start-and-exit --drizzled=--transaction-log.enable (expected
> port 9306) --basedir=$PWD/..
> 2) Start a second server to validate (expected port 9307)
> 3) Run the randgen and observe the results. Command line:
> perl ./gentest.pl \
> --dsn=dbi:drizzle:host=localhost:port=9306:user=root:password="":database=test
> \
> --gendata=conf/drizzle/translog_drizzle.zz
> --grammar=conf/drizzle/translog_concurrent3.yy \
> --queries=70 --threads=5 --sqltrace --debug
> --Reporter=DrizzleConcurrentTransactionLog
>
> Sample output:
> # 2010-11-09T11:51:18 Executing diff --unified
> /tmp//translog_rpl_dump_4869_9306.sql /tmp//translog_rpl_dump_4869_9307.sql
> --- /tmp//translog_rpl_dump_4869_9306.sql 2010-11-09
> 11:51:17.981424189 -0500
> +++ /tmp//translog_rpl_dump_4869_9307.sql 2010-11-09
> 11:51:18.491212405 -0500
> @@ -1014,21 +1014,46 @@
> INSERT INTO `a...

Read more...

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.