Many kewpie replication test cases are failing
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
MySQL patches by Codership |
Invalid
|
Undecided
|
Unassigned | ||
Percona XtraDB Cluster moved to https://jira.percona.com/projects/PXC |
Invalid
|
Undecided
|
Unassigned |
Bug Description
When running kewpie test cases against codership-mysql 5.5 and percona-
The cases are not the same for both versions of the server, but they are similar (indications of failed replication) output is below.
To repeat:
from a fresh branch of lp:kewpie and the attached branch (merges in a config file + test cases).
./kewpie.py --sys-config=
The qp directory includes a patch for the tests that will alter the tearDown behavior to leave the test schema intact if only one test runs / on the first failure...this had caused some problems previously.
20120206-184952 cluster_
20120206-184952 test_replace (replaceMultiRo
20120206-184952 ERROR
20120206-184952
20120206-184952 =======
20120206-184952 ERROR: test_replace (replaceMultiRo
20120206-184952 -------
20120206-184952 Traceback (most recent call last):
20120206-184952 File "/home/
20120206-184952 self.assertEqua
20120206-184952 AssertionError: (<type 'exceptions.
20120206-184952
20120206-184952 =======
20120206-184952 FAIL: test_replace (replaceMultiRo
20120206-184952 -------
20120206-184952 Traceback (most recent call last):
20120206-184952 File "/home/
20120206-184952 self.assertEqua
20120206-184952 AssertionError: {'s2': ["t1: master_checksum= (('test.t1', 2836003186L),) | slave_checksum= Error 1047: Unknown command"]}
20120206-184952
20120206-184952 -------
Related branches
summary: |
- Many replication test cases are failing + Many kewpie replication test cases are failing |
Patrick,
I was able to have similar results from some of the cases, and usually it seemed to be that although slave nodes were started, they hadn't received state snapshot yet. Looking inside galera.py I see that is_started() method just checks whether server pid file has been created. This is not enough to make sure that wsrep enabled server is actually synchronized with other nodes. This method should also check value of 'wsrep_ready' status variable. If it is 'ON', node is synchronized with the group.
There were also other kind of test failures which were related to query causality. Although galera ensures that all changes are received on all nodes before control is returned to client, it does not guarantee by default that all changes are applied. For this reason there is 'wsrep_ causal_ reads' session variable, which if set to '1', guarantees that all previously replicated changes are also applied before query is actually executed. While this should be enough to guarantee strict consistency for autocommit DML, unfortunately it seems that even this is not enough for DDLs (for the reasons I'm not complete sure about yet), but with following hack to kewpie I was able to get rid of causality related failures even with DDLs.
The following patch enforces one causal read on each slave in check_slaves_ by_query( ) and check_slaves_ by_checksum( ) before running actual check query.
=== modified file 'lib/util/ mysqlBaseTestCa se.py' mysqlBaseTestCa se.py 2012-02-04 23:03:30 +0000 mysqlBaseTestCa se.py 2012-02-07 10:48:40 +0000
results. append( table_name)
--- lib/util/
+++ lib/util/
@@ -87,6 +87,16 @@
return results
+ def causal_read(self, server): reads=1" , "SELECT 0"] queries( queries, server) by_query( self
, master_server
, other_servers
retcode, expected_result = self.execute_ query(query, master_server) read(server)
retcode, slave_result = self.execute_ query(query, server)
comp_ results = {} server. logging read(server)
for table in self.get_ tables( master_ server, schema):
query = "CHECKSUM TABLE %s.%s" %(schema, table)
+ """ Execute causal read on server to make sure that all
+ changes from master have been propagated and applied
+ (galera specific)
+
+ """
+ queries = ["SET wsrep_causal_
+ self.execute_
+ return None
+
def check_slaves_
@@ -111,6 +121,7 @@
# run against master for 'good' value
for server in other_servers:
+ self.causal_
#print "%s: expected_result= %s | slave_result= %s" % ( server.name
# , expected_result
@@ -149,6 +160,7 @@
logging = master_
for server in other_servers:
+ self.causal_
for schema in schemas: