Comment 1 for bug 927996

Teemu Ollakka (teemu-ollakka) wrote :


I was able to have similar results from some of the cases, and usually it seemed to be that although slave nodes were started, they hadn't received state snapshot yet. Looking inside I see that is_started() method just checks whether server pid file has been created. This is not enough to make sure that wsrep enabled server is actually synchronized with other nodes. This method should also check value of 'wsrep_ready' status variable. If it is 'ON', node is synchronized with the group.

There were also other kind of test failures which were related to query causality. Although galera ensures that all changes are received on all nodes before control is returned to client, it does not guarantee by default that all changes are applied. For this reason there is 'wsrep_causal_reads' session variable, which if set to '1', guarantees that all previously replicated changes are also applied before query is actually executed. While this should be enough to guarantee strict consistency for autocommit DML, unfortunately it seems that even this is not enough for DDLs (for the reasons I'm not complete sure about yet), but with following hack to kewpie I was able to get rid of causality related failures even with DDLs.

The following patch enforces one causal read on each slave in check_slaves_by_query() and check_slaves_by_checksum() before running actual check query.

=== modified file 'lib/util/'
--- lib/util/ 2012-02-04 23:03:30 +0000
+++ lib/util/ 2012-02-07 10:48:40 +0000
@@ -87,6 +87,16 @@
         return results

+ def causal_read(self, server):
+ """ Execute causal read on server to make sure that all
+ changes from master have been propagated and applied
+ (galera specific)
+ """
+ queries = ["SET wsrep_causal_reads=1", "SELECT 0"]
+ self.execute_queries(queries, server)
+ return None
     def check_slaves_by_query( self
                              , master_server
                              , other_servers
@@ -111,6 +121,7 @@
             # run against master for 'good' value
             retcode, expected_result = self.execute_query(query, master_server)
         for server in other_servers:
+ self.causal_read(server)
             retcode, slave_result = self.execute_query(query, server)
             #print "%s: expected_result= %s | slave_result= %s" % (
             # , expected_result
@@ -149,6 +160,7 @@
         comp_results = {}
         logging = master_server.logging
         for server in other_servers:
+ self.causal_read(server)
             for schema in schemas:
                 for table in self.get_tables(master_server, schema):
                     query = "CHECKSUM TABLE %s.%s" %(schema, table)