Many kewpie replication test cases are failing

Bug #927996 reported by Patrick Crews
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
MySQL patches by Codership
Invalid
Undecided
Unassigned
Percona XtraDB Cluster moved to https://jira.percona.com/projects/PXC
Invalid
Undecided
Unassigned

Bug Description

When running kewpie test cases against codership-mysql 5.5 and percona-xtradb-cluster 5.5 with the latest galera-2.x library, I am seeing a number of failing test cases.

The cases are not the same for both versions of the server, but they are similar (indications of failed replication) output is below.

To repeat:
from a fresh branch of lp:kewpie and the attached branch (merges in a config file + test cases).
./kewpie.py --sys-config=codership-mysql-basedir/qp/qp-config.py --wsrep-provider-path=/path (default=/usr/lib/galera/libgalera_smm.so)

The qp directory includes a patch for the tests that will alter the tearDown behavior to leave the test schema intact if only one test runs / on the first failure...this had caused some problems previously.

20120206-184952 cluster_basic.replaceMultiRow_test [ fail ] 12258
20120206-184952 test_replace (replaceMultiRow_test.basicTest) ... FAIL
20120206-184952 ERROR
20120206-184952
20120206-184952 ======================================================================
20120206-184952 ERROR: test_replace (replaceMultiRow_test.basicTest)
20120206-184952 ----------------------------------------------------------------------
20120206-184952 Traceback (most recent call last):
20120206-184952 File "/home/pcrews/bzr/work/pxc-beta/codership-mysql/kewpie/lib/util/mysqlBaseTestCase.py", line 54, in tearDown
20120206-184952 self.assertEqual(retcode,0,result)
20120206-184952 AssertionError: (<type 'exceptions.Exception'>, OperationalError(1047, 'Unknown command'))
20120206-184952
20120206-184952 ======================================================================
20120206-184952 FAIL: test_replace (replaceMultiRow_test.basicTest)
20120206-184952 ----------------------------------------------------------------------
20120206-184952 Traceback (most recent call last):
20120206-184952 File "/home/pcrews/bzr/work/pxc-beta/codership-mysql/qp/cluster_basic/replaceMultiRow_test.py", line 62, in test_replace
20120206-184952 self.assertEqual(master_slave_diff, None, master_slave_diff)
20120206-184952 AssertionError: {'s2': ["t1: master_checksum= (('test.t1', 2836003186L),) | slave_checksum= Error 1047: Unknown command"]}
20120206-184952
20120206-184952 ----------------------------------------------------------------------

summary: - Many replication test cases are failing
+ Many kewpie replication test cases are failing
Revision history for this message
Teemu Ollakka (teemu-ollakka) wrote :

Patrick,

I was able to have similar results from some of the cases, and usually it seemed to be that although slave nodes were started, they hadn't received state snapshot yet. Looking inside galera.py I see that is_started() method just checks whether server pid file has been created. This is not enough to make sure that wsrep enabled server is actually synchronized with other nodes. This method should also check value of 'wsrep_ready' status variable. If it is 'ON', node is synchronized with the group.

There were also other kind of test failures which were related to query causality. Although galera ensures that all changes are received on all nodes before control is returned to client, it does not guarantee by default that all changes are applied. For this reason there is 'wsrep_causal_reads' session variable, which if set to '1', guarantees that all previously replicated changes are also applied before query is actually executed. While this should be enough to guarantee strict consistency for autocommit DML, unfortunately it seems that even this is not enough for DDLs (for the reasons I'm not complete sure about yet), but with following hack to kewpie I was able to get rid of causality related failures even with DDLs.

The following patch enforces one causal read on each slave in check_slaves_by_query() and check_slaves_by_checksum() before running actual check query.

=== modified file 'lib/util/mysqlBaseTestCase.py'
--- lib/util/mysqlBaseTestCase.py 2012-02-04 23:03:30 +0000
+++ lib/util/mysqlBaseTestCase.py 2012-02-07 10:48:40 +0000
@@ -87,6 +87,16 @@
             results.append(table_name)
         return results

+ def causal_read(self, server):
+ """ Execute causal read on server to make sure that all
+ changes from master have been propagated and applied
+ (galera specific)
+
+ """
+ queries = ["SET wsrep_causal_reads=1", "SELECT 0"]
+ self.execute_queries(queries, server)
+ return None
+
     def check_slaves_by_query( self
                              , master_server
                              , other_servers
@@ -111,6 +121,7 @@
             # run against master for 'good' value
             retcode, expected_result = self.execute_query(query, master_server)
         for server in other_servers:
+ self.causal_read(server)
             retcode, slave_result = self.execute_query(query, server)
             #print "%s: expected_result= %s | slave_result= %s" % ( server.name
             # , expected_result
@@ -149,6 +160,7 @@
         comp_results = {}
         logging = master_server.logging
         for server in other_servers:
+ self.causal_read(server)
             for schema in schemas:
                 for table in self.get_tables(master_server, schema):
                     query = "CHECKSUM TABLE %s.%s" %(schema, table)

Revision history for this message
Patrick Crews (patrick-crews) wrote : Re: [Bug 927996] Re: Many kewpie replication test cases are failing
Download full text (3.5 KiB)

Teemu,

Hi. I do need to add a routine to check the wsrep_ready variable.
For most of these tests though, I have a 5 second delay from the
initialization of the node until I start issuing queries, which had been
sufficient on my test machine. However, I will make sure to update the
code to ensure we properly wait.

Thank you for the insight + the additional information on
wsrep_casual_reads. This definitely helps explain the issues I've been
seeing. Will update + rerun.

On 02/07/2012 06:15 AM, Teemu Ollakka wrote:
> Patrick,
>
> I was able to have similar results from some of the cases, and usually
> it seemed to be that although slave nodes were started, they hadn't
> received state snapshot yet. Looking inside galera.py I see that
> is_started() method just checks whether server pid file has been
> created. This is not enough to make sure that wsrep enabled server is
> actually synchronized with other nodes. This method should also check
> value of 'wsrep_ready' status variable. If it is 'ON', node is
> synchronized with the group.
>
>
> There were also other kind of test failures which were related to query causality. Although galera ensures that all changes are received on all nodes before control is returned to client, it does not guarantee by default that all changes are applied. For this reason there is 'wsrep_causal_reads' session variable, which if set to '1', guarantees that all previously replicated changes are also applied before query is actually executed. While this should be enough to guarantee strict consistency for autocommit DML, unfortunately it seems that even this is not enough for DDLs (for the reasons I'm not complete sure about yet), but with following hack to kewpie I was able to get rid of causality related failures even with DDLs.
>
> The following patch enforces one causal read on each slave in
> check_slaves_by_query() and check_slaves_by_checksum() before running
> actual check query.
>
>
> === modified file 'lib/util/mysqlBaseTestCase.py'
> --- lib/util/mysqlBaseTestCase.py 2012-02-04 23:03:30 +0000
> +++ lib/util/mysqlBaseTestCase.py 2012-02-07 10:48:40 +0000
> @@ -87,6 +87,16 @@
> results.append(table_name)
> return results
>
> + def causal_read(self, server):
> + """ Execute causal read on server to make sure that all
> + changes from master have been propagated and applied
> + (galera specific)
> +
> + """
> + queries = ["SET wsrep_causal_reads=1", "SELECT 0"]
> + self.execute_queries(queries, server)
> + return None
> +
> def check_slaves_by_query( self
> , master_server
> , other_servers
> @@ -111,6 +121,7 @@
> # run against master for 'good' value
> retcode, expected_result = self.execute_query(query, master_server)
> for server in other_servers:
> + self.causal_read(server)
> retcode, slave_result = self.execute_query(query, server)
> #print "%s: expected_result= %s | slave_result= %s" % ( server.name
> # ...

Read more...

Revision history for this message
Teemu Ollakka (teemu-ollakka) wrote :

Further investigation revealed that there actually is a bug which causes wsrep_causal_reads to be ignored for SHOW CREATE TABLE and some other queries, therefore simply setting 'wsrep_causal_reads=1' didn't have desired effect in tests which use SHOW CREATE TABLE result. The patch above worked around this, but on the other hand it might be too much as it may mask out some real bugs. Better approach would probably be giving param

    init_command = 'SET wsrep_causal_reads=1'

for MySQLdb.connect() call.

Revision history for this message
Patrick Crews (patrick-crews) wrote :

Incidentally, I am seeing node propagation die if I try to issue a "SHOW STATUS LIKE 'wsrep_ready'" query while the nodes are first doing their thing:
If I do a 3-5 second sleep before messing with the servers, things are a-ok. However, any attempt to just start the servers and do anything before some initial work is done results in slave nodes terminating as follows:

120207 17:07:07 [Note] WSREP: Requesting state transfer failed: -11(Resource temporarily unavailable). Will keep retrying every 1 second(s)
120207 17:07:07 [Note] WSREP: gcomm: joining thread
120207 17:07:07 [Note] WSREP: Quorum results:
 version = 2,
 component = NON-PRIMARY,
 conf_id = -1,
 members = 1/1 (joined/total),
 act_id = -1,
 last_appl. = -1,
 protocols = -1/-1/-1 (gcs/repl/appl),
 group UUID = 00000000-0000-0000-0000-000000000000
120207 17:07:07 [Note] WSREP: gcomm: closing backend
120207 17:07:07 [Note] WSREP: GMCast::handle_stable_view: view((empty))
120207 17:07:07 [Note] WSREP: gcomm: closed
120207 17:07:07 [Note] WSREP: Flow-control interval: [8, 16]
120207 17:07:07 [Note] WSREP: Received NON-PRIMARY.
120207 17:07:07 [Note] WSREP: Shifting PRIMARY -> OPEN (TO: 0)
120207 17:07:07 [Note] WSREP: Received self-leave message.
120207 17:07:07 [Note] WSREP: Flow-control interval: [0, 0]
120207 17:07:07 [Note] WSREP: Received SELF-LEAVE. Closing connection.
120207 17:07:07 [Note] WSREP: Shifting OPEN -> CLOSED (TO: 0)
120207 17:07:07 [Note] WSREP: RECV thread exiting 0: Success
120207 17:07:07 [Note] WSREP: recv_thread() joined.
120207 17:07:07 [Note] WSREP: Closing slave action queue.
120207 17:07:09 [ERROR] WSREP: Requesting state transfer failed: -77(File descriptor in bad state)
120207 17:07:09 [ERROR] WSREP: State transfer request failed unrecoverably: 77 (File descriptor in bad state). Most likely it is due to inability to communicate with the cluster primary component. Restart required.
120207 17:07:09 [Note] WSREP: Closing send monitor...

Revision history for this message
Patrick Crews (patrick-crews) wrote :

Test failures were due to starting test work before slave servers were in a ready state. Added a polling routine to the test infrastructure to ensure this won't happen again. Currently all tests are passing with 5.5 branch + latest galera provider library for both codership-mysql 5.5 + percona-xtradb-cluster 5.5

Marking bug as invalid

Changed in codership-mysql:
status: New → Invalid
Changed in percona-xtradb-cluster:
status: New → Invalid
Revision history for this message
Shahriyar Rzayev (rzayev-sehriyar) wrote :

Percona now uses JIRA for bug reports so this bug report is migrated to: https://jira.percona.com/browse/PXC-1206

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.