RBR error on IST not zeroing grastate

Bug #1180791 reported by Jay Janssen on 2013-05-16
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Galera
Status tracked in 3.x
2.x
High
Yan Zhang
3.x
High
Yan Zhang
Percona XtraDB Cluster
Status tracked in 5.6
5.5
Undecided
Unassigned
5.6
Undecided
Unassigned

Bug Description

130516 10:02:30 [Note] WSREP: SST received: f9ae5241-be23-11e2-0800-9610321e6dbf:43045
130516 10:02:30 [Note] /usr/sbin/mysqld: ready for connections.
Version: '5.5.30' socket: '/var/lib/mysql/mysql.sock' port: 3306 Percona XtraDB Cluster (GPL), wsrep_23.7.4
.r3843
130516 10:02:30 [Note] WSREP: Receiving IST: 24484 writesets, seqnos 43045-67529
130516 10:02:30 [ERROR] Slave SQL: Could not execute Delete_rows event on table test.sbtest1; Can't find recor
d in 'sbtest1', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log FIRST, end_log_po
s 1193, Error_code: 1032
130516 10:02:30 [Warning] WSREP: RBR event 6 Delete_rows apply warning: 120, 43046
130516 10:02:30 [ERROR] WSREP: receiving IST failed, node restart required: Failed to apply app buffer: seqno:
 43046, status: WSREP_FATAL
         at galera/src/replicator_smm.cpp:apply_wscoll():52
         at galera/src/replicator_smm.cpp:apply_trx_ws():118

I was able to get a node stuck in this state where it continued to retry IST on every restart and got this error. The grastate.dat was not getting zeroed appropriately in this case.

[root@perconadbt mysql]# rpm -qa | grep -i percona
percona-release-0.0-1.x86_64
Percona-XtraDB-Cluster-server-5.5.30-23.7.4.406.rhel6.x86_64
Percona-XtraDB-Cluster-client-5.5.30-23.7.4.406.rhel6.x86_64
percona-xtrabackup-2.0.7-552.rhel6.x86_64
Percona-XtraDB-Cluster-galera-2.5-1.150.rhel6.x86_64
Percona-XtraDB-Cluster-shared-5.5.30-23.7.4.406.rhel6.x86_64

Alex Yurchenko (ayurchen) wrote :

This seems to be a Galera bug: grastate invalidation code does not cover all code paths.

Changed in galera:
assignee: nobody → Alex Yurchenko (ayurchen)
importance: Undecided → High
milestone: none → 24.2.5
status: New → Confirmed

Yes, looks like ReplicatorSMM::recv_IST Exception can mark_unsafe in addition to gu_abort or st_.mark_safe be only called after IST is fully complete.

Changed in percona-xtradb-cluster:
milestone: none → 5.5.31-25
Changed in percona-xtradb-cluster:
milestone: 5.5.31-25 → 5.5.31-24.8
Changed in percona-xtradb-cluster:
milestone: 5.5.31-23.7.5 → 5.5.31-25
Changed in galera:
milestone: 23.2.6 → 23.2.7
Changed in percona-xtradb-cluster:
milestone: 5.5.33-23.7.6 → future-5.5

Tested with:

=== modified file 'galera/src/replicator_str.cpp'
--- galera/src/replicator_str.cpp 2013-11-02 17:21:57 +0000
+++ galera/src/replicator_str.cpp 2013-12-15 10:57:57 +0000
@@ -766,6 +766,7 @@
     {
         log_fatal << "receiving IST failed, node restart required: "
                   << e.what();
+ st_.mark_corrupt();
         gcs_.close();
         gu_abort();
     }

and it zeroed the grastate correctly on IST error.

However, as the error states there may be other exceptions which
node restart may fix - network issues for instance.

So, it is better to mark this closer to where it happens..

=== modified file 'galera/src/replicator_str.cpp'
--- galera/src/replicator_str.cpp 2013-11-02 17:21:57 +0000
+++ galera/src/replicator_str.cpp 2013-12-15 11:46:20 +0000
@@ -752,7 +752,15 @@
                     // processed on donor, just adjust states here
                     trx->set_state(TrxHandle::S_REPLICATING);
                     trx->set_state(TrxHandle::S_CERTIFYING);
- apply_trx(recv_ctx, trx);
+ try
+ {
+ apply_trx(recv_ctx, trx);
+ }
+ catch (gu::Exception& e)
+ {
+ st_.mark_corrupt();
+ throw;
+ }
                 }
             }
             else

Yan Zhang (yan.zhang) wrote :

@raghu

I don't understand the second patch. If ```apply_trx``` raises gu::Exception, the exception will be caught by outer try-catch-clause and mark state file corrupt(that's your first patch) immediately.

Our fix has been reverted in lieu of fix in 78 since it covers more space.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers