RBR error on IST not zeroing grastate

Bug #1180791 reported by Jay Janssen on 2013-05-16
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Status tracked in 3.x
Yan Zhang
Yan Zhang
Percona XtraDB Cluster
Status tracked in 5.6
Fix Released
Fix Released

Bug Description

130516 10:02:30 [Note] WSREP: SST received: f9ae5241-be23-11e2-0800-9610321e6dbf:43045
130516 10:02:30 [Note] /usr/sbin/mysqld: ready for connections.
Version: '5.5.30' socket: '/var/lib/mysql/mysql.sock' port: 3306 Percona XtraDB Cluster (GPL), wsrep_23.7.4
130516 10:02:30 [Note] WSREP: Receiving IST: 24484 writesets, seqnos 43045-67529
130516 10:02:30 [ERROR] Slave SQL: Could not execute Delete_rows event on table test.sbtest1; Can't find recor
d in 'sbtest1', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log FIRST, end_log_po
s 1193, Error_code: 1032
130516 10:02:30 [Warning] WSREP: RBR event 6 Delete_rows apply warning: 120, 43046
130516 10:02:30 [ERROR] WSREP: receiving IST failed, node restart required: Failed to apply app buffer: seqno:
 43046, status: WSREP_FATAL
         at galera/src/replicator_smm.cpp:apply_wscoll():52
         at galera/src/replicator_smm.cpp:apply_trx_ws():118

I was able to get a node stuck in this state where it continued to retry IST on every restart and got this error. The grastate.dat was not getting zeroed appropriately in this case.

[root@perconadbt mysql]# rpm -qa | grep -i percona

Alex Yurchenko (ayurchen) wrote :

This seems to be a Galera bug: grastate invalidation code does not cover all code paths.

Changed in galera:
assignee: nobody → Alex Yurchenko (ayurchen)
importance: Undecided → High
milestone: none → 24.2.5
status: New → Confirmed

Yes, looks like ReplicatorSMM::recv_IST Exception can mark_unsafe in addition to gu_abort or st_.mark_safe be only called after IST is fully complete.

Changed in percona-xtradb-cluster:
milestone: none → 5.5.31-25
Changed in percona-xtradb-cluster:
milestone: 5.5.31-25 → 5.5.31-24.8
Changed in percona-xtradb-cluster:
milestone: 5.5.31-23.7.5 → 5.5.31-25
Changed in galera:
milestone: 23.2.6 → 23.2.7
Changed in percona-xtradb-cluster:
milestone: 5.5.33-23.7.6 → future-5.5

Tested with:

=== modified file 'galera/src/replicator_str.cpp'
--- galera/src/replicator_str.cpp 2013-11-02 17:21:57 +0000
+++ galera/src/replicator_str.cpp 2013-12-15 10:57:57 +0000
@@ -766,6 +766,7 @@
         log_fatal << "receiving IST failed, node restart required: "
                   << e.what();
+ st_.mark_corrupt();

and it zeroed the grastate correctly on IST error.

However, as the error states there may be other exceptions which
node restart may fix - network issues for instance.

So, it is better to mark this closer to where it happens..

=== modified file 'galera/src/replicator_str.cpp'
--- galera/src/replicator_str.cpp 2013-11-02 17:21:57 +0000
+++ galera/src/replicator_str.cpp 2013-12-15 11:46:20 +0000
@@ -752,7 +752,15 @@
                     // processed on donor, just adjust states here
- apply_trx(recv_ctx, trx);
+ try
+ {
+ apply_trx(recv_ctx, trx);
+ }
+ catch (gu::Exception& e)
+ {
+ st_.mark_corrupt();
+ throw;
+ }

Yan Zhang (yan.zhang) wrote :


I don't understand the second patch. If ```apply_trx``` raises gu::Exception, the exception will be caught by outer try-catch-clause and mark state file corrupt(that's your first patch) immediately.

Our fix has been reverted in lieu of fix in 78 since it covers more space.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers