RBR error on IST not zeroing grastate

Bug #1180791 reported by Jay Janssen on 2013-05-16
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Status tracked in 3.x
Yan Zhang
Yan Zhang
Percona XtraDB Cluster moved to https://jira.percona.com/projects/PXC
Status tracked in 5.6
Fix Released
Fix Released

Bug Description

130516 10:02:30 [Note] WSREP: SST received: f9ae5241-be23-11e2-0800-9610321e6dbf:43045
130516 10:02:30 [Note] /usr/sbin/mysqld: ready for connections.
Version: '5.5.30' socket: '/var/lib/mysql/mysql.sock' port: 3306 Percona XtraDB Cluster (GPL), wsrep_23.7.4
130516 10:02:30 [Note] WSREP: Receiving IST: 24484 writesets, seqnos 43045-67529
130516 10:02:30 [ERROR] Slave SQL: Could not execute Delete_rows event on table test.sbtest1; Can't find recor
d in 'sbtest1', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log FIRST, end_log_po
s 1193, Error_code: 1032
130516 10:02:30 [Warning] WSREP: RBR event 6 Delete_rows apply warning: 120, 43046
130516 10:02:30 [ERROR] WSREP: receiving IST failed, node restart required: Failed to apply app buffer: seqno:
 43046, status: WSREP_FATAL
         at galera/src/replicator_smm.cpp:apply_wscoll():52
         at galera/src/replicator_smm.cpp:apply_trx_ws():118

I was able to get a node stuck in this state where it continued to retry IST on every restart and got this error. The grastate.dat was not getting zeroed appropriately in this case.

[root@perconadbt mysql]# rpm -qa | grep -i percona

Alex Yurchenko (ayurchen) wrote :

This seems to be a Galera bug: grastate invalidation code does not cover all code paths.

Changed in galera:
assignee: nobody → Alex Yurchenko (ayurchen)
importance: Undecided → High
milestone: none → 24.2.5
status: New → Confirmed

Yes, looks like ReplicatorSMM::recv_IST Exception can mark_unsafe in addition to gu_abort or st_.mark_safe be only called after IST is fully complete.

Changed in percona-xtradb-cluster:
milestone: none → 5.5.31-25
Changed in percona-xtradb-cluster:
milestone: 5.5.31-25 → 5.5.31-24.8
Changed in percona-xtradb-cluster:
milestone: 5.5.31-23.7.5 → 5.5.31-25
Changed in galera:
milestone: 23.2.6 → 23.2.7
Changed in percona-xtradb-cluster:
milestone: 5.5.33-23.7.6 → future-5.5

Tested with:

=== modified file 'galera/src/replicator_str.cpp'
--- galera/src/replicator_str.cpp 2013-11-02 17:21:57 +0000
+++ galera/src/replicator_str.cpp 2013-12-15 10:57:57 +0000
@@ -766,6 +766,7 @@
         log_fatal << "receiving IST failed, node restart required: "
                   << e.what();
+ st_.mark_corrupt();

and it zeroed the grastate correctly on IST error.

However, as the error states there may be other exceptions which
node restart may fix - network issues for instance.

So, it is better to mark this closer to where it happens..

=== modified file 'galera/src/replicator_str.cpp'
--- galera/src/replicator_str.cpp 2013-11-02 17:21:57 +0000
+++ galera/src/replicator_str.cpp 2013-12-15 11:46:20 +0000
@@ -752,7 +752,15 @@
                     // processed on donor, just adjust states here
- apply_trx(recv_ctx, trx);
+ try
+ {
+ apply_trx(recv_ctx, trx);
+ }
+ catch (gu::Exception& e)
+ {
+ st_.mark_corrupt();
+ throw;
+ }

Yan Zhang (yan.zhang) wrote :


I don't understand the second patch. If ```apply_trx``` raises gu::Exception, the exception will be caught by outer try-catch-clause and mark state file corrupt(that's your first patch) immediately.

Our fix has been reverted in lieu of fix in 78 since it covers more space.

Percona now uses JIRA for bug reports so this bug report is migrated to: https://jira.percona.com/browse/PXC-1348

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers