Percona XtraDB Cluster - HA scalable solution for MySQL

RBR error on IST not zeroing grastate

Reported by Jay Janssen on 2013-05-16
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Galera
High
Alex Yurchenko
Percona XtraDB Cluster
Status tracked in Trunk
5.6
Undecided
Unassigned
Trunk
Undecided
Unassigned

Bug Description

130516 10:02:30 [Note] WSREP: SST received: f9ae5241-be23-11e2-0800-9610321e6dbf:43045
130516 10:02:30 [Note] /usr/sbin/mysqld: ready for connections.
Version: '5.5.30' socket: '/var/lib/mysql/mysql.sock' port: 3306 Percona XtraDB Cluster (GPL), wsrep_23.7.4
.r3843
130516 10:02:30 [Note] WSREP: Receiving IST: 24484 writesets, seqnos 43045-67529
130516 10:02:30 [ERROR] Slave SQL: Could not execute Delete_rows event on table test.sbtest1; Can't find recor
d in 'sbtest1', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log FIRST, end_log_po
s 1193, Error_code: 1032
130516 10:02:30 [Warning] WSREP: RBR event 6 Delete_rows apply warning: 120, 43046
130516 10:02:30 [ERROR] WSREP: receiving IST failed, node restart required: Failed to apply app buffer: seqno:
 43046, status: WSREP_FATAL
         at galera/src/replicator_smm.cpp:apply_wscoll():52
         at galera/src/replicator_smm.cpp:apply_trx_ws():118

I was able to get a node stuck in this state where it continued to retry IST on every restart and got this error. The grastate.dat was not getting zeroed appropriately in this case.

[root@perconadbt mysql]# rpm -qa | grep -i percona
percona-release-0.0-1.x86_64
Percona-XtraDB-Cluster-server-5.5.30-23.7.4.406.rhel6.x86_64
Percona-XtraDB-Cluster-client-5.5.30-23.7.4.406.rhel6.x86_64
percona-xtrabackup-2.0.7-552.rhel6.x86_64
Percona-XtraDB-Cluster-galera-2.5-1.150.rhel6.x86_64
Percona-XtraDB-Cluster-shared-5.5.30-23.7.4.406.rhel6.x86_64

Alex Yurchenko (ayurchen) wrote :

This seems to be a Galera bug: grastate invalidation code does not cover all code paths.

Changed in galera:
assignee: nobody → Alex Yurchenko (ayurchen)
importance: Undecided → High
milestone: none → 24.2.5
status: New → Confirmed

Yes, looks like ReplicatorSMM::recv_IST Exception can mark_unsafe in addition to gu_abort or st_.mark_safe be only called after IST is fully complete.

Changed in percona-xtradb-cluster:
milestone: none → 5.5.31-25
Changed in percona-xtradb-cluster:
milestone: 5.5.31-25 → 5.5.31-24.8
Changed in percona-xtradb-cluster:
milestone: 5.5.31-23.7.5 → 5.5.31-25
Changed in galera:
milestone: 23.2.6 → 23.2.7
Changed in percona-xtradb-cluster:
milestone: 5.5.33-23.7.6 → future-5.5

Tested with:

=== modified file 'galera/src/replicator_str.cpp'
--- galera/src/replicator_str.cpp 2013-11-02 17:21:57 +0000
+++ galera/src/replicator_str.cpp 2013-12-15 10:57:57 +0000
@@ -766,6 +766,7 @@
     {
         log_fatal << "receiving IST failed, node restart required: "
                   << e.what();
+ st_.mark_corrupt();
         gcs_.close();
         gu_abort();
     }

and it zeroed the grastate correctly on IST error.

However, as the error states there may be other exceptions which
node restart may fix - network issues for instance.

So, it is better to mark this closer to where it happens..

=== modified file 'galera/src/replicator_str.cpp'
--- galera/src/replicator_str.cpp 2013-11-02 17:21:57 +0000
+++ galera/src/replicator_str.cpp 2013-12-15 11:46:20 +0000
@@ -752,7 +752,15 @@
                     // processed on donor, just adjust states here
                     trx->set_state(TrxHandle::S_REPLICATING);
                     trx->set_state(TrxHandle::S_CERTIFYING);
- apply_trx(recv_ctx, trx);
+ try
+ {
+ apply_trx(recv_ctx, trx);
+ }
+ catch (gu::Exception& e)
+ {
+ st_.mark_corrupt();
+ throw;
+ }
                 }
             }
             else

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers