RBR error on IST not zeroing grastate

Bug #1180791 reported by Jay Janssen
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Galera
Status tracked in 3.x
2.x
Fix Committed
High
Yan Zhang
3.x
Fix Released
High
Yan Zhang
Percona XtraDB Cluster moved to https://jira.percona.com/projects/PXC
Status tracked in 5.6
5.5
Fix Released
Undecided
Unassigned
5.6
Fix Released
Undecided
Unassigned

Bug Description

130516 10:02:30 [Note] WSREP: SST received: f9ae5241-be23-11e2-0800-9610321e6dbf:43045
130516 10:02:30 [Note] /usr/sbin/mysqld: ready for connections.
Version: '5.5.30' socket: '/var/lib/mysql/mysql.sock' port: 3306 Percona XtraDB Cluster (GPL), wsrep_23.7.4
.r3843
130516 10:02:30 [Note] WSREP: Receiving IST: 24484 writesets, seqnos 43045-67529
130516 10:02:30 [ERROR] Slave SQL: Could not execute Delete_rows event on table test.sbtest1; Can't find recor
d in 'sbtest1', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log FIRST, end_log_po
s 1193, Error_code: 1032
130516 10:02:30 [Warning] WSREP: RBR event 6 Delete_rows apply warning: 120, 43046
130516 10:02:30 [ERROR] WSREP: receiving IST failed, node restart required: Failed to apply app buffer: seqno:
 43046, status: WSREP_FATAL
         at galera/src/replicator_smm.cpp:apply_wscoll():52
         at galera/src/replicator_smm.cpp:apply_trx_ws():118

I was able to get a node stuck in this state where it continued to retry IST on every restart and got this error. The grastate.dat was not getting zeroed appropriately in this case.

[root@perconadbt mysql]# rpm -qa | grep -i percona
percona-release-0.0-1.x86_64
Percona-XtraDB-Cluster-server-5.5.30-23.7.4.406.rhel6.x86_64
Percona-XtraDB-Cluster-client-5.5.30-23.7.4.406.rhel6.x86_64
percona-xtrabackup-2.0.7-552.rhel6.x86_64
Percona-XtraDB-Cluster-galera-2.5-1.150.rhel6.x86_64
Percona-XtraDB-Cluster-shared-5.5.30-23.7.4.406.rhel6.x86_64

Revision history for this message
Alex Yurchenko (ayurchen) wrote :

This seems to be a Galera bug: grastate invalidation code does not cover all code paths.

Changed in galera:
assignee: nobody → Alex Yurchenko (ayurchen)
importance: Undecided → High
milestone: none → 24.2.5
status: New → Confirmed
Revision history for this message
Raghavendra D Prabhu (raghavendra-prabhu) wrote :

Yes, looks like ReplicatorSMM::recv_IST Exception can mark_unsafe in addition to gu_abort or st_.mark_safe be only called after IST is fully complete.

Changed in percona-xtradb-cluster:
milestone: none → 5.5.31-25
Changed in percona-xtradb-cluster:
milestone: 5.5.31-25 → 5.5.31-24.8
Changed in percona-xtradb-cluster:
milestone: 5.5.31-23.7.5 → 5.5.31-25
Changed in galera:
milestone: 23.2.6 → 23.2.7
Changed in percona-xtradb-cluster:
milestone: 5.5.33-23.7.6 → future-5.5
Revision history for this message
Raghavendra D Prabhu (raghavendra-prabhu) wrote :

Tested with:

=== modified file 'galera/src/replicator_str.cpp'
--- galera/src/replicator_str.cpp 2013-11-02 17:21:57 +0000
+++ galera/src/replicator_str.cpp 2013-12-15 10:57:57 +0000
@@ -766,6 +766,7 @@
     {
         log_fatal << "receiving IST failed, node restart required: "
                   << e.what();
+ st_.mark_corrupt();
         gcs_.close();
         gu_abort();
     }

and it zeroed the grastate correctly on IST error.

However, as the error states there may be other exceptions which
node restart may fix - network issues for instance.

So, it is better to mark this closer to where it happens..

=== modified file 'galera/src/replicator_str.cpp'
--- galera/src/replicator_str.cpp 2013-11-02 17:21:57 +0000
+++ galera/src/replicator_str.cpp 2013-12-15 11:46:20 +0000
@@ -752,7 +752,15 @@
                     // processed on donor, just adjust states here
                     trx->set_state(TrxHandle::S_REPLICATING);
                     trx->set_state(TrxHandle::S_CERTIFYING);
- apply_trx(recv_ctx, trx);
+ try
+ {
+ apply_trx(recv_ctx, trx);
+ }
+ catch (gu::Exception& e)
+ {
+ st_.mark_corrupt();
+ throw;
+ }
                 }
             }
             else

Revision history for this message
Yan Zhang (yan.zhang) wrote :

@raghu

I don't understand the second patch. If ```apply_trx``` raises gu::Exception, the exception will be caught by outer try-catch-clause and mark state file corrupt(that's your first patch) immediately.

Revision history for this message
Yan Zhang (yan.zhang) wrote :
Revision history for this message
Raghavendra D Prabhu (raghavendra-prabhu) wrote :

Our fix has been reverted in lieu of fix in 78 since it covers more space.

Revision history for this message
Shahriyar Rzayev (rzayev-sehriyar) wrote :

Percona now uses JIRA for bug reports so this bug report is migrated to: https://jira.percona.com/browse/PXC-1348

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.