Point-in-time recovery failure: unable to apply binlogs because of certification failures
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Percona XtraDB Cluster moved to https://jira.percona.com/projects/PXC |
Invalid
|
Undecided
|
Unassigned |
Bug Description
We have a customer with a three-node cluster, currently running XtraDB Cluster 5.6.37. There is also an asynchronous dedicated backup slave running Percona Server 5.6.37, performing nightly backups using mydumper. GTID replication is in use to minimize the need for resyncs of the slave. The customer has approximately 2200 schemas.
Recently the customer needed an older version of a schema which we'll call 'example' restored as 'example_restore', over the top of an existing more recent restored copy of the same schema. As is our standard practice, we restored the schema from the backup slave using 'myloader -h <cluster node 3> -oed <dump directory> -s example', but accidentally omitted '-B example_restore'.
As a result of this error, we accidentally began restoring over 'example' instead of over 'example_restore'. We therefore needed to do a point-in-time recovery of 'example'. We did a full restore of the most recent backup, then moved to cluster node 3 and attempted to replay binlogs using mysqlbinlog:
mysqlbinlog node3-bin.* --database=example --start-
Time and again, we could see the correct transactions being replayed, but the tables were never updated. We could not understand why this was not working.
After several tries, nodes 1 and 2 went down uncleanly. At this point we realized that the following was happening:
1. The binlog replay transactions were failing certification (we have no idea why).
2. Because they failed certification, they were not being applied.
3. After sufficient certification failures, nodes 1 and 2 declared themselves inconsistent and aborted.
At this point the cluster was down to one node, and NOW, with no certification happening, we were able to successfully replay the binlogs and complete the point-in-time recovery. Once binlog replay was complete, we were able to bring the other two nodes back online (which required a full SST for both nodes, at almost 3 hours each).
There fairly clearly seems to be something wrong with replaying binary logs back into the cluster, possibly only when GTIDs are in use. At present this seems to mean that we must manually bring the cluster down to a single node before performing any point-in-time recovery.
* If I understood the problem correctly when you tried restoring on node-3, node-1 and node-2 didn't had any active workload so they should have replicated things directly from node-3 but they raised certification failure. This is bit weird and need more investigation.
* Can you share log files and configuration file.
* If possible can you try to reproduce this on smaller scale w/o production data and hand over the said steps.