Point-in-time recovery failure: unable to apply binlogs because of certification failures

Bug #1727444 reported by Phil Stracchino on 2017-10-25
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Percona XtraDB Cluster moved to https://jira.percona.com/projects/PXC

Bug Description

We have a customer with a three-node cluster, currently running XtraDB Cluster 5.6.37. There is also an asynchronous dedicated backup slave running Percona Server 5.6.37, performing nightly backups using mydumper. GTID replication is in use to minimize the need for resyncs of the slave. The customer has approximately 2200 schemas.

Recently the customer needed an older version of a schema which we'll call 'example' restored as 'example_restore', over the top of an existing more recent restored copy of the same schema. As is our standard practice, we restored the schema from the backup slave using 'myloader -h <cluster node 3> -oed <dump directory> -s example', but accidentally omitted '-B example_restore'.

As a result of this error, we accidentally began restoring over 'example' instead of over 'example_restore'. We therefore needed to do a point-in-time recovery of 'example'. We did a full restore of the most recent backup, then moved to cluster node 3 and attempted to replay binlogs using mysqlbinlog:

mysqlbinlog node3-bin.* --database=example --start-datetime='start time of backup' --stop-datetime='last moment before restore error' --base64-output=AUTO | mysql

Time and again, we could see the correct transactions being replayed, but the tables were never updated. We could not understand why this was not working.

After several tries, nodes 1 and 2 went down uncleanly. At this point we realized that the following was happening:
1. The binlog replay transactions were failing certification (we have no idea why).
2. Because they failed certification, they were not being applied.
3. After sufficient certification failures, nodes 1 and 2 declared themselves inconsistent and aborted.

At this point the cluster was down to one node, and NOW, with no certification happening, we were able to successfully replay the binlogs and complete the point-in-time recovery. Once binlog replay was complete, we were able to bring the other two nodes back online (which required a full SST for both nodes, at almost 3 hours each).

There fairly clearly seems to be something wrong with replaying binary logs back into the cluster, possibly only when GTIDs are in use. At present this seems to mean that we must manually bring the cluster down to a single node before performing any point-in-time recovery.

Krunal Bauskar (krunal-bauskar) wrote :

* If I understood the problem correctly when you tried restoring on node-3, node-1 and node-2 didn't had any active workload so they should have replicated things directly from node-3 but they raised certification failure. This is bit weird and need more investigation.

* Can you share log files and configuration file.

* If possible can you try to reproduce this on smaller scale w/o production data and hand over the said steps.

I think I have found at least a part of the root cause, and it is operator error.

While trying to extract the needed data from the cluster logs, I spotted some errors further back in the logs that I had previously missed, which seemed to point towards inconsistent data across the cluster nodes. A further search of the shell history on the backup slave armed with that information allowed me to find that the previous night, a restore had been performed to a temporary schema on node 3, without enabling replication (the -e option to myloader was omitted). So that temporary schema now existed only on node 3.

This in itself didn't cause a problem at first, because nothing was writing to that temporary schema. The problem arose when we went to reload the *original* schema again, first dropping all of the data from that temporary schema and moving all of the new schema's tables to it. The temporary schema's tables could not be dropped on nodes 1 and 2 because they didn't exist there, and the tables from the original schema couldn't be moved to the temporary schema because it didn't exist either. This is where things started to go pear-shaped. Then we started trying to play back binary logs on node 3, not knowing that the other two nodes were inconsistent, and eventually nodes 1 and 2 just threw in the towel. Once that happened, the inconsistency was no longer a factor, the binlog replay worked, and the SSTs required to rejoin the other two nodes made everything consistent again.

We will try to do some tests of binlog replay with known consistent data on the cluster, but my guess is that without that data inconsistency, we're not going to be able to reproduce the problem. The root cause was the data inconsistency that we didn't know about, which came about because of the incorrectly-performed restore, and it didn't show up until we started trying to make cluster-wide writes to the inconsistent schema.

( ... er, excuse me ... I meant "first dropping all of the data from that temporary schema and moving all of the ORIGINAL schema's tables to it".)

Krunal Bauskar (krunal-bauskar) wrote :

Good to know Phil you traced the problem.

As you discovered cluster is sensitive to data-inconsistency. Any such inconsistency is handled in cluster through a shutdown.

I will close this ticket for now. Do let us know if you face any other issues in future.

BTW if you are trying PXC then why not evaluate new PXC-5.7.
It is feature rich and performance optimized with around 10x performance improvement (compared to 5.6).

Changed in percona-xtradb-cluster:
status: New → Invalid

Krunal, I know, unfortunately this client is unable to update to 5.7 at this time due to application compatibility issues.

Percona now uses JIRA for bug reports so this bug report is migrated to: https://jira.percona.com/browse/PXC-2013

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers