xtrabackup-v2 SST is broken on MariaDB 10.1/Galera w/ binary logs enabled

Bug #1570560 reported by Andrew Garner
30
This bug affects 6 people
Affects Status Importance Assigned to Milestone
Percona XtraBackup moved to https://jira.percona.com/projects/PXB
Status tracked in 2.4
2.3
Invalid
High
Unassigned
2.4
Invalid
High
Unassigned

Bug Description

With SST innobackupex is run with --galera-info and XtraBackup will flush the binary logs and copy the latest log under a FTWRL lock. This fails under MariaDB 10.1 (at least) as there seems to be a race condition where FLUSH BINARY LOGS starts a new log, but there is a period of time before a "binlog checkpoint" is written. On first start after prepare on the joiner node, MariaDB fails with an error along the lines of:

[Note] Recovering after a crash using /var/log/mysql/mysql-bin
[ERROR] Binlog file '/var/log/mysql/mysql-bin.XXXXXX' not found in binlog index, needed for recovery. Aborting.
[ERROR] Crash recovery failed. Either correct the problem (if it's, for example, out of memory error and restart, or delete (or rename) binary log and start mysqld with --tc-heuristic-recover={commit|rollback}
[ERROR] Can't init tc log
[ERROR] Aborting

There are further details in https://jira.mariadb.org/browse/MDEV-9423

This problem appears to be resolved by having xtrabackup perform an extra "FLUSH LOCAL ENGINE LOGS" prior to copying the binary log.

Currently xtrabackup 2.3 (and 2.4) does not enable FLUSH ENGINE LOGS on MariaDB due to simple check that fails on MariaDB:

          if (strcmp(version_var, "5.5") >= 0) {
                  have_flush_engine_logs = true;
          }

And further, even when enabled, FLUSH ENGINE LOGS is currently only run after copying the binlog for Galera, so this wouldn't help MariaDB.

A simple patch again the percona-xtrabackup 2.3 branches seems to fix this problem reliably in my environment, but I think this could be probably improved to avoid the double flush engine logs under mariadb.

$ git rev-parse HEAD
6d3a30519702ba9c98281b705fe83c0cbd301f92

$ git diff
diff --git a/storage/innobase/xtrabackup/src/backup_mysql.cc b/storage/innobase/xtrabackup/src/backup_mysql.cc
index 14d93ab..45350d4 100644
--- a/storage/innobase/xtrabackup/src/backup_mysql.cc
+++ b/storage/innobase/xtrabackup/src/backup_mysql.cc
@@ -1205,6 +1205,9 @@ write_current_binlog_file(MYSQL *connection)
                        goto cleanup;
                }

+ /* MariaDB requires FLUSH ENGINE LOGS or we may get an incomplete binlog */
+ xb_mysql_query(connection, "/*M!50500 FLUSH NO_WRITE_TO_BINLOG ENGINE LOGS*/", false);
+
                ut_snprintf(filepath, sizeof(filepath), "%s/%s",
                                log_bin_dir, log_bin_file);
                result = copy_file(filepath, log_bin_file, 0);

Revision history for this message
Sergei Glushchenko (sergei.glushchenko) wrote :
Revision history for this message
Andrew Garner (muzazzi) wrote :

With this PR, I see the FLUSH ENGINE LOGS being executed now, but xtrabackup-v2 based SST still fails under MariaDB 10.1.13 with a similar rate - failing about 99% of the time if I have sysbench write workload running in the background on some other node.

However, additionally moving the "FLUSH ENGINE LOGS" to occur *before* the galera_info bits get run gives me a 100% pass rate over 1000 SST attempts. That change looks like this (based off the gl-sergei/2.3-xb-bug1570560 branch):

diff --git a/storage/innobase/xtrabackup/src/backup_copy.cc b/storage/innobase/xtrabackup/src/backup_copy.cc
index 22f28ec..2014835 100644
--- a/storage/innobase/xtrabackup/src/backup_copy.cc
+++ b/storage/innobase/xtrabackup/src/backup_copy.cc
@@ -1258,6 +1258,12 @@ backup_start()
   }
  }

+ if (have_flush_engine_logs) {
+ msg_ts("Executing FLUSH NO_WRITE_TO_BINLOG ENGINE LOGS...\n");
+ xb_mysql_query(mysql_connection,
+ "FLUSH NO_WRITE_TO_BINLOG ENGINE LOGS", false);
+ }
+
  /* The only reason why Galera/binlog info is written before
  wait_for_ibbackup_log_copy_finish() is that after that call the xtrabackup
  binary will start streamig a temporary copy of REDO log to stdout and
@@ -1277,11 +1283,6 @@ backup_start()
   write_binlog_info(mysql_connection);
  }

- if (have_flush_engine_logs) {
- msg_ts("Executing FLUSH NO_WRITE_TO_BINLOG ENGINE LOGS...\n");
- xb_mysql_query(mysql_connection,
- "FLUSH NO_WRITE_TO_BINLOG ENGINE LOGS", false);
- }

  return(true);
 }
diff --git a/storage/innobase/xtrabackup/src/backup_mysql.cc b/storage/innobase/xtrabackup/src/backup_mysql.cc
index 26a0870..3d43551 100644
--- a/storage/innobase/xtrabackup/src/backup_mysql.cc
+++ b/storage/innobase/xtrabackup/src/backup_mysql.cc
@@ -374,7 +374,7 @@ get_mysql_vars(MYSQL *connection)
   have_galera_enabled = true;
  }

- if (strcmp(version_var, "5.5") >= 0) {
+ if (fnmatch("5.[123].*", version_var, FNM_PATHNAME) != 0) {
   have_flush_engine_logs = true;
  }

Revision history for this message
Sergei Glushchenko (sergei.glushchenko) wrote :

Any thoughts why version 2.2 works fine (based on comment in https://jira.mariadb.org/browse/MDEV-9423)?

Revision history for this message
Andrew Garner (muzazzi) wrote :

v2.2 is also broken - I tested this at the time, and under a heavy sysbench write workload this was failing similar to 2.3 / 2.4. This condition is racy, so SST will sometimes work - especially under a light write workload.

I suspect v2.2 might be somewhat more successful since this is handled by the perl frontend, which is slower so there may be more cases where the delay between FTWRL / FLUSH BINARY LOGS and the actual binlog copy can be higher giving more time for the natural background flush to complete even w/o FLUSH ENGINE LOGS. Just adding a sleep of 1s (or I suppose maybe up to $innodb_flush_log_at_timeout) before the binlog copy would probably also work. :)

I just added a comment to the mariadb issue noting my experience w/ v2.2 as well.

Revision history for this message
errogaht (7018407-o) wrote :

until 2.3.5 is released please help with this bug.
How i can manually fix it?
Cluster not correctly worked now. Maybe someone knows how can i fix it manually?
may be copy bin logs from another nodes? or something like that?

Revision history for this message
Andrew Garner (muzazzi) wrote :

@errogaht I managed to get MariaDB started in my (development-only) environment after a successful SST but failed MariaDB binlog recovery by copying the previous binlog, adding it to the binlog index and updating grastate.dat to match the logged uuid:seqno from the MySQL error log. This is a fairly sensitive operation and easy to get wrong - I would test it heavily before trying in production.

I will note other SST methods (well, rsync anyway) aren't affected here and handle the MariaDB GTID bits correctly. Of course if you're running into this problem with xtrabackup-v2, rsync SST locking is probably impractical for the environment. :(

Revision history for this message
Andrew Garner (abg) wrote :
Revision history for this message
Sergei Glushchenko (sergei.glushchenko) wrote :

Hey there,

As I understand, the last binary log file is required for PXC 5.6 with GTID enabled. It is needed in order to obtain Executed_Gtid_Set from it. I am not familiar with MariaDB implementation of GTID, but it is different from Oracle MySQL. Is last binary log needed for MariaDB?

MySQL 5.7 introduced mysql.gtid_executed table and last binary log file may not be longer needed for PXC 5.7.

Revision history for this message
Andrew Garner (abg) wrote :

Yes, the last binary log needs to be copied for similar reasons as MySQL 5.6. That's why this feature was implemented some time ago here:

https://bugs.launchpad.net/percona-xtrabackup/+bug/1329539

https://github.com/percona/percona-xtrabackup/commit/60f9795cdbb7ae13a554f8cadcb48bcb2a5e3a9d

That initial implementation for MariaDB missed the "FLUSH ENGINE LOGS" ordering I have been attempting to fix here - that is a subtle race condition that I assume was simply overlooked at the time.

I am not aware of any present or planned MariaDB equivalent feature to the MySQL 5.7 mysql.gtid_executed table that might avoid needing to copy around these binlogs. Perhaps someone else will be able to comment in that regard.

Revision history for this message
Sergei Glushchenko (sergei.glushchenko) wrote :

Hi Andrew,

Looking at PR#201 I think it does the wrong thing. I don't see how binary log can be flushed by FLUSH ENGINE LOGS. FLUSH ENGINE LOGS tells storage engines to flush their logs. InnoDB, for instance, invokes log_buffer_flush_to_disk, which flushes redo log buffer from memory to disk and thats what xtrabackup is using it for. It doesn't write into binary logs. Possibly it adds some delay, so binlog background thread has time to write checkpoint event.

For the record, this bug seems to be addressed in MariaDB. Can you please give it a try?

revision-id: de67ceaf61cca413063edbd32b3fba7603ac7fd5 (mariadb-10.1.14-44-gde67cea)
parent(s): 64c115b835e4108c089fee7f0bb2f8e4022eae7b
author: Nirbhay Choubey
committer: Nirbhay Choubey
timestamp: 2016-06-27 21:57:45 -0400
message:

MDEV-9423: cannot add new node to the cluser: Binlog..

.. file '/var/log/mysql/mariadb-bin.000001' not found in binlog
index, needed for recovery. Aborting.

In Galera cluster, while preparing for SST, the donor node takes
an FTWRL followed by REFRESH_ENGINE_LOG & REFRESH_BINARY_LOG. The
latter rotates the binary log and logs Binlog_checkpoint_log_event
corresponding to the penultimate binary log file into the new file.
The checkpoint event for the current file is later logged asynchronously
by binlog_background_thread.

Now, since in rsync/xtrabackup based snapshot state transfer methods,
only the last binary log file is transferred to the joiner node; the
file could get transferred even before the checkpoint event for the
same file gets written to it. As a result, the joiner node would fail
to start complaining about the missing binlog file needed for recovery.

In order to fix this, a mechanism has been put in place to make
REFRESH_BINARY_LOG operation wait for Binlog_checkpoint_log_event
to be logged for the current binary log file if the node is part of
a Galera cluster. As further safety, the donor node now acquires and
owns LOCK_log for the duration of file transfer during SST.

Revision history for this message
Andrew Garner (abg) wrote :

Point taken - although FLUSH ENGINE LOGS does reliably solve this for MariaDB, this is a MariaDB bug, not an xtrabackup bug. It was convenient to hack around it with xtrabackup.

Looks like the MariaDB fix will need some work too to get us where we need to be.

Thank you for your time.

Revision history for this message
Shahriyar Rzayev (rzayev-sehriyar) wrote :

Percona now uses JIRA for bug reports so this bug report is migrated to: https://jira.percona.com/browse/PXB-474

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.