SST fails if donor has to send keyring file

Bug #1696273 reported by Marcelo Altmann
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Percona XtraDB Cluster moved to https://jira.percona.com/projects/PXC
Fix Committed
Undecided
Unassigned
5.7
Fix Committed
Undecided
Kenn Takara

Bug Description

SST will fail if donor has to send keyring. Looks like the donor is trying to send the file while socat is still opening port 4444 on joiner:

 20170606 09:00:15.294 WSREP_SST: [INFO] Streaming GTID file before SST
 20170606 09:00:18.368 WSREP_SST: [INFO] Streaming donor-keyring file before SST
2017/06/06 09:00:18 socat[15464] E connect(4, AF=2 10.131.17.240:4444, 16): Connection refused
 20170606 09:00:18.376 WSREP_SST: [ERROR] ******************* FATAL ERROR **********************
 20170606 09:00:18.377 WSREP_SST: [ERROR] Error while sending data to joiner node: exit codes: 0 1
 20170606 09:00:18.379 WSREP_SST: [ERROR] ******************************************************
 20170606 09:00:18.380 WSREP_SST: [ERROR] Cleanup after exit with status:32

Donor is showing connection refused, but port 4444 is open on joiner.

How to repeat.

1) Create a PXC cluster with 2 nodes using 5.7
2) Stop node1 and add below config to my.cnf:
in [mysqld] section:
ssl-ca=/var/lib/mysql-files/ca.pem
ssl-cert=/var/lib/mysql-files/server-cert.pem
ssl-key=/var/lib/mysql-files/server-key.pem
early-plugin-load=keyring_file.so
keyring_file_data=/var/lib/mysql-keyring/keyring
in [sst] section:
streamfmt = xbstream
encrypt=4
ssl-ca=/var/lib/mysql-files/ca.pem
ssl-cert=/var/lib/mysql-files/server-cert.pem
ssl-key=/var/lib/mysql-files/server-key.pem
in [xtrabackup] section:
keyring-file-data=/var/lib/mysql-keyring/keyring

3) move *.pem files from /var/lib/mysql to /var/lib/mysql-files
4) scp /var/lib/mysql-files/* to node2
5) Start node1 and wait until it joins the cluster
6) repeat step 2 on node2
7) force sst from node2, it will fail on above error

How to fix:
I was able to fix it by increasing the time the donor waits on joiner to receive the file.
Edit /usr/bin/wsrep_sst_xtrabackup-v2 around line 1286:
++++++++++++++++++++++++++++++++++++++++
1285 # joiner need to wait to receive the file.
1286 sleep 3
1287
1288 cp $keyring $KEYRING_DIR/$XB_DONOR_KEYRING_FILE
1289
1290 wsrep_log_info "Streaming donor-keyring file before SST"
1291 keyringbackupopt=" --keyring-file-data=${KEYRING_DIR}/${XB_DONOR_KEYRING_FILE} --server-id=$keyringsid "
1292 FILE_TO_STREAM=$XB_DONOR_KEYRING_FILE
1293 send_data_from_donor_to_joiner "${KEYRING_DIR}" "${stagemsg}-keyring"
++++++++++++++++++++++++++++++++++++++++

Increase sleep 3 to sleep 10

Tested on 5.7.18-15-57 Percona XtraDB Cluster (GPL), Release rel15, Revision 7693d6e, WSREP version 29.20, wsrep_29.20

Tags: i192112
tags: added: i192112
description: updated
Changed in percona-xtradb-cluster:
status: New → Confirmed
Revision history for this message
Kenn Takara (kenn-takara) wrote :

This will happen anytime we are sending data (not just for the keyring). Anytime the joiner is slow on starting to listen on the server socket, the donor will fail (because we only try it once).

I am changing the code to have the donor retry the connection (so it will try more than once before returning an error).

Revision history for this message
Kenn Takara (kenn-takara) wrote :

Fix committed for the next release of PXC 5.6/5.7. We now retry 30 times (with the default 1-second intervals).

Changed in percona-xtradb-cluster:
status: Confirmed → Fix Committed
Revision history for this message
Shahriyar Rzayev (rzayev-sehriyar) wrote :

Percona now uses JIRA for bug reports so this bug report is migrated to: https://jira.percona.com/browse/PXC-833

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.