SST fails if donor has to send keyring file

Bug #1696273 reported by Marcelo Altmann on 2017-06-06
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Percona XtraDB Cluster moved to
Fix Committed
Fix Committed
Kenn Takara

Bug Description

SST will fail if donor has to send keyring. Looks like the donor is trying to send the file while socat is still opening port 4444 on joiner:

 20170606 09:00:15.294 WSREP_SST: [INFO] Streaming GTID file before SST
 20170606 09:00:18.368 WSREP_SST: [INFO] Streaming donor-keyring file before SST
2017/06/06 09:00:18 socat[15464] E connect(4, AF=2, 16): Connection refused
 20170606 09:00:18.376 WSREP_SST: [ERROR] ******************* FATAL ERROR **********************
 20170606 09:00:18.377 WSREP_SST: [ERROR] Error while sending data to joiner node: exit codes: 0 1
 20170606 09:00:18.379 WSREP_SST: [ERROR] ******************************************************
 20170606 09:00:18.380 WSREP_SST: [ERROR] Cleanup after exit with status:32

Donor is showing connection refused, but port 4444 is open on joiner.

How to repeat.

1) Create a PXC cluster with 2 nodes using 5.7
2) Stop node1 and add below config to my.cnf:
in [mysqld] section:
in [sst] section:
streamfmt = xbstream
in [xtrabackup] section:

3) move *.pem files from /var/lib/mysql to /var/lib/mysql-files
4) scp /var/lib/mysql-files/* to node2
5) Start node1 and wait until it joins the cluster
6) repeat step 2 on node2
7) force sst from node2, it will fail on above error

How to fix:
I was able to fix it by increasing the time the donor waits on joiner to receive the file.
Edit /usr/bin/wsrep_sst_xtrabackup-v2 around line 1286:
1285 # joiner need to wait to receive the file.
1286 sleep 3
1290 wsrep_log_info "Streaming donor-keyring file before SST"
1291 keyringbackupopt=" --keyring-file-data=${KEYRING_DIR}/${XB_DONOR_KEYRING_FILE} --server-id=$keyringsid "
1293 send_data_from_donor_to_joiner "${KEYRING_DIR}" "${stagemsg}-keyring"

Increase sleep 3 to sleep 10

Tested on 5.7.18-15-57 Percona XtraDB Cluster (GPL), Release rel15, Revision 7693d6e, WSREP version 29.20, wsrep_29.20

tags: added: i192112
description: updated
Changed in percona-xtradb-cluster:
status: New → Confirmed
Kenn Takara (kenn-takara) wrote :

This will happen anytime we are sending data (not just for the keyring). Anytime the joiner is slow on starting to listen on the server socket, the donor will fail (because we only try it once).

I am changing the code to have the donor retry the connection (so it will try more than once before returning an error).

Kenn Takara (kenn-takara) wrote :

Fix committed for the next release of PXC 5.6/5.7. We now retry 30 times (with the default 1-second intervals).

Changed in percona-xtradb-cluster:
status: Confirmed → Fix Committed

Percona now uses JIRA for bug reports so this bug report is migrated to:

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers