Incompatible SST between 5.5.29 and 5.5.33

Bug #1228618 reported by Tomasz Klekot
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
MySQL patches by Codership
New
Undecided
Unassigned
Percona XtraDB Cluster moved to https://jira.percona.com/projects/PXC
Status tracked in 5.6
5.5
Fix Released
Undecided
Unassigned
5.6
Fix Released
Undecided
Unassigned

Bug Description

We are running a Percona XtraDB Cluster with three nodes and two garbd instances(on 64bit CentOS 6.3), all running PXC 5.5.29. XtraBackup is used for SST.
Due to some unexplained cluster-wide crashes, we decided to perform upgrade to the latest version(5.5.33). After updating one of PXC nodes, it did not manage to sync with the cluster.

Domains: co1.ourdomain.com is donor; co3.ourdomain.com is the upgraded node(joiner).

1. Seems like the new version of PXC(5.5.33) is set by default to expect SST to be done in xb_stream, while the donor(still running 5.5.29) sends it in tar. There is no warning about that in changelog and it totally breaks backward compatibility.

WSREP_SST: [INFO] Evaluating socat -u TCP-LISTEN:4444,reuseaddr stdio | xbstream -x; RC=( ${PIPESTATUS[@]} ) (20130921 02:41:07.101)
130921 2:41:07 [Note] WSREP: Prepared SST request: xtrabackup|10.32.150.244:4444/xtrabackup_sst
130921 2:41:07 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
130921 2:41:07 [Note] WSREP: Assign initial position for certification: 407853908, protocol version: 2
130921 2:41:07 [Note] WSREP: Prepared IST receiver, listening at: tcp://10.32.150.244:4568
130921 2:41:07 [Note] WSREP: Node 0 (co3.ourdomain.com) requested state transfer from '*any*'. Selected 1 (co1.ourdomain.com)(SYNCED) as donor.
130921 2:41:07 [Note] WSREP: Shifting PRIMARY -> JOINER (TO: 407855552)
130921 2:41:07 [Note] WSREP: Requesting state transfer: success, donor: 1
xb_stream_read_chunk(): wrong chunk magic at offset 0x0.
2013/09/21 02:41:09 socat[32276] E write(1, 0x16e93e0, 2896): Broken pipe
WSREP_SST: [ERROR] Error while getting data from donor node: exit codes: 1 1 (20130921 02:41:09.742)
WSREP_SST: [ERROR] Cleanup after exit with status:32 (20130921 02:41:09.744)
WSREP_SST: [INFO] Removing the sst_in_progress file (20130921 02:41:09.745)

2. After changing SST settings in my.cnf at joiner (streamfmt=tar), the first data transfer goes through fine. However, joiner gets stuck forever while waiting for donor to send another part of data(I believe binlogs generated during state transfer?):

Joiner log:
130921 16:38:35 [Note] WSREP: New cluster view: global state: cffdaa53-542e-11e2-0800-c339f0c57558:410643189, view# 37: Primary, number of nodes: 5, my index: 4, protocol version 2
130921 16:38:35 [Warning] WSREP: Gap in state sequence. Need state transfer.
130921 16:38:37 [Note] WSREP: Running: 'wsrep_sst_xtrabackup --role 'joiner' --address '10.32.150.244' --auth 'wsrep_sst:xxxxxxxxxxxxxxxxx' --datadir '/var/lib/mysql/data/' --defaults-file '/etc/my.cnf' --parent '3727''
WSREP_SST: [INFO] Streaming with tar (20130921 16:38:37.872)
WSREP_SST: [INFO] Using netcat as streamer (20130921 16:38:37.874)
WSREP_SST: [INFO] Evaluating nc -dl 4444 | pv -f -i 10 -N joiner | tar xfi - ; RC=( ${PIPESTATUS[@]} ) (20130921 16:38:37.881)
130921 16:38:38 [Note] WSREP: Prepared SST request: xtrabackup|10.32.150.244:4444/xtrabackup_sst
130921 16:38:38 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
130921 16:38:38 [Note] WSREP: Assign initial position for certification: 410643189, protocol version: 2
130921 16:38:38 [Warning] WSREP: Failed to prepare for incremental state transfer: Local state UUID (00000000-0000-0000-0000-000000000000) does not match group state UUID (cffdaa53-542e-11e2-0800-c339f0c57558): 1 (Operation not permitted)
         at galera/src/replicator_str.cpp:prepare_for_IST():436. IST will be unavailable.
130921 16:38:38 [Note] WSREP: Node 4 (co3.ourdomain.com) requested state transfer from '*any*'. Selected 0 (co1.ourdomain.com)(SYNCED) as donor.
130921 16:38:38 [Note] WSREP: Shifting PRIMARY -> JOINER (TO: 410643189)
130921 16:38:38 [Note] WSREP: Requesting state transfer: success, donor: 0
   joiner: 96MB 0:00:10 [ 9.6MB/s] [<=> ]
   joiner: 198MB 0:00:20 [10.2MB/s] [ <=> ]
   joiner: 296MB 0:00:30 [9.74MB/s] [ <=> ]
   joiner: 376MB 0:00:40 [8.02MB/s] [ <=> ]
   joiner: 381MB 0:00:40 [9.34MB/s] [ <=> ]
WSREP_SST: [INFO] NOTE: Joiner-Recv-gtid took 41 seconds (20130921 16:39:18.699)
130921 16:39:18 [Note] WSREP: 0 (co1.ourdomain.com): State transfer to 4 (co3.ourdomain.com) complete.
130921 16:39:18 [Note] WSREP: Member 0 (co1.ourdomain.com) synced with group.
WSREP_SST: [INFO] Proceeding with SST (20130921 16:39:18.706)
WSREP_SST: [INFO] Cleaning the existing datadir (20130921 16:39:18.707)
WSREP_SST: [INFO] Evaluating nc -dl 4444 | pv -f -i 10 -N joiner | tar xfi - ; RC=( ${PIPESTATUS[@]} ) (20130921 16:39:18.710)
   joiner: 0B 0:00:10 [ 0B/s ] [<=> ]
   joiner: 0B 0:00:10 [ 0B/s ] [<=> ]
   joiner: 0B 0:00:10 [ 0B/s ] [<=> ]
   joiner: 0B 0:00:10 [ 0B/s ] [<=> ]
// it continues like that forever

Donor log:
130921 16:38:35 [Note] WSREP: Assign initial position for certification: 410643189, protocol version: 2
130921 16:38:38 [Note] WSREP: Node 4 (co3.ourdomain.com) requested state transfer from '*any*'. Selected 0 (co1.ourdomain.com)(SYNCED) as donor.
130921 16:38:38 [Note] WSREP: Shifting SYNCED -> DONOR/DESYNCED (TO: 410643189)
130921 16:38:38 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
130921 16:38:38 [Note] WSREP: Running: 'wsrep_sst_xtrabackup --role 'donor' --address '10.32.150.244:4444/xtrabackup_sst' --auth 'wsrep_sst:xxxxxxxxxxxxxxxxx' --socket '/var/lib/mysql/mysql.sock' --datadir '/var/lib/mysql/data/' --defaults-file '/etc/my.cnf' --gtid 'cffdaa53-542e-11e2-0800-c339f0c57558:410643189''
130921 16:38:38 [Note] WSREP: sst_donor_thread signaled with 0
130921 16:39:17 [Note] WSREP: Provider paused at cffdaa53-542e-11e2-0800-c339f0c57558:410650116
130921 16:39:18 [Note] WSREP: Provider resumed.
130921 16:39:18 [Note] WSREP: 0 (co1.ourdomain.com): State transfer to 4 (co3.ourdomain.com) complete.
130921 16:39:18 [Note] WSREP: Shifting DONOR/DESYNCED -> JOINED (TO: 410650116)
130921 16:39:18 [Note] WSREP: Member 0 (co1.ourdomain.com) synced with group.
130921 16:39:18 [Note] WSREP: Shifting JOINED -> SYNCED (TO: 410650116)
130921 16:39:18 [Note] WSREP: Synchronized with group, ready for connections
130921 16:39:18 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.

I am not running any kind of firewalls or anything that could prevent packets reaching their destination. Before updating the node to the latest PXC, SST was working just fine.
As the joiner was waiting for data, I checked the donor. It was happily working in SYNCED state without any traces of any SST being in progress.
No processes like tar/nc/socat/xtrabackup were running on it.

3. At the moment the donor changes status(DONOR/DESYNCED->JOINED or JOINED->SYNCED, not really sure), a weird entry appears in log of garbd instances:
2013-09-21 16:39:18.709 WARN: Protocol violation. JOIN message sender 0 (co1.ourdomain.com) is not in state transfer (SYNCED). Message ignored.

Revision history for this message
Tomasz Klekot (tomksoft) wrote :
Revision history for this message
Alex Yurchenko (ayurchen) wrote :

"JOIN message sender 0 (co1.ourdomain.com) is not in state transfer (SYNCED)."
- this message can be safely ignored. There can be several JOIN messages from the same sender.

Revision history for this message
Raghavendra D Prabhu (raghavendra-prabhu) wrote :
tags: added: doc
Revision history for this message
Tomasz Klekot (tomksoft) wrote :

I can't really find it in the errata or the SST docs. I understand it is two staged now, but it does not clearly say that the SST compatibility is totally broken. Doesn't it mean that it now became impossible to do zero-downtime PXC upgrade?

Revision history for this message
Tomasz Klekot (tomksoft) wrote :

Ah, I can see the Errata has just been updated. Thank you!

Revision history for this message
Raghavendra D Prabhu (raghavendra-prabhu) wrote : Re: [DOC] Incompatible SST between 5.5.29 and 5.5.33

Marking this 'fix committed' since it has been documented for now.

Changed in percona-xtradb-cluster:
status: New → Fix Committed
summary: - Incompatible SST between 5.5.29 and 5.5.33
+ [DOC] Incompatible SST between 5.5.29 and 5.5.33
Revision history for this message
Raghavendra D Prabhu (raghavendra-prabhu) wrote :

I have added the older xtrabackup as xtrabackup-v1 (so you can give as wsrep_sst_donor=xtrabackup-v1).

summary: - [DOC] Incompatible SST between 5.5.29 and 5.5.33
+ Incompatible SST between 5.5.29 and 5.5.33
Revision history for this message
Shahriyar Rzayev (rzayev-sehriyar) wrote :

Percona now uses JIRA for bug reports so this bug report is migrated to: https://jira.percona.com/browse/PXC-1444

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.