Percona XtraDB Cluster - HA scalable solution for MySQL

Incompatible SST between 5.5.29 and 5.5.33

Reported by Tomasz Klekot on 2013-09-21
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
MySQL patches by Codership
Undecided
Unassigned
Percona XtraDB Cluster
Status tracked in Trunk
5.6
Undecided
Unassigned
Trunk
Undecided
Unassigned

Bug Description

We are running a Percona XtraDB Cluster with three nodes and two garbd instances(on 64bit CentOS 6.3), all running PXC 5.5.29. XtraBackup is used for SST.
Due to some unexplained cluster-wide crashes, we decided to perform upgrade to the latest version(5.5.33). After updating one of PXC nodes, it did not manage to sync with the cluster.

Domains: co1.ourdomain.com is donor; co3.ourdomain.com is the upgraded node(joiner).

1. Seems like the new version of PXC(5.5.33) is set by default to expect SST to be done in xb_stream, while the donor(still running 5.5.29) sends it in tar. There is no warning about that in changelog and it totally breaks backward compatibility.

WSREP_SST: [INFO] Evaluating socat -u TCP-LISTEN:4444,reuseaddr stdio | xbstream -x; RC=( ${PIPESTATUS[@]} ) (20130921 02:41:07.101)
130921 2:41:07 [Note] WSREP: Prepared SST request: xtrabackup|10.32.150.244:4444/xtrabackup_sst
130921 2:41:07 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
130921 2:41:07 [Note] WSREP: Assign initial position for certification: 407853908, protocol version: 2
130921 2:41:07 [Note] WSREP: Prepared IST receiver, listening at: tcp://10.32.150.244:4568
130921 2:41:07 [Note] WSREP: Node 0 (co3.ourdomain.com) requested state transfer from '*any*'. Selected 1 (co1.ourdomain.com)(SYNCED) as donor.
130921 2:41:07 [Note] WSREP: Shifting PRIMARY -> JOINER (TO: 407855552)
130921 2:41:07 [Note] WSREP: Requesting state transfer: success, donor: 1
xb_stream_read_chunk(): wrong chunk magic at offset 0x0.
2013/09/21 02:41:09 socat[32276] E write(1, 0x16e93e0, 2896): Broken pipe
WSREP_SST: [ERROR] Error while getting data from donor node: exit codes: 1 1 (20130921 02:41:09.742)
WSREP_SST: [ERROR] Cleanup after exit with status:32 (20130921 02:41:09.744)
WSREP_SST: [INFO] Removing the sst_in_progress file (20130921 02:41:09.745)

2. After changing SST settings in my.cnf at joiner (streamfmt=tar), the first data transfer goes through fine. However, joiner gets stuck forever while waiting for donor to send another part of data(I believe binlogs generated during state transfer?):

Joiner log:
130921 16:38:35 [Note] WSREP: New cluster view: global state: cffdaa53-542e-11e2-0800-c339f0c57558:410643189, view# 37: Primary, number of nodes: 5, my index: 4, protocol version 2
130921 16:38:35 [Warning] WSREP: Gap in state sequence. Need state transfer.
130921 16:38:37 [Note] WSREP: Running: 'wsrep_sst_xtrabackup --role 'joiner' --address '10.32.150.244' --auth 'wsrep_sst:xxxxxxxxxxxxxxxxx' --datadir '/var/lib/mysql/data/' --defaults-file '/etc/my.cnf' --parent '3727''
WSREP_SST: [INFO] Streaming with tar (20130921 16:38:37.872)
WSREP_SST: [INFO] Using netcat as streamer (20130921 16:38:37.874)
WSREP_SST: [INFO] Evaluating nc -dl 4444 | pv -f -i 10 -N joiner | tar xfi - ; RC=( ${PIPESTATUS[@]} ) (20130921 16:38:37.881)
130921 16:38:38 [Note] WSREP: Prepared SST request: xtrabackup|10.32.150.244:4444/xtrabackup_sst
130921 16:38:38 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
130921 16:38:38 [Note] WSREP: Assign initial position for certification: 410643189, protocol version: 2
130921 16:38:38 [Warning] WSREP: Failed to prepare for incremental state transfer: Local state UUID (00000000-0000-0000-0000-000000000000) does not match group state UUID (cffdaa53-542e-11e2-0800-c339f0c57558): 1 (Operation not permitted)
         at galera/src/replicator_str.cpp:prepare_for_IST():436. IST will be unavailable.
130921 16:38:38 [Note] WSREP: Node 4 (co3.ourdomain.com) requested state transfer from '*any*'. Selected 0 (co1.ourdomain.com)(SYNCED) as donor.
130921 16:38:38 [Note] WSREP: Shifting PRIMARY -> JOINER (TO: 410643189)
130921 16:38:38 [Note] WSREP: Requesting state transfer: success, donor: 0
   joiner: 96MB 0:00:10 [ 9.6MB/s] [<=> ]
   joiner: 198MB 0:00:20 [10.2MB/s] [ <=> ]
   joiner: 296MB 0:00:30 [9.74MB/s] [ <=> ]
   joiner: 376MB 0:00:40 [8.02MB/s] [ <=> ]
   joiner: 381MB 0:00:40 [9.34MB/s] [ <=> ]
WSREP_SST: [INFO] NOTE: Joiner-Recv-gtid took 41 seconds (20130921 16:39:18.699)
130921 16:39:18 [Note] WSREP: 0 (co1.ourdomain.com): State transfer to 4 (co3.ourdomain.com) complete.
130921 16:39:18 [Note] WSREP: Member 0 (co1.ourdomain.com) synced with group.
WSREP_SST: [INFO] Proceeding with SST (20130921 16:39:18.706)
WSREP_SST: [INFO] Cleaning the existing datadir (20130921 16:39:18.707)
WSREP_SST: [INFO] Evaluating nc -dl 4444 | pv -f -i 10 -N joiner | tar xfi - ; RC=( ${PIPESTATUS[@]} ) (20130921 16:39:18.710)
   joiner: 0B 0:00:10 [ 0B/s ] [<=> ]
   joiner: 0B 0:00:10 [ 0B/s ] [<=> ]
   joiner: 0B 0:00:10 [ 0B/s ] [<=> ]
   joiner: 0B 0:00:10 [ 0B/s ] [<=> ]
// it continues like that forever

Donor log:
130921 16:38:35 [Note] WSREP: Assign initial position for certification: 410643189, protocol version: 2
130921 16:38:38 [Note] WSREP: Node 4 (co3.ourdomain.com) requested state transfer from '*any*'. Selected 0 (co1.ourdomain.com)(SYNCED) as donor.
130921 16:38:38 [Note] WSREP: Shifting SYNCED -> DONOR/DESYNCED (TO: 410643189)
130921 16:38:38 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
130921 16:38:38 [Note] WSREP: Running: 'wsrep_sst_xtrabackup --role 'donor' --address '10.32.150.244:4444/xtrabackup_sst' --auth 'wsrep_sst:xxxxxxxxxxxxxxxxx' --socket '/var/lib/mysql/mysql.sock' --datadir '/var/lib/mysql/data/' --defaults-file '/etc/my.cnf' --gtid 'cffdaa53-542e-11e2-0800-c339f0c57558:410643189''
130921 16:38:38 [Note] WSREP: sst_donor_thread signaled with 0
130921 16:39:17 [Note] WSREP: Provider paused at cffdaa53-542e-11e2-0800-c339f0c57558:410650116
130921 16:39:18 [Note] WSREP: Provider resumed.
130921 16:39:18 [Note] WSREP: 0 (co1.ourdomain.com): State transfer to 4 (co3.ourdomain.com) complete.
130921 16:39:18 [Note] WSREP: Shifting DONOR/DESYNCED -> JOINED (TO: 410650116)
130921 16:39:18 [Note] WSREP: Member 0 (co1.ourdomain.com) synced with group.
130921 16:39:18 [Note] WSREP: Shifting JOINED -> SYNCED (TO: 410650116)
130921 16:39:18 [Note] WSREP: Synchronized with group, ready for connections
130921 16:39:18 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.

I am not running any kind of firewalls or anything that could prevent packets reaching their destination. Before updating the node to the latest PXC, SST was working just fine.
As the joiner was waiting for data, I checked the donor. It was happily working in SYNCED state without any traces of any SST being in progress.
No processes like tar/nc/socat/xtrabackup were running on it.

3. At the moment the donor changes status(DONOR/DESYNCED->JOINED or JOINED->SYNCED, not really sure), a weird entry appears in log of garbd instances:
2013-09-21 16:39:18.709 WARN: Protocol violation. JOIN message sender 0 (co1.ourdomain.com) is not in state transfer (SYNCED). Message ignored.

Alex Yurchenko (ayurchen) wrote :

"JOIN message sender 0 (co1.ourdomain.com) is not in state transfer (SYNCED)."
- this message can be safely ignored. There can be several JOIN messages from the same sender.

Tomasz Klekot (tomksoft) wrote :

I can't really find it in the errata or the SST docs. I understand it is two staged now, but it does not clearly say that the SST compatibility is totally broken. Doesn't it mean that it now became impossible to do zero-downtime PXC upgrade?

Tomasz Klekot (tomksoft) wrote :

Ah, I can see the Errata has just been updated. Thank you!

Marking this 'fix committed' since it has been documented for now.

Changed in percona-xtradb-cluster:
status: New → Fix Committed
summary: - Incompatible SST between 5.5.29 and 5.5.33
+ [DOC] Incompatible SST between 5.5.29 and 5.5.33

I have added the older xtrabackup as xtrabackup-v1 (so you can give as wsrep_sst_donor=xtrabackup-v1).

summary: - [DOC] Incompatible SST between 5.5.29 and 5.5.33
+ Incompatible SST between 5.5.29 and 5.5.33
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers