Percona XtraDB Cluster moved to https://jira.percona.com/projects/PXC

garbd for backup doesn't work with lock for backup

Bug #1443154 reported by Frederic Descamps on 2015-04-12

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Percona XtraDB Cluster moved to https://jira.percona.com/projects/PXC	New	Undecided	Unassigned

Bug Description

If we check the documentation of Codership (http://galeracluster.com/documentation-webpages/backingupthecluster.html) it's possible to backup your cluster using garbd, and several people is using this technique (I'm not 100% sure but Cluster Control does too I guess).

However since the implementation of locks for backup, this is not working any more:

[root@node3 backup]# garbd --address gcomm://192.168.90.2:4567 -o gmcast.listen_addr=tcp://0.0.0.0:444 --group LefredPXC --donor node1 --sst xtrabackup-v2
2015-04-12 01:06:29.865 INFO: CRC-32C: using "slicing-by-8" algorithm.
2015-04-12 01:06:29.866 INFO: Read config:
daemon: 0
name: garb
address: gcomm://192.168.90.2:4567
group: LefredPXC
sst: xtrabackup-v2
donor: node1
options: gmcast.listen_addr=tcp://0.0.0.0:444; gcs.fc_limit=9999999; gcs.fc_factor=1.0; gcs.fc_master_slave=yes
cfg:
log:

2015-04-12 01:06:29.866 INFO: protonet asio version 0
2015-04-12 01:06:29.867 INFO: Using CRC-32C for message checksums.
2015-04-12 01:06:29.867 INFO: backend: asio
2015-04-12 01:06:29.867 WARN: access file(./gvwstate.dat) failed(No such file or directory)
2015-04-12 01:06:29.867 INFO: restore pc from disk failed
2015-04-12 01:06:29.867 INFO: GMCast version 0
2015-04-12 01:06:29.867 INFO: (63519dac, 'tcp://0.0.0.0:444') listening at tcp://0.0.0.0:444
2015-04-12 01:06:29.867 INFO: (63519dac, 'tcp://0.0.0.0:444') multicast: , ttl: 1
2015-04-12 01:06:29.868 INFO: EVS version 0
2015-04-12 01:06:29.868 INFO: gcomm: connecting to group 'LefredPXC', peer '192.168.90.2:4567'
2015-04-12 01:06:29.870 INFO: (63519dac, 'tcp://0.0.0.0:444') turning message relay requesting on, nonlive peers: tcp://192.168.90.3:4567
2015-04-12 01:06:30.370 INFO: declaring 3102fc82 at tcp://192.168.90.2:4567 stable
2015-04-12 01:06:30.370 INFO: declaring 8374222a at tcp://192.168.90.3:4567 stable
2015-04-12 01:06:30.371 INFO: Node 3102fc82 state prim
2015-04-12 01:06:30.372 INFO: view(view_id(PRIM,3102fc82,16) memb {
3102fc82,0
63519dac,0
8374222a,0
} joined {
} left {
} partitioned {
})
2015-04-12 01:06:30.372 INFO: save pc into disk
2015-04-12 01:06:30.869 INFO: gcomm: connected
2015-04-12 01:06:30.869 INFO: Changing maximum packet size to 64500, resulting msg size: 32636
2015-04-12 01:06:30.869 INFO: Shifting CLOSED -> OPEN (TO: 0)
2015-04-12 01:06:30.869 INFO: Opened channel 'LefredPXC'
2015-04-12 01:06:30.869 INFO: New COMPONENT: primary = yes, bootstrap = no, my_idx = 1, memb_num = 3
2015-04-12 01:06:30.869 INFO: STATE EXCHANGE: Waiting for state UUID.
2015-04-12 01:06:30.869 INFO: STATE EXCHANGE: sent state msg: 639e1f48-e09f-11e4-a700-db023c846e92
2015-04-12 01:06:30.869 INFO: STATE EXCHANGE: got state msg: 639e1f48-e09f-11e4-a700-db023c846e92 from 0 (node1)
2015-04-12 01:06:30.869 INFO: STATE EXCHANGE: got state msg: 639e1f48-e09f-11e4-a700-db023c846e92 from 2 (node2)
2015-04-12 01:06:30.870 INFO: STATE EXCHANGE: got state msg: 639e1f48-e09f-11e4-a700-db023c846e92 from 1 (garb)
2015-04-12 01:06:30.870 INFO: Quorum results:
version = 3,
component = PRIMARY,
conf_id = 13,
members = 2/3 (joined/total),
act_id = 18800,
last_appl. = -1,
protocols = 0/7/3 (gcs/repl/appl),
group UUID = b226add4-df80-11e4-a059-df2f970fa58d
2015-04-12 01:06:30.871 INFO: Flow-control interval: [9999999, 9999999]
2015-04-12 01:06:30.871 INFO: Shifting OPEN -> PRIMARY (TO: 18800)
2015-04-12 01:06:30.871 INFO: Sending state transfer request: 'xtrabackup-v2', size: 13
2015-04-12 01:06:30.872 INFO: Member 1.0 (garb) requested state transfer from 'node1'. Selected 0.0 (node1)(SYNCED) as donor.
2015-04-12 01:06:30.872 INFO: Shifting PRIMARY -> JOINER (TO: 18800)
2015-04-12 01:06:30.873 INFO: Closing send monitor...
2015-04-12 01:06:30.873 INFO: Closed send monitor.
2015-04-12 01:06:30.873 INFO: gcomm: terminating thread
2015-04-12 01:06:30.873 INFO: gcomm: joining thread
2015-04-12 01:06:30.873 INFO: gcomm: closing backend
2015-04-12 01:06:30.877 INFO: 1.0 (garb): State transfer from 0.0 (node1) complete.
2015-04-12 01:06:30.877 INFO: Shifting JOINER -> JOINED (TO: 18800)
2015-04-12 01:06:30.879 INFO: view(view_id(NON_PRIM,3102fc82,16) memb {
63519dac,0
} joined {
} left {
} partitioned {
3102fc82,0
8374222a,0
})
2015-04-12 01:06:30.879 INFO: view((empty))
2015-04-12 01:06:30.880 INFO: gcomm: closed
2015-04-12 01:06:30.880 WARN: 0x1f68500 down context(s) not set
2015-04-12 01:06:30.880 WARN: Failed to send SYNC signal: -107 (Transport endpoint is not connected)
2015-04-12 01:06:30.881 INFO: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1
2015-04-12 01:06:30.881 INFO: Flow-control interval: [9999999, 9999999]
2015-04-12 01:06:30.881 INFO: Received NON-PRIMARY.
2015-04-12 01:06:30.881 INFO: Shifting JOINED -> OPEN (TO: 18800)
2015-04-12 01:06:30.881 INFO: Received self-leave message.
2015-04-12 01:06:30.881 INFO: Flow-control interval: [9999999, 9999999]
2015-04-12 01:06:30.881 INFO: Received SELF-LEAVE. Closing connection.
2015-04-12 01:06:30.881 INFO: Shifting OPEN -> CLOSED (TO: 18800)
2015-04-12 01:06:30.881 INFO: RECV thread exiting 0: Success
2015-04-12 01:06:30.881 INFO: recv_thread() joined.
2015-04-12 01:06:30.881 INFO: Closing replication queue.
2015-04-12 01:06:30.881 INFO: Closing slave action queue.
2015-04-12 01:06:30.881 WARN: Attempt to close a closed connection
2015-04-12 01:06:30.881 INFO: Exiting main loop
2015-04-12 01:06:30.881 INFO: Shifting CLOSED -> DESTROYED (TO: 18800)

and on the DONOR:

2015-04-12 01:06:30 5087 [Note] WSREP: Service thread queue flushed.
WSREP_SST: [INFO] Streaming with xbstream (20150412 01:06:31.298)
WSREP_SST: [INFO] Using socat as streamer (20150412 01:06:31.299)
WSREP_SST: [ERROR] Upgrade joiner to 5.6.21 or higher for backup locks support (20150412 01:06:31.303)
WSREP_SST: [ERROR] The joiner is not supported for this version of donor (20150412 01:06:31.304)
WSREP_SST: [ERROR] Cleanup after exit with status:93 (20150412 01:06:31.306)
WSREP_SST: [INFO] Cleaning up temporary directories (20150412 01:06:31.308)
2015-04-12 01:06:31 5087 [ERROR] WSREP: Failed to read from: wsrep_sst_xtrabackup-v2 --role 'donor' --address '' --auth 'root:' --socket '/var/lib/mysql/mysql.sock' --datadir '/var/lib/mysql/' --defaults-file '/etc/my.cnf' --binlog 'node1-bin' --gtid 'b226add4-df80-11e4-a059-df2f970fa58d:18800'

But of course you don't want to configure the nodes to not use locks for backup as then this will be the default behaviour for the nodes in the cluster when SST is needed.

I don't know if it wlll be possible to use then garbd for backup and in case it's impossible, then we should document it.