SST: Joiner stuck in 'joining', but error log reports 'synced'

Bug #1373796 reported by johan on 2014-09-25
28
This bug affects 6 people
Affects Status Importance Assigned to Milestone
MySQL patches by Codership
Status tracked in 5.6
5.5
Undecided
Unassigned
5.6
Undecided
Unassigned
Percona XtraDB Cluster moved to https://jira.percona.com/projects/PXC
Status tracked in 5.6
5.5
New
High
David Bennett
5.6
Fix Released
High
David Bennett

Bug Description

Environment:
* PXC 5.6.20-68.0-56
* Galera provider: 3.7(r7f44a18)
* xtrabackup 2.2.4
* EC2, m3.medium, ubuntu-trusty-14.04-amd64-server-20140607.1 (ami-0307d674)

Galera nodes, all are 'synced':
10.0.0.24
10.0.0.25
10.0.0.26

ssh 10.0.0.24
sudo killall -9 mysqld mysqld_safe
# remove mysql data files to have an SST:
sudo rm -rf /mnt/data/mysql/*
sudo mysqld_safe &

10.0.0.24 is now the JOINER and writes this in the mysql error log:

140925 07:53:55 mysqld_safe Starting mysqld daemon with databases from /mnt/data/mysql/
140925 07:53:55 mysqld_safe Skipping wsrep-recover for empty datadir: /mnt/data/mysql/
140925 07:53:55 mysqld_safe Assigning 00000000-0000-0000-0000-000000000000:-1 to wsrep_start_position
2014-09-25 07:53:56 0 [Note] WSREP: wsrep_start_position var submitted: '00000000-0000-0000-0000-000000000000:-1'
2014-09-25 07:53:56 28987 [Note] WSREP: Read nil XID from storage engines, skipping position init
2014-09-25 07:53:56 28987 [Note] WSREP: wsrep_load(): loading provider library '/usr/lib/libgalera_smm.so'
2014-09-25 07:53:56 28987 [Note] WSREP: wsrep_load(): Galera 3.7(r7f44a18) by Codership Oy <email address hidden> loaded successfully.
2014-09-25 07:53:56 28987 [Note] WSREP: CRC-32C: using hardware acceleration.
2014-09-25 07:53:56 28987 [Warning] WSREP: Could not open saved state file for reading: /mnt/data/mysql//grastate.dat
2014-09-25 07:53:56 28987 [Note] WSREP: Found saved state: 00000000-0000-0000-0000-000000000000:-1
2014-09-25 07:53:56 28987 [Note] WSREP: Passing config to GCS: base_host = 10.0.0.24; base_port = 4567; cert.log_conflicts = no; debug = no; evs.inactive_check_period = PT0.5S; evs.inactive_timeout = PT15S; evs.join_retrans_period = PT1S; evs.max_install_timeouts = 3; evs.send_window = 4; evs.stats_report_period = PT1M; evs.suspect_timeout = PT5S; evs.user_send_window = 2; evs.view_forget_timeout = PT24H; gcache.dir = /mnt/data/mysql/; gcache.keep_pages_size = 0; gcache.mem_size = 0; gcache.name = /mnt/data/mysql//galera.cache; gcache.page_size = 128M; gcache.size = 128M; gcs.fc_debug = 0; gcs.fc_factor = 1.0; gcs.fc_limit = 16; gcs.fc_master_slave = no; gcs.max_packet_size = 64500; gcs.max_throttle = 0.25; gcs.recv_q_hard_limit = 9223372036854775807; gcs.recv_q_soft_limit = 0.25; gcs.sync_donor = no; gmcast.segment = 0; gmcast.version = 0; pc.announce_timeout = PT3S; pc.checksum = false; pc.ignore_quorum = false; pc.ignore_sb = false; pc.npvo = false; pc.recovery = 1; pc.version = 0; pc.wait_prim = true; pc.wait_prim_timeout = P30S; pc.wei
2014-09-25 07:53:56 28987 [Note] WSREP: Service thread queue flushed.
2014-09-25 07:53:56 28987 [Note] WSREP: Assign initial position for certification: -1, protocol version: -1
2014-09-25 07:53:56 28987 [Note] WSREP: wsrep_sst_grab()
2014-09-25 07:53:56 28987 [Note] WSREP: Start replication
2014-09-25 07:53:56 28987 [Note] WSREP: Setting initial position to 00000000-0000-0000-0000-000000000000:-1
2014-09-25 07:53:56 28987 [Note] WSREP: protonet asio version 0
2014-09-25 07:53:56 28987 [Note] WSREP: Using CRC-32C for message checksums.
2014-09-25 07:53:56 28987 [Note] WSREP: backend: asio
2014-09-25 07:53:56 28987 [Warning] WSREP: access file(gvwstate.dat) failed(No such file or directory)
2014-09-25 07:53:56 28987 [Note] WSREP: restore pc from disk failed
2014-09-25 07:53:56 28987 [Note] WSREP: GMCast version 0
2014-09-25 07:53:56 28987 [Note] WSREP: (1a021a43, 'tcp://0.0.0.0:4567') listening at tcp://0.0.0.0:4567
2014-09-25 07:53:56 28987 [Note] WSREP: (1a021a43, 'tcp://0.0.0.0:4567') multicast: , ttl: 1
2014-09-25 07:53:56 28987 [Note] WSREP: EVS version 0
2014-09-25 07:53:56 28987 [Note] WSREP: PC version 0
2014-09-25 07:53:56 28987 [Note] WSREP: gcomm: connecting to group 'my_wsrep_cluster', peer '10.0.0.25:4567,10.0.0.26:4567,10.0.0.24:4567'
2014-09-25 07:53:56 28987 [Warning] WSREP: (1a021a43, 'tcp://0.0.0.0:4567') address 'tcp://10.0.0.24:4567' points to own listening address, blacklisting
2014-09-25 07:53:56 28987 [Note] WSREP: (1a021a43, 'tcp://0.0.0.0:4567') address 'tcp://10.0.0.24:4567' pointing to uuid 1a021a43 is blacklisted, skipping
2014-09-25 07:53:56 28987 [Note] WSREP: (1a021a43, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers:
2014-09-25 07:53:56 28987 [Note] WSREP: (1a021a43, 'tcp://0.0.0.0:4567') address 'tcp://10.0.0.24:4567' pointing to uuid 1a021a43 is blacklisted, skipping
2014-09-25 07:53:56 28987 [Note] WSREP: (1a021a43, 'tcp://0.0.0.0:4567') address 'tcp://10.0.0.24:4567' pointing to uuid 1a021a43 is blacklisted, skipping
2014-09-25 07:53:56 28987 [Note] WSREP: declaring 1076e92e at tcp://10.0.0.25:4567 stable
2014-09-25 07:53:56 28987 [Note] WSREP: declaring 35b8b436 at tcp://10.0.0.26:4567 stable
2014-09-25 07:53:56 28987 [Note] WSREP: Node 1076e92e state prim
2014-09-25 07:53:56 28987 [Note] WSREP: save pc into disk
2014-09-25 07:53:57 28987 [Note] WSREP: gcomm: connected
2014-09-25 07:53:57 28987 [Note] WSREP: Changing maximum packet size to 64500, resulting msg size: 32636
2014-09-25 07:53:57 28987 [Note] WSREP: Shifting CLOSED -> OPEN (TO: 0)
2014-09-25 07:53:57 28987 [Note] WSREP: Opened channel 'my_wsrep_cluster'
2014-09-25 07:53:57 28987 [Note] WSREP: Waiting for SST to complete.
2014-09-25 07:53:57 28987 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 1, memb_num = 3
2014-09-25 07:53:57 28987 [Note] WSREP: STATE EXCHANGE: Waiting for state UUID.
2014-09-25 07:53:57 28987 [Note] WSREP: STATE EXCHANGE: sent state msg: 1a372f69-4489-11e4-9d92-c2831d253d7e
2014-09-25 07:53:57 28987 [Note] WSREP: STATE EXCHANGE: got state msg: 1a372f69-4489-11e4-9d92-c2831d253d7e from 0 (10.0.0.25)
2014-09-25 07:53:57 28987 [Note] WSREP: STATE EXCHANGE: got state msg: 1a372f69-4489-11e4-9d92-c2831d253d7e from 2 (10.0.0.26)
2014-09-25 07:53:57 28987 [Note] WSREP: STATE EXCHANGE: got state msg: 1a372f69-4489-11e4-9d92-c2831d253d7e from 1 (10.0.0.24)
2014-09-25 07:53:57 28987 [Note] WSREP: Quorum results:
 version = 3,
 component = PRIMARY,
 conf_id = 20,
 members = 2/3 (joined/total),
 act_id = 10,
 last_appl. = -1,
 protocols = 0/6/3 (gcs/repl/appl),
 group UUID = f7d5ff68-4482-11e4-9f6c-5f76e5087b34
2014-09-25 07:53:57 28987 [Note] WSREP: Flow-control interval: [28, 28]
2014-09-25 07:53:57 28987 [Note] WSREP: Shifting OPEN -> PRIMARY (TO: 10)
2014-09-25 07:53:57 28987 [Note] WSREP: State transfer required:
 Group state: f7d5ff68-4482-11e4-9f6c-5f76e5087b34:10
 Local state: 00000000-0000-0000-0000-000000000000:-1
2014-09-25 07:53:57 28987 [Note] WSREP: New cluster view: global state: f7d5ff68-4482-11e4-9f6c-5f76e5087b34:10, view# 21: Primary, number of nodes: 3, my index: 1, protocol version 3
2014-09-25 07:53:57 28987 [Warning] WSREP: Gap in state sequence. Need state transfer.
2014-09-25 07:53:57 28987 [Note] WSREP: Running: 'wsrep_sst_xtrabackup-v2 --role 'joiner' --address '10.0.0.24' --auth 'root:password' --datadir '/mnt/data/mysql/' --defaults-file '/etc/mysql/my.cnf' --parent '28987' '' '
WSREP_SST: [INFO] Streaming with xbstream (20140925 07:53:57.315)
WSREP_SST: [INFO] Using socat as streamer (20140925 07:53:57.329)
WSREP_SST: [INFO] Evaluating timeout -k 110 100 socat -u TCP-LISTEN:4444,reuseaddr stdio | xbstream -x; RC=( ${PIPESTATUS[@]} ) (20140925 07:53:57.799)
2014-09-25 07:53:57 28987 [Note] WSREP: Prepared SST request: xtrabackup-v2|10.0.0.24:4444/xtrabackup_sst
2014-09-25 07:53:57 28987 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2014-09-25 07:53:57 28987 [Note] WSREP: REPL Protocols: 6 (3, 2)
2014-09-25 07:53:57 28987 [Note] WSREP: Service thread queue flushed.
2014-09-25 07:53:57 28987 [Note] WSREP: Assign initial position for certification: 10, protocol version: 3
2014-09-25 07:53:57 28987 [Note] WSREP: Service thread queue flushed.
2014-09-25 07:53:57 28987 [Warning] WSREP: Failed to prepare for incremental state transfer: Local state UUID (00000000-0000-0000-0000-000000000000) does not match group state UUID (f7d5ff68-4482-11e4-9f6c-5f76e5087b34): 1 (Operation not permitted)
  at galera/src/replicator_str.cpp:prepare_for_IST():455. IST will be unavailable.
2014-09-25 07:53:57 28987 [Note] WSREP: Member 1.0 (10.0.0.24) requested state transfer from '*any*'. Selected 0.0 (10.0.0.25)(SYNCED) as donor.
2014-09-25 07:53:57 28987 [Note] WSREP: Shifting PRIMARY -> JOINER (TO: 10)
2014-09-25 07:53:57 28987 [Note] WSREP: Requesting state transfer: success, donor: 0
WSREP_SST: [INFO] Proceeding with SST (20140925 07:53:58.763)
WSREP_SST: [INFO] Cleaning the existing datadir (20140925 07:53:58.765)
removed ‘/mnt/data/mysql/gvwstate.dat’
WSREP_SST: [INFO] Evaluating socat -u TCP-LISTEN:4444,reuseaddr stdio | xbstream -x; RC=( ${PIPESTATUS[@]} ) (20140925 07:53:58.782)
2014-09-25 07:53:59 28987 [Note] WSREP: (1a021a43, 'tcp://0.0.0.0:4567') turning message relay requesting off
2014-09-25 07:54:12 28987 [Note] WSREP: 0.0 (10.0.0.25): State transfer to 1.0 (10.0.0.24) complete.
2014-09-25 07:54:12 28987 [Note] WSREP: Member 0.0 (10.0.0.25) synced with group.
WSREP_SST: [INFO] Preparing the backup at /mnt/data/mysql/ (20140925 07:54:12.694)
WSREP_SST: [INFO] Evaluating innobackupex --no-version-check --apply-log $rebuildcmd ${DATA} &>${DATA}/innobackup.prepare.log (20140925 07:54:12.696)
WSREP_SST: [INFO] Total time on joiner: 0 seconds (20140925 07:54:23.226)
WSREP_SST: [INFO] Removing the sst_in_progress file (20140925 07:54:23.229)
2014-09-25 07:54:23 28987 [Note] WSREP: SST complete, seqno: 7
2014-09-25 07:54:23 28987 [Note] Plugin 'FEDERATED' is disabled.
2014-09-25 07:54:23 28987 [Note] InnoDB: Using atomics to ref count buffer pool pages
2014-09-25 07:54:23 28987 [Note] InnoDB: The InnoDB memory heap is disabled
2014-09-25 07:54:23 28987 [Note] InnoDB: Mutexes and rw_locks use GCC atomic builtins
2014-09-25 07:54:23 28987 [Note] InnoDB: Memory barrier is not used
2014-09-25 07:54:23 28987 [Note] InnoDB: Compressed tables use zlib 1.2.8
2014-09-25 07:54:23 28987 [Note] InnoDB: Using Linux native AIO
2014-09-25 07:54:23 28987 [Note] InnoDB: Using CPU crc32 instructions
2014-09-25 07:54:23 28987 [Note] InnoDB: Initializing buffer pool, size = 2.4G
2014-09-25 07:54:23 28987 [Note] InnoDB: Completed initialization of buffer pool
2014-09-25 07:54:23 28987 [Note] InnoDB: Highest supported file format is Barracuda.
2014-09-25 07:54:23 28987 [Note] InnoDB: 128 rollback segment(s) are active.
2014-09-25 07:54:23 28987 [Note] InnoDB: Waiting for purge to start
2014-09-25 07:54:23 28987 [Note] InnoDB: Percona XtraDB (http://www.percona.com) 5.6.20-68.0 started; log sequence number 1627158
2014-09-25 07:54:23 28987 [Warning] No existing UUID has been found, so we assume that this is the first time that this server has been started. Generating a new UUID: 2a2ea9d3-4489-11e4-a0f4-02bb8ea7691b.
2014-09-25 07:54:23 28987 [Note] RSA private key file not found: /mnt/data/mysql//private_key.pem. Some authentication plugins will not work.
2014-09-25 07:54:23 28987 [Note] RSA public key file not found: /mnt/data/mysql//public_key.pem. Some authentication plugins will not work.
2014-09-25 07:54:23 28987 [Note] Server hostname (bind-address): '*'; port: 3306
2014-09-25 07:54:23 28987 [Note] IPv6 is available.
2014-09-25 07:54:23 28987 [Note] - '::' resolves to '::';
2014-09-25 07:54:23 28987 [Note] Server socket created on IP: '::'.
2014-09-25 07:54:23 28987 [Warning] 'proxies_priv' entry '@ root@ip-10-0-0-24' ignored in --skip-name-resolve mode.
2014-09-25 07:54:23 28987 [Note] Event Scheduler: Loaded 0 events
2014-09-25 07:54:23 28987 [Note] WSREP: Signalling provider to continue.
2014-09-25 07:54:23 28987 [Note] WSREP: inited wsrep sidno 1
2014-09-25 07:54:23 28987 [Note] WSREP: SST received: f7d5ff68-4482-11e4-9f6c-5f76e5087b34:7
2014-09-25 07:54:23 28987 [Note] /usr/sbin/mysqld: ready for connections.
Version: '5.6.20-68.0-56' socket: '/var/run/mysqld/mysqld.sock' port: 3306 Percona XtraDB Cluster (GPL), Release 25.7, wsrep_25.7.r4126
2014-09-25 07:54:23 28987 [Note] WSREP: 1.0 (10.0.0.24): State transfer from 0.0 (10.0.0.25) complete.
2014-09-25 07:54:23 28987 [Note] WSREP: Shifting JOINER -> JOINED (TO: 10)
2014-09-25 07:54:23 28987 [Note] WSREP: Member 1.0 (10.0.0.24) synced with group.
2014-09-25 07:54:23 28987 [Note] WSREP: Shifting JOINED -> SYNCED (TO: 10)

mysql> SHOW GLOBAL STATUS LIKE 'wsrep%';
+------------------------------+----------------------------------------------+
| Variable_name | Value |
+------------------------------+----------------------------------------------+
| wsrep_local_state_uuid | f7d5ff68-4482-11e4-9f6c-5f76e5087b34 |
| wsrep_protocol_version | 6 |
| wsrep_last_committed | 7 |
| wsrep_replicated | 0 |
| wsrep_replicated_bytes | 0 |
| wsrep_repl_keys | 0 |
| wsrep_repl_keys_bytes | 0 |
| wsrep_repl_data_bytes | 0 |
| wsrep_repl_other_bytes | 0 |
| wsrep_received | 3 |
| wsrep_received_bytes | 290 |
| wsrep_local_commits | 0 |
| wsrep_local_cert_failures | 0 |
| wsrep_local_replays | 0 |
| wsrep_local_send_queue | 0 |
| wsrep_local_send_queue_max | 1 |
| wsrep_local_send_queue_min | 0 |
| wsrep_local_send_queue_avg | 0.000000 |
| wsrep_local_recv_queue | 0 |
| wsrep_local_recv_queue_max | 1 |
| wsrep_local_recv_queue_min | 0 |
| wsrep_local_recv_queue_avg | 0.000000 |
| wsrep_local_cached_downto | 18446744073709551615 |
| wsrep_flow_control_paused_ns | 0 |
| wsrep_flow_control_paused | 0.000000 |
| wsrep_flow_control_sent | 0 |
| wsrep_flow_control_recv | 0 |
| wsrep_cert_deps_distance | 0.000000 |
| wsrep_apply_oooe | 0.000000 |
| wsrep_apply_oool | 0.000000 |
| wsrep_apply_window | 0.000000 |
| wsrep_commit_oooe | 0.000000 |
| wsrep_commit_oool | 0.000000 |
| wsrep_commit_window | 0.000000 |
| wsrep_local_state | 1 |
| wsrep_local_state_comment | Joining |
| wsrep_cert_index_size | 0 |
| wsrep_causal_reads | 0 |
| wsrep_cert_interval | 0.000000 |
| wsrep_incoming_addresses | 10.0.0.25:3306,10.0.0.24:3306,10.0.0.26:3306 |
| wsrep_evs_repl_latency | 0/0/0/0/0 |
| wsrep_cluster_conf_id | 21 |
| wsrep_cluster_size | 3 |
| wsrep_cluster_state_uuid | f7d5ff68-4482-11e4-9f6c-5f76e5087b34 |
| wsrep_cluster_status | Primary |
| wsrep_connected | ON |
| wsrep_local_bf_aborts | 0 |
| wsrep_local_index | 1 |
| wsrep_provider_name | Galera |
| wsrep_provider_vendor | Codership Oy <email address hidden> |
| wsrep_provider_version | 3.7(r7f44a18) |
| wsrep_ready | OFF |
+------------------------------+----------------------------------------------+
52 rows in set (0.00 sec)

**** There are no traces of xtrabackup still running:

ubuntu@ip-10-0-0-24:~$ ps -ef |grep inno
ubuntu 29349 28112 0 07:55 pts/0 00:00:00 grep --color=auto inno
ubuntu@ip-10-0-0-24:~$ ps -ef |grep mysqld
root 28143 28112 0 07:53 pts/0 00:00:00 sudo mysqld_safe
root 28144 28143 0 07:53 pts/0 00:00:00 /bin/sh /usr/bin/mysqld_safe
mysql 28987 28144 1 07:53 pts/0 00:00:01 /usr/sbin/mysqld --basedir=/usr/ --datadir=/mnt/data/mysql/ --plugin-dir=/usr//lib/mysql/plugin --user=mysql --wsrep-provider=/usr/lib/libgalera_smm.so --log-error=/var/log/mysql.log --pid-file=/mnt/data/mysql//mysqld.pid --socket=/var/run/mysqld/mysqld.sock --port=3306 --wsrep_start_position=00000000-0000-0000-0000-000000000000:-1

**** Looking at the innobackup prepare log:

ubuntu@ip-10-0-0-24:~$ sudo cat /mnt/data/mysql/innobackup.prepare.log
sudo: unable to resolve host ip-10-0-0-24

InnoDB Backup Utility v1.5.1-xtrabackup; Copyright 2003, 2009 Innobase Oy
and Percona LLC and/or its affiliates 2009-2013. All Rights Reserved.

This software is published under
the GNU GENERAL PUBLIC LICENSE Version 2, June 1991.

Get the latest version of Percona XtraBackup, documentation, and help resources:
http://www.percona.com/xb/p

IMPORTANT: Please check that the apply-log run completes successfully.
           At the end of a successful apply-log run innobackupex
           prints "completed OK!".

140925 07:54:13 innobackupex: Starting ibbackup with command: xtrabackup --defaults-file="/mnt/data/mysql/backup-my.cnf" --defaults-group="mysqld" --prepare --target-dir=/mnt/data/mysql --tmpdir=/tmp

xtrabackup version 2.2.4 based on MySQL server 5.6.17 Linux (x86_64) (revision id: )
xtrabackup: cd to /mnt/data/mysql
xtrabackup: This target seems to be not prepared yet.
xtrabackup: xtrabackup_logfile detected: size=2097152, start_lsn=(1627118)
xtrabackup: using the following InnoDB configuration for recovery:
xtrabackup: innodb_data_home_dir = ./
xtrabackup: innodb_data_file_path = ibdata1:100M:autoextend
xtrabackup: innodb_log_group_home_dir = ./
xtrabackup: innodb_log_files_in_group = 1
xtrabackup: innodb_log_file_size = 2097152
xtrabackup: using the following InnoDB configuration for recovery:
xtrabackup: innodb_data_home_dir = ./
xtrabackup: innodb_data_file_path = ibdata1:100M:autoextend
xtrabackup: innodb_log_group_home_dir = ./
xtrabackup: innodb_log_files_in_group = 1
xtrabackup: innodb_log_file_size = 2097152
xtrabackup: Starting InnoDB instance for recovery.
xtrabackup: Using 104857600 bytes for buffer pool (set by --use-memory parameter)
InnoDB: Using atomics to ref count buffer pool pages
InnoDB: The InnoDB memory heap is disabled
InnoDB: Mutexes and rw_locks use GCC atomic builtins
InnoDB: Compressed tables use zlib 1.2.8
InnoDB: Using CPU crc32 instructions
InnoDB: Initializing buffer pool, size = 100.0M
InnoDB: Completed initialization of buffer pool
InnoDB: Highest supported file format is Barracuda.
InnoDB: The log sequence numbers 1626646 and 1626646 in ibdata files do not match the log sequence number 1627118 in the ib_logfiles!
InnoDB: Database was not shutdown normally!
InnoDB: Starting crash recovery.
InnoDB: Reading tablespace information from the .ibd files...
InnoDB: Restoring possible half-written data pages
InnoDB: from the doublewrite buffer...
InnoDB: 128 rollback segment(s) are active.
InnoDB: Waiting for purge to start
InnoDB: 5.6.17 started; log sequence number 1627118

[notice (again)]
  If you use binary log and don't use any hack of group commit,
  the binary log position seems to be:

xtrabackup: Recovered WSREP position: f7d5ff68-4482-11e4-9f6c-5f76e5087b34:7
xtrabackup: starting shutdown with innodb_fast_shutdown = 1
InnoDB: FTS optimize thread exiting.
InnoDB: Starting shutdown...
InnoDB: Shutdown completed; log sequence number 1627128

140925 07:54:14 innobackupex: Restarting xtrabackup with command: xtrabackup --defaults-file="/mnt/data/mysql/backup-my.cnf" --defaults-group="mysqld" --prepare --target-dir=/mnt/data/mysql --tmpdir=/tmp
for creating ib_logfile*

xtrabackup version 2.2.4 based on MySQL server 5.6.17 Linux (x86_64) (revision id: )
xtrabackup: cd to /mnt/data/mysql
xtrabackup: This target seems to be already prepared.
xtrabackup: notice: xtrabackup_logfile was already used to '--prepare'.
xtrabackup: using the following InnoDB configuration for recovery:
xtrabackup: innodb_data_home_dir = ./
xtrabackup: innodb_data_file_path = ibdata1:100M:autoextend
xtrabackup: innodb_log_group_home_dir = ./
xtrabackup: innodb_log_files_in_group = 2
xtrabackup: innodb_log_file_size = 536870912
xtrabackup: using the following InnoDB configuration for recovery:
xtrabackup: innodb_data_home_dir = ./
xtrabackup: innodb_data_file_path = ibdata1:100M:autoextend
xtrabackup: innodb_log_group_home_dir = ./
xtrabackup: innodb_log_files_in_group = 2
xtrabackup: innodb_log_file_size = 536870912
xtrabackup: Starting InnoDB instance for recovery.
xtrabackup: Using 104857600 bytes for buffer pool (set by --use-memory parameter)
InnoDB: Using atomics to ref count buffer pool pages
InnoDB: The InnoDB memory heap is disabled
InnoDB: Mutexes and rw_locks use GCC atomic builtins
InnoDB: Compressed tables use zlib 1.2.8
InnoDB: Using CPU crc32 instructions
InnoDB: Initializing buffer pool, size = 100.0M
InnoDB: Completed initialization of buffer pool
InnoDB: Setting log file ./ib_logfile101 size to 512 MB
InnoDB: Progress in MB: 100 200 300 400 500
InnoDB: Setting log file ./ib_logfile1 size to 512 MB
InnoDB: Progress in MB: 100 200 300 400 500
InnoDB: Renaming log file ./ib_logfile101 to ./ib_logfile0
InnoDB: New log files created, LSN=1627128
InnoDB: Highest supported file format is Barracuda.
InnoDB: 128 rollback segment(s) are active.
InnoDB: Waiting for purge to start
InnoDB: 5.6.17 started; log sequence number 1627148

[notice (again)]
  If you use binary log and don't use any hack of group commit,
  the binary log position seems to be:

xtrabackup: Recovered WSREP position: f7d5ff68-4482-11e4-9f6c-5f76e5087b34:7
xtrabackup: starting shutdown with innodb_fast_shutdown = 1
InnoDB: FTS optimize thread exiting.
InnoDB: Starting shutdown...
InnoDB: Shutdown completed; log sequence number 1627158
140925 07:54:23 innobackupex: completed OK!

**** Looking at the DONOR (i hope i have got the most parts of the error log):
2014-09-25 07:53:47 10525 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2014-09-25 07:53:47 10525 [Note] WSREP: REPL Protocols: 6 (3, 2)
2014-09-25 07:53:47 10525 [Note] WSREP: Service thread queue flushed.
2014-09-25 07:53:47 10525 [Note] WSREP: Assign initial position for certification: 10, protocol version: 3
2014-09-25 07:53:47 10525 [Note] WSREP: Service thread queue flushed.
2014-09-25 07:53:50 10525 [Note] WSREP: cleaning up 9aced1b3 (tcp://10.0.0.24:4567)
2014-09-25 07:53:56 10525 [Note] WSREP: (1076e92e, 'tcp://0.0.0.0:4567') address 'tcp://10.0.0.25:4567' pointing to uuid 1076e92e is blacklisted, skipping
2014-09-25 07:53:56 10525 [Note] WSREP: (1076e92e, 'tcp://0.0.0.0:4567') address 'tcp://10.0.0.25:4567' pointing to uuid 1076e92e is blacklisted, skipping
2014-09-25 07:53:56 10525 [Note] WSREP: (1076e92e, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers:
2014-09-25 07:53:56 10525 [Note] WSREP: (1076e92e, 'tcp://0.0.0.0:4567') address 'tcp://10.0.0.25:4567' pointing to uuid 1076e92e is blacklisted, skipping
2014-09-25 07:53:56 10525 [Note] WSREP: (1076e92e, 'tcp://0.0.0.0:4567') address 'tcp://10.0.0.25:4567' pointing to uuid 1076e92e is blacklisted, skipping
2014-09-25 07:53:56 10525 [Note] WSREP: (1076e92e, 'tcp://0.0.0.0:4567') address 'tcp://10.0.0.25:4567' pointing to uuid 1076e92e is blacklisted, skipping
2014-09-25 07:53:56 10525 [Note] WSREP: (1076e92e, 'tcp://0.0.0.0:4567') address 'tcp://10.0.0.25:4567' pointing to uuid 1076e92e is blacklisted, skipping
2014-09-25 07:53:56 10525 [Note] WSREP: declaring 1a021a43 at tcp://10.0.0.24:4567 stable
2014-09-25 07:53:56 10525 [Note] WSREP: declaring 35b8b436 at tcp://10.0.0.26:4567 stable
2014-09-25 07:53:56 10525 [Note] WSREP: Node 1076e92e state prim
2014-09-25 07:53:56 10525 [Note] WSREP: save pc into disk
2014-09-25 07:53:56 10525 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 0, memb_num = 3
2014-09-25 07:53:56 10525 [Note] WSREP: STATE_EXCHANGE: sent state UUID: 1a372f69-4489-11e4-9d92-c2831d253d7e
2014-09-25 07:53:56 10525 [Note] WSREP: STATE EXCHANGE: sent state msg: 1a372f69-4489-11e4-9d92-c2831d253d7e
2014-09-25 07:53:56 10525 [Note] WSREP: STATE EXCHANGE: got state msg: 1a372f69-4489-11e4-9d92-c2831d253d7e from 0 (10.0.0.25)
2014-09-25 07:53:56 10525 [Note] WSREP: STATE EXCHANGE: got state msg: 1a372f69-4489-11e4-9d92-c2831d253d7e from 2 (10.0.0.26)
2014-09-25 07:53:57 10525 [Note] WSREP: STATE EXCHANGE: got state msg: 1a372f69-4489-11e4-9d92-c2831d253d7e from 1 (10.0.0.24)
2014-09-25 07:53:57 10525 [Note] WSREP: Quorum results:
 version = 3,
 component = PRIMARY,
 conf_id = 20,
 members = 2/3 (joined/total),
 act_id = 10,
 last_appl. = 0,
 protocols = 0/6/3 (gcs/repl/appl),
 group UUID = f7d5ff68-4482-11e4-9f6c-5f76e5087b34
2014-09-25 07:53:57 10525 [Note] WSREP: Flow-control interval: [28, 28]
2014-09-25 07:53:57 10525 [Note] WSREP: New cluster view: global state: f7d5ff68-4482-11e4-9f6c-5f76e5087b34:10, view# 21: Primary, number of nodes: 3, my index: 0, protocol version 3
2014-09-25 07:53:57 10525 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2014-09-25 07:53:57 10525 [Note] WSREP: REPL Protocols: 6 (3, 2)
2014-09-25 07:53:57 10525 [Note] WSREP: Service thread queue flushed.
2014-09-25 07:53:57 10525 [Note] WSREP: Assign initial position for certification: 10, protocol version: 3
2014-09-25 07:53:57 10525 [Note] WSREP: Service thread queue flushed.
2014-09-25 07:53:57 10525 [Note] WSREP: Member 1.0 (10.0.0.24) requested state transfer from '*any*'. Selected 0.0 (10.0.0.25)(SYNCED) as donor.
2014-09-25 07:53:57 10525 [Note] WSREP: Shifting SYNCED -> DONOR/DESYNCED (TO: 10)
2014-09-25 07:53:57 10525 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2014-09-25 07:53:57 10525 [Note] WSREP: Running: 'wsrep_sst_xtrabackup-v2 --role 'donor' --address '10.0.0.24:4444/xtrabackup_sst' --auth 'root:password' --socket '/var/run/mysqld/mysqld.sock' --datadir '/mnt/data/mysql/' --defaults-file '/etc/mysql/my.cnf' '' --gtid 'f7d5ff68-4482-11e4-9f6c-5f76e5087b34:10''
2014-09-25 07:53:57 10525 [Note] WSREP: sst_donor_thread signaled with 0
WSREP_SST: [INFO] Streaming with xbstream (20140925 07:53:58.240)
WSREP_SST: [INFO] Using socat as streamer (20140925 07:53:58.250)
WSREP_SST: [INFO] Using /tmp/tmp.nnC6BnypNz as xtrabackup temporary directory (20140925 07:53:58.702)
WSREP_SST: [INFO] Using /tmp/tmp.o9h80diUhb as innobackupex temporary directory (20140925 07:53:58.704)
WSREP_SST: [INFO] Streaming GTID file before SST (20140925 07:53:58.716)
WSREP_SST: [INFO] Evaluating xbstream -c ${INFO_FILE} | socat -u stdio TCP:10.0.0.24:4444; RC=( ${PIPESTATUS[@]} ) (20140925 07:53:58.718)
WSREP_SST: [INFO] Sleeping before data transfer for SST (20140925 07:53:58.723)
2014-09-25 07:53:59 10525 [Note] WSREP: (1076e92e, 'tcp://0.0.0.0:4567') turning message relay requesting off
WSREP_SST: [INFO] Streaming the backup to joiner at 10.0.0.24 4444 (20140925 07:54:08.734)
WSREP_SST: [INFO] Evaluating innobackupex --defaults-file=/etc/mysql/my.cnf --no-version-check $tmpopts $INNOEXTRA --galera-info --stream=$sfmt $itmpdir 2>${DATA}/innobackup.backup.log | socat -u stdio TCP:10.0.0.24:4444; RC=( ${PIPESTATUS[@]} ) (20140925 07:54:08.736)
2014-09-25 07:54:11 10525 [Note] WSREP: Provider paused at f7d5ff68-4482-11e4-9f6c-5f76e5087b34:10 (70)
2014-09-25 07:54:12 10525 [Note] WSREP: resuming provider at 70
2014-09-25 07:54:12 10525 [Note] WSREP: Provider resumed.
2014-09-25 07:54:12 10525 [Note] WSREP: 0.0 (10.0.0.25): State transfer to 1.0 (10.0.0.24) complete.
2014-09-25 07:54:12 10525 [Note] WSREP: Shifting DONOR/DESYNCED -> JOINED (TO: 10)
2014-09-25 07:54:12 10525 [Note] WSREP: Member 0.0 (10.0.0.25) synced with group.
2014-09-25 07:54:12 10525 [Note] WSREP: Shifting JOINED -> SYNCED (TO: 10)
2014-09-25 07:54:12 10525 [Note] WSREP: Synchronized with group, ready for connections
2014-09-25 07:54:12 10525 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
WSREP_SST: [INFO] Total time on donor: 0 seconds (20140925 07:54:12.670)
WSREP_SST: [INFO] Cleaning up temporary directories (20140925 07:54:12.672)
2014-09-25 07:54:23 10525 [Note] WSREP: 1.0 (10.0.0.24): State transfer from 0.0 (10.0.0.25) complete.
2014-09-25 07:54:23 10525 [Note] WSREP: Member 1.0 (10.0.0.24) synced with group.

mysql> show global status like 'wsrep%';
+------------------------------+----------------------------------------------+
| Variable_name | Value |
+------------------------------+----------------------------------------------+
| wsrep_local_state_uuid | f7d5ff68-4482-11e4-9f6c-5f76e5087b34 |
| wsrep_protocol_version | 6 |
| wsrep_last_committed | 10 |
| wsrep_replicated | 3 |
| wsrep_replicated_bytes | 651 |
| wsrep_repl_keys | 5 |
| wsrep_repl_keys_bytes | 109 |
| wsrep_repl_data_bytes | 350 |
| wsrep_repl_other_bytes | 0 |
| wsrep_received | 58 |
| wsrep_received_bytes | 6809 |
| wsrep_local_commits | 0 |
| wsrep_local_cert_failures | 0 |
| wsrep_local_replays | 0 |
| wsrep_local_send_queue | 0 |
| wsrep_local_send_queue_max | 1 |
| wsrep_local_send_queue_min | 0 |
| wsrep_local_send_queue_avg | 0.000000 |
| wsrep_local_recv_queue | 0 |
| wsrep_local_recv_queue_max | 1 |
| wsrep_local_recv_queue_min | 0 |
| wsrep_local_recv_queue_avg | 0.000000 |
| wsrep_local_cached_downto | 2 |
| wsrep_flow_control_paused_ns | 0 |
| wsrep_flow_control_paused | 0.000000 |
| wsrep_flow_control_sent | 0 |
| wsrep_flow_control_recv | 0 |
| wsrep_cert_deps_distance | 1.000000 |
| wsrep_apply_oooe | 0.000000 |
| wsrep_apply_oool | 0.000000 |
| wsrep_apply_window | 1.000000 |
| wsrep_commit_oooe | 0.000000 |
| wsrep_commit_oool | 0.000000 |
| wsrep_commit_window | 1.000000 |
| wsrep_local_state | 4 |
| wsrep_local_state_comment | Synced |
| wsrep_cert_index_size | 3 |
| wsrep_causal_reads | 0 |
| wsrep_cert_interval | 0.000000 |
| wsrep_incoming_addresses | 10.0.0.25:3306,10.0.0.24:3306,10.0.0.26:3306 |
| wsrep_evs_repl_latency | 0/0/0/0/0 |
| wsrep_cluster_conf_id | 21 |
| wsrep_cluster_size | 3 |
| wsrep_cluster_state_uuid | f7d5ff68-4482-11e4-9f6c-5f76e5087b34 |
| wsrep_cluster_status | Primary |
| wsrep_connected | ON |
| wsrep_local_bf_aborts | 0 |
| wsrep_local_index | 0 |
| wsrep_provider_name | Galera |
| wsrep_provider_vendor | Codership Oy <email address hidden> |
| wsrep_provider_version | 3.7(r7f44a18) |
| wsrep_ready | ON |
+------------------------------+----------------------------------------------+
52 rows in set (0.00 sec)

**** No traces of xtrabackup on the Donor:
ubuntu@ip-10-0-0-25:~$ ps -ef |grep inno
ubuntu 18975 18945 0 07:59 pts/0 00:00:00 grep --color=auto inno
ubuntu@ip-10-0-0-25:~$ ps -ef |grep xtra
ubuntu 18977 18945 0 07:59 pts/0 00:00:00 grep --color=auto xtra
ubuntu@ip-10-0-0-25:~$ ps -ef |grep mysqld
root 9682 1 0 07:10 ? 00:00:00 /bin/sh /usr/bin/mysqld_safe
mysql 10525 9682 0 07:10 ? 00:00:04 /usr/sbin/mysqld --basedir=/usr/ --datadir=/mnt/data/mysql/ --plugin-dir=/usr//lib/mysql/plugin --user=mysql --wsrep-provider=/usr/lib/libgalera_smm.so --log-error=/var/log/mysql.log --pid-file=/mnt/data/mysql//mysqld.pid --socket=/var/run/mysqld/mysqld.sock --port=3306 --wsrep_start_position=00000000-0000-0000-0000-000000000000:-1

*** innobackup.backup.log on the Donor:

ubuntu@ip-10-0-0-25:~$ sudo cat /mnt/data/mysql/innobackup.backup.log
sudo: unable to resolve host ip-10-0-0-25

InnoDB Backup Utility v1.5.1-xtrabackup; Copyright 2003, 2009 Innobase Oy
and Percona LLC and/or its affiliates 2009-2013. All Rights Reserved.

This software is published under
the GNU GENERAL PUBLIC LICENSE Version 2, June 1991.

Get the latest version of Percona XtraBackup, documentation, and help resources:
http://www.percona.com/xb/p

140925 07:54:09 innobackupex: Connecting to MySQL server with DSN 'dbi:mysql:;mysql_read_default_file=/etc/mysql/my.cnf;mysql_read_default_group=xtrabackup;mysql_socket=/var/run/mysqld/mysqld.sock' as 'root' (using password: YES).
140925 07:54:09 innobackupex: Connected to MySQL server
IMPORTANT: Please check that the backup run completes successfully.
           At the end of a successful backup run innobackupex
           prints "completed OK!".

innobackupex: Using mysql server version 5.6.20-68.0-56

innobackupex: Created backup directory /tmp/tmp.o9h80diUhb

140925 07:54:09 innobackupex: Starting ibbackup with command: xtrabackup --defaults-file="/etc/mysql/my.cnf" --defaults-group="mysqld" --backup --suspend-at-end --target-dir=/tmp/tmp.nnC6BnypNz --tmpdir=/tmp/tmp.nnC6BnypNz --extra-lsndir='/tmp/tmp.nnC6BnypNz' --stream=xbstream
innobackupex: Waiting for ibbackup (pid=18437) to suspend
innobackupex: Suspend file '/tmp/tmp.nnC6BnypNz/xtrabackup_suspended_2'

xtrabackup version 2.2.4 based on MySQL server 5.6.17 Linux (x86_64) (revision id: )
xtrabackup: uses posix_fadvise().
xtrabackup: cd to /mnt/data/mysql/
xtrabackup: open files limit requested 0, set to 5000
xtrabackup: using the following InnoDB configuration:
xtrabackup: innodb_data_home_dir = ./
xtrabackup: innodb_data_file_path = ibdata1:100M:autoextend
xtrabackup: innodb_log_group_home_dir = ./
xtrabackup: innodb_log_files_in_group = 2
xtrabackup: innodb_log_file_size = 536870912
xtrabackup: using O_DIRECT
>> log scanned up to (1627118)
[01] Streaming ./ibdata1
[01] ...done
[01] Streaming ./mysql/innodb_index_stats.ibd
[01] ...done
[01] Streaming ./mysql/innodb_table_stats.ibd
[01] ...done
[01] Streaming ./mysql/slave_master_info.ibd
[01] ...done
[01] Streaming ./mysql/slave_relay_log_info.ibd
[01] ...done
[01] Streaming ./mysql/slave_worker_info.ibd
[01] ...done
>> log scanned up to (1627118)
xtrabackup: Creating suspend file '/tmp/tmp.nnC6BnypNz/xtrabackup_suspended_2' with pid '18438'

140925 07:54:11 innobackupex: Continuing after ibbackup has suspended
140925 07:54:11 innobackupex: Executing FLUSH TABLES WITH READ LOCK...
140925 07:54:11 innobackupex: All tables locked and flushed to disk

140925 07:54:11 innobackupex: Starting to backup non-InnoDB tables and files
innobackupex: in subdirectories of '/mnt/data/mysql/'
innobackupex: Backing up files '/mnt/data/mysql//performance_schema/*.{frm,isl,MYD,MYI,MAD,MAI,MRG,TRG,TRN,ARM,ARZ,CSM,CSV,opt,par}' (53 files)
>> log scanned up to (1627118)
innobackupex: Backing up files '/mnt/data/mysql//mysql/*.{frm,isl,MYD,MYI,MAD,MAI,MRG,TRG,TRN,ARM,ARZ,CSM,CSV,opt,par}' (74 files)
140925 07:54:11 innobackupex: Finished backing up non-InnoDB tables and files

140925 07:54:11 innobackupex: Executing FLUSH ENGINE LOGS...
140925 07:54:11 innobackupex: Waiting for log copying to finish

xtrabackup: The latest check point (for incremental): '1627118'
xtrabackup: Stopping log copying thread.
.>> log scanned up to (1627118)

xtrabackup: Creating suspend file '/tmp/tmp.nnC6BnypNz/xtrabackup_log_copied' with pid '18438'
xtrabackup: Transaction log of lsn (1627118) to (1627118) was copied.
140925 07:54:12 innobackupex: All tables unlocked

innobackupex: Backup created in directory '/tmp/tmp.o9h80diUhb'
140925 07:54:12 innobackupex: Connection to database server closed
140925 07:54:12 innobackupex: completed OK!
ubuntu@ip-10-0-0-25:~$

*********** Workaround: *************
- downgrade all nodes to percona xtrabackup 2.2.3, the additional message written in the mysql error log is the following when using xtrabackup 2.2.3:
2014-09-25 08:02:06 30369 [Note] WSREP: Synchronized with group, ready for connections

- use wsrep_sst_method=rsync

****** If it helps i can facilitate remote login (ssh), it is just simple test servers.

johan (johan-severalnines) wrote :

I should also say, that when using xtrabackup 2.2.4, and having writes on the 10.0.0.26 node (not the donor, not the joiner), then the joiner joins correctly.

BR
johan

Did xtrabackup 2.2.5 introduce this as a regression? I'm seeing the above behavior also. Last week didnt see any issues.

Teemu Ollakka (teemu-ollakka) wrote :

@johan,

Can you provide my.cnf for both donor/joiner?

Regarding

"I should also say, that when using xtrabackup 2.2.4, and having writes on the 10.0.0.26 node (not the donor, not the joiner), then the joiner joins correctly."

Do you mean by not writing to donor/joiner during/before/after SST?

David Bennett (dbpercona) wrote :

Updated from Johan:

Now i have teared down the install, and have a new setup:
* ubuntu 14.04
* rackspace uk cloud
* 3 nodes, + adding a 4th node.

* PXC version used on all nodes mentioned below:
- mysql Ver 14.14 Distrib 5.6.20-68.0, for debian-linux-gnu (x86_64) using EditLine wrapper
- xtrabackup version 2.2.5 based on MySQL server 5.6.21 Linux (x86_64) (revision id: )

The nodes i have are:
10.177.194.12
10.179.2.66
10.177.192.227

adding 10.178.128.144, which means:
   - installing pxc 5.6 sw from repo
   - mysql_install_db..,
   - starting it up, performing SST (idle cluster)

I also have a garbd installed.

Attached is a .tar.gz with the files you requested, but the datadir is now /var/lib/mysql/.

At this stage, node 10.178.128.144 is in the state of
| wsrep_local_state | 1 |
| wsrep_local_state_comment | Joining |

and will never leave this state.

Thus if you want to look at the logs and my.cnf/backup-my.cnf in the attached file, the JOINER is 10.178.128.144, and the DONOR is 10.177.192.227

David Bennett (dbpercona) wrote :
David Bennett (dbpercona) wrote :

Still trying to replicate the problem unsuccessfully on VM's on my local machine. I have noted a couple of other reports that appear related:

https://github.com/codership/galera/issues/143

https://bugs.launchpad.net/percona-xtrabackup/+bug/1378138

David Bennett (dbpercona) wrote :

More research: There is a very subtle difference between the logs in the VM that is exhibiting the stuck Joining problem that the VM that is successfully syncing:

-----
Good: Joining becomes Syncing
-----
2014-10-16 03:15:56 3437 [Note] WSREP: 0.0 (192.168.122.166): State transfer from 1.0 (192.168.122.184) complete.
2014-10-16 03:15:56 3437 [Note] WSREP: Shifting JOINER -> JOINED (TO: 1)
2014-10-16 03:15:56 3437 [Note] WSREP: Member 0.0 (192.168.122.166) synced with group.
2014-10-16 03:15:56 3437 [Note] WSREP: Shifting JOINED -> SYNCED (TO: 1)
2014-10-16 03:15:56 3437 [Note] WSREP: Synchronized with group, ready for connections

-----
Bad: Stucking in Joining
-----
2014-10-16 07:41:27 13120 [Note] WSREP: 1.0 (10.177.192.249): State transfer from 0.0 (10.177.193.113) complete.
2014-10-16 07:41:27 13120 [Note] WSREP: Shifting JOINER -> JOINED (TO: 1184793)
2014-10-16 07:41:27 13120 [Note] WSREP: Member 1.0 (10.177.192.249) synced with group.
2014-10-16 07:41:27 13120 [Note] WSREP: Shifting JOINED -> SYNCED (TO: 1184793)
-----

The last line 'Synchronized with group, ready for connections' comes from the application callback sql/wsrep_mysqld.cc:wsrep_synced_cb(void* app_ctx).

In the case of the stuck sync, Galera is setting it's state from JOINED to SYNCED but is not issuing the callback to the application.

I have traced this to the Galera method gcs/src/gcs.cpp:gcs_recv_thread(void *arg).

At this point: https://github.com/percona/galera/blob/3.x/gcs/src/gcs.cpp#L1218

The state change is being recorded by the galera engine. I suspect that something in the gcs_act_rcvd structure is preventing the subsequent callback from occuring in certain circumstances.

@David, @Johan,

Would it be possible to provide backtrace (on whatever box it is reproducible on) at the point where it is hung here

" 2014-10-16 07:41:27 13120 [Note] WSREP: Shifting JOINED -> SYNCED (TO: 1184793)"

You can attach gdb to mysqld to get the backtrace with https://gist.github.com/ronin13/8d403c2f88826fcc7aba gdb script.

Make sure debug symbol packages are installed before attaching gdb for that.

Download full text (39.3 KiB)

Hi,

Here is a trace:

http://paste.ubuntu.com/8420587/

Teemu@codership then said that "thread #28 seems suspicious".

Then no more info on this.
However, since upgrading to (done by David on one node) I have never seen
this issue.
Despite 100s of restarts.

Now i have update the all nodes to use:

Percona-XtraDB-Cluster-server-56-5.6.21-25.8.910.el6.x86_64
Percona-XtraDB-Cluster-galera-3-3.8-1.3323.rhel6.x86_64

And as i said, I have never seen the issue again.

I also have this version installed, on all nodes.

percona-xtrabackup-2.2.6-5042.el6.x86_64

I assume something was changed was changed in the problem area in one or
more of the version mentioned above

BR
johan

On Fri, Nov 7, 2014 at 6:23 PM, Raghavendra D Prabhu <
<email address hidden>> wrote:

> Make sure debug symbol packages are installed before attaching gdb for
> that.
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1373796
>
> Title:
> SST: Joiner stuck in 'joining', but error log reports 'synced'
>
> Status in MySQL patches by Codership:
> New
> Status in MySQL patches by Codership 5.5 series:
> New
> Status in MySQL patches by Codership 5.6 series:
> New
> Status in Percona XtraDB Cluster - HA scalable solution for MySQL:
> New
> Status in Percona XtraDB Cluster 5.5 series:
> New
> Status in Percona XtraDB Cluster 5.6 series:
> New
>
> Bug description:
> Environment:
> * PXC 5.6.20-68.0-56
> * Galera provider: 3.7(r7f44a18)
> * xtrabackup 2.2.4
> * EC2, m3.medium, ubuntu-trusty-14.04-amd64-server-20140607.1
> (ami-0307d674)
>
> Galera nodes, all are 'synced':
> 10.0.0.24
> 10.0.0.25
> 10.0.0.26
>
> ssh 10.0.0.24
> sudo killall -9 mysqld mysqld_safe
> # remove mysql data files to have an SST:
> sudo rm -rf /mnt/data/mysql/*
> sudo mysqld_safe &
>
> 10.0.0.24 is now the JOINER and writes this in the mysql error log:
>
> 140925 07:53:55 mysqld_safe Starting mysqld daemon with databases from
> /mnt/data/mysql/
> 140925 07:53:55 mysqld_safe Skipping wsrep-recover for empty datadir:
> /mnt/data/mysql/
> 140925 07:53:55 mysqld_safe Assigning
> 00000000-0000-0000-0000-000000000000:-1 to wsrep_start_position
> 2014-09-25 07:53:56 0 [Note] WSREP: wsrep_start_position var submitted:
> '00000000-0000-0000-0000-000000000000:-1'
> 2014-09-25 07:53:56 28987 [Note] WSREP: Read nil XID from storage
> engines, skipping position init
> 2014-09-25 07:53:56 28987 [Note] WSREP: wsrep_load(): loading provider
> library '/usr/lib/libgalera_smm.so'
> 2014-09-25 07:53:56 28987 [Note] WSREP: wsrep_load(): Galera
> 3.7(r7f44a18) by Codership Oy <email address hidden> loaded successfully.
> 2014-09-25 07:53:56 28987 [Note] WSREP: CRC-32C: using hardware
> acceleration.
> 2014-09-25 07:53:56 28987 [Warning] WSREP: Could not open saved state
> file for reading: /mnt/data/mysql//grastate.dat
> 2014-09-25 07:53:56 28987 [Note] WSREP: Found saved state:
> 00000000-0000-0000-0000-000000000000:-1
> 2014-09-25 07:53:56 28987 [Note] WSREP: Passing config to GCS: base_host
> = 10.0.0.24; base_port = 4567; cert.log_conflicts = no; debug = no;
> evs.inacti...

David Bennett (dbpercona) wrote :
Download full text (40.6 KiB)

Thanks Johan!

On 11/07/2014 12:00 PM, johan wrote:
> Hi,
>
> Here is a trace:
>
> http://paste.ubuntu.com/8420587/
>
> Teemu@codership then said that "thread #28 seems suspicious".
>
> Then no more info on this.
> However, since upgrading to (done by David on one node) I have never seen
> this issue.
> Despite 100s of restarts.
>
> Now i have update the all nodes to use:
>
> Percona-XtraDB-Cluster-server-56-5.6.21-25.8.910.el6.x86_64
> Percona-XtraDB-Cluster-galera-3-3.8-1.3323.rhel6.x86_64
>
> And as i said, I have never seen the issue again.
>
> I also have this version installed, on all nodes.
>
> percona-xtrabackup-2.2.6-5042.el6.x86_64
>
> I assume something was changed was changed in the problem area in one or
> more of the version mentioned above
>
> BR
> johan
>
>
>
> On Fri, Nov 7, 2014 at 6:23 PM, Raghavendra D Prabhu <
> <email address hidden>> wrote:
>
>> Make sure debug symbol packages are installed before attaching gdb for
>> that.
>>
>> --
>> You received this bug notification because you are subscribed to the bug
>> report.
>> https://bugs.launchpad.net/bugs/1373796
>>
>> Title:
>> SST: Joiner stuck in 'joining', but error log reports 'synced'
>>
>> Status in MySQL patches by Codership:
>> New
>> Status in MySQL patches by Codership 5.5 series:
>> New
>> Status in MySQL patches by Codership 5.6 series:
>> New
>> Status in Percona XtraDB Cluster - HA scalable solution for MySQL:
>> New
>> Status in Percona XtraDB Cluster 5.5 series:
>> New
>> Status in Percona XtraDB Cluster 5.6 series:
>> New
>>
>> Bug description:
>> Environment:
>> * PXC 5.6.20-68.0-56
>> * Galera provider: 3.7(r7f44a18)
>> * xtrabackup 2.2.4
>> * EC2, m3.medium, ubuntu-trusty-14.04-amd64-server-20140607.1
>> (ami-0307d674)
>>
>> Galera nodes, all are 'synced':
>> 10.0.0.24
>> 10.0.0.25
>> 10.0.0.26
>>
>> ssh 10.0.0.24
>> sudo killall -9 mysqld mysqld_safe
>> # remove mysql data files to have an SST:
>> sudo rm -rf /mnt/data/mysql/*
>> sudo mysqld_safe &
>>
>> 10.0.0.24 is now the JOINER and writes this in the mysql error log:
>>
>> 140925 07:53:55 mysqld_safe Starting mysqld daemon with databases from
>> /mnt/data/mysql/
>> 140925 07:53:55 mysqld_safe Skipping wsrep-recover for empty datadir:
>> /mnt/data/mysql/
>> 140925 07:53:55 mysqld_safe Assigning
>> 00000000-0000-0000-0000-000000000000:-1 to wsrep_start_position
>> 2014-09-25 07:53:56 0 [Note] WSREP: wsrep_start_position var submitted:
>> '00000000-0000-0000-0000-000000000000:-1'
>> 2014-09-25 07:53:56 28987 [Note] WSREP: Read nil XID from storage
>> engines, skipping position init
>> 2014-09-25 07:53:56 28987 [Note] WSREP: wsrep_load(): loading provider
>> library '/usr/lib/libgalera_smm.so'
>> 2014-09-25 07:53:56 28987 [Note] WSREP: wsrep_load(): Galera
>> 3.7(r7f44a18) by Codership Oy <email address hidden> loaded successfully.
>> 2014-09-25 07:53:56 28987 [Note] WSREP: CRC-32C: using hardware
>> acceleration.
>> 2014-09-25 07:53:56 28987 [Warning] WSREP: Could not open saved state
>> file for reading: /mnt/data/mysql//grastate.dat
>> 2014-09-25 07:53:56 28987 [Note] WSREP: Found saved state:
>> 00000000-0000-0000-0000-00000...

@Johan,

Thanks for confirming. I will mark this as fix released for now.

I've still been having this issue with the below packages. It always seemed like an intermittent problem. Recently I was able to track down something interesting. The node gets stuck in 'joining' whenever it gets SST from a node that was started with bootstrap-pxc.

Reproduce failure:

Build and start two servers.
Build a 3rd server and specify the donor to be the node started with bootstrap-pxc
It should get hung on joining

Reproduce success:

Build and start two servers.
Build a 3rd server and specify the donor to be the node started with a standard 'service mysql start'
It should connect fine.

Any thoughts?

percona-xtrabackup_2.2.10-1.precise_amd64.deb
percona-xtradb-cluster-56_5.6.22-25.8-978.precise_amd64.deb
percona-xtradb-cluster-client-5.6_5.6.22-25.8-978.precise_amd64.deb
percona-xtradb-cluster-common-5.6_5.6.22-25.8-978.precise_amd64.deb
percona-xtradb-cluster-server-5.6_5.6.22-25.8-978.precise_amd64.deb
percona-xtradb-cluster-galera-3.x_3.9.3494.precise_amd64.deb
percona-xtradb-cluster-galera-3_3.9.3494.precise_amd64.deb

johan (johan-severalnines) wrote :

Hi,

I still see it sometimes too.
Maybe it has been said before, but the workaround is to have write traffic on the galera cluster so an IST is triggered after the SST...
But it can be hard in some situations to have that.

BR
johan

Percona now uses JIRA for bug reports so this bug report is migrated to: https://jira.percona.com/browse/PXC-1016

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers