1 node hung after network connectivity loss
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Galera |
New
|
Undecided
|
Unassigned | ||
Percona XtraDB Cluster moved to https://jira.percona.com/projects/PXC |
Incomplete
|
Undecided
|
Unassigned |
Bug Description
Hi,
We have a 5 node cluster all running percona-
data center 1:
dbe0 => writes
dbe1 => idle hot backup; backup SST/IST donor
dbe2 => mysqldump backups; main SST/IST donor
data center 2:
dbw0 => writes
dbw1 => idle hot backup; backup SST/IST donor
The network in data center 1 started flapping up and down. All nodes rejoined the cluster properly except for dbe2.
dbe2 has wsrep_sst_donor = dbe1,dbw1
It looks like it started an IST:
131021 11:34:02 [Note] WSREP: Quorum results:
version = 2,
component = PRIMARY,
conf_id = 74,
members = 4/5 (joined/total),
act_id = 45289158,
last_appl. = 45286450,
protocols = 0/4/2 (gcs/repl/appl),
group UUID = 73a23f83-
131021 11:34:02 [Note] WSREP: Flow-control interval: [36, 36]
131021 11:34:02 [Note] WSREP: Shifting OPEN -> PRIMARY (TO: 45289158)
131021 11:34:02 [Note] WSREP: State transfer required:
Group state: 73a23f83-
Local state: 73a23f83-
131021 11:34:02 [Note] WSREP: New cluster view: global state: 73a23f83-
131021 11:34:02 [Warning] WSREP: Gap in state sequence. Need state transfer.
131021 11:34:02 [Note] WSREP: Setting wsrep_ready to 0
131021 11:34:02 [Note] WSREP: [debug]: closing client connections for PRIM
131021 11:34:02 [Note] WSREP: closing connection 106063
131021 11:34:02 [Note] WSREP: closing connection 106062
131021 11:34:02 [Note] WSREP: closing connection 106061
131021 11:34:03 [Note] WSREP: (2b9c06dd-
131021 11:34:04 [Note] WSREP: (2b9c06dd-
131021 11:34:04 [Note] WSREP: killing local connection: 106063
131021 11:34:04 [Note] WSREP: killing local connection: 106062
131021 11:34:04 [Note] WSREP: killing local connection: 106061
131021 11:34:04 [Note] WSREP: waiting for client connections to close: 36
131021 11:34:04 [Note] WSREP: You have configured 'rsync_wan' state snapshot transfer method which cannot be performed on a running server. Wsrep provider won't be able to fall back to it if other means of state transfer are unavailable. In that case you will need to restart the server.
131021 11:34:04 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
131021 11:34:04 [Note] WSREP: Assign initial position for certification: 45289158, protocol version: 2
131021 11:34:04 [Note] WSREP: Prepared IST receiver, listening at: tcp://69.
131021 11:34:05 [Note] WSREP: (2b9c06dd-
131021 11:34:05 [Note] WSREP: Node 1 (dbe2) requested state transfer from 'dbe1'. Selected 3 (dbe1)(SYNCED) as donor.
131021 11:34:05 [Note] WSREP: Shifting PRIMARY -> JOINER (TO: 45289457)
131021 11:34:05 [Note] WSREP: Requesting state transfer: success, donor: 3
131021 11:34:05 [Note] WSREP: GCache DEBUG: RingBuffer:
131021 11:34:05 [Note] WSREP: GCache DEBUG: RingBuffer:
131021 11:34:05 [Note] WSREP: (2b9c06dd-
131021 11:34:05 [Note] WSREP: Receiving IST: 2315 writesets, seqnos 45286843-45289158
But there were more network problems which caused the IST to stop:
131021 11:36:24 [Warning] WSREP: Could not find peer:
131021 11:36:24 [Warning] WSREP: 3 (dbe1): State transfer to -1 (left the group) failed: -110 (Connection timed out)
131021 11:36:24 [Warning] WSREP: Protocol violation. JOIN message sender 3 (dbe1) is not in state transfer (JOINED). Message ignored.
131021 11:36:25 [Note] WSREP: Member 3 (dbe1) synced with group.
About 30 minutes after, it was still waiting for the IST:
+------
| Variable_name | Value |
+------
| wsrep_local_
| wsrep_protocol_
| wsrep_last_
| wsrep_replicated | 0 |
| wsrep_replicate
| wsrep_received | 13667170 |
| wsrep_received_
| wsrep_local_commits | 0 |
| wsrep_local_
| wsrep_local_
| wsrep_local_replays | 0 |
| wsrep_local_
| wsrep_local_
| wsrep_local_
| wsrep_local_
| wsrep_flow_
| wsrep_flow_
| wsrep_flow_
| wsrep_cert_
| wsrep_apply_oooe | 0.000000 |
| wsrep_apply_oool | 0.000000 |
| wsrep_apply_window | 0.000000 |
| wsrep_commit_oooe | 0.000000 |
| wsrep_commit_oool | 0.000000 |
| wsrep_commit_window | 0.000000 |
| wsrep_local_state | 1 |
| wsrep_local_
| wsrep_cert_
| wsrep_causal_reads | 0 |
| wsrep_incoming_
| wsrep_cluster_
| wsrep_cluster_size | 5 |
| wsrep_cluster_
| wsrep_cluster_
| wsrep_connected | ON |
| wsrep_local_index | 1 |
| wsrep_provider_name | Galera |
| wsrep_provider_
| wsrep_provider_
| wsrep_ready | OFF |
+------
40 rows in set (0.00 sec)
I tried 'sudo /etc/init.d/mysql restart':
131021 12:49:28 [Note] /usr/sbin/mysqld: Normal shutdown
131021 12:49:28 [Note] WSREP: Stop replication
131021 12:49:28 [Note] WSREP: Provider disconnect
131021 12:49:28 [Note] WSREP: Closing send monitor...
131021 12:49:28 [Note] WSREP: Closed send monitor.
131021 12:49:28 [Note] WSREP: gcomm: terminating thread
131021 12:49:28 [Note] WSREP: gcomm: joining thread
131021 12:49:28 [Note] WSREP: gcomm: closing backend
131021 12:49:28 [Note] WSREP: view(view_
2b9c06dd-
} joined {
} left {
} partitioned {
125d676e-
5f847da2-
8e2be836-
a81b66cd-
})
131021 12:49:28 [Note] WSREP: view((empty))
131021 12:49:28 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1
131021 12:49:28 [Note] WSREP: gcomm: closed
131021 12:49:28 [Note] WSREP: Flow-control interval: [16, 16]
131021 12:49:28 [Note] WSREP: Received NON-PRIMARY.
131021 12:49:28 [Note] WSREP: Shifting PRIMARY -> OPEN (TO: 45355416)
131021 12:49:28 [Note] WSREP: Received self-leave message.
131021 12:49:28 [Note] WSREP: Flow-control interval: [0, 0]
131021 12:49:28 [Note] WSREP: Received SELF-LEAVE. Closing connection.
131021 12:49:28 [Note] WSREP: Shifting OPEN -> CLOSED (TO: 45355416)
131021 12:49:28 [Note] WSREP: RECV thread exiting 0: Success
131021 12:49:28 [Note] WSREP: recv_thread() joined.
131021 12:49:28 [Note] WSREP: Closing replication queue.
131021 12:49:28 [Note] WSREP: Closing slave action queue.
131021 12:49:28 [Note] WSREP: closing connection 106460
131021 12:49:29 [Note] WSREP: Before Lock_thread_count
131021 12:49:30 [Note] WSREP: killing local connection: 106460
131021 12:49:30 [Note] WSREP: waiting for client connections to close: 34
It was hanging here for a few minutes. There probably was a connection from our custom monitoring script.
I finally kill -9'd it and it restarted ok:
Killed
131021 12:53:21 mysqld_safe Number of processes running now: 0
131021 12:53:21 mysqld_safe WSREP: not restarting wsrep node automatically
131021 12:53:21 mysqld_safe mysqld from pid file /var/lib/
131021 12:53:23 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
131021 12:53:23 mysqld_safe Skipping wsrep-recover for 73a23f83-
131021 12:53:23 mysqld_safe Assigning 73a23f83-
131021 12:53:23 [Note] WSREP: wsrep_start_
131021 12:53:23 [Note] WSREP: Setting wsrep_ready to 0
131021 12:53:23 [Note] WSREP: Read nil XID from storage engines, skipping position init
131021 12:53:23 [Note] WSREP: wsrep_load(): loading provider library '/usr/lib/
131021 12:53:23 [Note] WSREP: wsrep_load(): Galera 2.7(r157) by Codership Oy <email address hidden> loaded succesfully.
131021 12:53:23 [Note] WSREP: Found saved state: 73a23f83-
131021 12:53:23 [Note] WSREP: Reusing existing '/var/lib/
131021 12:53:23 [Note] WSREP: Passing config to GCS: base_host = 69.164.216.206; base_port = 4567; cert.log_conflicts = no; evs.debug_log_mask = EVS_DEBUG; evs.inactive_
131021 12:53:23 [Note] WSREP: Assign initial position for certification: 45286843, protocol version: -1
131021 12:53:23 [Note] WSREP: wsrep_sst_grab()
131021 12:53:23 [Note] WSREP: Start replication
131021 12:53:23 [Note] WSREP: Setting initial position to 73a23f83-
131021 12:53:23 [Note] WSREP: protonet asio version 0
131021 12:53:23 [Note] WSREP: backend: asio
131021 12:53:23 [Note] WSREP: GMCast version 0
131021 12:53:23 [Note] WSREP: (ae2dc584-
131021 12:53:23 [Note] WSREP: (ae2dc584-
131021 12:53:23 [Note] WSREP: EVS version 0
131021 12:53:23 [Note] WSREP: PC version 0
131021 12:53:23 [Note] WSREP: gcomm: connecting to group 'c1production', peer '162.216.
Then I messed up and accidentally ran 'sudo /etc/init.d/mysql restart' again while in the middle of the IST.
Thankfully, it restarted okay and did a successful IST and rejoined the cluster.
I expected the initial IST to timeout or fail and try the next wsrep_sst_donor, dbw1.
Is this warning about rsync_wan relevant or is that only for SST?
131021 11:34:04 [Note] WSREP: You have configured 'rsync_wan' state snapshot transfer method which cannot be performed on a running server. Wsrep provider won't be able to fall back to it if other means of state transfer are unavailable. In that case you will need to restart the server.
The full log should be attached.
Changed in percona-xtradb-cluster: | |
status: | New → Incomplete |
============
:
131021 11:36:24 [Warning] WSREP: Could not find peer:
131021 11:36:24 [Warning] WSREP: 3 (dbe1): State transfer to -1 (left the group) failed: -110 (Connection timed out)
131021 11:36:24 [Warning] WSREP: Protocol violation. JOIN message sender 3 (dbe1) is not in state transfer (JOINED). Message ignored.
131021 11:36:25 [Note] WSREP: Member 3 (dbe1) synced with group.
==================
Looks like IST timed out here during network issues. This is related to issue - lp:1202241
Adding galera, since this is a galera issue.