Galera

1 node hung after network connectivity loss

Bug #1242839 reported by jolan on 2013-10-21

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Galera	New	Undecided	Unassigned
	Percona XtraDB Cluster moved to https://jira.percona.com/projects/PXC	Incomplete	Undecided	Unassigned

Bug Description

Hi,

We have a 5 node cluster all running percona-xtradb-cluster 5.5.33-23.7.6-496.raring.

data center 1:
dbe0 => writes
dbe1 => idle hot backup; backup SST/IST donor
dbe2 => mysqldump backups; main SST/IST donor

data center 2:
dbw0 => writes
dbw1 => idle hot backup; backup SST/IST donor

The network in data center 1 started flapping up and down. All nodes rejoined the cluster properly except for dbe2.

dbe2 has wsrep_sst_donor = dbe1,dbw1

It looks like it started an IST:

131021 11:34:02 [Note] WSREP: Quorum results:
version = 2,
component = PRIMARY,
conf_id = 74,
members = 4/5 (joined/total),
act_id = 45289158,
last_appl. = 45286450,
protocols = 0/4/2 (gcs/repl/appl),
group UUID = 73a23f83-0c81-11e3-9fc4-a2a855d3a912
131021 11:34:02 [Note] WSREP: Flow-control interval: [36, 36]
131021 11:34:02 [Note] WSREP: Shifting OPEN -> PRIMARY (TO: 45289158)
131021 11:34:02 [Note] WSREP: State transfer required:
Group state: 73a23f83-0c81-11e3-9fc4-a2a855d3a912:45289158
Local state: 73a23f83-0c81-11e3-9fc4-a2a855d3a912:45286843
131021 11:34:02 [Note] WSREP: New cluster view: global state: 73a23f83-0c81-11e3-9fc4-a2a855d3a912:45289158, view# 75: Primary, number of nodes: 5, my index: 1, protocol version 2
131021 11:34:02 [Warning] WSREP: Gap in state sequence. Need state transfer.
131021 11:34:02 [Note] WSREP: Setting wsrep_ready to 0
131021 11:34:02 [Note] WSREP: [debug]: closing client connections for PRIM
131021 11:34:02 [Note] WSREP: closing connection 106063
131021 11:34:02 [Note] WSREP: closing connection 106062
131021 11:34:02 [Note] WSREP: closing connection 106061
131021 11:34:03 [Note] WSREP: (2b9c06dd-2d03-11e3-b52f-cf8cbfed0f77, 'tcp://0.0.0.0:4567') reconnecting to 8e2be836-2d04-11e3-bbd4-f34c587f0b39 (tcp://162.216.17.69:4567), attempt 0
131021 11:34:04 [Note] WSREP: (2b9c06dd-2d03-11e3-b52f-cf8cbfed0f77, 'tcp://0.0.0.0:4567') address 'tcp://69.164.216.206:4567' pointing to uuid 2b9c06dd-2d03-11e3-b52f-cf8cbfed0f77 is blacklisted, skipping
131021 11:34:04 [Note] WSREP: killing local connection: 106063
131021 11:34:04 [Note] WSREP: killing local connection: 106062
131021 11:34:04 [Note] WSREP: killing local connection: 106061
131021 11:34:04 [Note] WSREP: waiting for client connections to close: 36
131021 11:34:04 [Note] WSREP: You have configured 'rsync_wan' state snapshot transfer method which cannot be performed on a running server. Wsrep provider won't be able to fall back to it if other means of state transfer are unavailable. In that case you will need to restart the server.
131021 11:34:04 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
131021 11:34:04 [Note] WSREP: Assign initial position for certification: 45289158, protocol version: 2
131021 11:34:04 [Note] WSREP: Prepared IST receiver, listening at: tcp://69.164.216.206:4568
131021 11:34:05 [Note] WSREP: (2b9c06dd-2d03-11e3-b52f-cf8cbfed0f77, 'tcp://0.0.0.0:4567') address 'tcp://69.164.216.206:4567' pointing to uuid 2b9c06dd-2d03-11e3-b52f-cf8cbfed0f77 is blacklisted, skipping
131021 11:34:05 [Note] WSREP: Node 1 (dbe2) requested state transfer from 'dbe1'. Selected 3 (dbe1)(SYNCED) as donor.
131021 11:34:05 [Note] WSREP: Shifting PRIMARY -> JOINER (TO: 45289457)
131021 11:34:05 [Note] WSREP: Requesting state transfer: success, donor: 3
131021 11:34:05 [Note] WSREP: GCache DEBUG: RingBuffer::seqno_reset(): discarded 1073575052 bytes
131021 11:34:05 [Note] WSREP: GCache DEBUG: RingBuffer::seqno_reset(): found 0/300 locked buffers
131021 11:34:05 [Note] WSREP: (2b9c06dd-2d03-11e3-b52f-cf8cbfed0f77, 'tcp://0.0.0.0:4567') reconnecting to a81b66cd-2c64-11e3-830d-33220e91c9d3 (tcp://50.116.43.161:4567), attempt 0
131021 11:34:05 [Note] WSREP: Receiving IST: 2315 writesets, seqnos 45286843-45289158

But there were more network problems which caused the IST to stop:

131021 11:36:24 [Warning] WSREP: Could not find peer:
131021 11:36:24 [Warning] WSREP: 3 (dbe1): State transfer to -1 (left the group) failed: -110 (Connection timed out)
131021 11:36:24 [Warning] WSREP: Protocol violation. JOIN message sender 3 (dbe1) is not in state transfer (JOINED). Message ignored.
131021 11:36:25 [Note] WSREP: Member 3 (dbe1) synced with group.

About 30 minutes after, it was still waiting for the IST:

+----------------------------+------------------------------------------------------------------------------------------------+
| Variable_name | Value |
+----------------------------+------------------------------------------------------------------------------------------------+
| wsrep_local_state_uuid | 73a23f83-0c81-11e3-9fc4-a2a855d3a912 |
| wsrep_protocol_version | 4 |
| wsrep_last_committed | 45286843 |
| wsrep_replicated | 0 |
| wsrep_replicated_bytes | 0 |
| wsrep_received | 13667170 |
| wsrep_received_bytes | 6101658256 |
| wsrep_local_commits | 0 |
| wsrep_local_cert_failures | 0 |
| wsrep_local_bf_aborts | 0 |
| wsrep_local_replays | 0 |
| wsrep_local_send_queue | 0 |
| wsrep_local_send_queue_avg | 0.000000 |
| wsrep_local_recv_queue | 56122 |
| wsrep_local_recv_queue_avg | 56089.500000 |
| wsrep_flow_control_paused | 0.000000 |
| wsrep_flow_control_sent | 0 |
| wsrep_flow_control_recv | 0 |
| wsrep_cert_deps_distance | 0.000000 |
| wsrep_apply_oooe | 0.000000 |
| wsrep_apply_oool | 0.000000 |
| wsrep_apply_window | 0.000000 |
| wsrep_commit_oooe | 0.000000 |
| wsrep_commit_oool | 0.000000 |
| wsrep_commit_window | 0.000000 |
| wsrep_local_state | 1 |
| wsrep_local_state_comment | Joining: receiving State Transfer |
| wsrep_cert_index_size | 0 |
| wsrep_causal_reads | 0 |
| wsrep_incoming_addresses | 23.92.31.38:3306,69.164.216.206:3306,162.216.19.139:3306,162.216.17.69:3306,50.116.43.161:3306 |
| wsrep_cluster_conf_id | 75 |
| wsrep_cluster_size | 5 |
| wsrep_cluster_state_uuid | 73a23f83-0c81-11e3-9fc4-a2a855d3a912 |
| wsrep_cluster_status | Primary |
| wsrep_connected | ON |
| wsrep_local_index | 1 |
| wsrep_provider_name | Galera |
| wsrep_provider_vendor | Codership Oy <email address hidden> |
| wsrep_provider_version | 2.7(r157) |
| wsrep_ready | OFF |
+----------------------------+------------------------------------------------------------------------------------------------+
40 rows in set (0.00 sec)

I tried 'sudo /etc/init.d/mysql restart':

131021 12:49:28 [Note] /usr/sbin/mysqld: Normal shutdown

131021 12:49:28 [Note] WSREP: Stop replication
131021 12:49:28 [Note] WSREP: Provider disconnect
131021 12:49:28 [Note] WSREP: Closing send monitor...
131021 12:49:28 [Note] WSREP: Closed send monitor.
131021 12:49:28 [Note] WSREP: gcomm: terminating thread
131021 12:49:28 [Note] WSREP: gcomm: joining thread
131021 12:49:28 [Note] WSREP: gcomm: closing backend
131021 12:49:28 [Note] WSREP: view(view_id(NON_PRIM,125d676e-2d17-11e3-92c9-26db22af44e7,84) memb {
2b9c06dd-2d03-11e3-b52f-cf8cbfed0f77,
} joined {
} left {
} partitioned {
125d676e-2d17-11e3-92c9-26db22af44e7,
5f847da2-2c57-11e3-946b-fb8673da7ec6,
8e2be836-2d04-11e3-bbd4-f34c587f0b39,
a81b66cd-2c64-11e3-830d-33220e91c9d3,
})
131021 12:49:28 [Note] WSREP: view((empty))
131021 12:49:28 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1
131021 12:49:28 [Note] WSREP: gcomm: closed
131021 12:49:28 [Note] WSREP: Flow-control interval: [16, 16]
131021 12:49:28 [Note] WSREP: Received NON-PRIMARY.
131021 12:49:28 [Note] WSREP: Shifting PRIMARY -> OPEN (TO: 45355416)
131021 12:49:28 [Note] WSREP: Received self-leave message.
131021 12:49:28 [Note] WSREP: Flow-control interval: [0, 0]
131021 12:49:28 [Note] WSREP: Received SELF-LEAVE. Closing connection.
131021 12:49:28 [Note] WSREP: Shifting OPEN -> CLOSED (TO: 45355416)
131021 12:49:28 [Note] WSREP: RECV thread exiting 0: Success
131021 12:49:28 [Note] WSREP: recv_thread() joined.
131021 12:49:28 [Note] WSREP: Closing replication queue.
131021 12:49:28 [Note] WSREP: Closing slave action queue.
131021 12:49:28 [Note] WSREP: closing connection 106460
131021 12:49:29 [Note] WSREP: Before Lock_thread_count
131021 12:49:30 [Note] WSREP: killing local connection: 106460
131021 12:49:30 [Note] WSREP: waiting for client connections to close: 34

It was hanging here for a few minutes. There probably was a connection from our custom monitoring script.

I finally kill -9'd it and it restarted ok:

Killed
131021 12:53:21 mysqld_safe Number of processes running now: 0
131021 12:53:21 mysqld_safe WSREP: not restarting wsrep node automatically
131021 12:53:21 mysqld_safe mysqld from pid file /var/lib/mysql/dbe2.connected.cc.pid ended
131021 12:53:23 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
131021 12:53:23 mysqld_safe Skipping wsrep-recover for 73a23f83-0c81-11e3-9fc4-a2a855d3a912:45286843 pair
131021 12:53:23 mysqld_safe Assigning 73a23f83-0c81-11e3-9fc4-a2a855d3a912:45286843 to wsrep_start_position
131021 12:53:23 [Note] WSREP: wsrep_start_position var submitted: '73a23f83-0c81-11e3-9fc4-a2a855d3a912:45286843'
131021 12:53:23 [Note] WSREP: Setting wsrep_ready to 0
131021 12:53:23 [Note] WSREP: Read nil XID from storage engines, skipping position init
131021 12:53:23 [Note] WSREP: wsrep_load(): loading provider library '/usr/lib/libgalera_smm.so'
131021 12:53:23 [Note] WSREP: wsrep_load(): Galera 2.7(r157) by Codership Oy <email address hidden> loaded succesfully.
131021 12:53:23 [Note] WSREP: Found saved state: 73a23f83-0c81-11e3-9fc4-a2a855d3a912:45286843
131021 12:53:23 [Note] WSREP: Reusing existing '/var/lib/mysql//galera.cache'.
131021 12:53:23 [Note] WSREP: Passing config to GCS: base_host = 69.164.216.206; base_port = 4567; cert.log_conflicts = no; evs.debug_log_mask = EVS_DEBUG; evs.inactive_check_period = PT10S; evs.inactive_timeout = PT1M; evs.install_timeout = PT1M; evs.keepalive_period = PT3S; evs.send_window = 512; evs.suspect_timeout = PT30S; evs.user_send_window = 512; gcache.dir = /var/lib/mysql/; gcache.keep_pages_size = 0; gcache.mem_size = 0; gcache.name = /var/lib/mysql//galera.cache; gcache.page_size = 128M; gcache.size = 1G; gcs.fc_debug = 0; gcs.fc_factor = 1; gcs.fc_limit = 16; gcs.fc_master_slave = NO; gcs.max_packet_size = 64500; gcs.max_throttle = 0.25; gcs.recv_q_hard_limit = 9223372036854775807; gcs.recv_q_soft_limit = 0.25; gcs.sync_donor = NO; replicator.causal_read_timeout = PT30S; replicator.commit_order = 3
131021 12:53:23 [Note] WSREP: Assign initial position for certification: 45286843, protocol version: -1
131021 12:53:23 [Note] WSREP: wsrep_sst_grab()
131021 12:53:23 [Note] WSREP: Start replication
131021 12:53:23 [Note] WSREP: Setting initial position to 73a23f83-0c81-11e3-9fc4-a2a855d3a912:45286843
131021 12:53:23 [Note] WSREP: protonet asio version 0
131021 12:53:23 [Note] WSREP: backend: asio
131021 12:53:23 [Note] WSREP: GMCast version 0
131021 12:53:23 [Note] WSREP: (ae2dc584-3a79-11e3-9120-af07533ae064, 'tcp://0.0.0.0:4567') listening at tcp://0.0.0.0:4567
131021 12:53:23 [Note] WSREP: (ae2dc584-3a79-11e3-9120-af07533ae064, 'tcp://0.0.0.0:4567') multicast: , ttl: 1
131021 12:53:23 [Note] WSREP: EVS version 0
131021 12:53:23 [Note] WSREP: PC version 0
131021 12:53:23 [Note] WSREP: gcomm: connecting to group 'c1production', peer '162.216.19.139:,162.216.17.69:,69.164.216.206:,23.92.31.38:,50.116.43.161:'

Then I messed up and accidentally ran 'sudo /etc/init.d/mysql restart' again while in the middle of the IST.

Thankfully, it restarted okay and did a successful IST and rejoined the cluster.

I expected the initial IST to timeout or fail and try the next wsrep_sst_donor, dbw1.

Is this warning about rsync_wan relevant or is that only for SST?

131021 11:34:04 [Note] WSREP: You have configured 'rsync_wan' state snapshot transfer method which cannot be performed on a running server. Wsrep provider won't be able to fall back to it if other means of state transfer are unavailable. In that case you will need to restart the server.

The full log should be attached.

Revision history for this message

jolan (jolan) wrote on 2013-10-21:

dbe2 log Edit (162.5 KiB, text/plain)

Revision history for this message

Raghavendra D Prabhu (raghavendra-prabhu) wrote on 2013-10-22:

============
:

==================

Looks like IST timed out here during network issues. This is related to issue - lp:1202241

Adding galera, since this is a galera issue.

Revision history for this message

Alex Yurchenko (ayurchen) wrote on 2013-10-25:

what's in between those lines? Receiving 2315 writesets is usually instantaneous.

> 131021 11:34:05 [Note] WSREP: Receiving IST: 2315 writesets, seqnos 45286843-45289158
>
> But there were more network problems which caused the IST to stop:
>
> 131021 11:36:24 [Warning] WSREP: Could not find peer:

Donor log would have been nice as well - for the same time period, starting with 131021 11:34:02 [Note] WSREP: Quorum results:

Valerii Kravchuk (valerii-kravchuk) on 2014-07-03