Problems reconnecting after network failure

Bug #1153656 reported by Sean Fulton
14
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Percona XtraDB Cluster moved to https://jira.percona.com/projects/PXC
Fix Committed
Medium
Unassigned

Bug Description

We set up a four-node cluster for testing, two nodes in each data center. We had a router problem in one data center, knocking two nodes in that center off-line. The other two remain online.

When attempting to start either of the two crashed nodes, we get the following before the cluster software shuts itself down:

130311 11:49:24 mysqld_safe mysqld from pid file /var/lib/mysql/chicago-gcn2.pid ended
130311 11:51:58 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
130311 11:51:58 mysqld_safe WSREP: Running position recovery with --log_error=/tmp/tmp.L2GwaYnFTh
130311 11:52:04 mysqld_safe WSREP: Recovered position eb6e22bd-880e-11e2-0800-c522d0f79a65:5661
130311 11:52:04 [Note] WSREP: wsrep_start_position var submitted: 'eb6e22bd-880e-11e2-0800-c522d0f79a65:5661'
130311 11:52:04 [Note] WSREP: Read nil XID from storage engines, skipping position init
130311 11:52:04 [Note] WSREP: wsrep_load(): loading provider library '/usr/lib64/libgalera_smm.so'
130311 11:52:04 [Note] WSREP: wsrep_load(): Galera 2.3(r143) by Codership Oy <email address hidden> loaded succesfully.
130311 11:52:04 [Warning] WSREP: Could not open saved state file for reading: /var/lib/mysql//grastate.dat
130311 11:52:04 [Note] WSREP: Found saved state: 00000000-0000-0000-0000-000000000000:-1
130311 11:52:04 [Note] WSREP: Preallocating 34359739688/34359739688 bytes in '/var/lib/mysql//galera.cache'...
130311 11:52:04 [Note] WSREP: Passing config to GCS: base_host = 50.31.163.199; base_port = 4567; cert.log_conflicts = no; gcache.dir = /var/lib/mysql/; gcache.keep_pages_size = 0; gcache.mem_size = 0; gcache.name = /var/lib/mysql//galera.cache; gcache.page_size = 1G; gcache.size = 32G; gcs.fc_debug = 0; gcs.fc_factor = 1; gcs.fc_limit = 16; gcs.fc_master_slave = NO; gcs.max_packet_size = 64500; gcs.max_throttle = 0.25; gcs.recv_q_hard_limit = 9223372036854775807; gcs.recv_q_soft_limit = 0.25; gcs.sync_donor = NO; replicator.causal_read_timeout = PT30S; replicator.commit_order = 3
130311 11:52:04 [Note] WSREP: Assign initial position for certification: -1, protocol version: -1
130311 11:52:04 [Note] WSREP: wsrep_sst_grab()
130311 11:52:04 [Note] WSREP: Start replication
130311 11:52:04 [Note] WSREP: Setting initial position to 00000000-0000-0000-0000-000000000000:-1
130311 11:52:04 [Note] WSREP: protonet asio version 0
130311 11:52:04 [Note] WSREP: backend: asio
130311 11:52:04 [Note] WSREP: GMCast version 0
130311 11:52:04 [Note] WSREP: (9ef8b875-8a63-11e2-0800-5524043be1f1, 'tcp://0.0.0.0:4567') listening at tcp://0.0.0.0:4567
130311 11:52:04 [Note] WSREP: (9ef8b875-8a63-11e2-0800-5524043be1f1, 'tcp://0.0.0.0:4567') multicast: , ttl: 1
130311 11:52:04 [Note] WSREP: EVS version 0
130311 11:52:04 [Note] WSREP: PC version 0
130311 11:52:04 [Note] WSREP: gcomm: connecting to group 'gcnmedia_db_cluster', peer '74.201.38.114:,74.201.39.114:,50.31.163.135:,50.31.163.199:'
130311 11:52:04 [Warning] WSREP: (9ef8b875-8a63-11e2-0800-5524043be1f1, 'tcp://0.0.0.0:4567') address 'tcp://50.31.163.199:4567' points to own listening address, blacklisting
130311 11:52:04 [Note] WSREP: (9ef8b875-8a63-11e2-0800-5524043be1f1, 'tcp://0.0.0.0:4567') address 'tcp://50.31.163.199:4567' pointing to uuid 9ef8b875-8a63-11e2-0800-5524043be1f1 is blacklisted, skipping
130311 11:52:04 [Note] WSREP: (9ef8b875-8a63-11e2-0800-5524043be1f1, 'tcp://0.0.0.0:4567') address 'tcp://50.31.163.199:4567' pointing to uuid 9ef8b875-8a63-11e2-0800-5524043be1f1 is blacklisted, skipping
130311 11:52:04 [Note] WSREP: (9ef8b875-8a63-11e2-0800-5524043be1f1, 'tcp://0.0.0.0:4567') address 'tcp://50.31.163.199:4567' pointing to uuid 9ef8b875-8a63-11e2-0800-5524043be1f1 is blacklisted, skipping
130311 11:52:04 [Note] WSREP: declaring 597c1e1b-890c-11e2-0800-6682279acf81 stable
130311 11:52:04 [Note] WSREP: declaring 7b00d434-8909-11e2-0800-2ebd244f30d6 stable
130311 11:52:05 [Note] WSREP: view(view_id(NON_PRIM,597c1e1b-890c-11e2-0800-6682279acf81,222) memb {
 597c1e1b-890c-11e2-0800-6682279acf81,
 7b00d434-8909-11e2-0800-2ebd244f30d6,
 9ef8b875-8a63-11e2-0800-5524043be1f1,
} joined {
} left {
} partitioned {
 2c87fd2b-8a63-11e2-0800-c672a249200b,
 8ddcc044-89d8-11e2-0800-7750514952a9,
 bc3c9605-89d0-11e2-0800-2987439b0a42,
 c1724cd4-8a62-11e2-0800-ff44ef33acde,
 f991ab7d-8a62-11e2-0800-69d0bed5b9d2,
})
130311 11:52:06 [Note] WSREP: (9ef8b875-8a63-11e2-0800-5524043be1f1, 'tcp://0.0.0.0:4567') address 'tcp://50.31.163.199:4567' pointing to uuid 9ef8b875-8a63-11e2-0800-5524043be1f1 is blacklisted, skipping
130311 11:52:06 [Note] WSREP: (9ef8b875-8a63-11e2-0800-5524043be1f1, 'tcp://0.0.0.0:4567') address 'tcp://50.31.163.199:4567' pointing to uuid 9ef8b875-8a63-11e2-0800-5524043be1f1 is blacklisted, skipping
130311 11:52:07 [Note] WSREP: (9ef8b875-8a63-11e2-0800-5524043be1f1, 'tcp://0.0.0.0:4567') address 'tcp://50.31.163.199:4567' pointing to uuid 9ef8b875-8a63-11e2-0800-5524043be1f1 is blacklisted, skipping
130311 11:52:07 [Note] WSREP: (9ef8b875-8a63-11e2-0800-5524043be1f1, 'tcp://0.0.0.0:4567') address 'tcp://50.31.163.199:4567' pointing to uuid 9ef8b875-8a63-11e2-0800-5524043be1f1 is blacklisted, skipping
130311 11:52:09 [Note] WSREP: (9ef8b875-8a63-11e2-0800-5524043be1f1, 'tcp://0.0.0.0:4567') address 'tcp://50.31.163.199:4567' pointing to uuid 9ef8b875-8a63-11e2-0800-5524043be1f1 is blacklisted, skipping
130311 11:52:09 [Note] WSREP: (9ef8b875-8a63-11e2-0800-5524043be1f1, 'tcp://0.0.0.0:4567') address 'tcp://50.31.163.199:4567' pointing to uuid 9ef8b875-8a63-11e2-0800-5524043be1f1 is blacklisted, skipping
130311 11:52:10 [Note] WSREP: (9ef8b875-8a63-11e2-0800-5524043be1f1, 'tcp://0.0.0.0:4567') address 'tcp://50.31.163.199:4567' pointing to uuid 9ef8b875-8a63-11e2-0800-5524043be1f1 is blacklisted, skipping
130311 11:52:10 [Note] WSREP: (9ef8b875-8a63-11e2-0800-5524043be1f1, 'tcp://0.0.0.0:4567') address 'tcp://50.31.163.199:4567' pointing to uuid 9ef8b875-8a63-11e2-0800-5524043be1f1 is blacklisted, skipping
130311 11:52:12 [Note] WSREP: (9ef8b875-8a63-11e2-0800-5524043be1f1, 'tcp://0.0.0.0:4567') address 'tcp://50.31.163.199:4567' pointing to uuid 9ef8b875-8a63-11e2-0800-5524043be1f1 is blacklisted, skipping
130311 11:52:12 [Note] WSREP: (9ef8b875-8a63-11e2-0800-5524043be1f1, 'tcp://0.0.0.0:4567') address 'tcp://50.31.163.199:4567' pointing to uuid 9ef8b875-8a63-11e2-0800-5524043be1f1 is blacklisted, skipping
130311 11:52:13 [Note] WSREP: (9ef8b875-8a63-11e2-0800-5524043be1f1, 'tcp://0.0.0.0:4567') address 'tcp://50.31.163.199:4567' pointing to uuid 9ef8b875-8a63-11e2-0800-5524043be1f1 is blacklisted, skipping
130311 11:52:13 [Note] WSREP: (9ef8b875-8a63-11e2-0800-5524043be1f1, 'tcp://0.0.0.0:4567') address 'tcp://50.31.163.199:4567' pointing to uuid 9ef8b875-8a63-11e2-0800-5524043be1f1 is blacklisted, skipping
130311 11:52:15 [Note] WSREP: (9ef8b875-8a63-11e2-0800-5524043be1f1, 'tcp://0.0.0.0:4567') address 'tcp://50.31.163.199:4567' pointing to uuid 9ef8b875-8a63-11e2-0800-5524043be1f1 is blacklisted, skipping
130311 11:52:15 [Note] WSREP: (9ef8b875-8a63-11e2-0800-5524043be1f1, 'tcp://0.0.0.0:4567') address 'tcp://50.31.163.199:4567' pointing to uuid 9ef8b875-8a63-11e2-0800-5524043be1f1 is blacklisted, skipping
130311 11:52:16 [Note] WSREP: (9ef8b875-8a63-11e2-0800-5524043be1f1, 'tcp://0.0.0.0:4567') address 'tcp://50.31.163.199:4567' pointing to uuid 9ef8b875-8a63-11e2-0800-5524043be1f1 is blacklisted, skipping
130311 11:52:16 [Note] WSREP: (9ef8b875-8a63-11e2-0800-5524043be1f1, 'tcp://0.0.0.0:4567') address 'tcp://50.31.163.199:4567' pointing to uuid 9ef8b875-8a63-11e2-0800-5524043be1f1 is blacklisted, skipping
130311 11:52:17 [Note] WSREP: (9ef8b875-8a63-11e2-0800-5524043be1f1, 'tcp://0.0.0.0:4567') address 'tcp://50.31.163.199:4567' pointing to uuid 9ef8b875-8a63-11e2-0800-5524043be1f1 is blacklisted, skipping
130311 11:52:17 [Note] WSREP: (9ef8b875-8a63-11e2-0800-5524043be1f1, 'tcp://0.0.0.0:4567') address 'tcp://50.31.163.199:4567' pointing to uuid 9ef8b875-8a63-11e2-0800-5524043be1f1 is blacklisted, skipping
130311 11:52:19 [Note] WSREP: (9ef8b875-8a63-11e2-0800-5524043be1f1, 'tcp://0.0.0.0:4567') address 'tcp://50.31.163.199:4567' pointing to uuid 9ef8b875-8a63-11e2-0800-5524043be1f1 is blacklisted, skipping
130311 11:52:19 [Note] WSREP: (9ef8b875-8a63-11e2-0800-5524043be1f1, 'tcp://0.0.0.0:4567') address 'tcp://50.31.163.199:4567' pointing to uuid 9ef8b875-8a63-11e2-0800-5524043be1f1 is blacklisted, skipping
130311 11:52:20 [Note] WSREP: (9ef8b875-8a63-11e2-0800-5524043be1f1, 'tcp://0.0.0.0:4567') address 'tcp://50.31.163.199:4567' pointing to uuid 9ef8b875-8a63-11e2-0800-5524043be1f1 is blacklisted, skipping
130311 11:52:20 [Note] WSREP: (9ef8b875-8a63-11e2-0800-5524043be1f1, 'tcp://0.0.0.0:4567') address 'tcp://50.31.163.199:4567' pointing to uuid 9ef8b875-8a63-11e2-0800-5524043be1f1 is blacklisted, skipping
130311 11:52:22 [Note] WSREP: (9ef8b875-8a63-11e2-0800-5524043be1f1, 'tcp://0.0.0.0:4567') address 'tcp://50.31.163.199:4567' pointing to uuid 9ef8b875-8a63-11e2-0800-5524043be1f1 is blacklisted, skipping
130311 11:52:22 [Note] WSREP: (9ef8b875-8a63-11e2-0800-5524043be1f1, 'tcp://0.0.0.0:4567') address 'tcp://50.31.163.199:4567' pointing to uuid 9ef8b875-8a63-11e2-0800-5524043be1f1 is blacklisted, skipping
130311 11:52:23 [Note] WSREP: (9ef8b875-8a63-11e2-0800-5524043be1f1, 'tcp://0.0.0.0:4567') address 'tcp://50.31.163.199:4567' pointing to uuid 9ef8b875-8a63-11e2-0800-5524043be1f1 is blacklisted, skipping
130311 11:52:23 [Note] WSREP: (9ef8b875-8a63-11e2-0800-5524043be1f1, 'tcp://0.0.0.0:4567') address 'tcp://50.31.163.199:4567' pointing to uuid 9ef8b875-8a63-11e2-0800-5524043be1f1 is blacklisted, skipping
130311 11:52:25 [Note] WSREP: (9ef8b875-8a63-11e2-0800-5524043be1f1, 'tcp://0.0.0.0:4567') address 'tcp://50.31.163.199:4567' pointing to uuid 9ef8b875-8a63-11e2-0800-5524043be1f1 is blacklisted, skipping
130311 11:52:25 [Note] WSREP: (9ef8b875-8a63-11e2-0800-5524043be1f1, 'tcp://0.0.0.0:4567') address 'tcp://50.31.163.199:4567' pointing to uuid 9ef8b875-8a63-11e2-0800-5524043be1f1 is blacklisted, skipping
130311 11:52:26 [Note] WSREP: (9ef8b875-8a63-11e2-0800-5524043be1f1, 'tcp://0.0.0.0:4567') address 'tcp://50.31.163.199:4567' pointing to uuid 9ef8b875-8a63-11e2-0800-5524043be1f1 is blacklisted, skipping
130311 11:52:26 [Note] WSREP: (9ef8b875-8a63-11e2-0800-5524043be1f1, 'tcp://0.0.0.0:4567') address 'tcp://50.31.163.199:4567' pointing to uuid 9ef8b875-8a63-11e2-0800-5524043be1f1 is blacklisted, skipping
130311 11:52:28 [Note] WSREP: (9ef8b875-8a63-11e2-0800-5524043be1f1, 'tcp://0.0.0.0:4567') address 'tcp://50.31.163.199:4567' pointing to uuid 9ef8b875-8a63-11e2-0800-5524043be1f1 is blacklisted, skipping
130311 11:52:28 [Note] WSREP: (9ef8b875-8a63-11e2-0800-5524043be1f1, 'tcp://0.0.0.0:4567') address 'tcp://50.31.163.199:4567' pointing to uuid 9ef8b875-8a63-11e2-0800-5524043be1f1 is blacklisted, skipping
130311 11:52:29 [Note] WSREP: (9ef8b875-8a63-11e2-0800-5524043be1f1, 'tcp://0.0.0.0:4567') address 'tcp://50.31.163.199:4567' pointing to uuid 9ef8b875-8a63-11e2-0800-5524043be1f1 is blacklisted, skipping
130311 11:52:29 [Note] WSREP: (9ef8b875-8a63-11e2-0800-5524043be1f1, 'tcp://0.0.0.0:4567') address 'tcp://50.31.163.199:4567' pointing to uuid 9ef8b875-8a63-11e2-0800-5524043be1f1 is blacklisted, skipping
130311 11:52:31 [Note] WSREP: (9ef8b875-8a63-11e2-0800-5524043be1f1, 'tcp://0.0.0.0:4567') address 'tcp://50.31.163.199:4567' pointing to uuid 9ef8b875-8a63-11e2-0800-5524043be1f1 is blacklisted, skipping
130311 11:52:31 [Note] WSREP: (9ef8b875-8a63-11e2-0800-5524043be1f1, 'tcp://0.0.0.0:4567') address 'tcp://50.31.163.199:4567' pointing to uuid 9ef8b875-8a63-11e2-0800-5524043be1f1 is blacklisted, skipping
130311 11:52:32 [Note] WSREP: (9ef8b875-8a63-11e2-0800-5524043be1f1, 'tcp://0.0.0.0:4567') address 'tcp://50.31.163.199:4567' pointing to uuid 9ef8b875-8a63-11e2-0800-5524043be1f1 is blacklisted, skipping
130311 11:52:32 [Note] WSREP: (9ef8b875-8a63-11e2-0800-5524043be1f1, 'tcp://0.0.0.0:4567') address 'tcp://50.31.163.199:4567' pointing to uuid 9ef8b875-8a63-11e2-0800-5524043be1f1 is blacklisted, skipping
130311 11:52:34 [Note] WSREP: (9ef8b875-8a63-11e2-0800-5524043be1f1, 'tcp://0.0.0.0:4567') address 'tcp://50.31.163.199:4567' pointing to uuid 9ef8b875-8a63-11e2-0800-5524043be1f1 is blacklisted, skipping
130311 11:52:34 [Note] WSREP: (9ef8b875-8a63-11e2-0800-5524043be1f1, 'tcp://0.0.0.0:4567') address 'tcp://50.31.163.199:4567' pointing to uuid 9ef8b875-8a63-11e2-0800-5524043be1f1 is blacklisted, skipping
130311 11:52:35 [ERROR] WSREP: failed to open gcomm backend connection: 110: failed to reach primary view: 110 (Connection timed out)
  at gcomm/src/pc.cpp:connect():139
130311 11:52:35 [ERROR] WSREP: gcs/src/gcs_core.c:gcs_core_open():195: Failed to open backend connection: -110 (Connection timed out)
130311 11:52:35 [ERROR] WSREP: gcs/src/gcs.c:gcs_open():1290: Failed to open channel 'gcnmedia_db_cluster' at 'gcomm://74.201.38.114,74.201.39.114,50.31.163.135,50.31.163.199': -110 (Connection timed out)
130311 11:52:35 [ERROR] WSREP: gcs connect failed: Connection timed out
130311 11:52:35 [ERROR] WSREP: wsrep::connect() failed: 6
130311 11:52:35 [ERROR] Aborting

130311 11:52:35 [Note] WSREP: Service disconnected.
130311 11:52:36 [Note] WSREP: Some threads may fail to exit.
130311 11:52:36 [Note] /usr/sbin/mysqld: Shutdown complete

130311 11:52:36 mysqld_safe mysqld from pid file /var/lib/mysql/chicago-gcn2.pid ended

Only two nodes are currently running.

here is the relevant portions of my.cnf:

# Galera stuff
innodb_locks_unsafe_for_binlog=1
innodb_autoinc_lock_mode=2
wsrep_provider=/usr/lib64/libgalera_smm.so
wsrep_provider_options="gcache.size=32G; gcache.page_size=1G"
wsrep_cluster_address=gcomm://74.201.38.114,74.201.39.114,50.31.163.135,50.31.163.199
wsrep_cluster_name='gcnmedia_db_cluster'
wsrep_node_name='chicago-gcn2'
wsrep_node_address='50.31.163.199'
wsrep_sst_method=xtrabackup
wsrep_slave_threads=8
socket.ssl_cert='/etc/ssl/galera-cert.pem'
socket.ssl_key='/etc/ssl/galera-key.pem'

We're having the same problem on the other crashed node--ie, lots of blacklisting its own IP address before shutting down. We've tried removing the .dat file as well as the cache file to force it to reload from one of the two primary nodes, but no effect.

A copy of the full logs from today, including the disconnect during the router issue, is attached.

Revision history for this message
Sean Fulton (sean-v) wrote :
Revision history for this message
Raghavendra D Prabhu (raghavendra-prabhu) wrote :

@Sean,

After 2 of your 4 servers crashed, there is/was a loss of quorum. Can you try running -- SET GLOBAL wsrep_provider_options="pc.bootstrap=1"; -- on any of the surviving nodes and see if bringing up the other nodes causes them to connect successfully?

Revision history for this message
Sean Fulton (sean-v) wrote :

Sorry it has been a while since I responded. This worked like a charm. Can you put it in the docs somewhere??

sean

Revision history for this message
Raghavendra D Prabhu (raghavendra-prabhu) wrote :

It is already in the documentation here http://www.percona.com/doc/percona-xtradb-cluster/faq.html but will make it clearer.

Changed in percona-xtradb-cluster:
status: New → Fix Committed
Changed in percona-xtradb-cluster:
importance: Undecided → Medium
Revision history for this message
Shahriyar Rzayev (rzayev-sehriyar) wrote :

Percona now uses JIRA for bug reports so this bug report is migrated to: https://jira.percona.com/browse/PXC-1056

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.