Bug #1237816 “All nodes but one closes down” : Series 5.5 : Bugs : MySQL patches by Codership

Daniel Ylitalo (danigl) on 2013-10-10

description:

updated

Revision history for this message

Alex Yurchenko (ayurchen) wrote on 2013-10-10:

#1

I would agree about DELETE, and in general inconsistency policy should be configurable.

Having said that, such inconsistency is likely to be a result of a bug and not isolated, so ignoring this error on DELETE may not buy you much operational time. Upgrading to the latest release is more likely to solve your problem.

Revision history for this message

Jason Bryan (jbryan) wrote on 2014-01-01:

#2

Hi,

Our Bugzilla app is on a 3 node multi-master cluster and having the same issue. Can you provide a bit more understanding as to why this happens?

131231 18:16:43 [ERROR] Slave SQL: Could not execute Update_rows event on table bugzilla.attachments; Can't find record in 'attachments', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log FIRST, end_log_pos 279, Error_code: 1032
131231 18:16:43 [Warning] WSREP: RBR event 2 Update_rows apply warning: 120, 1593958
131231 18:16:43 [Note] WSREP: declaring 62a92d9f-7179-11e3-b4ca-de001387c324 stable
131231 18:16:43 [ERROR] WSREP: Failed to apply trx: source: 62a92d9f-7179-11e3-b4ca-de001387c324 version: 2 local: 0 state: APPLYING flags: 1 conn_id: 32030272 trx_id: 8936786 seqnos (l: 550435, g: 1593958, s: 1593957, d: 1593718, ts: 1388535403140834735)
131231 18:16:43 [ERROR] WSREP: Failed to apply app buffer: seqno: 1593958, status: WSREP_FATAL
at galera/src/replicator_smm.cpp:apply_wscoll():52
at galera/src/replicator_smm.cpp:apply_trx_ws():118
131231 18:16:43 [ERROR] WSREP: Node consistency compromized, aborting...

On the front, there is a 4 node web cluster serving BZ. I had made a config change and restarted httpd on each node when the inconsistency occurred. I can't help to think that was a contributor.

MariaDB-Galera-server-5.5.34-1.x86_64
wsrep_provider_name Galera
wsrep_provider_vendor Codership Oy <email address hidden>
wsrep_provider_version 23.2.7(r157)

wsrep_replicate_myisam ON (all BZ tables InnoDB)
wsrep_retry_autocommit 1
wsrep_slave_threads 1
wsrep_sst_auth
wsrep_sst_donor
wsrep_sst_donor_rejects_queries OFF
wsrep_sst_method rsync
wsrep_sst_receive_address AUTO

Is this caused by some app race condition some other factor? I am also load balancing :3306 with haproxy on each of the 4 web nodes. haproxy.cfg:

listen mysql_proxy 127.0.0.1:3306
   timeout connect 5s
   timeout client 1m
   timeout server 1m
   mode tcp
   balance roundrobin
   option tcpka
   option mysql-check user haproxy
   server web-db1 10.0.0.1:3306 check weight 1
   server web-db2 10.0.0.2:3306 check weight 1
   server web-db3 10.0.0.3:3306 check weight 1

Thank you.

Hi,

Our Bugzilla app is on a 3 node multi-master cluster and having the same issue. Can you provide a bit more understanding as to why this happens?

131231 18:16:43 [ERROR] Slave SQL: Could not execute Update_rows event on table bugzilla.attachments; Can't find record in 'attachments', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log FIRST, end_log_pos 279, Error_code: 1032
131231 18:16:43 [Warning] WSREP: RBR event 2 Update_rows apply warning: 120, 1593958
131231 18:16:43 [Note] WSREP: declaring 62a92d9f-7179-11e3-b4ca-de001387c324 stable
131231 18:16:43 [ERROR] WSREP: Failed to apply trx: source: 62a92d9f-7179-11e3-b4ca-de001387c324 version: 2 local: 0 state: APPLYING flags: 1 conn_id: 32030272 trx_id: 8936786 seqnos (l: 550435, g: 1593958, s: 1593957, d: 1593718, ts: 1388535403140834735)
131231 18:16:43 [ERROR] WSREP: Failed to apply app buffer: seqno: 1593958, status: WSREP_FATAL
         at galera/src/replicator_smm.cpp:apply_wscoll():52
         at galera/src/replicator_smm.cpp:apply_trx_ws():118
131231 18:16:43 [ERROR] WSREP: Node consistency compromized, aborting...

On the front, there is a 4 node web cluster serving BZ. I had made a config change and restarted httpd on each node when the inconsistency occurred. I can't help to think that was a contributor.

MariaDB-Galera-server-5.5.34-1.x86_64
wsrep_provider_name     Galera
wsrep_provider_vendor  Codership Oy <info@codership.com>
wsrep_provider_version  23.2.7(r157)

wsrep_replicate_myisam	ON  (all BZ tables InnoDB)
wsrep_retry_autocommit	1
wsrep_slave_threads	1
wsrep_sst_auth	
wsrep_sst_donor	
wsrep_sst_donor_rejects_queries	OFF
wsrep_sst_method	rsync
wsrep_sst_receive_address	AUTO

Is this caused by some app race condition some other factor? I am also load balancing :3306 with haproxy on each of the 4 web nodes. haproxy.cfg:

listen mysql_proxy 127.0.0.1:3306
   timeout connect 5s
   timeout client 1m
   timeout server 1m
   mode tcp
   balance roundrobin
   option tcpka
   option mysql-check user haproxy
   server web-db1 10.0.0.1:3306 check weight 1
   server web-db2 10.0.0.2:3306 check weight 1
   server web-db3 10.0.0.3:3306 check weight 1

Thank you.

Revision history for this message

Alex Yurchenko (ayurchen) wrote on 2014-01-02:

#3

Hi Jason,

As follows from the log, the node was missing a row to update. Assuming that all nodes but one had this situation it is logical to assume that the remaining node was inconsistent with the rest of the cluster.

It could happen for many reasons and arbitrarily long ago. But the most likely cause is human error when restarting the cluster. If you could carefully recall EXACTLY how you did it you may find a clue.

Restarting http servers and ha-proxy load balancing is not a CAUSE of inconsistency, unless there is some nasty bug. Such bugs proved to be rather rare (1 or 2 through the history), so you should look into how you restarted the nodes first. ha-proxy balancing could CONTRIBUTE to it if you start the nodes in wrong order.

Revision history for this message

Ivan Pantovic (ivan-pantovicc) wrote on 2014-04-28:

#4

Download full text (10.4 KiB)

Hi Alex, i've been strugling with a very similar problem on 5.6 for a month now.

so far, it is usually a delete statement like this:

Apr 28 17:24:48 vm-gotc-prod-mysql1 mysqld: Retrying 4th time
Apr 28 17:24:48 vm-gotc-prod-mysql1 mysqld: 2014-04-28 17:24:48 28224 [ERROR] Slave SQL: Could not execute Delete_rows event on table gotcourts.Exception_Courts; Can't find record in 'Exception_Courts', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log FIRST, end_log_pos 176, Error_code: 1032
Apr 28 17:24:48 vm-gotc-prod-mysql1 mysqld: 2014-04-28 17:24:48 28224 [Warning] WSREP: RBR event 3 Delete_rows apply warning: 120, 593017
Apr 28 17:24:48 vm-gotc-prod-mysql1 mysqld: 2014-04-28 17:24:48 28224 [ERROR] WSREP: Failed to apply trx: source: 4ba1d787-ceb4-11e3-a510-5fed256959fa version: 3 local: 0 state: APPLYING flags: 129 conn_id: 28595 trx_id: 59111915 seqnos (l: 1654, g: 593017, s: 593016, d: 593016, ts: 94625220425952)
Apr 28 17:24:48 vm-gotc-prod-mysql1 mysqld: 2014-04-28 17:24:48 28224 [ERROR] WSREP: Failed to apply trx 593017 4 times
Apr 28 17:24:48 vm-gotc-prod-mysql1 mysqld: 2014-04-28 17:24:48 28224 [ERROR] WSREP: Node consistency compromized, aborting...
Apr 28 17:24:48 vm-gotc-prod-mysql1 mysqld: 2014-04-28 17:24:48 28224 [Note] WSREP: Closing send monitor...
Apr 28 17:24:48 vm-gotc-prod-mysql1 mysqld: 2014-04-28 17:24:48 28224 [Note] WSREP: Closed send monitor.
Apr 28 17:24:48 vm-gotc-prod-mysql1 mysqld: 2014-04-28 17:24:48 28224 [Note] WSREP: gcomm: terminating thread
Apr 28 17:24:48 vm-gotc-prod-mysql1 mysqld: 2014-04-28 17:24:48 28224 [Note] WSREP: gcomm: joining thread
Apr 28 17:24:48 vm-gotc-prod-mysql1 mysqld: 2014-04-28 17:24:48 28224 [Note] WSREP: gcomm: closing backend
Apr 28 17:24:49 vm-gotc-prod-mysql1 mysqld: 2014-04-28 17:24:49 28224 [Note] WSREP: view(view_id(NON_PRIM,4ba1d787-ceb4-11e3-a510-5fed256959fa,172) memb {
Apr 28 17:24:49 vm-gotc-prod-mysql1 mysqld: #011c980479d-ceb3-11e3-961f-9bc7520e933a,0
Apr 28 17:24:49 vm-gotc-prod-mysql1 mysqld: } joined {
Apr 28 17:24:49 vm-gotc-prod-mysql1 mysqld: } left {
Apr 28 17:24:49 vm-gotc-prod-mysql1 mysqld: } partitioned {
Apr 28 17:24:49 vm-gotc-prod-mysql1 mysqld: #0114ba1d787-ceb4-11e3-a510-5fed256959fa,0
Apr 28 17:24:49 vm-gotc-prod-mysql1 mysqld: #011e2745e2c-ceb3-11e3-95dc-daaca2f50d15,0
Apr 28 17:24:49 vm-gotc-prod-mysql1 mysqld: })
Apr 28 17:24:49 vm-gotc-prod-mysql1 mysqld: 2014-04-28 17:24:49 28224 [Note] WSREP: view((empty))
Apr 28 17:24:49 vm-gotc-prod-mysql1 mysqld: 2014-04-28 17:24:49 28224 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1
Apr 28 17:24:49 vm-gotc-prod-mysql1 mysqld: 2014-04-28 17:24:49 28224 [Note] WSREP: gcomm: closed
Apr 28 17:24:49 vm-gotc-prod-mysql1 mysqld: 2014-04-28 17:24:49 28224 [Note] WSREP: Flow-control interval: [16, 16]
Apr 28 17:24:49 vm-gotc-prod-mysql1 mysqld: 2014-04-28 17:24:49 28224 [Note] WSREP: Received NON-PRIMARY.
Apr 28 17:24:49 vm-gotc-prod-mysql1 mysqld: 2014-04-28 17:24:49 28224 [Note] WSREP: Shifting SYNCED -> OPEN (TO: 593017)
Apr 28 17:24:49 vm-gotc-prod-mysql1 mysqld: 2014-04-28 17:24:49 28224 [Note] WSREP: Received self-leave message.
Apr 28 17:24:49 vm-gotc-...

Hi Alex, i've been strugling with a very similar problem on 5.6 for a month now.

so far, it is usually a delete statement like this:

Apr 28 17:24:48 vm-gotc-prod-mysql1 mysqld: Retrying 4th time
Apr 28 17:24:48 vm-gotc-prod-mysql1 mysqld: 2014-04-28 17:24:48 28224 [ERROR] Slave SQL: Could not execute Delete_rows event on table gotcourts.Exception_Courts; Can't find record in 'Exception_Courts', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log FIRST, end_log_pos 176, Error_code: 1032
Apr 28 17:24:48 vm-gotc-prod-mysql1 mysqld: 2014-04-28 17:24:48 28224 [Warning] WSREP: RBR event 3 Delete_rows apply warning: 120, 593017
Apr 28 17:24:48 vm-gotc-prod-mysql1 mysqld: 2014-04-28 17:24:48 28224 [ERROR] WSREP: Failed to apply trx: source: 4ba1d787-ceb4-11e3-a510-5fed256959fa version: 3 local: 0 state: APPLYING flags: 129 conn_id: 28595 trx_id: 59111915 seqnos (l: 1654, g: 593017, s: 593016, d: 593016, ts: 94625220425952)
Apr 28 17:24:48 vm-gotc-prod-mysql1 mysqld: 2014-04-28 17:24:48 28224 [ERROR] WSREP: Failed to apply trx 593017 4 times
Apr 28 17:24:48 vm-gotc-prod-mysql1 mysqld: 2014-04-28 17:24:48 28224 [ERROR] WSREP: Node consistency compromized, aborting...
Apr 28 17:24:48 vm-gotc-prod-mysql1 mysqld: 2014-04-28 17:24:48 28224 [Note] WSREP: Closing send monitor...
Apr 28 17:24:48 vm-gotc-prod-mysql1 mysqld: 2014-04-28 17:24:48 28224 [Note] WSREP: Closed send monitor.
Apr 28 17:24:48 vm-gotc-prod-mysql1 mysqld: 2014-04-28 17:24:48 28224 [Note] WSREP: gcomm: terminating thread
Apr 28 17:24:48 vm-gotc-prod-mysql1 mysqld: 2014-04-28 17:24:48 28224 [Note] WSREP: gcomm: joining thread
Apr 28 17:24:48 vm-gotc-prod-mysql1 mysqld: 2014-04-28 17:24:48 28224 [Note] WSREP: gcomm: closing backend
Apr 28 17:24:49 vm-gotc-prod-mysql1 mysqld: 2014-04-28 17:24:49 28224 [Note] WSREP: view(view_id(NON_PRIM,4ba1d787-ceb4-11e3-a510-5fed256959fa,172) memb {
Apr 28 17:24:49 vm-gotc-prod-mysql1 mysqld: #011c980479d-ceb3-11e3-961f-9bc7520e933a,0
Apr 28 17:24:49 vm-gotc-prod-mysql1 mysqld: } joined {
Apr 28 17:24:49 vm-gotc-prod-mysql1 mysqld: } left {
Apr 28 17:24:49 vm-gotc-prod-mysql1 mysqld: } partitioned {
Apr 28 17:24:49 vm-gotc-prod-mysql1 mysqld: #0114ba1d787-ceb4-11e3-a510-5fed256959fa,0
Apr 28 17:24:49 vm-gotc-prod-mysql1 mysqld: #011e2745e2c-ceb3-11e3-95dc-daaca2f50d15,0
Apr 28 17:24:49 vm-gotc-prod-mysql1 mysqld: })
Apr 28 17:24:49 vm-gotc-prod-mysql1 mysqld: 2014-04-28 17:24:49 28224 [Note] WSREP: view((empty))
Apr 28 17:24:49 vm-gotc-prod-mysql1 mysqld: 2014-04-28 17:24:49 28224 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1
Apr 28 17:24:49 vm-gotc-prod-mysql1 mysqld: 2014-04-28 17:24:49 28224 [Note] WSREP: gcomm: closed
Apr 28 17:24:49 vm-gotc-prod-mysql1 mysqld: 2014-04-28 17:24:49 28224 [Note] WSREP: Flow-control interval: [16, 16]
Apr 28 17:24:49 vm-gotc-prod-mysql1 mysqld: 2014-04-28 17:24:49 28224 [Note] WSREP: Received NON-PRIMARY.
Apr 28 17:24:49 vm-gotc-prod-mysql1 mysqld: 2014-04-28 17:24:49 28224 [Note] WSREP: Shifting SYNCED -> OPEN (TO: 593017)
Apr 28 17:24:49 vm-gotc-prod-mysql1 mysqld: 2014-04-28 17:24:49 28224 [Note] WSREP: Received self-leave message.
Apr 28 17:24:49 vm-gotc-prod-mysql1 mysqld: 2014-04-28 17:24:49 28224 [Note] WSREP: Flow-control interval: [0, 0]
Apr 28 17:24:49 vm-gotc-prod-mysql1 mysqld: 2014-04-28 17:24:49 28224 [Note] WSREP: Received SELF-LEAVE. Closing connection.
Apr 28 17:24:49 vm-gotc-prod-mysql1 mysqld: 2014-04-28 17:24:49 28224 [Note] WSREP: Shifting OPEN -> CLOSED (TO: 593017)
Apr 28 17:24:49 vm-gotc-prod-mysql1 mysqld: 2014-04-28 17:24:49 28224 [Note] WSREP: RECV thread exiting 0: Success
Apr 28 17:24:49 vm-gotc-prod-mysql1 mysqld: 2014-04-28 17:24:49 28224 [Note] WSREP: recv_thread() joined.
Apr 28 17:24:49 vm-gotc-prod-mysql1 mysqld: 2014-04-28 17:24:49 28224 [Note] WSREP: Closing replication queue.
Apr 28 17:24:49 vm-gotc-prod-mysql1 mysqld: 2014-04-28 17:24:49 28224 [Note] WSREP: Closing slave action queue.
Apr 28 17:24:49 vm-gotc-prod-mysql1 mysqld: 2014-04-28 17:24:49 28224 [Note] WSREP: /usr/sbin/mysqld: Terminated.
Apr 28 17:24:49 vm-gotc-prod-mysql1 mysqld_safe: Number of processes running now: 0
Apr 28 17:24:49 vm-gotc-prod-mysql1 mysqld_safe: WSREP: not restarting wsrep node automatically
Apr 28 17:24:49 vm-gotc-prod-mysql1 mysqld_safe: mysqld from pid file /var/lib/mysql/vm-gotc-prod-mysql1.localdomain.pid ended

i have noticed two things.

we have a stage enviroment . .quite the same as the production. the problem is present there also. First I'm unable to reproduce it without doing update or delete but then it just manifests itself. I can't locate what is the main trigger that puts the cluster in the pre fail state. Any ideas would be welcome.

The second thing is that when two out of three nodes fail, the one remaining node should fail to continue operating since there is no quorum but that is not the case. It is not the node started with gconn:// and event the first node that was started like that is later, when cluster is running three nodes, restarted without that param. (bootstrap-pxe).

And third question is how resilient is the cluster communication between the nodes? I see some errors bursts in logs from time to time like:

Apr 28 08:52:17 vm-gotc-prod-mysql1 mysqld: 2014-04-28 08:52:17 2081 [Note] WSREP: (9b1c6fe8-ce0c-11e3-9815-0a774155e7b5, 'tcp://0.0.0.0:4567') address 'tcp://192.168.100.31:4567' pointing to uuid 9b1c6fe8-ce0c-11e3-9815-0a774155e7b5 is 
blacklisted, skipping
Apr 28 08:52:17 vm-gotc-prod-mysql1 mysqld: 2014-04-28 08:52:17 2081 [Note] WSREP: (9b1c6fe8-ce0c-11e3-9815-0a774155e7b5, 'tcp://0.0.0.0:4567') address 'tcp://192.168.100.31:4567' pointing to uuid 9b1c6fe8-ce0c-11e3-9815-0a774155e7b5 is 
blacklisted, skipping
Apr 28 08:52:17 vm-gotc-prod-mysql1 mysqld: 2014-04-28 08:52:17 2081 [Note] WSREP: (9b1c6fe8-ce0c-11e3-9815-0a774155e7b5, 'tcp://0.0.0.0:4567') address 'tcp://192.168.100.31:4567' pointing to uuid 9b1c6fe8-ce0c-11e3-9815-0a774155e7b5 is 
blacklisted, skipping
Apr 28 08:52:17 vm-gotc-prod-mysql1 mysqld: 2014-04-28 08:52:17 2081 [Note] WSREP: (9b1c6fe8-ce0c-11e3-9815-0a774155e7b5, 'tcp://0.0.0.0:4567') address 'tcp://192.168.100.31:4567' pointing to uuid 9b1c6fe8-ce0c-11e3-9815-0a774155e7b5 is 
blacklisted, skipping
Apr 28 08:52:17 vm-gotc-prod-mysql1 mysqld: 2014-04-28 08:52:17 2081 [Note] WSREP: (9b1c6fe8-ce0c-11e3-9815-0a774155e7b5, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers: tcp://192.168.100.32:4567 
Apr 28 08:52:18 vm-gotc-prod-mysql1 mysqld: 2014-04-28 08:52:18 2081 [Note] WSREP: (9b1c6fe8-ce0c-11e3-9815-0a774155e7b5, 'tcp://0.0.0.0:4567') reconnecting to 313c9296-ce0c-11e3-87a2-62f56f8868cd (tcp://192.168.100.32:4567), attempt 0
Apr 28 08:52:18 vm-gotc-prod-mysql1 mysqld: 2014-04-28 08:52:18 2081 [Note] WSREP: (9b1c6fe8-ce0c-11e3-9815-0a774155e7b5, 'tcp://0.0.0.0:4567') cleaning up duplicate 0x7f82b8022a30 after established 0x7f82b800d700
Apr 28 08:52:18 vm-gotc-prod-mysql1 mysqld: 2014-04-28 08:52:18 2081 [Note] WSREP: (9b1c6fe8-ce0c-11e3-9815-0a774155e7b5, 'tcp://0.0.0.0:4567') address 'tcp://192.168.100.31:4567' pointing to uuid 9b1c6fe8-ce0c-11e3-9815-0a774155e7b5 is 
blacklisted, skipping
Apr 28 08:52:18 vm-gotc-prod-mysql1 mysqld: 2014-04-28 08:52:18 2081 [Note] WSREP: (9b1c6fe8-ce0c-11e3-9815-0a774155e7b5, 'tcp://0.0.0.0:4567') address 'tcp://192.168.100.31:4567' pointing to uuid 9b1c6fe8-ce0c-11e3-9815-0a774155e7b5 is 
blacklisted, skipping
Apr 28 08:52:18 vm-gotc-prod-mysql1 mysqld: 2014-04-28 08:52:18 2081 [Note] WSREP: (9b1c6fe8-ce0c-11e3-9815-0a774155e7b5, 'tcp://0.0.0.0:4567') turning message relay requesting off
Apr 28 08:52:18 vm-gotc-prod-mysql1 mysqld: 2014-04-28 08:52:18 2081 [Note] WSREP: (9b1c6fe8-ce0c-11e3-9815-0a774155e7b5, 'tcp://0.0.0.0:4567') address 'tcp://192.168.100.31:4567' pointing to uuid 9b1c6fe8-ce0c-11e3-9815-0a774155e7b5 is 
blacklisted, skipping
Apr 28 08:52:18 vm-gotc-prod-mysql1 mysqld: 2014-04-28 08:52:18 2081 [Note] WSREP: (9b1c6fe8-ce0c-11e3-9815-0a774155e7b5, 'tcp://0.0.0.0:4567') address 'tcp://192.168.100.31:4567' pointing to uuid 9b1c6fe8-ce0c-11e3-9815-0a774155e7b5 is 
blacklisted, skipping
Apr 28 08:52:18 vm-gotc-prod-mysql1 mysqld: 2014-04-28 08:52:18 2081 [Note] WSREP: (9b1c6fe8-ce0c-11e3-9815-0a774155e7b5, 'tcp://0.0.0.0:4567') address 'tcp://192.168.100.31:4567' pointing to uuid 9b1c6fe8-ce0c-11e3-9815-0a774155e7b5 is 
blacklisted, skipping
Apr 28 08:52:18 vm-gotc-prod-mysql1 mysqld: 2014-04-28 08:52:18 2081 [Note] WSREP: (9b1c6fe8-ce0c-11e3-9815-0a774155e7b5, 'tcp://0.0.0.0:4567') address 'tcp://192.168.100.31:4567' pointing to uuid 9b1c6fe8-ce0c-11e3-9815-0a774155e7b5 is 
blacklisted, skipping
Apr 28 08:57:17 vm-gotc-prod-mysql1 mysqld: 2014-04-28 08:57:17 2081 [Note] WSREP: (9b1c6fe8-ce0c-11e3-9815-0a774155e7b5, 'tcp://0.0.0.0:4567') address 'tcp://192.168.100.31:4567' pointing to uuid 9b1c6fe8-ce0c-11e3-9815-0a774155e7b5 is 
blacklisted, skipping
Apr 28 08:57:17 vm-gotc-prod-mysql1 mysqld: 2014-04-28 08:57:17 2081 [Note] WSREP: (9b1c6fe8-ce0c-11e3-9815-0a774155e7b5, 'tcp://0.0.0.0:4567') address 'tcp://192.168.100.31:4567' pointing to uuid 9b1c6fe8-ce0c-11e3-9815-0a774155e7b5 is 
blacklisted, skipping
Apr 28 08:57:17 vm-gotc-prod-mysql1 mysqld: 2014-04-28 08:57:17 2081 [Note] WSREP: (9b1c6fe8-ce0c-11e3-9815-0a774155e7b5, 'tcp://0.0.0.0:4567') address 'tcp://192.168.100.31:4567' pointing to uuid 9b1c6fe8-ce0c-11e3-9815-0a774155e7b5 is 
blacklisted, skipping
Apr 28 08:57:17 vm-gotc-prod-mysql1 mysqld: 2014-04-28 08:57:17 2081 [Note] WSREP: (9b1c6fe8-ce0c-11e3-9815-0a774155e7b5, 'tcp://0.0.0.0:4567') address 'tcp://192.168.100.31:4567' pointing to uuid 9b1c6fe8-ce0c-11e3-9815-0a774155e7b5 is 
blacklisted, skipping
Apr 28 08:57:17 vm-gotc-prod-mysql1 mysqld: 2014-04-28 08:57:17 2081 [Note] WSREP: (9b1c6fe8-ce0c-11e3-9815-0a774155e7b5, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers: tcp://192.168.100.32:4567 
Apr 28 08:57:18 vm-gotc-prod-mysql1 mysqld: 2014-04-28 08:57:18 2081 [Note] WSREP: (9b1c6fe8-ce0c-11e3-9815-0a774155e7b5, 'tcp://0.0.0.0:4567') address 'tcp://192.168.100.31:4567' pointing to uuid 9b1c6fe8-ce0c-11e3-9815-0a774155e7b5 is 
blacklisted, skipping

Oh and this might be of some help, but on the stage enviroment, it is always the node1 that survives and on the producion enviroment it's always the node3 but they are all the same, bootstrapped from the same CentOS 6.5 with different virtual machines running on different nodes and load is balanced equally over all three nodes. I would expect to see random failuers but that is not the case.

Revision history for this message

fabe (fabe-e) wrote on 2014-07-15:

#5

Download full text (4.0 KiB)

Not sure if we have the same problem, but it is in production on 3 servers and two more are coming as soon as the user count rises, this is extremely bad, any suggestion is welcome. Two servers crash at same time when this duplicate occurs and main node brain splits, perhaps a tmp sollution if this will hapen more often to disable split brain until bug is resolved or whatever is happening?

from log:
140702 3:43:07 ...
140713 20:34:53 [ERROR] Slave SQL: Could not execute Write_rows event on table r.r_s; Duplicate entry '267444' for key 'subscriber_id_UNIQUE', Error_code: 1062; handler error HA_ER
R_FOUND_DUPP_KEY; the event's master log FIRST, end_log_pos 311, Error_code: 1062
140713 20:34:53 [Warning] WSREP: RBR event 2 Write_rows apply warning: 121, 298496015
140713 20:34:53 [ERROR] WSREP: Failed to apply trx: source: b7f40923-da82-11e3-bbc0-7b766eddf37e version: 2 local: 0 state: APPLYING flags: 1 conn_id: 9243737 trx_id: 819005834 seqnos (l: 20583
3760, g: 298496015, s: 298496001, d: 298496014, ts: 1405272892721099371)
140713 20:34:53 [ERROR] WSREP: Failed to apply app buffer: seqno: 298496015, status: WSREP_FATAL
         at galera/src/replicator_smm.cpp:apply_wscoll():52
         at galera/src/replicator_smm.cpp:apply_trx_ws():118
140713 20:34:53 [ERROR] WSREP: Node consistency compromized, aborting...
140713 20:34:53 [Note] WSREP: Closing send monitor...
140713 20:34:53 [Note] WSREP: Closed send monitor.
140713 20:34:53 [Note] WSREP: gcomm: terminating thread
140713 20:34:53 [Note] WSREP: gcomm: joining thread
140713 20:34:53 [Note] WSREP: declaring b7f40923-da82-11e3-bbc0-7b766eddf37e stable
140713 20:34:53 [Note] WSREP: gcomm: closing backend
140713 20:34:53 [Warning] WSREP: Failed to report last committed 298495993, -77 (File descriptor in bad state)
140713 20:34:53 [Note] WSREP: Node 010ba639-dce3-11e3-a855-ee8794d217f0 state prim
140713 20:34:53 [Warning] WSREP: user message in state LEAVING
140713 20:34:53 [Warning] WSREP: 010ba639-dce3-11e3-a855-ee8794d217f0 sending install message failed: Transport endpoint is not connected
140713 20:34:53 [Note] WSREP: (010ba639-dce3-11e3-a855-ee8794d217f0, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers: tcp://22.37.44.11:4567
140713 20:34:53 [Note] WSREP: view(view_id(NON_PRIM,010ba639-dce3-11e3-a855-ee8794d217f0,23) memb {
        010ba639-dce3-11e3-a855-ee8794d217f0,
} joined {
} left {
} partitioned {
        93536422-dcf3-11e3-a37b-1e9808fdc0f6,
        b7f40923-da82-11e3-bbc0-7b766eddf37e,
})
140713 20:34:53 [Note] WSREP: view((empty))
140713 20:34:53 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1
140713 20:34:53 [Note] WSREP: gcomm: closed
140713 20:34:53 [Note] WSREP: Flow-control interval: [16, 16]
140713 20:34:53 [Note] WSREP: Received NON-PRIMARY.
140713 20:34:53 [Note] WSREP: Shifting SYNCED -> OPEN (TO: 298496134)
140713 20:34:53 [Note] WSREP: Received self-leave message.
140713 20:34:53 [Note] WSREP: Flow-control interval: [0, 0]
140713 20:34:53 [Note] WSREP: Received SELF-LEAVE. Closing connection.
140713 20:34:53 [Note] WSREP: Shifting OPEN -> CLOSED (TO: 298496134)
140713 20:34:53 [Note] WSREP: RECV thread exiting...

	Status	Importance	Assigned to
MySQL patches by Codership	Status tracked in 5.6
5.5	Confirmed	Low	Unassigned
5.6	Confirmed	Low	Unassigned
Percona XtraDB Cluster moved to https://jira.percona.com/projects/PXC	Status tracked in 5.6
5.5	Fix Committed	Medium	Unassigned
5.6	Fix Committed	Medium	Unassigned

MySQL patches by Codership

All nodes but one closes down

Bug Description

Other bug subscribers

Remote bug watches