Inconsistent behavior on data consistency abort

Bug #1704404 reported by Przemek on 2017-07-14
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Percona XtraDB Cluster moved to https://jira.percona.com/projects/PXC
Fix Committed
Undecided
Unassigned
5.7
Fix Committed
Undecided
Unassigned

Bug Description

Currently when nodes abort on data consistency problem, cluster may behave in different ways, depending on how many total nodes and how fast other nodes manage to quit the cluster.

For a two-node cluster, in all my tests the writer remains Primary, so it behaves like clean shutdown of the second node, so quorum is not lost.

However, in three-node cluster - sometimes writer remains primary, sometimes not - depending on how fast the other two nodes leave the cluster. In case both of them are seen by writer as leaving cluster in the same time - writer enters non-Primary state.

Tested on version:
mysql> select @@version,@@version_comment\G
*************************** 1. row ***************************
        @@version: 5.7.18-15-57-log
@@version_comment: Percona XtraDB Cluster (GPL), Release rel15, Revision 7693d6e, WSREP version 29.20, wsrep_29.20
1 row in set (0.00 sec)

Test case:

use test
create table t1 (id int primary key);
insert into t1 values (1),(2);
set wsrep_on=0; insert into t1 values (7); set wsrep_on=1; delete from t1 where id=7;

Result on writer node on 2-node cluster:

2017-07-14T14:23:00.832327Z 0 [Note] WSREP: forgetting 9cf75139 (tcp://192.168.70.4:4567)
2017-07-14T14:23:00.832470Z 0 [Note] WSREP: Node 97ba771e state primary
2017-07-14T14:23:00.832470Z 0 [Note] WSREP: Current view of cluster as seen by this node
view (view_id(PRIM,97ba771e,21)
memb {
        97ba771e,0
        }
joined {
        }
left {
        }
partitioned {
        9cf75139,0
        }
)
2017-07-14T14:23:00.832494Z 0 [Note] WSREP: Save the discovered primary-component to disk
2017-07-14T14:23:00.832988Z 0 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 0, memb_num = 1
2017-07-14T14:23:00.835347Z 0 [Note] WSREP: forgetting 9cf75139 (tcp://192.168.70.4:4567)

Result on writer node on 3-node cluster, when both peers exit in the same time:

2017-07-14T14:28:52.250371Z 0 [Note] WSREP: declaring 39e4cef7 at tcp://192.168.70.4:4567 stable
2017-07-14T14:28:52.250450Z 0 [Note] WSREP: forgetting 71f2b1d2 (tcp://192.168.70.3:4567)
2017-07-14T14:28:52.252798Z 0 [Note] WSREP: Node 39e4cef7 state primary
2017-07-14T14:28:52.253022Z 0 [Note] WSREP: Current view of cluster as seen by this node
view (view_id(NON_PRIM,39e4cef7,24)
memb {
        97ba771e,0
        }
joined {
        }
left {
        }
partitioned {
        39e4cef7,0
        71f2b1d2,0
        }
)
2017-07-14T14:28:52.253126Z 0 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1
2017-07-14T14:28:52.253164Z 0 [Note] WSREP: Flow-control interval: [100, 100]
2017-07-14T14:28:52.253240Z 0 [Note] WSREP: Received NON-PRIMARY.
2017-07-14T14:28:52.253267Z 0 [Note] WSREP: Shifting SYNCED -> OPEN (TO: 23)
2017-07-14T14:28:52.253354Z 1 [Note] WSREP: New cluster view: global state: f38c6685-46b6-11e7-a298-53039f0a6c1c:23, view# -1: non-Primary, number of nodes: 1, my index: 0, protocol version 3
...
2017-07-14T14:28:52.253791Z 0 [Note] WSREP: forgetting 39e4cef7 (tcp://192.168.70.4:4567)
...
2017-07-14T14:28:57.276243Z 0 [Note] WSREP: cleaning up 71f2b1d2 (tcp://192.168.70.3:4567)
2017-07-14T14:28:57.276354Z 0 [Note] WSREP: cleaning up 39e4cef7 (tcp://192.168.70.4:4567)

In all cases, nodes which abort, seem to send the leave message properly:

2017-07-14T14:28:52.349750Z 9 [Warning] WSREP: RBR event 3 Delete_rows apply warning: 120, 23
2017-07-14T14:28:52.353441Z 9 [ERROR] WSREP: Failed to apply trx: source: 97ba771e-6896-11e7-9dcf-1ad30db837ab version: 3 local: 0 state: APPLYING flags: 1 conn_id: 58 trx_id: 788 seqnos (l: 10, g: 23, s: 22, d: 22, ts: 3731437472065252)
2017-07-14T14:28:52.353555Z 9 [ERROR] WSREP: Failed to apply trx 23 4 times
2017-07-14T14:28:52.354801Z 9 [ERROR] WSREP: Node consistency compromized, aborting...
...
2017-07-14T14:28:52.363364Z 9 [Warning] WSREP: user message in state LEAVING
...
2017-07-14T14:28:52.367454Z 0 [Note] WSREP: Received SELF-LEAVE. Closing connection.

IMHO behavior should be consistent - either we treat the leaving nodes as if it was clean shutdown, and remaining node re-configures PC as single node, primary cluster, or we put it in non-primary state when it looses quorum (cluster doesn't have >50% of nodes up).

Krunal Bauskar (krunal-bauskar) wrote :

Przemek,

I just tried this multiple times including a scenario when both nodes leave at same time but couldn't reproduce the issue. Do you have anything specific in your my.cnf ?

Can you share your my.cnf ?

----------------------

2017-07-17T08:50:21.717558Z 0 [Note] WSREP: 2.2 (n3): State transfer from 0.0 (n1) complete.
2017-07-17T08:50:21.718225Z 0 [Note] WSREP: Member 2.2 (n3) synced with group.
2017-07-17T08:50:22.834585Z 0 [Note] WSREP: 1.0 (n2): State transfer from 0.0 (n1) complete.
2017-07-17T08:50:22.835420Z 0 [Note] WSREP: Member 1.0 (n2) synced with group.
2017-07-17T08:50:36.900232Z 0 [Note] WSREP: forgetting ed17fecc (tcp://127.0.0.1:5030)
2017-07-17T08:50:36.900323Z 0 [Note] WSREP: forgetting edcc146d (tcp://127.0.0.1:6030)
2017-07-17T08:50:36.900459Z 0 [Note] WSREP: Node ec95cd79 state primary
2017-07-17T08:50:36.900572Z 0 [Note] WSREP: Current view of cluster as seen by this node
view (view_id(PRIM,ec95cd79,4)
memb {
 ec95cd79,0
 }
joined {
 }
left {
 }
partitioned {
 ed17fecc,0
 edcc146d,2
 }
)

***************** BOTH NODE NOTIFIED TO LEAVE AT SAME TIME ******************************

2017-07-17T08:50:36.900598Z 0 [Note] WSREP: Save the discovered primary-component to disk
2017-07-17T08:50:36.900810Z 0 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 0, memb_num = 1
2017-07-17T08:50:36.901330Z 0 [Note] WSREP: forgetting ed17fecc (tcp://127.0.0.1:5030)
2017-07-17T08:50:36.901356Z 0 [Note] WSREP: forgetting edcc146d (tcp://127.0.0.1:6030)
2017-07-17T08:50:36.901518Z 0 [Note] WSREP: STATE_EXCHANGE: sent state UUID: 0099ca74-6acd-11e7-b21f-9715285cfec1
2017-07-17T08:50:36.901586Z 0 [Note] WSREP: STATE EXCHANGE: sent state msg: 0099ca74-6acd-11e7-b21f-9715285cfec1
2017-07-17T08:50:36.901600Z 0 [Note] WSREP: STATE EXCHANGE: got state msg: 0099ca74-6acd-11e7-b21f-9715285cfec1 from 0 (n1)
2017-07-17T08:50:36.901612Z 0 [Note] WSREP: Quorum results:
 version = 4,
 component = PRIMARY,
 conf_id = 3,
 members = 1/1 (primary/total),
 act_id = 4,
 last_appl. = 0,
 protocols = 0/7/3 (gcs/repl/appl),
 group UUID = ec962b8c-6acc-11e7-9cca-5bbfe78342d4
2017-07-17T08:50:36.901624Z 0 [Note] WSREP: Flow-control interval: [100, 100]
2017-07-17T08:50:36.901666Z 2 [Note] WSREP: New cluster view: global state: ec962b8c-6acc-11e7-9cca-5bbfe78342d4:4, view# 4: Primary, number of nodes: 1, my index: 0, protocol version 3
2017-07-17T08:50:36.901676Z 2 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2017-07-17T08:50:36.901686Z 2 [Note] WSREP: REPL Protocols: 7 (3, 2)
2017-07-17T08:50:36.901750Z 2 [Note] WSREP: Assign initial position for certification: 4, protocol version: 3
2017-07-17T08:50:36.901793Z 0 [Note] WSREP: Service thread queue flushed.

Przemek (pmalkowski) wrote :

Krunal,
I have very basic settings, this is from writer node and others have almost identical:

[client]
socket=/var/lib/mysql/mysql.sock

[mysqld]
wsrep_provider=/usr/lib64/galera3/libgalera_smm.so
wsrep_cluster_address=gcomm://192.168.70.2,192.168.70.3,192.168.70.4
binlog_format=ROW
default_storage_engine=InnoDB
wsrep_slave_threads= 8
wsrep_log_conflicts
innodb_autoinc_lock_mode=2
wsrep_node_address=192.168.70.2
wsrep_cluster_name=pxc-cluster
wsrep_node_name=pxc-cluster-node-1
wsrep_sst_method=xtrabackup-v2
wsrep_sst_auth="root:***"

server-id=1
datadir=/var/lib/mysql
socket=/var/lib/mysql/mysql.sock
log-error=/var/log/mysqld.log
pid-file=/var/run/mysqld/mysqld.pid
log-bin
log_slave_updates
expire_logs_days=7
symbolic-links=0

Nodes are running on separate VirtualBox VMs. Tested again on 3-node cluster with same test case, and got immediately permanent non-primary state:

2017-08-02T11:19:52.527769Z 0 [Note] WSREP: declaring 89144fa3 at tcp://192.168.70.4:4567 stable
2017-08-02T11:19:52.527912Z 0 [Note] WSREP: forgetting aa147e7d (tcp://192.168.70.3:4567)
2017-08-02T11:19:52.531652Z 0 [Note] WSREP: Node 3a03d4b7 state primary
2017-08-02T11:19:52.531724Z 0 [Warning] WSREP: 3a03d4b7 sending install message failed: Resource temporarily unavailable
2017-08-02T11:19:52.531955Z 0 [Note] WSREP: Current view of cluster as seen by this node
view (view_id(NON_PRIM,3a03d4b7,85)
memb {
        3a03d4b7,0
        }
joined {
        }
left {
        }
partitioned {
        89144fa3,0
        aa147e7d,0
        }
)
2017-08-02T11:19:52.532078Z 0 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1
2017-08-02T11:19:52.532122Z 0 [Note] WSREP: Flow-control interval: [100, 100]
2017-08-02T11:19:52.532139Z 0 [Note] WSREP: Received NON-PRIMARY.

Krunal Bauskar (krunal-bauskar) wrote :

commit 28b60e8f723ab3d641cf7e14ea24ecad22f2bc9f
Author: Krunal Bauskar <email address hidden>
Date: Fri Aug 11 08:59:51 2017 +0530

    - PXC#850: Inconsistent behavior on data consistency abort

      Problem:
      -------

      Say out of 3 node cluster, 2 nodes leave cluster due to
      data inconsistency then sole pending node should also turn
      non-Primary given that
      a. It doesn't have majority
      b. It is difficult to tell if pending node has good data
         or node that left.

      Currently, if node detects inconsistency then it leaves
      cluster gracefully and depending on timing
      sole pending node can turn non-Primary or Primary.

      Solution:
      --------

      Ensure that node leaving cluster due to data inconsistency,
      first isolate itself from cluster there-by turning non-Primary
      and forcing re-evaluation of quorum on pending nodes.

      This will ensure if minority of the nodes has inconsistent data
      and majority of the nodes are shutting down, miniority will turn non-Primary.

      (If user still would like to operate with minority nodes
       given these are writer nodes or more reliable nodes
       user can pre-adjust the pc.weight to maintain quorum).

Changed in percona-xtradb-cluster:
status: New → Fix Committed

Percona now uses JIRA for bug reports so this bug report is migrated to: https://jira.percona.com/browse/PXC-850

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers