Percona XtraDB Cluster moved to https://jira.percona.com/projects/PXC

Bug #1037380
Comment #6

Comment 6 for bug 1037380

Revision history for this message

Przemek (pmalkowski) wrote on 2013-08-24:

I can confirm the whole cluster can go down in such case. Well, at least no node is available for access.
Let's look at following example.

node1>show create table myisam1\G
*************************** 1. row ***************************
       Table: myisam1
Create Table: CREATE TABLE `myisam1` (
  `id` int(11) DEFAULT NULL,
  KEY `id` (`id`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1
1 row in set (0.00 sec)

node1>select * from myisam1;
+------+
| id |
+------+
| 0 |
| 1 |
| 2 |
| 3 |
| 99 |
| 100 |
+------+
6 rows in set (0.00 sec)

node2>select * from myisam1;
+------+
| id |
+------+
| 0 |
| 1 |
| 99 |
| 500 |
+------+
4 rows in set (0.00 sec)

node3>select * from myisam1;
+------+
| id |
+------+
| 0 |
| 1 |
| 99 |
+------+
3 rows in set (0.00 sec)

node1>alter table myisam1 engine=InnoDB;
Query OK, 6 rows affected (0.03 sec)
Records: 6 Duplicates: 0 Warnings: 0

node1>update myisam1 set id=300 where id=3;
Query OK, 1 row affected (0.01 sec)
Rows matched: 1 Changed: 1 Warnings: 0

Now both node2 and node3 are doing emergency shutdown (not crash) as data inconsistency was detected:

130803 2:59:53 [ERROR] WSREP: Failed to apply trx: source: 40c408fb-f844-11e2-8f5a-02577ce11790 version: 2 local: 0 state: APPLYING flags: 1 conn_id: 11 trx_id: 473664 seqnos (l: 13, g: 215593, s: 215592, d: 215591, ts: 1375513225722038951)
130803 2:59:53 [ERROR] WSREP: Failed to apply app buffer: seqno: 215593, status: WSREP_FATAL
at galera/src/replicator_smm.cpp:apply_wscoll():52
at galera/src/replicator_smm.cpp:apply_trx_ws():118
130803 2:59:53 [ERROR] WSREP: Node consistency compromized, aborting...

But also a while, node1 does this:

130803 3:00:25 [Note] WSREP: declaring bbbbdab0-f844-11e2-8c18-73c48a26a0f8 stable
130803 3:00:25 [Note] WSREP: (40c408fb-f844-11e2-8f5a-02577ce11790, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers: tcp://10.8.0.203:4567
130803 3:00:25 [Note] WSREP: Node 40c408fb-f844-11e2-8f5a-02577ce11790 state prim
130803 3:00:25 [Warning] WSREP: 40c408fb-f844-11e2-8f5a-02577ce11790 sending install message failed: Resource temporarily unavailable
130803 3:00:25 [Note] WSREP: view(view_id(NON_PRIM,40c408fb-f844-11e2-8f5a-02577ce11790,17) memb {
        40c408fb-f844-11e2-8f5a-02577ce11790,
} joined {
} left {
} partitioned {
        392c5097-fb84-11e2-bc39-43e986d05fbc,
        bbbbdab0-f844-11e2-8c18-73c48a26a0f8,
})
130803 3:00:25 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1
130803 3:00:25 [Note] WSREP: Flow-control interval: [16, 16]
130803 3:00:25 [Note] WSREP: Received NON-PRIMARY.
130803 3:00:25 [Note] WSREP: Shifting SYNCED -> OPEN (TO: 215593)
130803 3:00:25 [Note] WSREP: New cluster view: global state: 36c05e7e-c781-11e2-0800-942090b7f498:215593, view# -1: non-Primary, number of nodes: 1, my index: 0, protocol version 2
130803 3:00:25 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
130803 3:00:25 [Note] WSREP: view(view_id(NON_PRIM,40c408fb-f844-11e2-8f5a-02577ce11790,18) memb {
        40c408fb-f844-11e2-8f5a-02577ce11790,
} joined {
} left {
} partitioned {
        392c5097-fb84-11e2-bc39-43e986d05fbc,
        bbbbdab0-f844-11e2-8c18-73c48a26a0f8,
})
130803 3:00:25 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1
130803 3:00:25 [Note] WSREP: Flow-control interval: [16, 16]
130803 3:00:25 [Note] WSREP: Received NON-PRIMARY.
130803 3:00:25 [Note] WSREP: New cluster view: global state: 36c05e7e-c781-11e2-0800-942090b7f498:215593, view# -1: non-Primary, number of nodes: 1, my index: 0, protocol version 2
130803 3:00:25 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
130803 3:00:27 [Note] WSREP: (40c408fb-f844-11e2-8f5a-02577ce11790, 'tcp://0.0.0.0:4567') reconnecting to bbbbdab0-f844-11e2-8c18-73c48a26a0f8 (tcp://10.8.0.202:4567), attempt 0
130803 3:00:27 [Note] WSREP: (40c408fb-f844-11e2-8f5a-02577ce11790, 'tcp://0.0.0.0:4567') reconnecting to 392c5097-fb84-11e2-bc39-43e986d05fbc (tcp://10.8.0.203:4567), attempt 0
130803 3:01:06 [Note] WSREP: (40c408fb-f844-11e2-8f5a-02577ce11790, 'tcp://0.0.0.0:4567') reconnecting to bbbbdab0-f844-11e2-8c18-73c48a26a0f8 (tcp://10.8.0.202:4567), attempt 30
130803 3:01:07 [Note] WSREP: (40c408fb-f844-11e2-8f5a-02577ce11790, 'tcp://0.0.0.0:4567') reconnecting to 392c5097-fb84-11e2-bc39-43e986d05fbc (tcp://10.8.0.203:4567), attempt 30
130803 3:01:48 [Note] WSREP: (40c408fb-f844-11e2-8f5a-02577ce11790, 'tcp://0.0.0.0:4567') reconnecting to bbbbdab0-f844-11e2-8c18-73c48a26a0f8 (tcp://10.8.0.202:4567), attempt 60
130803 3:01:49 [Note] WSREP: (40c408fb-f844-11e2-8f5a-02577ce11790, 'tcp://0.0.0.0:4567') reconnecting to 392c5097-fb84-11e2-bc39-43e986d05fbc (tcp://10.8.0.203:4567), attempt 60
130803 3:02:29 [Note] WSREP: (40c408fb-f844-11e2-8f5a-02577ce11790, 'tcp://0.0.0.0:4567') reconnecting to bbbbdab0-f844-11e2-8c18-73c48a26a0f8 (tcp://10.8.0.202:4567), attempt 90

etc.

And as the result:

node1>select * from myisam1;
ERROR 1047 (08S01): Unknown command

ws status:

Start attempts of node2 or node3 are failing as there is no node in Primary status. This is log fragment from a node3 being restarted and trying to connect to a cluster:

130803 3:16:43 [Note] WSREP: gcomm: connecting to group 'cluster1', peer 'node1:,node2:,node3:'
130803 3:16:43 [Warning] WSREP: (a64c939a-fc0c-11e2-9b1f-2fa595cede12, 'tcp://0.0.0.0:4567') address 'tcp://10.8.0.203:4567' points to own listening address, blacklisting
130803 3:16:43 [Note] WSREP: (a64c939a-fc0c-11e2-9b1f-2fa595cede12, 'tcp://0.0.0.0:4567') address 'tcp://10.8.0.203:4567' pointing to uuid a64c939a-fc0c-11e2-9b1f-2fa595cede12 is blacklisted, skipping
130803 3:16:43 [Note] WSREP: declaring 40c408fb-f844-11e2-8f5a-02577ce11790 stable
130803 3:16:43 [Note] WSREP: view(view_id(NON_PRIM,40c408fb-f844-11e2-8f5a-02577ce11790,21) memb {
        40c408fb-f844-11e2-8f5a-02577ce11790,
        a64c939a-fc0c-11e2-9b1f-2fa595cede12,
} joined {
} left {
} partitioned {
        1a6cd2fd-fc0b-11e2-b1cc-3a867f991b74,
        392c5097-fb84-11e2-bc39-43e986d05fbc,
        bbbbdab0-f844-11e2-8c18-73c48a26a0f8,
})
130803 3:16:44 [Note] WSREP: (a64c939a-fc0c-11e2-9b1f-2fa595cede12, 'tcp://0.0.0.0:4567') address 'tcp://10.8.0.203:4567' pointing to uuid a64c939a-fc0c-11e2-9b1f-2fa595cede12 is blacklisted, skipping
130803 3:16:45 [Note] WSREP: (a64c939a-fc0c-11e2-9b1f-2fa595cede12, 'tcp://0.0.0.0:4567') address 'tcp://10.8.0.203:4567' pointing to uuid a64c939a-fc0c-11e2-9b1f-2fa595cede12 is blacklisted, skipping
(...)
130803 3:17:14 [ERROR] WSREP: failed to open gcomm backend connection: 110: failed to reach primary view: 110 (Connection timed out)
         at gcomm/src/pc.cpp:connect():139
130803 3:17:14 [ERROR] WSREP: gcs/src/gcs_core.c:gcs_core_open():195: Failed to open backend connection: -110 (Connection timed out)
130803 3:17:14 [ERROR] WSREP: gcs/src/gcs.c:gcs_open():1289: Failed to open channel 'cluster1' at 'gcomm://node1,node2,node3': -110 (Connection timed out)
130803 3:17:14 [ERROR] WSREP: gcs connect failed: Connection timed out
130803 3:17:14 [ERROR] WSREP: wsrep::connect() failed: 6
130803 3:17:14 [ERROR] Aborting

The only way to restore the cluster is rebootstrap from the most advanced node.

The above happened because the node1 considered the situation as a split brain condition. It would be avoided if only one other node would have consistent data with primary node, or if we had an arbitrator node (http://www.codership.com/wiki/doku.php?id=galera_arbitrator).

Now the question is if emergency shutdown of all other nodes should result in non-primary state of the primary node? It does not happen when we shutdown all other nodes manually - the only single existing node remains in primary state...
I think yes, since this way we are able to investigate data differences without allowing the primary node (which in fact could be the least advanced one) to further diverge or being untrusted source of data for SST -> other nodes.

Given the above, I think this is not a bug but a correct behaviour for severe data inconsistency situation.

I can confirm the whole cluster can go down in such case. Well, at least no node is available for access. 
Let's look at following example.

node1>select * from myisam1;
+------+
| id   |
+------+
|    0 |
|    1 |
|    2 |
|    3 |
|   99 |
|  100 |
+------+
6 rows in set (0.00 sec)

node2>select * from myisam1;
+------+
| id   |
+------+
|    0 |
|    1 |
|   99 |
|  500 |
+------+
4 rows in set (0.00 sec)

node3>select * from myisam1;
+------+
| id   |
+------+
|    0 |
|    1 |
|   99 |
+------+
3 rows in set (0.00 sec)

node1>alter table myisam1 engine=InnoDB;
Query OK, 6 rows affected (0.03 sec)
Records: 6  Duplicates: 0  Warnings: 0

node1>update myisam1 set id=300 where id=3;
Query OK, 1 row affected (0.01 sec)
Rows matched: 1  Changed: 1  Warnings: 0

Now both node2 and node3 are doing emergency shutdown (not crash) as data inconsistency was detected:

130803  2:59:53 [ERROR] WSREP: Failed to apply trx: source: 40c408fb-f844-11e2-8f5a-02577ce11790 version: 2 local: 0 state: APPLYING flags: 1 conn_id: 11 trx_id: 473664 seqnos (l: 13, g: 215593, s: 215592, d: 215591, ts: 1375513225722038951)
130803  2:59:53 [ERROR] WSREP: Failed to apply app buffer: seqno: 215593, status: WSREP_FATAL
         at galera/src/replicator_smm.cpp:apply_wscoll():52
         at galera/src/replicator_smm.cpp:apply_trx_ws():118
130803  2:59:53 [ERROR] WSREP: Node consistency compromized, aborting...

But also a while, node1 does this:

130803  3:00:25 [Note] WSREP: declaring bbbbdab0-f844-11e2-8c18-73c48a26a0f8 stable
130803  3:00:25 [Note] WSREP: (40c408fb-f844-11e2-8f5a-02577ce11790, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers: tcp://10.8.0.203:4567
130803  3:00:25 [Note] WSREP: Node 40c408fb-f844-11e2-8f5a-02577ce11790 state prim
130803  3:00:25 [Warning] WSREP: 40c408fb-f844-11e2-8f5a-02577ce11790 sending install message failed: Resource temporarily unavailable
130803  3:00:25 [Note] WSREP: view(view_id(NON_PRIM,40c408fb-f844-11e2-8f5a-02577ce11790,17) memb {
        40c408fb-f844-11e2-8f5a-02577ce11790,
} joined {
} left {
} partitioned {
        392c5097-fb84-11e2-bc39-43e986d05fbc,
        bbbbdab0-f844-11e2-8c18-73c48a26a0f8,
})
130803  3:00:25 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1
130803  3:00:25 [Note] WSREP: Flow-control interval: [16, 16]
130803  3:00:25 [Note] WSREP: Received NON-PRIMARY.
130803  3:00:25 [Note] WSREP: Shifting SYNCED -> OPEN (TO: 215593)
130803  3:00:25 [Note] WSREP: New cluster view: global state: 36c05e7e-c781-11e2-0800-942090b7f498:215593, view# -1: non-Primary, number of nodes: 1, my index: 0, protocol version 2
130803  3:00:25 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
130803  3:00:25 [Note] WSREP: view(view_id(NON_PRIM,40c408fb-f844-11e2-8f5a-02577ce11790,18) memb {
        40c408fb-f844-11e2-8f5a-02577ce11790,
} joined {
} left {
} partitioned {
        392c5097-fb84-11e2-bc39-43e986d05fbc,
        bbbbdab0-f844-11e2-8c18-73c48a26a0f8,
})
130803  3:00:25 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1
130803  3:00:25 [Note] WSREP: Flow-control interval: [16, 16]
130803  3:00:25 [Note] WSREP: Received NON-PRIMARY.
130803  3:00:25 [Note] WSREP: New cluster view: global state: 36c05e7e-c781-11e2-0800-942090b7f498:215593, view# -1: non-Primary, number of nodes: 1, my index: 0, protocol version 2
130803  3:00:25 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
130803  3:00:27 [Note] WSREP: (40c408fb-f844-11e2-8f5a-02577ce11790, 'tcp://0.0.0.0:4567') reconnecting to bbbbdab0-f844-11e2-8c18-73c48a26a0f8 (tcp://10.8.0.202:4567), attempt 0
130803  3:00:27 [Note] WSREP: (40c408fb-f844-11e2-8f5a-02577ce11790, 'tcp://0.0.0.0:4567') reconnecting to 392c5097-fb84-11e2-bc39-43e986d05fbc (tcp://10.8.0.203:4567), attempt 0
130803  3:01:06 [Note] WSREP: (40c408fb-f844-11e2-8f5a-02577ce11790, 'tcp://0.0.0.0:4567') reconnecting to bbbbdab0-f844-11e2-8c18-73c48a26a0f8 (tcp://10.8.0.202:4567), attempt 30
130803  3:01:07 [Note] WSREP: (40c408fb-f844-11e2-8f5a-02577ce11790, 'tcp://0.0.0.0:4567') reconnecting to 392c5097-fb84-11e2-bc39-43e986d05fbc (tcp://10.8.0.203:4567), attempt 30
130803  3:01:48 [Note] WSREP: (40c408fb-f844-11e2-8f5a-02577ce11790, 'tcp://0.0.0.0:4567') reconnecting to bbbbdab0-f844-11e2-8c18-73c48a26a0f8 (tcp://10.8.0.202:4567), attempt 60
130803  3:01:49 [Note] WSREP: (40c408fb-f844-11e2-8f5a-02577ce11790, 'tcp://0.0.0.0:4567') reconnecting to 392c5097-fb84-11e2-bc39-43e986d05fbc (tcp://10.8.0.203:4567), attempt 60
130803  3:02:29 [Note] WSREP: (40c408fb-f844-11e2-8f5a-02577ce11790, 'tcp://0.0.0.0:4567') reconnecting to bbbbdab0-f844-11e2-8c18-73c48a26a0f8 (tcp://10.8.0.202:4567), attempt 90

etc.

And as the result:

node1>select * from myisam1;
ERROR 1047 (08S01): Unknown command

ws status:

Start attempts of node2 or node3 are failing as there is no node in Primary status. This is log fragment from a node3 being restarted and trying to connect to a cluster:

130803  3:16:43 [Note] WSREP: gcomm: connecting to group 'cluster1', peer 'node1:,node2:,node3:'
130803  3:16:43 [Warning] WSREP: (a64c939a-fc0c-11e2-9b1f-2fa595cede12, 'tcp://0.0.0.0:4567') address 'tcp://10.8.0.203:4567' points to own listening address, blacklisting
130803  3:16:43 [Note] WSREP: (a64c939a-fc0c-11e2-9b1f-2fa595cede12, 'tcp://0.0.0.0:4567') address 'tcp://10.8.0.203:4567' pointing to uuid a64c939a-fc0c-11e2-9b1f-2fa595cede12 is blacklisted, skipping
130803  3:16:43 [Note] WSREP: declaring 40c408fb-f844-11e2-8f5a-02577ce11790 stable
130803  3:16:43 [Note] WSREP: view(view_id(NON_PRIM,40c408fb-f844-11e2-8f5a-02577ce11790,21) memb {
        40c408fb-f844-11e2-8f5a-02577ce11790,
        a64c939a-fc0c-11e2-9b1f-2fa595cede12,
} joined {
} left {
} partitioned {
        1a6cd2fd-fc0b-11e2-b1cc-3a867f991b74,
        392c5097-fb84-11e2-bc39-43e986d05fbc,
        bbbbdab0-f844-11e2-8c18-73c48a26a0f8,
})
130803  3:16:44 [Note] WSREP: (a64c939a-fc0c-11e2-9b1f-2fa595cede12, 'tcp://0.0.0.0:4567') address 'tcp://10.8.0.203:4567' pointing to uuid a64c939a-fc0c-11e2-9b1f-2fa595cede12 is blacklisted, skipping
130803  3:16:45 [Note] WSREP: (a64c939a-fc0c-11e2-9b1f-2fa595cede12, 'tcp://0.0.0.0:4567') address 'tcp://10.8.0.203:4567' pointing to uuid a64c939a-fc0c-11e2-9b1f-2fa595cede12 is blacklisted, skipping
(...)
130803  3:17:14 [ERROR] WSREP: failed to open gcomm backend connection: 110: failed to reach primary view: 110 (Connection timed out)
         at gcomm/src/pc.cpp:connect():139
130803  3:17:14 [ERROR] WSREP: gcs/src/gcs_core.c:gcs_core_open():195: Failed to open backend connection: -110 (Connection timed out)
130803  3:17:14 [ERROR] WSREP: gcs/src/gcs.c:gcs_open():1289: Failed to open channel 'cluster1' at 'gcomm://node1,node2,node3': -110 (Connection timed out)
130803  3:17:14 [ERROR] WSREP: gcs connect failed: Connection timed out
130803  3:17:14 [ERROR] WSREP: wsrep::connect() failed: 6
130803  3:17:14 [ERROR] Aborting

The only way to restore the cluster is rebootstrap from the most advanced node.

Now the question is if emergency shutdown of all other nodes should result in non-primary state of the primary node?  It does not happen when we shutdown all other nodes manually - the only single existing node remains in primary state...
I think yes, since this way we are able to investigate data differences without allowing the primary node (which in fact could be the least advanced one) to further diverge or being untrusted source of data for SST -> other nodes.

Given the above, I think this is not a bug but a correct behaviour for severe data inconsistency situation.