Primary Component not restored after cluster partition during SST
Affects | Status | Importance | Assigned to | Milestone | ||
---|---|---|---|---|---|---|
Percona XtraDB Cluster moved to https://jira.percona.com/projects/PXC | Status tracked in 5.6 | |||||
5.6 |
Invalid
|
Undecided
|
Unassigned |
Bug Description
In a situation when donor node suffers connectivity problems during and because of SST (network saturation, IO overload, etc), it may be removed from cluster by other members so the SST will fail. However, in the following scenario, SST attempt leaves the cluster in non-Primary state:
* cluster has 3 members, node1 and node2 are up and node3 needs SST to join
* node3 joins the cluster and requests the SST from node1
* SST starts but node1 has problems communicating to the other nodes on 4567 port
* node1 is removed from cluster configuration by node2 and node3
* SST fails and node3 (joiner) has to abort
* node2 switches to non-Primary as cannot keep quorum alone
* node1 restores connectivity with node2
* both node1 and node2 cannot restore primary component any more, until manual intervention
In usual case of split brain situation - when node1 and node2 would loose connectivity, they would become non-Primary, but when network is restored, tbey will restore Primary Component and continue to operate. But in this scenario, that's not the case.
Tested on PXC 5.6.30.
### Example test case ###
-- percona3 service start
2016-07-06 10:34:51 29280 [Note] WSREP: Quorum results:
version = 3,
component = PRIMARY,
conf_id = 47,
members = 2/3 (joined/total),
act_id = 3717,
last_appl. = -1,
protocols = 0/7/3 (gcs/repl/appl),
group UUID = 405bb13f-
2016-07-06 10:34:51 29280 [Note] WSREP: Flow-control interval: [28, 28]
2016-07-06 10:34:51 29280 [Note] WSREP: Shifting OPEN -> PRIMARY (TO: 3717)
2016-07-06 10:34:51 29280 [Note] WSREP: State transfer required:
Group state: 405bb13f-
Local state: 00000000-
(...)
2016-07-06 10:34:53 29280 [Note] WSREP: Member 2.1 (percona3) requested state transfer from 'percona1'. Selected 0.1 (percona1)(SYNCED) as donor.
2016-07-06 10:34:53 29280 [Note] WSREP: Shifting PRIMARY -> JOINER (TO: 3717)
2016-07-06 10:34:53 29280 [Note] WSREP: Requesting state transfer: success, donor: 0
WSREP_SST: [INFO] Proceeding with SST (20160706 10:34:54.924)
(...)
-- percona1 port 4567 blocked
2016-07-06 10:34:51 22287 [Note] WSREP: Member 2.1 (percona3) requested state transfer from 'percona1'. Selected 0.1 (percona1)(SYNCED) as donor.
2016-07-06 10:34:51 22287 [Note] WSREP: Shifting SYNCED -> DONOR/DESYNCED (TO: 3717)
(...)
WSREP_SST: [INFO] Sleeping before data transfer for SST (20160706 10:34:53.527)
2016-07-06 10:34:56 22287 [Note] WSREP: (39892c46, 'tcp://
2016-07-06 10:34:57 22287 [Note] WSREP: (39892c46, 'tcp://
2016-07-06 10:34:59 22287 [Note] WSREP: (39892c46, 'tcp://
(...)
2016-07-06 10:35:01 22287 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1
2016-07-06 10:35:01 22287 [Note] WSREP: Flow-control interval: [16, 16]
2016-07-06 10:35:01 22287 [Note] WSREP: Received NON-PRIMARY.
2016-07-06 10:35:01 22287 [Note] WSREP: Shifting DONOR/DESYNCED -> OPEN (TO: 3717)
2016-07-06 10:35:01 22287 [Note] WSREP: New cluster view: global state: 405bb13f-
(...)
-- percona3 aborts due to failed SST
2016-07-06 10:35:06 29280 [Note] WSREP: Quorum results:
version = 3,
component = PRIMARY,
conf_id = 48,
members = 1/2 (joined/total),
act_id = 3717,
last_appl. = 0,
protocols = 0/7/3 (gcs/repl/appl),
group UUID = 405bb13f-
2016-07-06 10:35:06 29280 [Warning] WSREP: Donor 39892c46-
2016-07-06 10:35:06 29280 [Note] WSREP: /usr/sbin/mysqld: Terminated.
160706 10:35:06 mysqld_safe mysqld from pid file /var/lib/
-- percona2 looses PC
2016-07-06 08:35:12 9346 [Note] WSREP: view(view_
464fbed4,2
} joined {
} left {
} partitioned {
81727c1c,1
})
2016-07-06 08:35:12 9346 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1
2016-07-06 08:35:12 9346 [Note] WSREP: Flow-control interval: [16, 16]
2016-07-06 08:35:12 9346 [Note] WSREP: Received NON-PRIMARY.
2016-07-06 08:35:12 9346 [Note] WSREP: Shifting SYNCED -> OPEN (TO: 3717)
-- percona1 and percona2 communication is restored, but they cannot restore the original Primary Component
2016-07-06 08:35:12 9346 [Note] WSREP: New cluster view: global state: 405bb13f-
2016-07-06 08:35:12 9346 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2016-07-06 08:35:28 9346 [Note] WSREP: declaring 39892c46 at tcp://192.
2016-07-06 08:35:28 9346 [Note] WSREP: view(view_
39892c46,1
464fbed4,2
} joined {
} left {
} partitioned {
81727c1c,1
})
2016-07-06 08:35:28 9346 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 1, memb_num = 2
2016-07-06 08:35:28 9346 [Note] WSREP: Flow-control interval: [23, 23]
2016-07-06 08:35:28 9346 [Note] WSREP: Received NON-PRIMARY.
2016-07-06 08:35:28 9346 [Note] WSREP: New cluster view: global state: 405bb13f-
2016-07-06 08:35:28 9346 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2016-07-06 08:35:51 9346 [Note] WSREP: (464fbed4, 'tcp://
2016-07-06 08:36:35 9346 [Note] WSREP: (464fbed4, 'tcp://
2016-07-06 08:37:19 9346 [Note] WSREP: (464fbed4, 'tcp://
2016-07-06 08:38:04 9346 [Note] WSREP: (464fbed4, 'tcp://
2016-07-06 08:38:48 9346 [Note] WSREP: (464fbed4, 'tcp://
(...)
Complete logs and output of show status like 'ws%'; from another test.