Percona XtraDB Cluster moved to https://jira.percona.com/projects/PXC

Primary Component not restored after cluster partition during SST

Bug #1599440 reported by Przemek on 2016-07-06

This bug affects 5 people

Affects		Status	Importance	Assigned to	Milestone
	Percona XtraDB Cluster moved to https://jira.percona.com/projects/PXC	Status tracked in 5.6
	5.6	Invalid	Undecided	Unassigned

Bug Description

In a situation when donor node suffers connectivity problems during and because of SST (network saturation, IO overload, etc), it may be removed from cluster by other members so the SST will fail. However, in the following scenario, SST attempt leaves the cluster in non-Primary state:

* cluster has 3 members, node1 and node2 are up and node3 needs SST to join
* node3 joins the cluster and requests the SST from node1
* SST starts but node1 has problems communicating to the other nodes on 4567 port
* node1 is removed from cluster configuration by node2 and node3
* SST fails and node3 (joiner) has to abort
* node2 switches to non-Primary as cannot keep quorum alone
* node1 restores connectivity with node2
* both node1 and node2 cannot restore primary component any more, until manual intervention

In usual case of split brain situation - when node1 and node2 would loose connectivity, they would become non-Primary, but when network is restored, tbey will restore Primary Component and continue to operate. But in this scenario, that's not the case.
Tested on PXC 5.6.30.

### Example test case ###

-- percona3 service start

2016-07-06 10:34:51 29280 [Note] WSREP: Quorum results:
        version = 3,
        component = PRIMARY,
        conf_id = 47,
        members = 2/3 (joined/total),
        act_id = 3717,
        last_appl. = -1,
        protocols = 0/7/3 (gcs/repl/appl),
        group UUID = 405bb13f-aa42-11e4-96e9-da7dd046b9dd
2016-07-06 10:34:51 29280 [Note] WSREP: Flow-control interval: [28, 28]
2016-07-06 10:34:51 29280 [Note] WSREP: Shifting OPEN -> PRIMARY (TO: 3717)
2016-07-06 10:34:51 29280 [Note] WSREP: State transfer required:
        Group state: 405bb13f-aa42-11e4-96e9-da7dd046b9dd:3717
        Local state: 00000000-0000-0000-0000-000000000000:-1
(...)
2016-07-06 10:34:53 29280 [Note] WSREP: Member 2.1 (percona3) requested state transfer from 'percona1'. Selected 0.1 (percona1)(SYNCED) as donor.
2016-07-06 10:34:53 29280 [Note] WSREP: Shifting PRIMARY -> JOINER (TO: 3717)
2016-07-06 10:34:53 29280 [Note] WSREP: Requesting state transfer: success, donor: 0
WSREP_SST: [INFO] Proceeding with SST (20160706 10:34:54.924)
(...)

-- percona1 port 4567 blocked

2016-07-06 10:34:51 22287 [Note] WSREP: Member 2.1 (percona3) requested state transfer from 'percona1'. Selected 0.1 (percona1)(SYNCED) as donor.
2016-07-06 10:34:51 22287 [Note] WSREP: Shifting SYNCED -> DONOR/DESYNCED (TO: 3717)
(...)
WSREP_SST: [INFO] Sleeping before data transfer for SST (20160706 10:34:53.527)
2016-07-06 10:34:56 22287 [Note] WSREP: (39892c46, 'tcp://192.168.3.2:4567') turning message relay requesting on, nonlive peers: tcp://192.168.3.3:4567
2016-07-06 10:34:57 22287 [Note] WSREP: (39892c46, 'tcp://192.168.3.2:4567') reconnecting to 464fbed4 (tcp://192.168.3.3:4567), attempt 0
2016-07-06 10:34:59 22287 [Note] WSREP: (39892c46, 'tcp://192.168.3.2:4567') reconnecting to 81727c1c (tcp://192.168.3.4:4567), attempt 0
(...)
2016-07-06 10:35:01 22287 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1
2016-07-06 10:35:01 22287 [Note] WSREP: Flow-control interval: [16, 16]
2016-07-06 10:35:01 22287 [Note] WSREP: Received NON-PRIMARY.
2016-07-06 10:35:01 22287 [Note] WSREP: Shifting DONOR/DESYNCED -> OPEN (TO: 3717)
2016-07-06 10:35:01 22287 [Note] WSREP: New cluster view: global state: 405bb13f-aa42-11e4-96e9-da7dd046b9dd:3717, view# -1: non-Primary, number of nodes: 1, my index: 0, protocol version 3
(...)

-- percona3 aborts due to failed SST

2016-07-06 10:35:06 29280 [Note] WSREP: Quorum results:
        version = 3,
        component = PRIMARY,
        conf_id = 48,
        members = 1/2 (joined/total),
        act_id = 3717,
        last_appl. = 0,
        protocols = 0/7/3 (gcs/repl/appl),
        group UUID = 405bb13f-aa42-11e4-96e9-da7dd046b9dd
2016-07-06 10:35:06 29280 [Warning] WSREP: Donor 39892c46-4354-11e6-a224-92c5aa111752 is no longer in the group. State transfer cannot be completed, need to abort. Aborting...
2016-07-06 10:35:06 29280 [Note] WSREP: /usr/sbin/mysqld: Terminated.
160706 10:35:06 mysqld_safe mysqld from pid file /var/lib/mysql/percona3.pid ended

-- percona2 looses PC

2016-07-06 08:35:12 9346 [Note] WSREP: view(view_id(NON_PRIM,464fbed4,754) memb {
464fbed4,2
} joined {
} left {
} partitioned {
81727c1c,1
})
2016-07-06 08:35:12 9346 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1
2016-07-06 08:35:12 9346 [Note] WSREP: Flow-control interval: [16, 16]
2016-07-06 08:35:12 9346 [Note] WSREP: Received NON-PRIMARY.
2016-07-06 08:35:12 9346 [Note] WSREP: Shifting SYNCED -> OPEN (TO: 3717)

-- percona1 and percona2 communication is restored, but they cannot restore the original Primary Component

2016-07-06 08:35:12 9346 [Note] WSREP: New cluster view: global state: 405bb13f-aa42-11e4-96e9-da7dd046b9dd:3717, view# -1: non-Primary, number of nodes: 1, my index: 0, protocol version 3
2016-07-06 08:35:12 9346 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2016-07-06 08:35:28 9346 [Note] WSREP: declaring 39892c46 at tcp://192.168.3.2:4567 stable
2016-07-06 08:35:28 9346 [Note] WSREP: view(view_id(NON_PRIM,39892c46,756) memb {
        39892c46,1
        464fbed4,2
} joined {
} left {
} partitioned {
        81727c1c,1
})
2016-07-06 08:35:28 9346 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 1, memb_num = 2
2016-07-06 08:35:28 9346 [Note] WSREP: Flow-control interval: [23, 23]
2016-07-06 08:35:28 9346 [Note] WSREP: Received NON-PRIMARY.
2016-07-06 08:35:28 9346 [Note] WSREP: New cluster view: global state: 405bb13f-aa42-11e4-96e9-da7dd046b9dd:3717, view# -1: non-Primary, number of nodes: 2, my index: 1, protocol version 3
2016-07-06 08:35:28 9346 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2016-07-06 08:35:51 9346 [Note] WSREP: (464fbed4, 'tcp://192.168.3.3:4567') reconnecting to 81727c1c (tcp://192.168.3.4:4567), attempt 30
2016-07-06 08:36:35 9346 [Note] WSREP: (464fbed4, 'tcp://192.168.3.3:4567') reconnecting to 81727c1c (tcp://192.168.3.4:4567), attempt 60
2016-07-06 08:37:19 9346 [Note] WSREP: (464fbed4, 'tcp://192.168.3.3:4567') reconnecting to 81727c1c (tcp://192.168.3.4:4567), attempt 90
2016-07-06 08:38:04 9346 [Note] WSREP: (464fbed4, 'tcp://192.168.3.3:4567') reconnecting to 81727c1c (tcp://192.168.3.4:4567), attempt 120
2016-07-06 08:38:48 9346 [Note] WSREP: (464fbed4, 'tcp://192.168.3.3:4567') reconnecting to 81727c1c (tcp://192.168.3.4:4567), attempt 150
(...)

Tags:

Revision history for this message

Przemek (pmalkowski) wrote on 2016-08-11:

lp1599440.tgz Edit (6.3 KiB, application/x-tar)

Complete logs and output of show status like 'ws%'; from another test.

Revision history for this message

Krunal Bauskar (krunal-bauskar) wrote on 2016-08-16:

Finally I was able to reproduce this. (Infact found a simpler way to isolate a node).

----------------------

1. Start node-1

2. Start node-2

3. Start node-3

4. While the node-1 is trying waiting to send data to node-3 (node-1 is DONOR and node-3 is JOINER)
WSREP_SST: [INFO] Sleeping before data transfer for SST (20160816 15:50:18.820)

execute following on through client terminal.
set global wsrep_provider_options="gmcast.isolate=1;";
(This isolates the node and stop communication between the said node and other cluster node).

node-1 is turns into non-primary and status is reflected in other nodes too.

5. node-3 will eventually fail (as it fails to get SST)

6. node-2 will turn into non-primary too as it loose quorum.
(Before loosing quorum primary component was made from n2 and n3. Make an note of this this is important).

7. Post that re-introduce node-1 by setting it gmcast.isolate=0.

8. node-1 and node-2 are able to communicate with each other but cluster is still not formed.

9. node-1 and node-2 continue to reach node-3 and in the process continue to emit following messages:

2016-08-16 16:00:50 8676 [Note] WSREP: (5ea6e131, 'tcp://192.168.1.5:4030') reconnecting to e6c44b07 (tcp://192.168.1.5:6030), attempt 0
2016-08-16 16:01:30 8676 [Note] WSREP: (5ea6e131, 'tcp://192.168.1.5:4030') reconnecting to e6c44b07 (tcp://192.168.1.5:6030), attempt 30
....

----------------------

Analysis:

a. I was able to reproduce this with PXC-5.6.30

b. I was able to reproduce this with Codership-5.6.30 (Confirmed to be an upstream issue).
We should log it.

c. These are some related article (though not exact same steps)

https://github.com/codership/galera/issues/249
https://github.com/codership/galera/issues/194

http://severalnines.com/blog/become-mysql-dba-blog-series-troubleshooting-galera-cluster-issues-part-1
"Above you can see a couple of failed attempts to bring the cluster back in sync. When a cluster is in ‘non-Primary’ state, it won’t rejoin unless all of the nodes are available again or one of the nodes is bootstrapped."

d. But wait it turns out that this is problem with interpreting how galera evaluate quorum.

Check quorum details here:
http://galeracluster.com/documentation-webpages/weightedquorum.html

Going by formula it will consider last seen primary components which in our case are n2 and n3.
(n3 is still in partitioned mode and so even though n1 joins we n1 and n2 are unable to make it a primary component).

e. To validate this idea I tried to increase wt of node-2 from (default 1) to 3 and then tried the same use-case.

As expected node-2 didn't turned into non-primary and node-1 was able to re-join node-2 as primary component and node-3 too.

----------------------------------------------------------------------------------------

So the flow is operating as per the semantics so this is not a bug per see but
confusing JOINER node quorum evaluation algorithm can be upgrade.

Finally I was able to reproduce this. (Infact found a simpler way to isolate a node).

----------------------

1. Start node-1

2. Start node-2

3. Start node-3

4. While the node-1 is trying waiting to send data to node-3 (node-1 is DONOR and node-3 is JOINER)
WSREP_SST: [INFO] Sleeping before data transfer for SST (20160816 15:50:18.820)

execute following on through client terminal.
set global wsrep_provider_options="gmcast.isolate=1;";
(This isolates the node and stop communication between the said node and other cluster node).

node-1 is turns into non-primary and status is reflected in other nodes too.

5. node-3 will eventually fail (as it fails to get SST)

6. node-2 will turn into non-primary too as it loose quorum.
(Before loosing quorum primary component was made from n2 and n3. Make an note of this this is important).

7. Post that re-introduce node-1 by setting it gmcast.isolate=0.

8. node-1 and node-2 are able to communicate with each other but cluster is still not formed.

9. node-1 and node-2 continue to reach node-3 and in the process continue to emit following messages:

----------------------

Analysis:

a. I was able to reproduce this with PXC-5.6.30

b. I was able to reproduce this with Codership-5.6.30 (Confirmed to be an upstream issue).
    We should log it.

c. These are some related article (though not exact same steps)

https://github.com/codership/galera/issues/249
https://github.com/codership/galera/issues/194

d. But wait it turns out that this is problem with interpreting how galera evaluate quorum.

Check quorum details here:
http://galeracluster.com/documentation-webpages/weightedquorum.html

e. To validate this idea I tried to increase wt of node-2 from (default 1) to 3 and then tried the same use-case.

As expected node-2 didn't turned into non-primary and node-1 was able to re-join node-2 as primary component and node-3 too.

----------------------------------------------------------------------------------------

So the flow is operating as per the semantics so this is not a bug per see but
confusing JOINER node quorum evaluation algorithm can be upgrade.

Revision history for this message

Przemek (pmalkowski) wrote on 2016-10-11:

Upstream bug reported here: https://github.com/codership/galera/issues/410

Revision history for this message

Sveta Smirnova (svetasmirnova) wrote on 2017-05-13:

After fix for bug #1684810 this one is not repeatable anymore.

Revision history for this message

Shahriyar Rzayev (rzayev-sehriyar) wrote on 2018-01-18:

Percona now uses JIRA for bug reports so this bug report is migrated to: https://jira.percona.com/browse/PXC-1912

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

lp1599440.tgz Edit

Add attachment

Remote bug watches

auto-github-codership-galera #194
[open] Edit
auto-github-codership-galera #249
[open] Edit
auto-github-codership-galera #410
[open] Edit

Bug watches keep track of this bug in other bug trackers.