JOINER partition during SST causes cluster hang

Bug #1202241 reported by Jay Janssen
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Galera
New
Undecided
Unassigned
Percona XtraDB Cluster moved to https://jira.percona.com/projects/PXC
Incomplete
High
Unassigned

Bug Description

This is PXC:
Version: '5.5.30' socket: '/var/lib/mysql/mysql.sock' port: 3306 Percona XtraDB Cluster (GPL), wsrep_23.7.4.r3843

JOINER: tamarind (XXX.XXX.XXX.206)
DONOR: tarragon (XXX.XXX.XXX.14)

Other cluster nodes:
tabasco (XXX.XXX.XXX.205), tandoori (XXX.XXX.XXX.207)

Timeline of events:

7:21AM - Cluster node started, enters JOINER state, DONOR starts Xtrabackup SST
7:44AM - JOINER gets partitioned from the cluster, cluster hangs (no write activity)
7:55AM - Xtrabackup takes FTWRL on DONOR
7:56AM - DONOR finishes Xtrabackup
7:57AM - Cluster write activity resumes, JOINER tries to become Synced, but gets error:
130613 7:57:28 [Warning] WSREP: Protocol violation. JOIN message sender 0 (tarragon) is not in state transfer (SYNCED). Message ignored.

Question is why did the cluster pause between 7:44 and 7:56? (logs attached)

Revision history for this message
Alex Yurchenko (ayurchen) wrote :

Jay, looks like logs from tamarind and tarragon are identical and belong to a donor node. So we are missing the most important log - from the joiner (which partitioned).

Anyway, the immediate reason for a stall is clear - everybody was waiting for the state exchange message from tamarind.

Revision history for this message
Jay Janssen (jay-janssen) wrote : Re: [Bug 1202241] JOINER partition during SST causes cluster hang

Alex,
  The logs for tamarind and tarragon look correct to me:

jayj@~/Downloads/logs [517]$ head -n1 tamarind-7_44-7_57.out tarragon-7_44-7_57.out
==> tamarind-7_44-7_57.out <==
130613 7:21:40 [Note] WSREP: IST first seqno 8246011901 not found from cache, falling back to SST

==> tarragon-7_44-7_57.out <==
130613 7:21:40 [Note] WSREP: Running: 'wsrep_sst_xtrabackup --role 'donor' --address 'XXX.XXX.XXX.206:4444/xtrabackup_sst' --auth 'sst_backup:lb5G58PAaeDhk' --socket '/var/lib/mysql/mysql.sock' --datadir '/mnt/ssd/mysql/' --defaults-file '/etc/my.cnf' --gtid '82ff21da-0811-11e2-0800-3771efde9244:8361077172''

  Tamarind is the donor, tarragon is the joiner. I do agree it's confusing why this is in tamarind's log:

130613 7:44:25 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 0, memb_num = 4
130613 7:44:25 [Note] WSREP: STATE_EXCHANGE: sent state UUID: be13da01-d437-11e2-0800-f3569a587936
130613 7:44:25 [Note] WSREP: STATE EXCHANGE: sent state msg: be13da01-d437-11e2-0800-f3569a587936
130613 7:44:25 [Note] WSREP: STATE EXCHANGE: got state msg: be13da01-d437-11e2-0800-f3569a587936 from 0 (tarragon)
130613 7:44:25 [Note] WSREP: STATE EXCHANGE: got state msg: be13da01-d437-11e2-0800-f3569a587936 from 1 (tabasco)
130613 7:44:25 [Note] WSREP: STATE EXCHANGE: got state msg: be13da01-d437-11e2-0800-f3569a587936 from 3 (tandoori)
130613 7:54:55 [Note] WSREP: Provider paused at 82ff21da-0811-11e2-0800-3771efde9244:8363882737
130613 7:55:16 [Note] WSREP: Provider resumed.
130613 7:57:28 [Note] WSREP: STATE EXCHANGE: got state msg: be13da01-d437-11e2-0800-f3569a587936 from 2 (tamarind)
130613 7:57:28 [Note] WSREP: Quorum results:
        version = 2,
        component = PRIMARY,
        conf_id = 285,
        members = 3/4 (joined/total),
        act_id = 8363882737,
        last_appl. = 8363881969,
        protocols = 0/4/2 (gcs/repl/appl),
        group UUID = 82ff21da-0811-11e2-0800-3771efde9244

  How could tamarind be stuck waiting for a message from itself?

On Jul 25, 2013, at 1:01 PM, Alex Yurchenko <email address hidden> wrote:

> Jay, looks like logs from tamarind and tarragon are identical and belong
> to a donor node. So we are missing the most important log - from the
> joiner (which partitioned).
>
> Anyway, the immediate reason for a stall is clear - everybody was
> waiting for the state exchange message from tamarind.

Jay Janssen, MySQL Consulting Lead, Percona
http://about.me/jay.janssen

Revision history for this message
Alex Yurchenko (ayurchen) wrote :

Jay, yes contents of these files are slightly different, but you said that they were "sanitized" and I strongly believe they come from the same original log:

tarragon-7_44-7-57.out:
=================
130613 7:21:40 [Note] WSREP: Running: 'wsrep_sst_xtrabackup --role 'donor' --address 'XXX.XXX.XXX.206:4444/xtrabackup_sst' --auth 'sst_backup:lb5G58PAaeDhk' --socket '/var/lib/mysql/mysql.sock' --datadir '/mnt/ssd/mysql/' --defaults-file '/etc/my.cnf' --gtid '82ff21da-0811-11e2-0800-3771efde9244:8361077172''
...
130613 7:44:16 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 0, memb_num = 3
...
130613 7:44:16 [Note] WSREP: STATE EXCHANGE: got state msg: b92282d0-d437-11e2-0800-644a2bba7edc from 0 (tarragon)
130613 7:44:16 [Note] WSREP: STATE EXCHANGE: got state msg: b92282d0-d437-11e2-0800-644a2bba7edc from 1 (tabasco)
130613 7:44:16 [Note] WSREP: STATE EXCHANGE: got state msg: b92282d0-d437-11e2-0800-644a2bba7edc from 2 (tandoori)

tamarind-7_44-7-57.out:
=================
130613 7:21:40 [Note] WSREP: IST first seqno 8246011901 not found from cache, falling back to SST
130613 7:21:40 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
130613 7:21:40 [Note] WSREP: Running: 'wsrep_sst_xtrabackup --role 'donor' --address 'XXX.XXX.XXX.206:4444/xtrabackup_sst' --auth 'sst_backup:lb5G58PAaeDhk' --socket '/var/lib/mysql/mysql.sock' --datadir '/mnt/ssd/mysql/' --defaults-file '/etc/my.cnf' --gtid '82ff21da-0811-11e2-0800-3771efde9244:8361077172''
...
130613 7:44:16 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 0, memb_num = 3
...
130613 7:44:16 [Note] WSREP: STATE EXCHANGE: got state msg: b92282d0-d437-11e2-0800-644a2bba7edc from 0 (tarragon)
130613 7:44:16 [Note] WSREP: STATE EXCHANGE: got state msg: b92282d0-d437-11e2-0800-644a2bba7edc from 1 (tabasco)
130613 7:44:16 [Note] WSREP: STATE EXCHANGE: got state msg: b92282d0-d437-11e2-0800-644a2bba7edc from 2 (tandoori)
130613 7:44:16 [Note] WSREP: Quorum results:
        version = 2,
        component = PRIMARY,
        conf_id = 284,
        members = 3/3 (joined/total),
        act_id = 8363856198,
        last_appl. = 8363855423,
        protocols = 0/4/2 (gcs/repl/appl),
        group UUID = 82ff21da-0811-11e2-0800-3771efde9244

- notice that "tarragon" speaks like donor (even though you claim that donor was tamarind, and tarragon was a joiner) and says pretty much the same stuff as "tamarind", AND "tamarind" never cares to get a state exchange messages from itself, but instead gets it from tarragon, tabasco and tandoori in a three-node component, leaving no space for "tamarind" there.

Revision history for this message
Jay Janssen (jay-janssen) wrote :

You were right. Sanitizing, I just masked out the full IP. I think this was just a dumb copy-paste error.

Anyway, I'm attaching the correct log for tamarind.

On Jul 31, 2013, at 9:58 AM, Alex Yurchenko <email address hidden> wrote:

> Jay, yes contents of these files are slightly different, but you said
> that they were "sanitized" and I strongly believe they come from the
> same original log:

Jay Janssen, MySQL Consulting Lead, Percona
http://about.me/jay.janssen

Revision history for this message
Alex Yurchenko (ayurchen) wrote :

Now we're talking... and the talk is that the main Galera recv loop was absent from the logs from 7:44... to 7:57. Apparently it was stuck somewhere. I can tell that it was stuck at least 3 events before it had to send the state exchange message, however I don't know where exactly... In any case, this was caused by some extraordinary conditions on the box.

Just in case: what was the transaction rate there during SST and how much RAM was there?

Revision history for this message
Jay Janssen (jay-janssen) wrote :

On Jul 31, 2013, at 12:57 PM, Alex Yurchenko <email address hidden> wrote:

> Just in case: what was the transaction rate there during SST

I can't tell that for sure. They have a relatively high rate of transactions/writesets. Currently, it is 2-3k per second, I think it's safe to assume that area during this issue

> and how
> much RAM was there?

Tamarind has 48G of RAM -- all of these servers are reasonably provisioned.

Jay Janssen, MySQL Consulting Lead, Percona
http://about.me/jay.janssen

Revision history for this message
Jay Janssen (jay-janssen) wrote :
Revision history for this message
Krunal Bauskar (krunal-bauskar) wrote :

@jay

Can you check this out with latest 5.6 and update if issue still exist.

Changed in percona-xtradb-cluster:
status: New → Incomplete
importance: Undecided → High
Revision history for this message
Shahriyar Rzayev (rzayev-sehriyar) wrote :

Percona now uses JIRA for bug reports so this bug report is migrated to: https://jira.percona.com/browse/PXC-970

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.