Percona XtraDB Cluster moved to https://jira.percona.com/projects/PXC

Bug #1202241
Comment #3

Comment 3 for bug 1202241

Revision history for this message

Jay Janssen (jay-janssen) wrote on 2013-07-31: Re: [Bug 1202241] JOINER partition during SST causes cluster hang

Alex,
The logs for tamarind and tarragon look correct to me:

jayj@~/Downloads/logs [517]$ head -n1 tamarind-7_44-7_57.out tarragon-7_44-7_57.out
==> tamarind-7_44-7_57.out <==
130613 7:21:40 [Note] WSREP: IST first seqno 8246011901 not found from cache, falling back to SST

==> tarragon-7_44-7_57.out <==
130613 7:21:40 [Note] WSREP: Running: 'wsrep_sst_xtrabackup --role 'donor' --address 'XXX.XXX.XXX.206:4444/xtrabackup_sst' --auth 'sst_backup:lb5G58PAaeDhk' --socket '/var/lib/mysql/mysql.sock' --datadir '/mnt/ssd/mysql/' --defaults-file '/etc/my.cnf' --gtid '82ff21da-0811-11e2-0800-3771efde9244:8361077172''

Tamarind is the donor, tarragon is the joiner. I do agree it's confusing why this is in tamarind's log:

130613 7:44:25 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 0, memb_num = 4
130613 7:44:25 [Note] WSREP: STATE_EXCHANGE: sent state UUID: be13da01-d437-11e2-0800-f3569a587936
130613 7:44:25 [Note] WSREP: STATE EXCHANGE: sent state msg: be13da01-d437-11e2-0800-f3569a587936
130613 7:44:25 [Note] WSREP: STATE EXCHANGE: got state msg: be13da01-d437-11e2-0800-f3569a587936 from 0 (tarragon)
130613 7:44:25 [Note] WSREP: STATE EXCHANGE: got state msg: be13da01-d437-11e2-0800-f3569a587936 from 1 (tabasco)
130613 7:44:25 [Note] WSREP: STATE EXCHANGE: got state msg: be13da01-d437-11e2-0800-f3569a587936 from 3 (tandoori)
130613 7:54:55 [Note] WSREP: Provider paused at 82ff21da-0811-11e2-0800-3771efde9244:8363882737
130613 7:55:16 [Note] WSREP: Provider resumed.
130613 7:57:28 [Note] WSREP: STATE EXCHANGE: got state msg: be13da01-d437-11e2-0800-f3569a587936 from 2 (tamarind)
130613 7:57:28 [Note] WSREP: Quorum results:
        version = 2,
        component = PRIMARY,
        conf_id = 285,
        members = 3/4 (joined/total),
        act_id = 8363882737,
        last_appl. = 8363881969,
        protocols = 0/4/2 (gcs/repl/appl),
        group UUID = 82ff21da-0811-11e2-0800-3771efde9244

How could tamarind be stuck waiting for a message from itself?

On Jul 25, 2013, at 1:01 PM, Alex Yurchenko <email address hidden> wrote:

> Jay, looks like logs from tamarind and tarragon are identical and belong
> to a donor node. So we are missing the most important log - from the
> joiner (which partitioned).
>
> Anyway, the immediate reason for a stall is clear - everybody was
> waiting for the state exchange message from tamarind.

Jay Janssen, MySQL Consulting Lead, Percona
http://about.me/jay.janssen

Alex,
  The logs for tamarind and tarragon look correct to me:

jayj@~/Downloads/logs [517]$ head -n1 tamarind-7_44-7_57.out tarragon-7_44-7_57.out
==> tamarind-7_44-7_57.out <==
130613  7:21:40 [Note] WSREP: IST first seqno 8246011901 not found from cache, falling back to SST

==> tarragon-7_44-7_57.out <==
130613  7:21:40 [Note] WSREP: Running: 'wsrep_sst_xtrabackup --role 'donor' --address 'XXX.XXX.XXX.206:4444/xtrabackup_sst' --auth 'sst_backup:lb5G58PAaeDhk' --socket '/var/lib/mysql/mysql.sock' --datadir '/mnt/ssd/mysql/' --defaults-file '/etc/my.cnf' --gtid '82ff21da-0811-11e2-0800-3771efde9244:8361077172''

Tamarind is the donor, tarragon is the joiner.  I do agree it's confusing why this is in tamarind's log:

130613  7:44:25 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 0, memb_num = 4
130613  7:44:25 [Note] WSREP: STATE_EXCHANGE: sent state UUID: be13da01-d437-11e2-0800-f3569a587936
130613  7:44:25 [Note] WSREP: STATE EXCHANGE: sent state msg: be13da01-d437-11e2-0800-f3569a587936
130613  7:44:25 [Note] WSREP: STATE EXCHANGE: got state msg: be13da01-d437-11e2-0800-f3569a587936 from 0 (tarragon)
130613  7:44:25 [Note] WSREP: STATE EXCHANGE: got state msg: be13da01-d437-11e2-0800-f3569a587936 from 1 (tabasco)
130613  7:44:25 [Note] WSREP: STATE EXCHANGE: got state msg: be13da01-d437-11e2-0800-f3569a587936 from 3 (tandoori)
130613  7:54:55 [Note] WSREP: Provider paused at 82ff21da-0811-11e2-0800-3771efde9244:8363882737
130613  7:55:16 [Note] WSREP: Provider resumed.
130613  7:57:28 [Note] WSREP: STATE EXCHANGE: got state msg: be13da01-d437-11e2-0800-f3569a587936 from 2 (tamarind)
130613  7:57:28 [Note] WSREP: Quorum results:
        version    = 2,
        component  = PRIMARY,
        conf_id    = 285,
        members    = 3/4 (joined/total),
        act_id     = 8363882737,
        last_appl. = 8363881969,
        protocols  = 0/4/2 (gcs/repl/appl),
        group UUID = 82ff21da-0811-11e2-0800-3771efde9244

How could tamarind be stuck waiting for a message from itself?

On Jul 25, 2013, at 1:01 PM, Alex Yurchenko <1202241@bugs.launchpad.net> wrote:

> Jay, looks like logs from tamarind and tarragon are identical and belong
> to a donor node. So we are missing the most important log - from the
> joiner (which partitioned).
> 
> Anyway, the immediate reason for a stall is clear - everybody was
> waiting for the state exchange message from tamarind.

Jay Janssen, MySQL Consulting Lead, Percona
http://about.me/jay.janssen