Alex,
The logs for tamarind and tarragon look correct to me:
jayj@~/Downloads/logs [517]$ head -n1 tamarind-7_44-7_57.out tarragon-7_44-7_57.out
==> tamarind-7_44-7_57.out <==
130613 7:21:40 [Note] WSREP: IST first seqno 8246011901 not found from cache, falling back to SST
Tamarind is the donor, tarragon is the joiner. I do agree it's confusing why this is in tamarind's log:
130613 7:44:25 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 0, memb_num = 4
130613 7:44:25 [Note] WSREP: STATE_EXCHANGE: sent state UUID: be13da01-d437-11e2-0800-f3569a587936
130613 7:44:25 [Note] WSREP: STATE EXCHANGE: sent state msg: be13da01-d437-11e2-0800-f3569a587936
130613 7:44:25 [Note] WSREP: STATE EXCHANGE: got state msg: be13da01-d437-11e2-0800-f3569a587936 from 0 (tarragon)
130613 7:44:25 [Note] WSREP: STATE EXCHANGE: got state msg: be13da01-d437-11e2-0800-f3569a587936 from 1 (tabasco)
130613 7:44:25 [Note] WSREP: STATE EXCHANGE: got state msg: be13da01-d437-11e2-0800-f3569a587936 from 3 (tandoori)
130613 7:54:55 [Note] WSREP: Provider paused at 82ff21da-0811-11e2-0800-3771efde9244:8363882737
130613 7:55:16 [Note] WSREP: Provider resumed.
130613 7:57:28 [Note] WSREP: STATE EXCHANGE: got state msg: be13da01-d437-11e2-0800-f3569a587936 from 2 (tamarind)
130613 7:57:28 [Note] WSREP: Quorum results:
version = 2,
component = PRIMARY,
conf_id = 285,
members = 3/4 (joined/total),
act_id = 8363882737,
last_appl. = 8363881969,
protocols = 0/4/2 (gcs/repl/appl),
group UUID = 82ff21da-0811-11e2-0800-3771efde9244
How could tamarind be stuck waiting for a message from itself?
On Jul 25, 2013, at 1:01 PM, Alex Yurchenko <email address hidden> wrote:
> Jay, looks like logs from tamarind and tarragon are identical and belong
> to a donor node. So we are missing the most important log - from the
> joiner (which partitioned).
>
> Anyway, the immediate reason for a stall is clear - everybody was
> waiting for the state exchange message from tamarind.
Alex,
The logs for tamarind and tarragon look correct to me:
jayj@~/ Downloads/ logs [517]$ head -n1 tamarind- 7_44-7_ 57.out tarragon- 7_44-7_ 57.out 7_44-7_ 57.out <==
==> tamarind-
130613 7:21:40 [Note] WSREP: IST first seqno 8246011901 not found from cache, falling back to SST
==> tarragon- 7_44-7_ 57.out <== sst_xtrabackup --role 'donor' --address 'XXX.XXX. XXX.206: 4444/xtrabackup _sst' --auth 'sst_backup: lb5G58PAaeDhk' --socket '/var/lib/ mysql/mysql. sock' --datadir '/mnt/ssd/mysql/' --defaults-file '/etc/my.cnf' --gtid '82ff21da- 0811-11e2- 0800-3771efde92 44:8361077172' '
130613 7:21:40 [Note] WSREP: Running: 'wsrep_
Tamarind is the donor, tarragon is the joiner. I do agree it's confusing why this is in tamarind's log:
130613 7:44:25 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 0, memb_num = 4 d437-11e2- 0800-f3569a5879 36 d437-11e2- 0800-f3569a5879 36 d437-11e2- 0800-f3569a5879 36 from 0 (tarragon) d437-11e2- 0800-f3569a5879 36 from 1 (tabasco) d437-11e2- 0800-f3569a5879 36 from 3 (tandoori) 0811-11e2- 0800-3771efde92 44:8363882737 d437-11e2- 0800-f3569a5879 36 from 2 (tamarind) 0811-11e2- 0800-3771efde92 44
130613 7:44:25 [Note] WSREP: STATE_EXCHANGE: sent state UUID: be13da01-
130613 7:44:25 [Note] WSREP: STATE EXCHANGE: sent state msg: be13da01-
130613 7:44:25 [Note] WSREP: STATE EXCHANGE: got state msg: be13da01-
130613 7:44:25 [Note] WSREP: STATE EXCHANGE: got state msg: be13da01-
130613 7:44:25 [Note] WSREP: STATE EXCHANGE: got state msg: be13da01-
130613 7:54:55 [Note] WSREP: Provider paused at 82ff21da-
130613 7:55:16 [Note] WSREP: Provider resumed.
130613 7:57:28 [Note] WSREP: STATE EXCHANGE: got state msg: be13da01-
130613 7:57:28 [Note] WSREP: Quorum results:
version = 2,
component = PRIMARY,
conf_id = 285,
members = 3/4 (joined/total),
act_id = 8363882737,
last_appl. = 8363881969,
protocols = 0/4/2 (gcs/repl/appl),
group UUID = 82ff21da-
How could tamarind be stuck waiting for a message from itself?
On Jul 25, 2013, at 1:01 PM, Alex Yurchenko <email address hidden> wrote:
> Jay, looks like logs from tamarind and tarragon are identical and belong
> to a donor node. So we are missing the most important log - from the
> joiner (which partitioned).
>
> Anyway, the immediate reason for a stall is clear - everybody was
> waiting for the state exchange message from tamarind.
Jay Janssen, MySQL Consulting Lead, Percona about.me/ jay.janssen
http://