Percona XtraDB Cluster - HA scalable solution for MySQL

Node fails to gracefully leave the cluster

Reported by Alex Yurchenko on 2013-01-28
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Galera
Medium
Teemu Ollakka
Percona XtraDB Cluster
Undecided
Unassigned

Bug Description

This happened when 2 nodes in a 3-node cluster had to abort due to inconsistency introduced by lp:587170. One of them failed to send LEAVE message which resulted in surviving node (master) to lose primary component and subsequent downtime.

Master:
=======
130128 18:24:06 [Warning] IP address '190.129.175.58' could not be resolved: Name or service not known
130128 18:57:36 [Note] WSREP: (62f840e0-6642-11e2-0800-9cfc9b518a6c, 'tcp://0.0.0.0:4567') turning message relay requesting on,
 nonlive peers: tcp://10.54.94.195:4567
130128 18:57:37 [Note] WSREP: (62f840e0-6642-11e2-0800-9cfc9b518a6c, 'tcp://0.0.0.0:4567') reconnecting to bcf36816-6642-11e2-0
800-64742e6f03f5 (tcp://10.54.94.195:4567), attempt 0
130128 18:57:40 [Note] WSREP: (62f840e0-6642-11e2-0800-9cfc9b518a6c, 'tcp://0.0.0.0:4567') reconnecting to fd286470-6642-11e2-0
800-350da336b5e7 (tcp://10.53.65.239:4567), attempt 0
130128 18:57:41 [Note] WSREP: evs::proto(62f840e0-6642-11e2-0800-9cfc9b518a6c, INSTALL, view_id(REG,62f840e0-6642-11e2-0800-9cf
c9b518a6c,3)) suspecting node: bcf36816-6642-11e2-0800-64742e6f03f5
130128 18:57:41 [Note] WSREP: evs::proto(62f840e0-6642-11e2-0800-9cfc9b518a6c, INSTALL, view_id(REG,62f840e0-6642-11e2-0800-9cf
c9b518a6c,3)) suspecting node: fd286470-6642-11e2-0800-350da336b5e7
130128 18:57:42 [Note] WSREP: view(view_id(NON_PRIM,62f840e0-6642-11e2-0800-9cfc9b518a6c,3) memb {
        62f840e0-6642-11e2-0800-9cfc9b518a6c,
} joined {
} left {
} partitioned {
        bcf36816-6642-11e2-0800-64742e6f03f5,
        fd286470-6642-11e2-0800-350da336b5e7,
})
130128 18:57:42 [Note] WSREP: view(view_id(NON_PRIM,62f840e0-6642-11e2-0800-9cfc9b518a6c,4) memb {
        62f840e0-6642-11e2-0800-9cfc9b518a6c,
} joined {
} left {
} partitioned {
        bcf36816-6642-11e2-0800-64742e6f03f5,
        fd286470-6642-11e2-0800-350da336b5e7,
})
130128 18:57:42 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1

Slave 1:
========
130128 18:57:34 [ERROR] WSREP: Node consistency compromized, aborting...
130128 18:57:34 [Note] WSREP: Closing send monitor...
130128 18:57:34 [Note] WSREP: Closed send monitor.
130128 18:57:34 [Note] WSREP: gcomm: terminating thread
130128 18:57:34 [Note] WSREP: gcomm: joining thread
130128 18:57:34 [Note] WSREP: gcomm: closing backend
130128 18:57:34 [Note] WSREP: view(view_id(NON_PRIM,62f840e0-6642-11e2-0800-9cfc9b518a6c,3) memb {
        bcf36816-6642-11e2-0800-64742e6f03f5,
} joined {
} left {
} partitioned {
        62f840e0-6642-11e2-0800-9cfc9b518a6c,
        fd286470-6642-11e2-0800-350da336b5e7,
})
130128 18:57:34 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1
130128 18:57:34 [Note] WSREP: view((empty))
130128 18:57:34 [Note] WSREP: gcomm: closed
130128 18:57:34 [Note] WSREP: Flow-control interval: [16, 16]
130128 18:57:34 [Note] WSREP: Received NON-PRIMARY.
130128 18:57:34 [Note] WSREP: Shifting SYNCED -> OPEN (TO: 18533923)
130128 18:57:34 [Note] WSREP: Received self-leave message.

Slave 2
=======
130128 18:57:44 [ERROR] WSREP: Node consistency compromized, aborting...
130128 18:57:44 [Note] WSREP: Closing send monitor...
130128 18:57:44 [Note] WSREP: Closed send monitor.
130128 18:57:44 [Note] WSREP: gcomm: terminating thread
130128 18:57:44 [Note] WSREP: gcomm: joining thread
130128 18:57:44 [Note] WSREP: (fd286470-6642-11e2-0800-350da336b5e7, 'tcp://0.0.0.0:4567') address 'tcp://10.53.65.239:4567' po
inting to uuid fd286470-6642-11e2-0800-350da336b5e7 is blacklisted, skipping
130128 18:57:44 [Note] WSREP: (fd286470-6642-11e2-0800-350da336b5e7, 'tcp://0.0.0.0:4567') address 'tcp://10.53.65.239:4567' po
inting to uuid fd286470-6642-11e2-0800-350da336b5e7 is blacklisted, skipping
130128 18:57:44 [Note] WSREP: gcomm: closing backend
130128 18:57:44 [ERROR] WSREP: failed to close gcomm backend connection: 131: Forbidden state transition: INSTALL -> LEAVING (FATAL)
         at gcomm/src/evs_proto.cpp:shift_to():2149
130128 18:57:44 [Note] WSREP: Received self-leave message.

Alex Yurchenko (ayurchen) wrote :

Internal trac reference: #607

Changed in galera:
importance: Undecided → Medium
milestone: none → 24.2.4
status: New → Confirmed
Alex Yurchenko (ayurchen) wrote :

Avoid shift to S_LEAVE in S_INSTALL, instead raise a boolean to denote that shift to S_LEAVE should be done when reaching S_OPERATIONAL. Also increased pc.linger default to 20 sec to give evs more time to attempt graceful leave. Fixed in r147.

Changed in galera:
assignee: nobody → Teemu Ollakka (teemu-ollakka)
status: Confirmed → Fix Committed
Changed in galera:
status: Fix Committed → Fix Released
Changed in percona-xtradb-cluster:
milestone: none → 5.5.30-23.7.4
Changed in percona-xtradb-cluster:
status: New → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers