Node fails to gracefully leave the cluster

Bug #1108165 reported by Alex Yurchenko
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Galera
Fix Released
Medium
Teemu Ollakka
Percona XtraDB Cluster moved to https://jira.percona.com/projects/PXC
Fix Released
Undecided
Unassigned

Bug Description

This happened when 2 nodes in a 3-node cluster had to abort due to inconsistency introduced by lp:587170. One of them failed to send LEAVE message which resulted in surviving node (master) to lose primary component and subsequent downtime.

Master:
=======
130128 18:24:06 [Warning] IP address '190.129.175.58' could not be resolved: Name or service not known
130128 18:57:36 [Note] WSREP: (62f840e0-6642-11e2-0800-9cfc9b518a6c, 'tcp://0.0.0.0:4567') turning message relay requesting on,
 nonlive peers: tcp://10.54.94.195:4567
130128 18:57:37 [Note] WSREP: (62f840e0-6642-11e2-0800-9cfc9b518a6c, 'tcp://0.0.0.0:4567') reconnecting to bcf36816-6642-11e2-0
800-64742e6f03f5 (tcp://10.54.94.195:4567), attempt 0
130128 18:57:40 [Note] WSREP: (62f840e0-6642-11e2-0800-9cfc9b518a6c, 'tcp://0.0.0.0:4567') reconnecting to fd286470-6642-11e2-0
800-350da336b5e7 (tcp://10.53.65.239:4567), attempt 0
130128 18:57:41 [Note] WSREP: evs::proto(62f840e0-6642-11e2-0800-9cfc9b518a6c, INSTALL, view_id(REG,62f840e0-6642-11e2-0800-9cf
c9b518a6c,3)) suspecting node: bcf36816-6642-11e2-0800-64742e6f03f5
130128 18:57:41 [Note] WSREP: evs::proto(62f840e0-6642-11e2-0800-9cfc9b518a6c, INSTALL, view_id(REG,62f840e0-6642-11e2-0800-9cf
c9b518a6c,3)) suspecting node: fd286470-6642-11e2-0800-350da336b5e7
130128 18:57:42 [Note] WSREP: view(view_id(NON_PRIM,62f840e0-6642-11e2-0800-9cfc9b518a6c,3) memb {
        62f840e0-6642-11e2-0800-9cfc9b518a6c,
} joined {
} left {
} partitioned {
        bcf36816-6642-11e2-0800-64742e6f03f5,
        fd286470-6642-11e2-0800-350da336b5e7,
})
130128 18:57:42 [Note] WSREP: view(view_id(NON_PRIM,62f840e0-6642-11e2-0800-9cfc9b518a6c,4) memb {
        62f840e0-6642-11e2-0800-9cfc9b518a6c,
} joined {
} left {
} partitioned {
        bcf36816-6642-11e2-0800-64742e6f03f5,
        fd286470-6642-11e2-0800-350da336b5e7,
})
130128 18:57:42 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1

Slave 1:
========
130128 18:57:34 [ERROR] WSREP: Node consistency compromized, aborting...
130128 18:57:34 [Note] WSREP: Closing send monitor...
130128 18:57:34 [Note] WSREP: Closed send monitor.
130128 18:57:34 [Note] WSREP: gcomm: terminating thread
130128 18:57:34 [Note] WSREP: gcomm: joining thread
130128 18:57:34 [Note] WSREP: gcomm: closing backend
130128 18:57:34 [Note] WSREP: view(view_id(NON_PRIM,62f840e0-6642-11e2-0800-9cfc9b518a6c,3) memb {
        bcf36816-6642-11e2-0800-64742e6f03f5,
} joined {
} left {
} partitioned {
        62f840e0-6642-11e2-0800-9cfc9b518a6c,
        fd286470-6642-11e2-0800-350da336b5e7,
})
130128 18:57:34 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1
130128 18:57:34 [Note] WSREP: view((empty))
130128 18:57:34 [Note] WSREP: gcomm: closed
130128 18:57:34 [Note] WSREP: Flow-control interval: [16, 16]
130128 18:57:34 [Note] WSREP: Received NON-PRIMARY.
130128 18:57:34 [Note] WSREP: Shifting SYNCED -> OPEN (TO: 18533923)
130128 18:57:34 [Note] WSREP: Received self-leave message.

Slave 2
=======
130128 18:57:44 [ERROR] WSREP: Node consistency compromized, aborting...
130128 18:57:44 [Note] WSREP: Closing send monitor...
130128 18:57:44 [Note] WSREP: Closed send monitor.
130128 18:57:44 [Note] WSREP: gcomm: terminating thread
130128 18:57:44 [Note] WSREP: gcomm: joining thread
130128 18:57:44 [Note] WSREP: (fd286470-6642-11e2-0800-350da336b5e7, 'tcp://0.0.0.0:4567') address 'tcp://10.53.65.239:4567' po
inting to uuid fd286470-6642-11e2-0800-350da336b5e7 is blacklisted, skipping
130128 18:57:44 [Note] WSREP: (fd286470-6642-11e2-0800-350da336b5e7, 'tcp://0.0.0.0:4567') address 'tcp://10.53.65.239:4567' po
inting to uuid fd286470-6642-11e2-0800-350da336b5e7 is blacklisted, skipping
130128 18:57:44 [Note] WSREP: gcomm: closing backend
130128 18:57:44 [ERROR] WSREP: failed to close gcomm backend connection: 131: Forbidden state transition: INSTALL -> LEAVING (FATAL)
         at gcomm/src/evs_proto.cpp:shift_to():2149
130128 18:57:44 [Note] WSREP: Received self-leave message.

Revision history for this message
Alex Yurchenko (ayurchen) wrote :

Internal trac reference: #607

Changed in galera:
importance: Undecided → Medium
milestone: none → 24.2.4
status: New → Confirmed
Revision history for this message
Alex Yurchenko (ayurchen) wrote :

Avoid shift to S_LEAVE in S_INSTALL, instead raise a boolean to denote that shift to S_LEAVE should be done when reaching S_OPERATIONAL. Also increased pc.linger default to 20 sec to give evs more time to attempt graceful leave. Fixed in r147.

Changed in galera:
assignee: nobody → Teemu Ollakka (teemu-ollakka)
status: Confirmed → Fix Committed
Changed in galera:
status: Fix Committed → Fix Released
Changed in percona-xtradb-cluster:
milestone: none → 5.5.30-23.7.4
Changed in percona-xtradb-cluster:
status: New → Fix Released
Revision history for this message
Shahriyar Rzayev (rzayev-sehriyar) wrote :

Percona now uses JIRA for bug reports so this bug report is migrated to: https://jira.percona.com/browse/PXC-1286

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.