Removing unit from hacluster doesn't properly remove node from corosync

Bug #1400481 reported by Billy Olsen
110
This bug affects 18 people
Affects Status Importance Assigned to Milestone
OpenStack Charm Guide
Fix Released
High
Unassigned
OpenStack HA Cluster Charm
Fix Released
Critical
Felipe Reyes
hacluster (Juju Charms Collection)
Invalid
Undecided
Unassigned

Bug Description

[Description]
The hacluster charm doesn't properly support the hanode-relation-departed hook. This is also indicated in the TODO list for the charm itself. This relationship is needed to be handled in order to set the appropriate quorum count.

When destroying or removing a service unit from the hacluster, the node remains as an offline node in the corosync status output. In order to fully remove the node, the corosync service should first be stopped on the node which is being removed, then removed from the cluster resource manager on one of the remaining nodes.

Note, when a unit is added after removing 1 or more units, the charm will appropriately adjust the nodelist or the expected_votes count to be the appropriate number of votes expected in a cluster.

[Impact]
The number of nodes required for quorum may be incorrect causing inability to form quorum in cluster for small number of nodes. Two node specialty case may not be enabled when number of nodes is 2.

[Test Case]
1. Deploy a service w/ 3 nodes which accepts the hacluster subordinate charm (e.g. keystone)
2. Related service and hacluster
3. Remove one of the service units (either juju destroy-unit or juju remove-unit)

Observe:
- /etc/corosync/corosync.conf still contains an incorrect nodelist (unicast) or expected_votes (multicast)
- two_node option is not specified in quorum section
- sudo crm status continues to report the removed unit as offline

Revision history for this message
Alex Kavanagh (ajkavanagh) wrote :

Yes, can confirm. After following the Test Case above, crm status gives:

sudo crm status
Last updated: Thu Jan 5 13:46:18 2017 Last change: Thu Jan 5 12:55:37 2017 by hacluster via crmd on juju-0388cc-default-1
Stack: corosync
Current DC: juju-0388cc-default-1 (version 1.1.14-70404b0) - partition with quorum
3 nodes and 4 resources configured

Online: [ juju-0388cc-default-1 juju-0388cc-default-2 ]
OFFLINE: [ juju-0388cc-default-3 ]

Full list of resources:

 Resource Group: grp_ks_vips
     res_ks_ens2_vip (ocf::heartbeat:IPaddr2): Started juju-0388cc-default-1
 Clone Set: cl_ks_haproxy [res_ks_haproxy]
     Started: [ juju-0388cc-default-1 juju-0388cc-default-2 ]
     Stopped: [ juju-0388cc-default-3 ]

Changed in hacluster (Juju Charms Collection):
status: New → Confirmed
Revision history for this message
Alex Kavanagh (ajkavanagh) wrote :

Note there is an additional problem with (at least) the 16.04 xenial, in that the 'parallax' module is required for 'crm cluster health' (or any cluster operation that involves ssh). This makes the crm blow up, which makes a fix a bit more awkward.

I'm exploring whether 'crm node delete' can be used instead.

Revision history for this message
Alex Kavanagh (ajkavanagh) wrote :

Just some thoughts on how to deal with it:

If the unit is removed using "juju remove-unit ..." or "juju destroy-machine ..." then, the other hacluster subordinates (attached to the other units) get a 'hanode-relation-departed' hook call with the relation data of {u'ready': u'True', u'private-address': u'10.5.9.224'}

i.e. they receive notification that the unit left when it was still active, and the IP address of the unit that left.

However, due to the async nature a "crm node delete ..." will hang until the unit has actually really been deleted and corosync/pacemaker notices. So it might be best to record the departing node during the departed hook in the kv() store, and then check the machine has gone and delete it during an update-status?

James Page (james-page)
Changed in charm-hacluster:
status: New → Confirmed
Changed in hacluster (Juju Charms Collection):
status: Confirmed → Invalid
Revision history for this message
James Page (james-page) wrote :

The stop hook makes an attempt:

@hooks.hook()
def stop():
    cmd = 'crm -w -F node delete %s' % socket.gethostname()
    pcmk.commit(cmd)
    apt_purge(['corosync', 'pacemaker'], fatal=True)

however I suspect that at this point in time, the knowledge of the other units in the cluster has already been removed from its configuration file so things won't actually work.

Revision history for this message
James Page (james-page) wrote :

-1 on mutations or side-effects during update-status.

Changed in charm-hacluster:
importance: Undecided → Medium
status: Confirmed → Triaged
tags: added: canonical-bootstack
Tytus Kurek (tkurek)
tags: added: 4010 cpe-onsite
Revision history for this message
Drew Freiberger (afreiberger) wrote :

It was noted in duplicate bug 1806505 that this should be non-impacting, as the config files are updated and the only issue is the lingering nodes in the running config, however there is a use-case where this creates a critical outage for services.

In the use case where you have a 3 node application that is deployed and you have to deploy 3 new units and remove the original 3 (i.e. while migrating the entire application from metal to lxd, or lxd to kvm, or from old hardware to new hardware), you end up with a loss of quorum in the running corosync environment.

consider the quorum counts that happen as you add nodes:
3 node cluster, quorum min = 2
4 node cluster, quorum min = 3
5 node cluster, quorum min = 3
6 node cluster, quorum min = 4
remove 3 nodes from 6 node cluster and there's now no way to reach quorum without cleanup, and the VIP resource goes offline with crm showing cluster w/out quorum.

I believe the current workaround is to remove the dead nodes with 'crm node remove' and running config-changed hook on each hacluster unit to update running corosync.

More notes in the duplicate bug 1821109.

Revision history for this message
David Ames (thedac) wrote :

Raised the priority. Note the 19.10 milestone is a lie. We don't have the 20.01 milestone created yet.

Changed in charm-hacluster:
importance: Medium → Critical
milestone: none → 19.10
Ryan Beisner (1chb1n)
tags: added: scaleback
Revision history for this message
Andrea Ieri (aieri) wrote :

As an extension to what Drew described, consider the case of a 3 unit percona cluster in which you have replaced one unit. Everything seems to be working fine, but you're actually sitting on a time bomb: as soon as a single unit fails, corosync quorum is lost, all the resources get stopped, and your DB stops responding. Even though quorum in a 3 node cluster should be 2 votes, it has been inflated to 3 in a not so evident way.
The above is also described in more detail in the proposed nrpe check in LP#1835418

David Ames (thedac)
Changed in charm-hacluster:
milestone: 19.10 → 20.01
James Page (james-page)
Changed in charm-hacluster:
milestone: 20.01 → 20.05
Revision history for this message
Chris Sanders (chris.sanders) wrote :

I'm subscribing ~field-high as this appears to have slipped several releases and this is going to impact a major re-architecture/migration event we have planned in the near future. If this is in 20.05 we'll be fine, if it's later it could be problematic.

David Ames (thedac)
Changed in charm-hacluster:
milestone: 20.05 → 20.08
Alvaro Uria (aluria)
Changed in charm-hacluster:
assignee: nobody → Alvaro Uria (aluria)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-hacluster (master)

Fix proposed to branch: master
Review: https://review.opendev.org/741592

Changed in charm-hacluster:
status: Triaged → In Progress
James Page (james-page)
Changed in charm-hacluster:
milestone: 20.08 → none
Changed in charm-hacluster:
assignee: Alvaro Uria (aluria) → Aurelien Lourot (aurelien-lourot)
tags: added: sts
Revision history for this message
Aurelien Lourot (aurelien-lourot) wrote :

Not working actively on it anymore. The last state is https://review.opendev.org/741592 . This review is in good shape but there are two items left to be addressed.

Changed in charm-hacluster:
assignee: Aurelien Lourot (aurelien-lourot) → nobody
Felipe Reyes (freyes)
Changed in charm-hacluster:
assignee: nobody → Felipe Reyes (freyes)
Revision history for this message
Drew Freiberger (afreiberger) wrote :

I did recently watch this happen live.

hacluster properly removed the node from crm, then uninstalled pacemaker/corosync, but then they ended up re-installed....During the stop hook, it looks like it re-starts pacemaker with on-disk configs and that causes the node to re-join the cluster.

https://pastebin.canonical.com/p/yFc8BJKtDV/

Revision history for this message
Aurelien Lourot (aurelien-lourot) wrote :

This has landed \o/ I'll now resurrect the release notes review [0] and move it forward.

[0] https://review.opendev.org/#/c/741626/

Changed in charm-hacluster:
status: In Progress → Fix Committed
milestone: none → 21.04
Changed in charm-guide:
status: New → In Progress
importance: Undecided → High
assignee: nobody → Aurelien Lourot (aurelien-lourot)
Revision history for this message
Aurelien Lourot (aurelien-lourot) wrote :

FYI this introduced regression lp:1920124

Changed in charm-guide:
status: In Progress → Fix Committed
assignee: Aurelien Lourot (aurelien-lourot) → nobody
milestone: none → 21.04
Changed in charm-hacluster:
status: Fix Committed → Fix Released
Changed in charm-guide:
status: Fix Committed → Fix Released
Revision history for this message
Trent Lloyd (lathiat) wrote :

Ran into a related issue to this and want to note the workaround for future travellers.

After pausing 1 unit all services went down, all VIP/haproxy were showing as Stopped in "crm status". In syslog we can see we had no quorum

[syslog]
pacemaker-schedulerd[PID]: warning: Fencing and resource management disabled due to lack of quorum
pacemaker-schedulerd[PID]: notice: * Start res_neutron_xxxxxx_vip ( hostname1 ) due to no quorum (blocked)

Pacemaker including "crm status", "crm configure show" and "crm_node -l" all showed 3 nodes as expected. 2 online, 1 offline. However we had no quorum.

The missing node only shows up in corosync status commands. corosync.conf was corrected and syncronised on all 3 nodes.

The solution was simply to run "corosync-cfgtool -R" to reload the configuration. It updated the ring from the config file and then quorum was acheived and the services started.

There have been a couple of fixes this 'update-ring' action and a newer 'delete-node-from-ring' but as we had 1 node down it wasn't clear if those would work correctly so we tried corosync-cfgtool which worked.

So seems sometimes we simply miss a reload somewhere.

From a sosreport we saw:

[sos_commands/corosync/corosync-quorumtool_-s]
Quorum information
------------------
Nodes: 2

Votequorum information
----------------------
Expected votes: 4
Highest expected: 4
Total votes: 2
Quorum: 3 Activity blocked

[sos_commands/corosync/corosync-cmapctl]
(this node no longer existed)
nodelist.node.0.nodeid (u32) = 1000
nodelist.node.0.ring0_addr (str) = X.X.X.1

(these are the correct nodes showing everywhere else)
nodelist.node.1.nodeid (u32) = 1001
nodelist.node.1.ring0_addr (str) = X.X.X.2
nodelist.node.2.nodeid (u32) = 1003
nodelist.node.2.ring0_addr (str) = X.X.X.3
nodelist.node.3.nodeid (u32) = 1002
nodelist.node.3.ring0_addr (str) = X.X.X.4

(the bad node 1000 also listed)
runtime.members.1000.config_version (u64) = 0
runtime.members.1000.ip (str) = r(0) ip(10.101.223.105)
runtime.members.1000.join_count (u32) = 1
runtime.members.1000.status (str) = left

runtime.votequorum.ev_barrier (u32) = 4
runtime.votequorum.highest_node_id (u32) = 1003
runtime.votequorum.lowest_node_id (u32) = 1001
runtime.votequorum.this_node_id (u32) = 1002
runtime.votequorum.two_node (u8) = 0

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.