OpenStack HA Cluster Charm

update-ring action removes all nodes except the juju leader

Bug #1933223 reported by Chris MacNaughton on 2021-06-22

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	OpenStack HA Cluster Charm	Fix Released	High	Unassigned	OpenStack HA Cluster Charm 21.10

Bug Description

I attempted to run the update-ring action to remove node1 on a freshly deployed OpenStack, but it remove all nodes except the juju leader.

Versions:

charm-hacluster from openstack-charmers-next rev 176
focal-victoria deploy
juju controller: 2.8.10
juju client: 2.8.11-bionic-amd64

before running the action:

$ juju ssh hacluster/0 sudo crm status
Cluster Summary:
  * Stack: corosync
  * Current DC: juju-875fec-zaza-fdb2897b5f67-4 (version 2.0.3-4b1f869f0f) - partition with quorum
  * Last updated: Tue Jun 22 09:20:19 2021
  * Last change: Tue Jun 22 08:43:07 2021 by root via cibadmin on juju-875fec-zaza-fdb2897b5f67-3
  * 4 nodes configured
  * 3 resource instances configured

Node List:
* Online: [ juju-875fec-zaza-fdb2897b5f67-3 juju-875fec-zaza-fdb2897b5f67-4 juju-875fec-zaza-fdb2897b5f67-5 ]
* OFFLINE: [ node1 ]

Full List of Resources:
  * res_ganesha_997cc80_vip (ocf::heartbeat:IPaddr2): Started juju-875fec-zaza-fdb2897b5f67-3
  * res_manila_share_manila_share (systemd:manila-share): Started juju-875fec-zaza-fdb2897b5f67-4
  * res_nfs_ganesha_nfs_ganesha (systemd:nfs-ganesha): FAILED juju-875fec-zaza-fdb2897b5f67-5

Failed Resource Actions:
* res_nfs_ganesha_nfs_ganesha_start_0 on juju-875fec-zaza-fdb2897b5f67-5 'error' (1): call=373, status='complete', exitreason='', last-rc-
change='2021-06-22 09:20:16Z', queued=0ms, exec=2681ms

Connection to 10.5.1.98 closed.

----

$ juju run-action hacluster/0 update-ring --wait i-really-mean-it=true
unit-hacluster-0:
  UnitId: hacluster/0
  id: "8"
  results:
    result: noop
  status: completed
  timing:
    completed: 2021-06-22 09:23:40 +0000 UTC
    enqueued: 2021-06-22 09:21:31 +0000 UTC
    started: 2021-06-22 09:23:39 +0000 UTC

after running the action:

$ juju ssh hacluster/0 sudo crm status
Cluster Summary:
  * Stack: corosync
  * Current DC: NONE
  * Last updated: Tue Jun 22 09:23:46 2021
  * Last change: Tue Jun 22 09:23:29 2021 by root via crm_node on juju-875fec-zaza-fdb2897b5f67-3
  * 1 node configured
  * 3 resource instances configured

Node List:
* Online: [ juju-875fec-zaza-fdb2897b5f67-3 ]

Full List of Resources:
  * res_ganesha_997cc80_vip (ocf::heartbeat:IPaddr2): Started juju-875fec-zaza-fdb2897b5f67-3
  * res_manila_share_manila_share (systemd:manila-share): Started juju-875fec-zaza-fdb2897b5f67-3
  * res_nfs_ganesha_nfs_ganesha (systemd:nfs-ganesha): Stopped

Failed Resource Actions:
* res_nfs_ganesha_nfs_ganesha_start_0 on juju-875fec-zaza-fdb2897b5f67-3 'error' (1): call=311, status='complete', exitreason='', last-rc-change='2021-06-22 09:23:43Z', queued=0ms, exec=2705ms

Connection to 10.5.1.98 closed.

See original description

Tags:

Chris MacNaughton (chris.macnaughton) on 2021-06-22

description:

updated

Aurelien Lourot (aurelien-lourot) on 2021-06-22

Changed in charm-hacluster:
status:	New → Triaged
importance:	Undecided → High

Felipe Reyes (freyes) on 2021-06-22

tags:

added: seg

Revision history for this message

David Ames (thedac) wrote on 2021-06-23:

TRIAGE:

I have not yet re-created this myself, however, I suspect the bug is located here [0]

    pcmk_nodes = set(pcmk.list_nodes())
    juju_nodes = {socket.gethostname()}
    juju_hanode_rel = get_ha_nodes()
    for corosync_id, addr in juju_hanode_rel.items():
        peer_node_name = utils.get_hostname(addr, fqdn=False)
        juju_nodes.add(peer_node_name)

In a production deployment with spaces, the peer relation may be bound to an interface that does not resolve to the same thing that socket.gethostname() returns. In this scenario, none of the hostnames that utils.get_hostname(addr, fqdn=False) returns would match what is in pcmk_nodes. Therefore, everything but *this* node would be removed.

"hostname" is set on the relation. So we already have that information. ie:
uju run --unit keystone-hacluster/0 -- "relation-get -r hanode:1 - keystone-hacluster/5"
egress-subnets: 10.5.2.151/32
hostname: juju-722a65-zaza-ed44fa5f8808-7
ingress-address: 10.5.2.151
member_ready: "True"
private-address: 10.5.2.151
ready: "True"

Rather than doing a utils.get_hostname(addr, fqdn=False) use hostname from the relation. This will avoid the bug.

A bigger question may be "Do we want this destructive process automated?" Would it make more sense to pass the hostname one wants to remove to the action?

[0] https://github.com/openstack/charm-hacluster/blob/master/hooks/utils.py#L1563-L1568

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-06-24: Fix proposed to charm-hacluster (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/charm-hacluster/+/798018

Changed in charm-hacluster:
status:	Triaged → In Progress

Revision history for this message

David Ames (thedac) wrote on 2021-06-24:

Fix up for discussion https://review.opendev.org/c/openstack/charm-hacluster/+/798018

I left (and fixed) the update-ring action with appropriate warning and added a delete-node-from-ring action that is safer.

Aurelien Lourot (aurelien-lourot) on 2021-06-28

Changed in charm-hacluster:
milestone:	none → 21.10

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-06-28: Fix merged to charm-hacluster (master)

Reviewed: https://review.opendev.org/c/openstack/charm-hacluster/+/798018
Committed: https://opendev.org/openstack/charm-hacluster/commit/102d463aa3439b6b4a9392757e1aa92b1ae63106
Submitter: "Zuul (22348)"
Branch: master

commit 102d463aa3439b6b4a9392757e1aa92b1ae63106
Author: David Ames <email address hidden>
Date: Thu Jun 24 13:14:58 2021 -0700

Safely delete node from ring

Provide the delete-node-from-ring action to safely remove a known node
from the corosync ring.

Update the less safe update-ring action to avoid LP Bug #1933223 and
provide warnings in actions.yaml on its use.

Change-Id: I56cf2360ac41b12fc0a508881897ba63a5e89dbd
Closes-Bug: #1933223

Changed in charm-hacluster:
status:	In Progress → Fix Committed

Alex Kavanagh (ajkavanagh) on 2021-10-22

Changed in charm-hacluster:
status:	Fix Committed → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.