update-ring action removes all nodes except the juju leader
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
OpenStack HA Cluster Charm |
Fix Released
|
High
|
Unassigned |
Bug Description
I attempted to run the update-ring action to remove node1 on a freshly deployed OpenStack, but it remove all nodes except the juju leader.
Versions:
charm-hacluster from openstack-
focal-victoria deploy
juju controller: 2.8.10
juju client: 2.8.11-bionic-amd64
before running the action:
$ juju ssh hacluster/0 sudo crm status
Cluster Summary:
* Stack: corosync
* Current DC: juju-875fec-
* Last updated: Tue Jun 22 09:20:19 2021
* Last change: Tue Jun 22 08:43:07 2021 by root via cibadmin on juju-875fec-
* 4 nodes configured
* 3 resource instances configured
Node List:
* Online: [ juju-875fec-
* OFFLINE: [ node1 ]
Full List of Resources:
* res_ganesha_
* res_manila_
* res_nfs_
Failed Resource Actions:
* res_nfs_
change='2021-06-22 09:20:16Z', queued=0ms, exec=2681ms
Connection to 10.5.1.98 closed.
----
$ juju run-action hacluster/0 update-ring --wait i-really-
unit-hacluster-0:
UnitId: hacluster/0
id: "8"
results:
result: noop
status: completed
timing:
completed: 2021-06-22 09:23:40 +0000 UTC
enqueued: 2021-06-22 09:21:31 +0000 UTC
started: 2021-06-22 09:23:39 +0000 UTC
after running the action:
$ juju ssh hacluster/0 sudo crm status
Cluster Summary:
* Stack: corosync
* Current DC: NONE
* Last updated: Tue Jun 22 09:23:46 2021
* Last change: Tue Jun 22 09:23:29 2021 by root via crm_node on juju-875fec-
* 1 node configured
* 3 resource instances configured
Node List:
* Online: [ juju-875fec-
Full List of Resources:
* res_ganesha_
* res_manila_
* res_nfs_
Failed Resource Actions:
* res_nfs_
Connection to 10.5.1.98 closed.
description: | updated |
Changed in charm-hacluster: | |
status: | New → Triaged |
importance: | Undecided → High |
tags: | added: seg |
Changed in charm-hacluster: | |
milestone: | none → 21.10 |
Changed in charm-hacluster: | |
status: | Fix Committed → Fix Released |
TRIAGE:
I have not yet re-created this myself, however, I suspect the bug is located here [0]
pcmk_nodes = set(pcmk. list_nodes( )) gethostname( )} rel.items( ):
peer_node_ name = utils.get_ hostname( addr, fqdn=False)
juju_nodes. add(peer_ node_name)
juju_nodes = {socket.
juju_hanode_rel = get_ha_nodes()
for corosync_id, addr in juju_hanode_
In a production deployment with spaces, the peer relation may be bound to an interface that does not resolve to the same thing that socket. gethostname( ) returns. In this scenario, none of the hostnames that utils.get_ hostname( addr, fqdn=False) returns would match what is in pcmk_nodes. Therefore, everything but *this* node would be removed.
"hostname" is set on the relation. So we already have that information. ie: hacluster/ 0 -- "relation-get -r hanode:1 - keystone- hacluster/ 5" zaza-ed44fa5f88 08-7
uju run --unit keystone-
egress-subnets: 10.5.2.151/32
hostname: juju-722a65-
ingress-address: 10.5.2.151
member_ready: "True"
private-address: 10.5.2.151
ready: "True"
Rather than doing a utils.get_ hostname( addr, fqdn=False) use hostname from the relation. This will avoid the bug.
A bigger question may be "Do we want this destructive process automated?" Would it make more sense to pass the hostname one wants to remove to the action?
[0] https:/ /github. com/openstack/ charm-hacluster /blob/master/ hooks/utils. py#L1563- L1568