update-ring action removes all nodes except the juju leader

Bug #1933223 reported by Chris MacNaughton
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack HA Cluster Charm
Fix Released
High
Unassigned

Bug Description

I attempted to run the update-ring action to remove node1 on a freshly deployed OpenStack, but it remove all nodes except the juju leader.

Versions:

charm-hacluster from openstack-charmers-next rev 176
focal-victoria deploy
juju controller: 2.8.10
juju client: 2.8.11-bionic-amd64

before running the action:

$ juju ssh hacluster/0 sudo crm status
Cluster Summary:
  * Stack: corosync
  * Current DC: juju-875fec-zaza-fdb2897b5f67-4 (version 2.0.3-4b1f869f0f) - partition with quorum
  * Last updated: Tue Jun 22 09:20:19 2021
  * Last change: Tue Jun 22 08:43:07 2021 by root via cibadmin on juju-875fec-zaza-fdb2897b5f67-3
  * 4 nodes configured
  * 3 resource instances configured

Node List:
  * Online: [ juju-875fec-zaza-fdb2897b5f67-3 juju-875fec-zaza-fdb2897b5f67-4 juju-875fec-zaza-fdb2897b5f67-5 ]
  * OFFLINE: [ node1 ]

Full List of Resources:
  * res_ganesha_997cc80_vip (ocf::heartbeat:IPaddr2): Started juju-875fec-zaza-fdb2897b5f67-3
  * res_manila_share_manila_share (systemd:manila-share): Started juju-875fec-zaza-fdb2897b5f67-4
  * res_nfs_ganesha_nfs_ganesha (systemd:nfs-ganesha): FAILED juju-875fec-zaza-fdb2897b5f67-5

Failed Resource Actions:
  * res_nfs_ganesha_nfs_ganesha_start_0 on juju-875fec-zaza-fdb2897b5f67-5 'error' (1): call=373, status='complete', exitreason='', last-rc-
change='2021-06-22 09:20:16Z', queued=0ms, exec=2681ms

Connection to 10.5.1.98 closed.

----

$ juju run-action hacluster/0 update-ring --wait i-really-mean-it=true
unit-hacluster-0:
  UnitId: hacluster/0
  id: "8"
  results:
    result: noop
  status: completed
  timing:
    completed: 2021-06-22 09:23:40 +0000 UTC
    enqueued: 2021-06-22 09:21:31 +0000 UTC
    started: 2021-06-22 09:23:39 +0000 UTC

after running the action:

$ juju ssh hacluster/0 sudo crm status
Cluster Summary:
  * Stack: corosync
  * Current DC: NONE
  * Last updated: Tue Jun 22 09:23:46 2021
  * Last change: Tue Jun 22 09:23:29 2021 by root via crm_node on juju-875fec-zaza-fdb2897b5f67-3
  * 1 node configured
  * 3 resource instances configured

Node List:
  * Online: [ juju-875fec-zaza-fdb2897b5f67-3 ]

Full List of Resources:
  * res_ganesha_997cc80_vip (ocf::heartbeat:IPaddr2): Started juju-875fec-zaza-fdb2897b5f67-3
  * res_manila_share_manila_share (systemd:manila-share): Started juju-875fec-zaza-fdb2897b5f67-3
  * res_nfs_ganesha_nfs_ganesha (systemd:nfs-ganesha): Stopped

Failed Resource Actions:
  * res_nfs_ganesha_nfs_ganesha_start_0 on juju-875fec-zaza-fdb2897b5f67-3 'error' (1): call=311, status='complete', exitreason='', last-rc-change='2021-06-22 09:23:43Z', queued=0ms, exec=2705ms

Connection to 10.5.1.98 closed.

Tags: seg
description: updated
Changed in charm-hacluster:
status: New → Triaged
importance: Undecided → High
Felipe Reyes (freyes)
tags: added: seg
Revision history for this message
David Ames (thedac) wrote :

TRIAGE:

I have not yet re-created this myself, however, I suspect the bug is located here [0]

    pcmk_nodes = set(pcmk.list_nodes())
    juju_nodes = {socket.gethostname()}
    juju_hanode_rel = get_ha_nodes()
    for corosync_id, addr in juju_hanode_rel.items():
        peer_node_name = utils.get_hostname(addr, fqdn=False)
        juju_nodes.add(peer_node_name)

In a production deployment with spaces, the peer relation may be bound to an interface that does not resolve to the same thing that socket.gethostname() returns. In this scenario, none of the hostnames that utils.get_hostname(addr, fqdn=False) returns would match what is in pcmk_nodes. Therefore, everything but *this* node would be removed.

"hostname" is set on the relation. So we already have that information. ie:
uju run --unit keystone-hacluster/0 -- "relation-get -r hanode:1 - keystone-hacluster/5"
egress-subnets: 10.5.2.151/32
hostname: juju-722a65-zaza-ed44fa5f8808-7
ingress-address: 10.5.2.151
member_ready: "True"
private-address: 10.5.2.151
ready: "True"

Rather than doing a utils.get_hostname(addr, fqdn=False) use hostname from the relation. This will avoid the bug.

A bigger question may be "Do we want this destructive process automated?" Would it make more sense to pass the hostname one wants to remove to the action?

[0] https://github.com/openstack/charm-hacluster/blob/master/hooks/utils.py#L1563-L1568

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-hacluster (master)
Changed in charm-hacluster:
status: Triaged → In Progress
Revision history for this message
David Ames (thedac) wrote :

Fix up for discussion https://review.opendev.org/c/openstack/charm-hacluster/+/798018

I left (and fixed) the update-ring action with appropriate warning and added a delete-node-from-ring action that is safer.

Changed in charm-hacluster:
milestone: none → 21.10
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-hacluster (master)

Reviewed: https://review.opendev.org/c/openstack/charm-hacluster/+/798018
Committed: https://opendev.org/openstack/charm-hacluster/commit/102d463aa3439b6b4a9392757e1aa92b1ae63106
Submitter: "Zuul (22348)"
Branch: master

commit 102d463aa3439b6b4a9392757e1aa92b1ae63106
Author: David Ames <email address hidden>
Date: Thu Jun 24 13:14:58 2021 -0700

    Safely delete node from ring

    Provide the delete-node-from-ring action to safely remove a known node
    from the corosync ring.

    Update the less safe update-ring action to avoid LP Bug #1933223 and
    provide warnings in actions.yaml on its use.

    Change-Id: I56cf2360ac41b12fc0a508881897ba63a5e89dbd
    Closes-Bug: #1933223

Changed in charm-hacluster:
status: In Progress → Fix Committed
Changed in charm-hacluster:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.