[hirsute] update-ring action fails on removing unreachable nodes

Bug #1931588 reported by Aurelien Lourot
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack HA Cluster Charm
Fix Released
High
Aurelien Lourot

Bug Description

Seen twice on hirsute in our test gate [0][1]

## Running Test zaza.openstack.charm_tests.hacluster.tests.HaclusterScaleBackAndForthTest ##
looking at application: {'name': 'keystone', 'type': {'pkg': 'keystone', 'origin_setting': 'openstack-origin'}}
Using keystone API V3 (or later) for overcloud auth
looking at application: {'name': 'keystone', 'type': {'pkg': 'keystone', 'origin_setting': 'openstack-origin'}}
test_930_scaleback (zaza.openstack.charm_tests.hacluster.tests.HaclusterScaleBackAndForthTest)
Remove one unit, recalculate quorum and re-add one unit.
 ...
Pausing unit keystone-hacluster/1
Removing keystone/1
Waiting for model to settle
Checking that corosync considers at least one node to be offline
Updating corosync ring
ERROR
======================================================================
ERROR: test_930_scaleback (zaza.openstack.charm_tests.hacluster.tests.HaclusterScaleBackAndForthTest)
Remove one unit, recalculate quorum and re-add one unit.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/ubuntu/src/review.opendev.org/openstack/charm-mysql-router/build/builds/mysql-router/.tox/func-target/lib/python3.6/site-packages/zaza/openstack/charm_tests/hacluster/tests.py", line 141, in test_930_scaleback
    raise_on_failure=True)
  File "/home/ubuntu/src/review.opendev.org/openstack/charm-mysql-router/build/builds/mysql-router/.tox/func-target/lib/python3.6/site-packages/zaza/__init__.py", line 48, in _wrapper
    return run(_run_it())
  File "/home/ubuntu/src/review.opendev.org/openstack/charm-mysql-router/build/builds/mysql-router/.tox/func-target/lib/python3.6/site-packages/zaza/__init__.py", line 36, in run
    return task.result()
  File "/home/ubuntu/src/review.opendev.org/openstack/charm-mysql-router/build/builds/mysql-router/.tox/func-target/lib/python3.6/site-packages/zaza/__init__.py", line 47, in _run_it
    return await f(*args, **kwargs)
  File "/home/ubuntu/src/review.opendev.org/openstack/charm-mysql-router/build/builds/mysql-router/.tox/func-target/lib/python3.6/site-packages/zaza/model.py", line 901, in async_run_action_on_leader
    raise ActionFailed(action_obj, output=output)
zaza.model.ActionFailed: Run of action "update-ring" with parameters "{'i-really-mean-it': True}" on "keystone-hacluster/0" failed with "Removing node1 from the cluster failed. Command '['crm', '-w', '-F', 'node', 'delete', 'node1']' returned non-zero exit status 1. output=b'Could not remove node1[0] from nodes: Transport endpoint is not connectedCould not remove node1[0] from status: Transport endpoint is not connectedWARNING: "crm_node --force -R node1" failed, rc=1\n'" (id=256 status=failed enqueued=2021-06-10T13:32:21Z started=2021-06-10T13:32:21Z completed=2021-06-10T13:32:24Z output={'Code': '0'})

Note that `node1` is wrongly created by pacemaker because of another bug [2] and the update-ring action is supposed to successfully work around that by removing `node1`.

[0] https://review.opendev.org/c/openstack/charm-mysql-router/+/795226
[1] https://openstack-ci-reports.ubuntu.com/artifacts/74b/795226/3/check/full_model_ha-hirsute-full-ha/74b6249/job-output.txt
[2] lp:1874719

Changed in charm-hacluster:
assignee: nobody → Aurelien Lourot (aurelien-lourot)
Changed in charm-hacluster:
status: New → In Progress
importance: Undecided → High
Revision history for this message
Aurelien Lourot (aurelien-lourot) wrote :
Revision history for this message
Aurelien Lourot (aurelien-lourot) wrote :
Download full text (4.7 KiB)

TL;DR: this seems to be an intended change of behavior in corosync and we need to adapt the charm.

Details:

On hirsute `crm -w -F node delete <node-name>` fails with "Transport endpoint is not connected" if <node-name> isn't reachable, and so the `update-ring` action is unusable. On groovy we don't have the issue.

The most relevant packages are:

 crmsh | 4.2.0-3ubuntu1 | groovy
 crmsh | 4.2.0-4ubuntu1 | hirsute
 corosync | 3.0.3-2ubuntu3.1 | groovy-updates
 corosync | 3.1.0-2ubuntu3 | hirsute
 pacemaker | 2.0.4-2ubuntu3.2 | groovy-updates
 pacemaker | 2.0.5-2ubuntu1 | hirsute
 pacemaker-cli-utils | 2.0.4-2ubuntu3.2 | groovy-updates
 pacemaker-cli-utils | 2.0.5-2ubuntu1 | hirsute

If I deploy a groovy bundle [0], then upgrade all packages above **EXCEPT corosync** on each node:
# sed s/groovy/hirsute/g /etc/apt/sources.list > /etc/apt/sources.list.d/hirsute.list
# apt update
# apt install pacemaker libpe-status28 libpe-rules26 libcib27 libcrmservice28 libcrmcluster29 libpacemaker1 pacemaker-cli-utils crmsh liblrmd28
Get:1 http://nova.clouds.archive.ubuntu.com/ubuntu hirsute/main amd64 libfakeroot amd64 1.25.3-1.1ubuntu2 [28.1 kB]
Get:2 http://nova.clouds.archive.ubuntu.com/ubuntu hirsute/main amd64 fakeroot amd64 1.25.3-1.1ubuntu2 [62.9 kB]
Get:3 http://nova.clouds.archive.ubuntu.com/ubuntu hirsute/main amd64 locales all 2.33-0ubuntu5 [3876 kB]
Get:4 http://nova.clouds.archive.ubuntu.com/ubuntu hirsute/main amd64 libc6 amd64 2.33-0ubuntu5 [2690 kB]
Get:5 http://nova.clouds.archive.ubuntu.com/ubuntu hirsute/main amd64 libc-bin amd64 2.33-0ubuntu5 [646 kB]
Get:6 http://nova.clouds.archive.ubuntu.com/ubuntu hirsute/main amd64 libc-dev-bin amd64 2.33-0ubuntu5 [19.3 kB]
Get:7 http://nova.clouds.archive.ubuntu.com/ubuntu hirsute/main amd64 libc6-dev amd64 2.33-0ubuntu5 [2143 kB]
Get:8 http://nova.clouds.archive.ubuntu.com/ubuntu hirsute-updates/main amd64 libnettle8 amd64 3.7-2.1ubuntu1.1 [146 kB]
Get:9 http://nova.clouds.archive.ubuntu.com/ubuntu hirsute/main amd64 libgnutls30 amd64 3.7.1-3ubuntu1 [902 kB]
Get:10 http://nova.clouds.archive.ubuntu.com/ubuntu hirsute/main amd64 libqb100 amd64 2.0.2-1 [66.9 kB]
Get:11 http://nova.clouds.archive.ubuntu.com/ubuntu hirsute/main amd64 libcrmcommon34 amd64 2.0.5-2ubuntu1 [175 kB]
Get:12 http://nova.clouds.archive.ubuntu.com/ubuntu hirsute/main amd64 libpe-rules26 amd64 2.0.5-2ubuntu1 [27.7 kB]
Get:13 http://nova.clouds.archive.ubuntu.com/ubuntu hirsute/main amd64 libcib27 amd64 2.0.5-2ubuntu1 [50.6 kB]
Get:14 http://nova.clouds.archive.ubuntu.com/ubuntu hirsute/main amd64 libcrmservice28 amd64 2.0.5-2ubuntu1 [38.1 kB]
Get:15 http://nova.clouds.archive.ubuntu.com/ubuntu hirsute/main amd64 libstonithd26 amd64 2.0.5-2ubuntu1 [39.6 kB]
Get:16 http://nova.clouds.archive.ubuntu.com/ubuntu hirsute/main amd64 liblrmd28 amd64 2.0.5-2ubuntu1 [30.8 kB]
Get:17 http://nova.clouds.archive.ubuntu.com/ubuntu hirsute/main amd64 libpe-status28 amd64 2.0.5-2ubuntu1 [150 kB]
Get:18 http://nova.clouds.archive.ubuntu.com/ubuntu hirsute/main amd64 libpacemaker1 amd64 2.0.5-2ubuntu1 [167 kB]
Get:19 http://nova.clouds.archive.ubuntu.com/ubuntu hi...

Read more...

summary: - [hirsute] update-ring action fails on removing node1
+ [hirsute] update-ring action fails on removing unreachable nodes
Revision history for this message
David Ames (thedac) wrote :
Download full text (4.2 KiB)

FWIW, I notice that this only happens when we have both node1 and the removed unit offline. When it succeeds the removed node has already been removed from the cluster.

For example:

ubuntu@juju-aba261-zaza-0ce5519c4156-0:~$ sudo crm status
Cluster Summary:
  * Stack: corosync
  * Current DC: juju-aba261-zaza-0ce5519c4156-0 (version 2.0.5-ba59be7122) - partition with quorum
  * Last updated: Tue Jun 22 20:09:32 2021
  * Last change: Tue Jun 22 19:45:58 2021 by hacluster via crmd on juju-aba261-zaza-0ce5519c4156-0
  * 4 nodes configured
  * 5 resource instances configured

Node List:
  * Online: [ juju-aba261-zaza-0ce5519c4156-0 juju-aba261-zaza-0ce5519c4156-2 ]
  * OFFLINE: [ juju-aba261-zaza-0ce5519c4156-1 node1 ]

Full List of Resources:
  * Resource Group: grp_ks_vips:
    * res_ks_2cfb08e_vip (ocf::heartbeat:IPaddr2): Started juju-aba261-zaza-0ce5519c4156-0
  * Clone Set: cl_ks_haproxy [res_ks_haproxy]:
    * Started: [ juju-aba261-zaza-0ce5519c4156-0 juju-aba261-zaza-0ce5519c4156-2 ]
    * Stopped: [ juju-aba261-zaza-0ce5519c4156-1 node1 ]

Running the action, juju run-action --wait keystone-hacluster/0 update-ring i-really-mean-it=True, fails with "Could not remove juju-aba261-zaza-0ce5519c4156-1[0] from nodes: Trans...

Read more...

Revision history for this message
David Ames (thedac) wrote :

Also, in the action this innocuous looking line [0]:

 diff_nodes = update_node_list()

Actually runs the set maintenance and delete commands [1].

I suspect the bug is at or near [1]. It is possible the maintenance setting has not completed yet before we attempt the delete.

[0] https://github.com/openstack/charm-hacluster/blob/master/actions/actions.py#L123
[1] https://github.com/openstack/charm-hacluster/blob/master/hooks/utils.py#L1566-L1567

Revision history for this message
David Ames (thedac) wrote :

Log on leader node:

2021-06-22 21:02:24 INFO juju-log Setting node juju-90a54c-zaza-25e951f28151-1 to maintenance
2021-06-22 21:02:26 INFO juju-log Deleting node juju-90a54c-zaza-25e951f28151-1 from the cluster
2021-06-22 21:02:27 WARNING juju-log "crm -w -F node delete juju-90a54c-zaza-25e951f28151-1" failed with "ERROR: node juju-90a54c-zaza-25e951f28151-1 not found in the CIB"
2021-06-22 21:02:27 WARNING juju-log {} was already removed from the cluster, moving on
2021-06-22 21:02:27 INFO juju-log Setting node node1 to maintenance
2021-06-22 21:02:28 INFO juju-log Deleting node node1 from the cluster
2021-06-22 21:02:29 WARNING juju-log "crm -w -F node delete node1" failed with "Could not remove node1[0] from nodes: Transport endpoint is not connectedCould not remove node1[0] from status: Transport endpoint is
 not connectedWARNING: "crm_node --force -R node1" failed, rc=1"
2021-06-22 21:02:29 INFO juju-log DEPRECATION WARNING: Function action_fail is being removed : moved to function_fail()

Revision history for this message
Aurelien Lourot (aurelien-lourot) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-hacluster (master)

Reviewed: https://review.opendev.org/c/openstack/charm-hacluster/+/797744
Committed: https://opendev.org/openstack/charm-hacluster/commit/3b872ff4d28a42dc6dd0fbb1a2725444ff8d07cb
Submitter: "Zuul (22348)"
Branch: master

commit 3b872ff4d28a42dc6dd0fbb1a2725444ff8d07cb
Author: David Ames <email address hidden>
Date: Wed Jun 23 13:13:21 2021 -0700

    Retry on "Transport endpoint is not connected"

    The crm node delete already handles some expected failure modes. Add
    "Transport endpoint is not connected" so that it retries the node
    delete.

    Change-Id: I9727e7b5babcfed1444f6d4821498fbc16e69297
    Closes-Bug: #1931588
    Co-authored-by: Aurelien Lourot <email address hidden>

Changed in charm-hacluster:
status: In Progress → Fix Committed
Changed in charm-hacluster:
milestone: none → 21.10
Changed in charm-hacluster:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.