OpenStack HA Cluster Charm

[hirsute] update-ring action fails on removing unreachable nodes

Bug #1931588 reported by Aurelien Lourot on 2021-06-10

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	OpenStack HA Cluster Charm	Fix Released	High	Aurelien Lourot	OpenStack HA Cluster Charm 21.10

Bug Description

Seen twice on hirsute in our test gate [0][1]

## Running Test zaza.openstack.charm_tests.hacluster.tests.HaclusterScaleBackAndForthTest ##
looking at application: {'name': 'keystone', 'type': {'pkg': 'keystone', 'origin_setting': 'openstack-origin'}}
Using keystone API V3 (or later) for overcloud auth
looking at application: {'name': 'keystone', 'type': {'pkg': 'keystone', 'origin_setting': 'openstack-origin'}}
test_930_scaleback (zaza.openstack.charm_tests.hacluster.tests.HaclusterScaleBackAndForthTest)
Remove one unit, recalculate quorum and re-add one unit.
...
Pausing unit keystone-hacluster/1
Removing keystone/1
Waiting for model to settle
Checking that corosync considers at least one node to be offline
Updating corosync ring
ERROR
======================================================================
ERROR: test_930_scaleback (zaza.openstack.charm_tests.hacluster.tests.HaclusterScaleBackAndForthTest)
Remove one unit, recalculate quorum and re-add one unit.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/ubuntu/src/review.opendev.org/openstack/charm-mysql-router/build/builds/mysql-router/.tox/func-target/lib/python3.6/site-packages/zaza/openstack/charm_tests/hacluster/tests.py", line 141, in test_930_scaleback
    raise_on_failure=True)
  File "/home/ubuntu/src/review.opendev.org/openstack/charm-mysql-router/build/builds/mysql-router/.tox/func-target/lib/python3.6/site-packages/zaza/__init__.py", line 48, in _wrapper
    return run(_run_it())
  File "/home/ubuntu/src/review.opendev.org/openstack/charm-mysql-router/build/builds/mysql-router/.tox/func-target/lib/python3.6/site-packages/zaza/__init__.py", line 36, in run
    return task.result()
  File "/home/ubuntu/src/review.opendev.org/openstack/charm-mysql-router/build/builds/mysql-router/.tox/func-target/lib/python3.6/site-packages/zaza/__init__.py", line 47, in _run_it
    return await f(*args, **kwargs)
  File "/home/ubuntu/src/review.opendev.org/openstack/charm-mysql-router/build/builds/mysql-router/.tox/func-target/lib/python3.6/site-packages/zaza/model.py", line 901, in async_run_action_on_leader
    raise ActionFailed(action_obj, output=output)
zaza.model.ActionFailed: Run of action "update-ring" with parameters "{'i-really-mean-it': True}" on "keystone-hacluster/0" failed with "Removing node1 from the cluster failed. Command '['crm', '-w', '-F', 'node', 'delete', 'node1']' returned non-zero exit status 1. output=b'Could not remove node1[0] from nodes: Transport endpoint is not connectedCould not remove node1[0] from status: Transport endpoint is not connectedWARNING: "crm_node --force -R node1" failed, rc=1\n'" (id=256 status=failed enqueued=2021-06-10T13:32:21Z started=2021-06-10T13:32:21Z completed=2021-06-10T13:32:24Z output={'Code': '0'})

Note that `node1` is wrongly created by pacemaker because of another bug [2] and the update-ring action is supposed to successfully work around that by removing `node1`.

[0] https://review.opendev.org/c/openstack/charm-mysql-router/+/795226
[1] https://openstack-ci-reports.ubuntu.com/artifacts/74b/795226/3/check/full_model_ha-hirsute-full-ha/74b6249/job-output.txt
[2] lp:1874719

Aurelien Lourot (aurelien-lourot) on 2021-06-10

Changed in charm-hacluster:
assignee:	nobody → Aurelien Lourot (aurelien-lourot)

Alex Kavanagh (ajkavanagh) on 2021-06-12

Changed in charm-hacluster:
status:	New → In Progress
importance:	Undecided → High

Revision history for this message

Aurelien Lourot (aurelien-lourot) wrote on 2021-06-15:

Issue also visible in this gate:
https://review.opendev.org/c/openstack/charm-hacluster/+/794501

Revision history for this message

Aurelien Lourot (aurelien-lourot) wrote on 2021-06-22:

Download full text (4.7 KiB)

TL;DR: this seems to be an intended change of behavior in corosync and we need to adapt the charm.

Details:

On hirsute `crm -w -F node delete <node-name>` fails with "Transport endpoint is not connected" if <node-name> isn't reachable, and so the `update-ring` action is unusable. On groovy we don't have the issue.

The most relevant packages are:

If I deploy a groovy bundle [0], then upgrade all packages above **EXCEPT corosync** on each node:
# sed s/groovy/hirsute/g /etc/apt/sources.list > /etc/apt/sources.list.d/hirsute.list
# apt update
# apt install pacemaker libpe-status28 libpe-rules26 libcib27 libcrmservice28 libcrmcluster29 libpacemaker1 pacemaker-cli-utils crmsh liblrmd28
Get:1 http://nova.clouds.archive.ubuntu.com/ubuntu hirsute/main amd64 libfakeroot amd64 1.25.3-1.1ubuntu2 [28.1 kB]
Get:2 http://nova.clouds.archive.ubuntu.com/ubuntu hirsute/main amd64 fakeroot amd64 1.25.3-1.1ubuntu2 [62.9 kB]
Get:3 http://nova.clouds.archive.ubuntu.com/ubuntu hirsute/main amd64 locales all 2.33-0ubuntu5 [3876 kB]
Get:4 http://nova.clouds.archive.ubuntu.com/ubuntu hirsute/main amd64 libc6 amd64 2.33-0ubuntu5 [2690 kB]
Get:5 http://nova.clouds.archive.ubuntu.com/ubuntu hirsute/main amd64 libc-bin amd64 2.33-0ubuntu5 [646 kB]
Get:6 http://nova.clouds.archive.ubuntu.com/ubuntu hirsute/main amd64 libc-dev-bin amd64 2.33-0ubuntu5 [19.3 kB]
Get:7 http://nova.clouds.archive.ubuntu.com/ubuntu hirsute/main amd64 libc6-dev amd64 2.33-0ubuntu5 [2143 kB]
Get:8 http://nova.clouds.archive.ubuntu.com/ubuntu hirsute-updates/main amd64 libnettle8 amd64 3.7-2.1ubuntu1.1 [146 kB]
Get:9 http://nova.clouds.archive.ubuntu.com/ubuntu hirsute/main amd64 libgnutls30 amd64 3.7.1-3ubuntu1 [902 kB]
Get:10 http://nova.clouds.archive.ubuntu.com/ubuntu hirsute/main amd64 libqb100 amd64 2.0.2-1 [66.9 kB]
Get:11 http://nova.clouds.archive.ubuntu.com/ubuntu hirsute/main amd64 libcrmcommon34 amd64 2.0.5-2ubuntu1 [175 kB]
Get:12 http://nova.clouds.archive.ubuntu.com/ubuntu hirsute/main amd64 libpe-rules26 amd64 2.0.5-2ubuntu1 [27.7 kB]
Get:13 http://nova.clouds.archive.ubuntu.com/ubuntu hirsute/main amd64 libcib27 amd64 2.0.5-2ubuntu1 [50.6 kB]
Get:14 http://nova.clouds.archive.ubuntu.com/ubuntu hirsute/main amd64 libcrmservice28 amd64 2.0.5-2ubuntu1 [38.1 kB]
Get:15 http://nova.clouds.archive.ubuntu.com/ubuntu hirsute/main amd64 libstonithd26 amd64 2.0.5-2ubuntu1 [39.6 kB]
Get:16 http://nova.clouds.archive.ubuntu.com/ubuntu hirsute/main amd64 liblrmd28 amd64 2.0.5-2ubuntu1 [30.8 kB]
Get:17 http://nova.clouds.archive.ubuntu.com/ubuntu hirsute/main amd64 libpe-status28 amd64 2.0.5-2ubuntu1 [150 kB]
Get:18 http://nova.clouds.archive.ubuntu.com/ubuntu hirsute/main amd64 libpacemaker1 amd64 2.0.5-2ubuntu1 [167 kB]
Get:19 http://nova.clouds.archive.ubuntu.com/ubuntu hi...

TL;DR: this seems to be an intended change of behavior in corosync and we need to adapt the charm.

Details:

The most relevant packages are:

If I deploy a groovy bundle [0], then upgrade all packages above **EXCEPT corosync** on each node:
# sed s/groovy/hirsute/g /etc/apt/sources.list > /etc/apt/sources.list.d/hirsute.list
# apt update
# apt install pacemaker libpe-status28 libpe-rules26 libcib27 libcrmservice28 libcrmcluster29 libpacemaker1 pacemaker-cli-utils crmsh liblrmd28
Get:1 http://nova.clouds.archive.ubuntu.com/ubuntu hirsute/main amd64 libfakeroot amd64 1.25.3-1.1ubuntu2 [28.1 kB]
Get:2 http://nova.clouds.archive.ubuntu.com/ubuntu hirsute/main amd64 fakeroot amd64 1.25.3-1.1ubuntu2 [62.9 kB]
Get:3 http://nova.clouds.archive.ubuntu.com/ubuntu hirsute/main amd64 locales all 2.33-0ubuntu5 [3876 kB] 
Get:4 http://nova.clouds.archive.ubuntu.com/ubuntu hirsute/main amd64 libc6 amd64 2.33-0ubuntu5 [2690 kB] 
Get:5 http://nova.clouds.archive.ubuntu.com/ubuntu hirsute/main amd64 libc-bin amd64 2.33-0ubuntu5 [646 kB]
Get:6 http://nova.clouds.archive.ubuntu.com/ubuntu hirsute/main amd64 libc-dev-bin amd64 2.33-0ubuntu5 [19.3 kB]
Get:7 http://nova.clouds.archive.ubuntu.com/ubuntu hirsute/main amd64 libc6-dev amd64 2.33-0ubuntu5 [2143 kB]
Get:8 http://nova.clouds.archive.ubuntu.com/ubuntu hirsute-updates/main amd64 libnettle8 amd64 3.7-2.1ubuntu1.1 [146 kB]
Get:9 http://nova.clouds.archive.ubuntu.com/ubuntu hirsute/main amd64 libgnutls30 amd64 3.7.1-3ubuntu1 [902 kB]
Get:10 http://nova.clouds.archive.ubuntu.com/ubuntu hirsute/main amd64 libqb100 amd64 2.0.2-1 [66.9 kB]
Get:11 http://nova.clouds.archive.ubuntu.com/ubuntu hirsute/main amd64 libcrmcommon34 amd64 2.0.5-2ubuntu1 [175 kB]
Get:12 http://nova.clouds.archive.ubuntu.com/ubuntu hirsute/main amd64 libpe-rules26 amd64 2.0.5-2ubuntu1 [27.7 kB]
Get:13 http://nova.clouds.archive.ubuntu.com/ubuntu hirsute/main amd64 libcib27 amd64 2.0.5-2ubuntu1 [50.6 kB]
Get:14 http://nova.clouds.archive.ubuntu.com/ubuntu hirsute/main amd64 libcrmservice28 amd64 2.0.5-2ubuntu1 [38.1 kB]
Get:15 http://nova.clouds.archive.ubuntu.com/ubuntu hirsute/main amd64 libstonithd26 amd64 2.0.5-2ubuntu1 [39.6 kB]
Get:16 http://nova.clouds.archive.ubuntu.com/ubuntu hirsute/main amd64 liblrmd28 amd64 2.0.5-2ubuntu1 [30.8 kB]
Get:17 http://nova.clouds.archive.ubuntu.com/ubuntu hirsute/main amd64 libpe-status28 amd64 2.0.5-2ubuntu1 [150 kB]
Get:18 http://nova.clouds.archive.ubuntu.com/ubuntu hirsute/main amd64 libpacemaker1 amd64 2.0.5-2ubuntu1 [167 kB]
Get:19 http://nova.clouds.archive.ubuntu.com/ubuntu hirsute/main amd64 crmsh all 4.2.0-4ubuntu1 [500 kB]
Get:20 http://nova.clouds.archive.ubuntu.com/ubuntu hirsute/main amd64 pacemaker amd64 2.0.5-2ubuntu1 [314 kB]
Get:21 http://nova.clouds.archive.ubuntu.com/ubuntu hirsute/main amd64 pacemaker-cli-utils amd64 2.0.5-2ubuntu1 [161 kB]
Get:22 http://nova.clouds.archive.ubuntu.com/ubuntu hirsute/main amd64 libcrmcluster29 amd64 2.0.5-2ubuntu1 [43.7 kB]
# systemctl restart corosync
# systemctl restart pacemaker

then I still don't hit the issue:
functest-test -t zaza.openstack.charm_tests.hacluster.tests.HaclusterScaleBackAndForthTest -m <model-name>

But if I now also upgrade corosync to 3.1.0-2ubuntu3 on each node:
# apt install corosync
Get:1 http://nova.clouds.archive.ubuntu.com/ubuntu hirsute/main amd64 corosync amd64 3.1.0-2ubuntu3 [237 kB]
# systemctl restart corosync
# systemctl restart pacemaker

I then hit the issue again.

There has been a lot of work/commits between corosync 3.0.3 and 3.1.0 and these two commits [1][2] make me think that corosync returning this error code is a new intended behavior. Our charm should just accept that error code and move on.

Also note that there seems to have been a change in the packager's default /etc/corosync/corosync.conf and they may have to be reflected in the charm's template. [3]

[0] https://github.com/openstack/charm-hacluster/blob/master/tests/bundles/groovy-victoria.yaml
[1] https://github.com/corosync/corosync/commit/9105d94a
[2] https://github.com/corosync/corosync/commit/0d0febbc
[3] https://github.com/openstack/charm-hacluster/blob/master/templates/corosync.conf

summary:

- [hirsute] update-ring action fails on removing node1
+ [hirsute] update-ring action fails on removing unreachable nodes

Revision history for this message

David Ames (thedac) wrote on 2021-06-22:

Download full text (4.2 KiB)

FWIW, I notice that this only happens when we have both node1 and the removed unit offline. When it succeeds the removed node has already been removed from the cluster.

For example:

Node List:
* Online: [ juju-aba261-zaza-0ce5519c4156-0 juju-aba261-zaza-0ce5519c4156-2 ]
* OFFLINE: [ juju-aba261-zaza-0ce5519c4156-1 node1 ]

Running the action, juju run-action --wait keystone-hacluster/0 update-ring i-really-mean-it=True, fails with "Could not remove juju-aba261-zaza-0ce5519c4156-1[0] from nodes: Trans...

FWIW, I notice that this only happens when we have both node1 and the removed unit offline. When it succeeds the removed node has already been removed from the cluster.

For example:

ubuntu@juju-aba261-zaza-0ce5519c4156-0:~$ sudo crm status                                                                                                                                                                                                                         
Cluster Summary:                                                                                                                                                                                                                                                                  
  * Stack: corosync                                                                                                                                                                                                                                                               
  * Current DC: juju-aba261-zaza-0ce5519c4156-0 (version 2.0.5-ba59be7122) - partition with quorum                                                                                                                                                                                
  * Last updated: Tue Jun 22 20:09:32 2021                                                                                                                                                                                                                                        
  * Last change:  Tue Jun 22 19:45:58 2021 by hacluster via crmd on juju-aba261-zaza-0ce5519c4156-0                                                                                                                                                                               
  * 4 nodes configured                                                                                                                                                                                                                                                            
  * 5 resource instances configured

Node List:                                                          
  * Online: [ juju-aba261-zaza-0ce5519c4156-0 juju-aba261-zaza-0ce5519c4156-2 ]                                                          
  * OFFLINE: [ juju-aba261-zaza-0ce5519c4156-1 node1 ]

Full List of Resources:                                             
  * Resource Group: grp_ks_vips:                                    
    * res_ks_2cfb08e_vip        (ocf::heartbeat:IPaddr2):        Started juju-aba261-zaza-0ce5519c4156-0                                 
  * Clone Set: cl_ks_haproxy [res_ks_haproxy]:                      
    * Started: [ juju-aba261-zaza-0ce5519c4156-0 juju-aba261-zaza-0ce5519c4156-2 ]                                                       
    * Stopped: [ juju-aba261-zaza-0ce5519c4156-1 node1 ]

Running the action, juju run-action --wait keystone-hacluster/0 update-ring i-really-mean-it=True, fails with "Could not remove juju-aba261-zaza-0ce5519c4156-1[0] from nodes: Transport endpoint is not connected".

Status now looks like:
ubuntu@juju-aba261-zaza-0ce5519c4156-0:~$ sudo crm status
Cluster Summary:
  * Stack: corosync
  * Current DC: juju-aba261-zaza-0ce5519c4156-0 (version 2.0.5-ba59be7122) - partition with quorum
  * Last updated: Tue Jun 22 20:10:54 2021
  * Last change:  Tue Jun 22 20:09:53 2021 by root via crm_node on juju-aba261-zaza-0ce5519c4156-0
  * 3 nodes configured
  * 4 resource instances configured

Node List:
  * Online: [ juju-aba261-zaza-0ce5519c4156-0 juju-aba261-zaza-0ce5519c4156-2 ]
  * OFFLINE: [ juju-aba261-zaza-0ce5519c4156-1 ]

Running the action again succeeds: juju run-action --wait keystone-hacluster/0 update-ring i-really-mean-it=True

Revision history for this message

David Ames (thedac) wrote on 2021-06-22:

Also, in the action this innocuous looking line [0]:

diff_nodes = update_node_list()

Actually runs the set maintenance and delete commands [1].

I suspect the bug is at or near [1]. It is possible the maintenance setting has not completed yet before we attempt the delete.

[0] https://github.com/openstack/charm-hacluster/blob/master/actions/actions.py#L123
[1] https://github.com/openstack/charm-hacluster/blob/master/hooks/utils.py#L1566-L1567

Revision history for this message

David Ames (thedac) wrote on 2021-06-22:

Log on leader node:

2021-06-22 21:02:24 INFO juju-log Setting node juju-90a54c-zaza-25e951f28151-1 to maintenance
2021-06-22 21:02:26 INFO juju-log Deleting node juju-90a54c-zaza-25e951f28151-1 from the cluster
2021-06-22 21:02:27 WARNING juju-log "crm -w -F node delete juju-90a54c-zaza-25e951f28151-1" failed with "ERROR: node juju-90a54c-zaza-25e951f28151-1 not found in the CIB"
2021-06-22 21:02:27 WARNING juju-log {} was already removed from the cluster, moving on
2021-06-22 21:02:27 INFO juju-log Setting node node1 to maintenance
2021-06-22 21:02:28 INFO juju-log Deleting node node1 from the cluster
2021-06-22 21:02:29 WARNING juju-log "crm -w -F node delete node1" failed with "Could not remove node1[0] from nodes: Transport endpoint is not connectedCould not remove node1[0] from status: Transport endpoint is
not connectedWARNING: "crm_node --force -R node1" failed, rc=1"
2021-06-22 21:02:29 INFO juju-log DEPRECATION WARNING: Function action_fail is being removed : moved to function_fail()

Revision history for this message

Aurelien Lourot (aurelien-lourot) wrote on 2021-06-24:

Fix: https://review.opendev.org/c/openstack/charm-hacluster/+/797744

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-06-25: Fix merged to charm-hacluster (master)

Reviewed: https://review.opendev.org/c/openstack/charm-hacluster/+/797744
Committed: https://opendev.org/openstack/charm-hacluster/commit/3b872ff4d28a42dc6dd0fbb1a2725444ff8d07cb
Submitter: "Zuul (22348)"
Branch: master

commit 3b872ff4d28a42dc6dd0fbb1a2725444ff8d07cb
Author: David Ames <email address hidden>
Date: Wed Jun 23 13:13:21 2021 -0700

Retry on "Transport endpoint is not connected"

    The crm node delete already handles some expected failure modes. Add
    "Transport endpoint is not connected" so that it retries the node
    delete.

    Change-Id: I9727e7b5babcfed1444f6d4821498fbc16e69297
    Closes-Bug: #1931588
    Co-authored-by: Aurelien Lourot <email address hidden>

Changed in charm-hacluster:
status:	In Progress → Fix Committed

Aurelien Lourot (aurelien-lourot) on 2021-06-25

Changed in charm-hacluster:
milestone:	none → 21.10

Alex Kavanagh (ajkavanagh) on 2021-10-22

Changed in charm-hacluster:
status:	Fix Committed → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.