forget cluster node failed during cluster-relation-changed hook

Bug #1818260 reported by Alexander Balderson
18
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack RabbitMQ Server Charm
Fix Released
Critical
David Ames

Bug Description

2019-03-01 13:32:41 WARNING juju-log cluster:29: min-cluster-size is not defined, race conditions may occur if this is not a single unit deployment.
2019-03-01 13:32:43 INFO juju.worker.uniter.operation runhook.go:138 ran "cluster-relation-changed" hook
2019-03-01 13:33:18 DEBUG juju-log Hardening function 'update_status'
2019-03-01 13:33:18 DEBUG juju-log No hardening applied to 'update_status'
2019-03-01 13:33:18 INFO juju-log Updating status.
2019-03-01 13:33:22 DEBUG juju-log check_cluster_memberships(): 'rabbit@landscapeamqp-1' in nodes but not in charm relations or running_nodes, telling RabbitMQ to forget about it.
2019-03-01 13:33:24 DEBUG juju-log Running ['/usr/sbin/rabbitmqctl', 'forget_cluster_node', 'rabbit@landscapeamqp-1']
2019-03-01 13:33:25 DEBUG update-status Removing node 'rabbit@landscapeamqp-1' from cluster
2019-03-01 13:33:25 DEBUG update-status Error: {failed_to_remove_node,'rabbit@landscapeamqp-1',
2019-03-01 13:33:25 DEBUG update-status {active,"Mnesia is running",
2019-03-01 13:33:25 DEBUG update-status 'rabbit@landscapeamqp-1'}}
2019-03-01 13:33:25 DEBUG update-status Traceback (most recent call last):
2019-03-01 13:33:25 DEBUG update-status File "/var/lib/juju/agents/unit-landscape-rabbitmq-server-1/charm/hooks/update-status", line 957, in <module>
2019-03-01 13:33:25 DEBUG update-status hooks.execute(sys.argv)
2019-03-01 13:33:25 DEBUG update-status File "/var/lib/juju/agents/unit-landscape-rabbitmq-server-1/charm/charmhelpers/core/hookenv.py", line 909, in execute
2019-03-01 13:33:25 DEBUG update-status self._hooks[hook_name]()
2019-03-01 13:33:25 DEBUG update-status File "/var/lib/juju/agents/unit-landscape-rabbitmq-server-1/charm/charmhelpers/contrib/hardening/harden.py", line 93, in _harden_inner2
2019-03-01 13:33:25 DEBUG update-status return f(*args, **kwargs)
2019-03-01 13:33:25 DEBUG update-status File "/var/lib/juju/agents/unit-landscape-rabbitmq-server-1/charm/hooks/update-status", line 953, in update_status
2019-03-01 13:33:25 DEBUG update-status rabbit.check_cluster_memberships()
2019-03-01 13:33:25 DEBUG update-status File "/var/lib/juju/agents/unit-landscape-rabbitmq-server-1/charm/hooks/rabbit_utils.py", line 512, in check_cluster_memberships
2019-03-01 13:33:25 DEBUG update-status forget_cluster_node(node)
2019-03-01 13:33:25 DEBUG update-status File "/var/lib/juju/agents/unit-landscape-rabbitmq-server-1/charm/hooks/rabbit_utils.py", line 522, in forget_cluster_node
2019-03-01 13:33:25 DEBUG update-status rabbitmqctl('forget_cluster_node', node)
2019-03-01 13:33:25 DEBUG update-status File "/var/lib/juju/agents/unit-landscape-rabbitmq-server-1/charm/hooks/rabbit_utils.py", line 348, in rabbitmqctl
2019-03-01 13:33:25 DEBUG update-status subprocess.check_call(cmd)
2019-03-01 13:33:25 DEBUG update-status File "/usr/lib/python3.6/subprocess.py", line 291, in check_call
2019-03-01 13:33:25 DEBUG update-status raise CalledProcessError(retcode, cmd)
2019-03-01 13:33:25 DEBUG update-status subprocess.CalledProcessError: Command '['/usr/sbin/rabbitmqctl', 'forget_cluster_node', 'rabbit@landscapeamqp-1']' returned non-zero exit status 70.
2019-03-01 13:33:25 ERROR juju.worker.uniter.operation runhook.go:132 hook "update-status" failed: exit status 1
2019-03-01 13:33:25 INFO juju.worker.uniter resolver.go:124 awaiting error resolution for "update-status" hook
2019-03-01 13:33:27 INFO juju.worker.uniter resolver.go:124 awaiting error resolution for "update-status" hook
2019-03-01 13:33:30 INFO juju.worker.uniter resolver.go:124 awaiting error resolution for "update-status" hook

Revision history for this message
Alexander Balderson (asbalderson) wrote :

Full crash-dump logs

Revision history for this message
Ashley Lai (alai) wrote :

In the bundle, rabbitmq-server min-cluster-size is set to 3.

rabbitmq-server:
    charm: cs:rabbitmq-server
    bindings:
      "": *oam-space
      amqp: *internal-space
      cluster: *internal-space
    options:
      source: *openstack-origin
      min-cluster-size: 3
    num_units: 3

Revision history for this message
James Page (james-page) wrote :

The first log message in the bug report would not confirm that:

2019-03-01 13:32:41 WARNING juju-log cluster:29: min-cluster-size is not defined, race conditions may occur if this is not a single unit deployment.

are you sure that min-cluster-size was set for this deployment?

Revision history for this message
James Page (james-page) wrote :

Please check the settings for 'landscape-rabbitmq-server' not 'rabbitmq-server'.

Marking Incomplete.

Changed in charm-rabbitmq-server:
status: New → Incomplete
importance: Undecided → Low
Revision history for this message
David Ames (thedac) wrote :
Download full text (3.3 KiB)

QA caught this and Alex Kavanagh has also seen it:

From the Traceback we can see the cluster is trying to remove a node. Then the charm comes along runs any arbitrary hook which executes rabbitmqctl and it fails:

2019-04-19 02:02:00 DEBUG juju-log Running ['/usr/sbin/rabbitmqctl', 'forget_cluster_node', 'rabbit@landscapeamqp-3']
2019-04-19 02:02:02 DEBUG update-status Removing node 'rabbit@landscapeamqp-3' from cluster
2019-04-19 02:02:02 DEBUG update-status Error: {failed_to_remove_node,'rabbit@landscapeamqp-3',
2019-04-19 02:02:02 DEBUG update-status {active,"Mnesia is running",
2019-04-19 02:02:02 DEBUG update-status 'rabbit@landscapeamqp-3'}}
2019-04-19 02:02:02 DEBUG update-status Traceback (most recent call last):
2019-04-19 02:02:02 DEBUG update-status File "/var/lib/juju/agents/unit-landscape-rabbitmq-server-0/charm/hooks/update-status", line 972, in <module>
2019-04-19 02:02:02 DEBUG update-status hooks.execute(sys.argv)
2019-04-19 02:02:02 DEBUG update-status File "/var/lib/juju/agents/unit-landscape-rabbitmq-server-0/charm/charmhelpers/core/hookenv.py", line 914, in execute
2019-04-19 02:02:02 DEBUG update-status self._hooks[hook_name]()
2019-04-19 02:02:02 DEBUG update-status File "/var/lib/juju/agents/unit-landscape-rabbitmq-server-0/charm/charmhelpers/contrib/hardening/harden.py", line 93, in _harden_inner2
2019-04-19 02:02:02 DEBUG update-status return f(*args, **kwargs)
2019-04-19 02:02:02 DEBUG update-status File "/var/lib/juju/agents/unit-landscape-rabbitmq-server-0/charm/hooks/update-status", line 968, in update_status
2019-04-19 02:02:02 DEBUG update-status rabbit.check_cluster_memberships()
2019-04-19 02:02:02 DEBUG update-status File "/var/lib/juju/agents/unit-landscape-rabbitmq-server-0/charm/hooks/rabbit_utils.py", line 554, in check_cluster_memberships
2019-04-19 02:02:02 DEBUG update-status forget_cluster_node(node)
2019-04-19 02:02:02 DEBUG update-status File "/var/lib/juju/agents/unit-landscape-rabbitmq-server-0/charm/hooks/rabbit_utils.py", line 564, in forget_cluster_node
2019-04-19 02:02:02 DEBUG update-status rabbitmqctl('forget_cluster_node', node)
2019-04-19 02:02:02 DEBUG update-status File "/var/lib/juju/agents/unit-landscape-rabbitmq-server-0/charm/hooks/rabbit_utils.py", line 376, in rabbitmqctl
2019-04-19 02:02:02 DEBUG update-status subprocess.check_call(cmd)
2019-04-19 02:02:02 DEBUG update-status File "/usr/lib/python3.6/subprocess.py", line 291, in check_call
2019-04-19 02:02:02 DEBUG update-status raise CalledProcessError(retcode, cmd)
2019-04-19 02:02:02 DEBUG update-status subprocess.CalledProcessError: Command '['/usr/sbin/rabbitmqctl', 'forget_cluster_node', 'rabbit@landscapeamqp-3']' returned non-zero exit status 70.
2019-04-19 02:02:02 ERROR juju.worker.uniter.operation runhook.go:132 hook "update-status" failed: exit status 1

I suspect this is due to commit b74a50d3 [0]. During deploy time when resources are scarce the cluster believes a node is offline and attempts to remove it. Subsequently, a charm hook runs leading to the Traceback.

I have asked QA to test with cluster-p...

Read more...

Changed in charm-rabbitmq-server:
status: Incomplete → Confirmed
importance: Low → Critical
assignee: nobody → David Ames (thedac)
milestone: none → 19.04
Revision history for this message
David Ames (thedac) wrote :

The approach we need to take is the following:

During deploy before the {leader,client}_node_is_ready returns true, regardless of the juju config value write the config file with cluster-partition-handling = ignore, then after the cluster is complete ({leader,client}_node_is_ready returns true) re-write the config with the value set in juju config.

This will avoid deploy time collisions of the rabbitmq cluster management and the charm cluster build process.

Note: The problem on a very busy cloud still exists. If the rabbitmq cluster management happens to remove a node and a juju hook subsequently we will get hook execution errors. A simple juju resolved will fix the issue.

Note:
We may need to validate {leader,client}_node_is_ready is sufficient.

James Page (james-page)
Changed in charm-rabbitmq-server:
status: Confirmed → Triaged
Ryan Beisner (1chb1n)
tags: added: scaleback
Revision history for this message
Marian Gasparovic (marosg) wrote :

we hit this again today in SQA

Revision history for this message
Ashley Lai (alai) wrote :
Revision history for this message
Michael Skalka (mskalka) wrote :
Revision history for this message
Michael Skalka (mskalka) wrote :
Revision history for this message
David Ames (thedac) wrote :

UPDATE:

While I assumed this would be about the cluster-partition-handling it seems rather there was code introduced a few years ago [0] that attempts to remove nodes it is unaware of. This is almost certainly where the bug lies.

[0] https://github.com/openstack/charm-rabbitmq-server/commit/08b10513c5725fb740382668c47fc769a6f2936c

Revision history for this message
David Ames (thedac) wrote :

The fix [0] for a 2017 bug [1] is colliding during deployment. The leader node is running update-status simultaneously as a non-leader node is joining the cluster. Such that the check_cluster_membership thinks the non-leader is a dead node and attempts to remove it.

# Leader juju-828573-15-lxd-10 Running update-status hook
2020-05-29 16:44:27 DEBUG juju-log check_cluster_memberships(): 'rabbit@juju-828573-11-lxd-19' in nodes but not in charm relations or running_nodes, telling RabbitMQ to forget about it.

# Non-leader juju-828573-11-lxd-19 joining cluster
2020-05-29 16:44:15 DEBUG juju-log cluster:40: Running ['/usr/sbin/rabbitmqctl', 'stop_app']
2020-05-29 16:44:18 DEBUG cluster-relation-changed Stopping rabbit application on node 'rabbit@juju-828573-11-lxd-19'
2020-05-29 16:44:23 DEBUG juju-log cluster:40: Running ['/usr/sbin/rabbitmqctl', 'start_app']
2020-05-29 16:44:28 DEBUG cluster-relation-changed Starting node 'rabbit@juju-828573-11-lxd-19'
2020-05-29 16:44:30 DEBUG juju-log cluster:40: Waiting for rabbitmq app to start: /<email address hidden>
2020-05-29 16:44:30 DEBUG juju-log cluster:40: Running ['timeout', '180', '/usr/sbin/rabbitmqctl', 'wait', '/<email address hidden>']
2020-05-29 16:44:35 DEBUG cluster-relation-changed Waiting for 'rabbit@juju-828573-11-lxd-19'
2020-05-29 16:44:35 DEBUG cluster-relation-changed pid is 29035
2020-05-29 16:44:35 DEBUG juju-log cluster:40: Confirmed rabbitmq app is running
2020-05-29 16:44:35 INFO juju-log cluster:40: Host clustered with rabbit@juju-828573-15-lxd-10.

[0] https://github.com/openstack/charm-rabbitmq-server/commit/08b10513c5725fb740382668c47fc769a6f2936c#diff-21870f9ed3cd89ae3ca8d1f237afc315R467
[1] https://bugs.launchpad.net/charm-rabbitmq-server/+bug/1679449

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-rabbitmq-server (master)

Fix proposed to branch: master
Review: https://review.opendev.org/732465

Changed in charm-rabbitmq-server:
status: Triaged → In Progress
Revision history for this message
David Ames (thedac) wrote :

I believe [0] will resolve this for SQA. However, I would like to discuss with Frode Nordahl, the original author of the check and removal code, on alternative approaches.

[0] https://review.opendev.org/732465

David Ames (thedac)
Changed in charm-rabbitmq-server:
milestone: 19.04 → 20.08
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-rabbitmq-server (master)

Reviewed: https://review.opendev.org/732465
Committed: https://git.openstack.org/cgit/openstack/charm-rabbitmq-server/commit/?id=f1c3deac35138a5e0c8ecc5a3004a838f0322b95
Submitter: Zuul
Branch: master

commit f1c3deac35138a5e0c8ecc5a3004a838f0322b95
Author: David Ames <email address hidden>
Date: Mon Jun 1 14:00:05 2020 -0700

    Do not attempt to remove a unit during clustering

    Do not attempt to remove a node while in the act of clustering.

    Change-Id: Id30ba53b17fb015e8336127ab1479faf7d8f4779
    Partial-Bug: #1818260

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-rabbitmq-server (master)

Fix proposed to branch: master
Review: https://review.opendev.org/732928

Revision history for this message
Michael Skalka (mskalka) wrote :
tags: added: field-high
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-rabbitmq-server (stable/20.05)

Fix proposed to branch: stable/20.05
Review: https://review.opendev.org/733285

Revision history for this message
David Ames (thedac) wrote :

Michael,

Looks like those runs are on stable. So here is the stable back-port of the trivial fix which should unblock you:
https://review.opendev.org/#/c/733285/

Also this is in next already: cs:~openstack-charmers-next/rabbitmq-server-369

For the completeness, I am in the process of making the forget_cluster_node an action so that it is never called during hook executions. This will stay in master until the 20.08 release:
https://review.opendev.org/#/c/732928/

Revision history for this message
Michael Skalka (mskalka) wrote :

David,

Thank you for the update. We'll pick up the -next version of the charm in testing.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-rabbitmq-server (stable/20.05)

Reviewed: https://review.opendev.org/733285
Committed: https://git.openstack.org/cgit/openstack/charm-rabbitmq-server/commit/?id=2cd2962c2b3582339577b6aabddcb95e681c8f03
Submitter: Zuul
Branch: stable/20.05

commit 2cd2962c2b3582339577b6aabddcb95e681c8f03
Author: David Ames <email address hidden>
Date: Mon Jun 1 14:00:05 2020 -0700

    Do not attempt to remove a unit during clustering

    Do not attempt to remove a node while in the act of clustering.

    Change-Id: Id30ba53b17fb015e8336127ab1479faf7d8f4779
    Partial-Bug: #1818260
    (cherry picked from commit f1c3deac35138a5e0c8ecc5a3004a838f0322b95)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-rabbitmq-server (master)

Reviewed: https://review.opendev.org/732928
Committed: https://git.openstack.org/cgit/openstack/charm-rabbitmq-server/commit/?id=5318764cc6be62d54ba1bb4d998301f81fb2c63e
Submitter: Zuul
Branch: master

commit 5318764cc6be62d54ba1bb4d998301f81fb2c63e
Author: David Ames <email address hidden>
Date: Tue Jun 2 09:55:05 2020 -0700

    Forget cluster node as an action

    There are race conditions in which the forget_cluster_node can get
    executed against a node that is in the process of joining the cluster.

    This change moves forget_cluster_node to an action that can be
    performed by the administrator. The asses_cluster_status has been
    updated to check for departed nodes and set status pointing toward the
    use of the new forget-cluster-node action.

    Closes-Bug: #1818260
    Change-Id: I64bcdb9811a3816b394395fac19f5af5cc9f9006

Changed in charm-rabbitmq-server:
status: In Progress → Fix Committed
Revision history for this message
Michael Skalka (mskalka) wrote :
Changed in charm-rabbitmq-server:
status: Fix Committed → New
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-rabbitmq-server (stable/20.05)

Fix proposed to branch: stable/20.05
Review: https://review.opendev.org/741983

James Page (james-page)
Changed in charm-rabbitmq-server:
status: New → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on charm-rabbitmq-server (stable/20.05)

Change abandoned by David Ames (<email address hidden>) on branch: stable/20.05
Review: https://review.opendev.org/741983
Reason: Released in 20.08

Changed in charm-rabbitmq-server:
status: Fix Committed → Fix Released
no longer affects: ubuntu
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.