Comment 5 for bug 1818260

Revision history for this message
David Ames (thedac) wrote :

QA caught this and Alex Kavanagh has also seen it:

From the Traceback we can see the cluster is trying to remove a node. Then the charm comes along runs any arbitrary hook which executes rabbitmqctl and it fails:

2019-04-19 02:02:00 DEBUG juju-log Running ['/usr/sbin/rabbitmqctl', 'forget_cluster_node', 'rabbit@landscapeamqp-3']
2019-04-19 02:02:02 DEBUG update-status Removing node 'rabbit@landscapeamqp-3' from cluster
2019-04-19 02:02:02 DEBUG update-status Error: {failed_to_remove_node,'rabbit@landscapeamqp-3',
2019-04-19 02:02:02 DEBUG update-status {active,"Mnesia is running",
2019-04-19 02:02:02 DEBUG update-status 'rabbit@landscapeamqp-3'}}
2019-04-19 02:02:02 DEBUG update-status Traceback (most recent call last):
2019-04-19 02:02:02 DEBUG update-status File "/var/lib/juju/agents/unit-landscape-rabbitmq-server-0/charm/hooks/update-status", line 972, in <module>
2019-04-19 02:02:02 DEBUG update-status hooks.execute(sys.argv)
2019-04-19 02:02:02 DEBUG update-status File "/var/lib/juju/agents/unit-landscape-rabbitmq-server-0/charm/charmhelpers/core/hookenv.py", line 914, in execute
2019-04-19 02:02:02 DEBUG update-status self._hooks[hook_name]()
2019-04-19 02:02:02 DEBUG update-status File "/var/lib/juju/agents/unit-landscape-rabbitmq-server-0/charm/charmhelpers/contrib/hardening/harden.py", line 93, in _harden_inner2
2019-04-19 02:02:02 DEBUG update-status return f(*args, **kwargs)
2019-04-19 02:02:02 DEBUG update-status File "/var/lib/juju/agents/unit-landscape-rabbitmq-server-0/charm/hooks/update-status", line 968, in update_status
2019-04-19 02:02:02 DEBUG update-status rabbit.check_cluster_memberships()
2019-04-19 02:02:02 DEBUG update-status File "/var/lib/juju/agents/unit-landscape-rabbitmq-server-0/charm/hooks/rabbit_utils.py", line 554, in check_cluster_memberships
2019-04-19 02:02:02 DEBUG update-status forget_cluster_node(node)
2019-04-19 02:02:02 DEBUG update-status File "/var/lib/juju/agents/unit-landscape-rabbitmq-server-0/charm/hooks/rabbit_utils.py", line 564, in forget_cluster_node
2019-04-19 02:02:02 DEBUG update-status rabbitmqctl('forget_cluster_node', node)
2019-04-19 02:02:02 DEBUG update-status File "/var/lib/juju/agents/unit-landscape-rabbitmq-server-0/charm/hooks/rabbit_utils.py", line 376, in rabbitmqctl
2019-04-19 02:02:02 DEBUG update-status subprocess.check_call(cmd)
2019-04-19 02:02:02 DEBUG update-status File "/usr/lib/python3.6/subprocess.py", line 291, in check_call
2019-04-19 02:02:02 DEBUG update-status raise CalledProcessError(retcode, cmd)
2019-04-19 02:02:02 DEBUG update-status subprocess.CalledProcessError: Command '['/usr/sbin/rabbitmqctl', 'forget_cluster_node', 'rabbit@landscapeamqp-3']' returned non-zero exit status 70.
2019-04-19 02:02:02 ERROR juju.worker.uniter.operation runhook.go:132 hook "update-status" failed: exit status 1

I suspect this is due to commit b74a50d3 [0]. During deploy time when resources are scarce the cluster believes a node is offline and attempts to remove it. Subsequently, a charm hook runs leading to the Traceback.

I have asked QA to test with cluster-partition-handling=ignore. If that passes, we will revert [0]

[0] https://github.com/openstack/charm-rabbitmq-server/commit/b74a50d30f5d257a1061263426052ca830528b55