When remove-unit, node information won't be removed

Bug #1679449 reported by Yoshi Kadokawa
32
This bug affects 5 people
Affects Status Importance Assigned to Milestone
OpenStack RabbitMQ Server Charm
Fix Released
Medium
Frode Nordahl

Bug Description

I have deployed a rabbitmq-server cluster, and the status looks like this.

$ juju status rabbitmq-server
Model Controller Cloud/Region Version
rabbitmq devmaas devmaas 2.1.2

App Version Status Scale Charm Store Rev OS Notes
rabbitmq-server 3.5.7 active 3 rabbitmq-server jujucharms 61 ubuntu

Unit Workload Agent Machine Public address Ports Message
rabbitmq-server/0* active idle 0/lxd/6 10.12.1.174 5672/tcp Unit is ready and clustered
rabbitmq-server/1 active idle 1/lxd/6 10.12.1.178 5672/tcp Unit is ready and clustered
rabbitmq-server/2 active idle 2/lxd/6 10.12.1.193 5672/tcp Unit is ready and clustered

Machine State DNS Inst id Series AZ
0 started 10.12.1.248 7c3whf xenial default
0/lxd/6 started 10.12.1.174 juju-6bd42f-0-lxd-6 xenial
1 started 10.12.1.249 ww8nyf xenial default
1/lxd/6 started 10.12.1.178 juju-6bd42f-1-lxd-6 xenial
2 started 10.12.1.246 acdyn8 xenial default
2/lxd/6 started 10.12.1.193 juju-6bd42f-2-lxd-6 xenial

1. The rabbitmq cluster status before removing unit
$ sudo rabbitmqctl cluster_status
Cluster status of node 'rabbit@juju-6bd42f-0-lxd-6' ...
[{nodes,[{disc,['rabbit@juju-6bd42f-0-lxd-6','rabbit@juju-6bd42f-1-lxd-6',
                'rabbit@juju-6bd42f-2-lxd-6']}]},
 {running_nodes,['rabbit@juju-6bd42f-2-lxd-6','rabbit@juju-6bd42f-1-lxd-6',
                 'rabbit@juju-6bd42f-0-lxd-6']},
 {cluster_name,<<"rabbit@juju-6bd42f-1-lxd-6">>},
 {partitions,[]}]

2. The rabbitmq cluster status after removing unit
$ sudo rabbitmqctl cluster_status
Cluster status of node 'rabbit@juju-6bd42f-0-lxd-6' ...
[{nodes,[{disc,['rabbit@juju-6bd42f-0-lxd-6','rabbit@juju-6bd42f-1-lxd-6',
                'rabbit@juju-6bd42f-2-lxd-6']}]},
 {running_nodes,['rabbit@juju-6bd42f-1-lxd-6','rabbit@juju-6bd42f-0-lxd-6']},
 {cluster_name,<<"rabbit@juju-6bd42f-1-lxd-6">>},
 {partitions,[]}]

As you can see, in "nodes", the removed unit's host is still there.
This can cause problems, for instance if you re-add an unit with the same hostname.

For now, after remove-unit a rabbitmq-server unit, you will need to run
$ sudo rabbitmqctl forget_cluster_node <rabbit@hostname>

I think this should be run in a hook when the remove-unit runs.

Tags: sts
tags: added: sts
Revision history for this message
Edward Hope-Morley (hopem) wrote :

This is definitely something that the charm should be handling. For ref, -departed behavior was completely removed in [1] since it was totally broken and since then it has been a gap that needs fixing. As far as I can see we have two options here; we either use actions or we implement some safe logic in -departed hooks. For actions, we could add one that would need to be called on a remaining unit to clean up any departed units or we add an action to the unit that is about to be removed to have it remove itself from the cluster (the latter also potentially usable in a -departed hook). if you wanted to cover cases where a node dies suddenly and irreconcilably then i guess either an action on an extant unit or a -departed hook cleanup on the cluster leader might be best (assuming that the hook fires after the leader has switched in the case where the leader died).

[1] https://github.com/openstack/charm-rabbitmq-server/commit/cba419897dab7e85c2baeb23120e3b8d1824f6c2

Changed in charm-rabbitmq-server:
milestone: none → 17.05
Frode Nordahl (fnordahl)
Changed in charm-rabbitmq-server:
assignee: nobody → Frode Nordahl (fnordahl)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-rabbitmq-server (master)

Fix proposed to branch: master
Review: https://review.openstack.org/458491

Changed in charm-rabbitmq-server:
status: New → In Progress
Revision history for this message
Frode Nordahl (fnordahl) wrote :

Having a dormant no longer existing stopped node (unit removed,
server shredded) lying around will also cause trouble in the event that
all RabbitMQ nodes are shut down. In such a situation the cluster
most likely will not start again without operator intervention as
RabbitMQ will want to interrogate the now non-existing stopped node
about any queue it thinks it would be most likely to have
authoritative knowledge about. Since the node already have been gone
for quite some time when this happens this is of course not true.

I am addressing this by gracefully leaving the cluster on unit removal
and performing periodic clean-up by forgetting any abruptly and
forcefully removed units in the proposed patch.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-rabbitmq-server (master)

Reviewed: https://review.openstack.org/458491
Committed: https://git.openstack.org/cgit/openstack/charm-rabbitmq-server/commit/?id=08b10513c5725fb740382668c47fc769a6f2936c
Submitter: Jenkins
Branch: master

commit 08b10513c5725fb740382668c47fc769a6f2936c
Author: Frode Nordahl <email address hidden>
Date: Thu Apr 20 11:48:01 2017 +0200

    Leave RabbitMQ cluster gracefully on unit removal

    Make leader do periodic check for and forget nodes on abrubtly
    removed units in update-status hook. (See detailed explanation in
    stop function docstring)

    Add function to get list of all nodes registered with
    RabbitMQ. Function has modifier to limit list to nodes
    currently running.

    Change existing running_nodes() function to call new function
    with modifier.

    Update amulet test for beam process name with multiple CPUs. The
    test infrastructure now presents test instances with more than one
    CPU core.

    Change-Id: I7eacf9839cd69539d82a76b1ea023e29ba1f5df9
    Closes-Bug: #1679449

Changed in charm-rabbitmq-server:
status: In Progress → Fix Committed
James Page (james-page)
Changed in charm-rabbitmq-server:
milestone: 17.05 → 17.08
Changed in charm-rabbitmq-server:
importance: Undecided → Medium
James Page (james-page)
Changed in charm-rabbitmq-server:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.