Add controlled restart of RMQ services to avoid split-brain

Bug #1902793 reported by Paul Goins
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack RabbitMQ Server Charm
Fix Released
High
Unassigned

Bug Description

After an upgrade-charm on a customer's cloud, we found RabbitMQ to have gone into a split-brain situation.

It is believed that the services restarted at roughly the same time, and thus the race may have caused the split-brain situation.

We believe the charm should be reviewed to determine whether there's a need to improve how it handles service restart so as to avoid or reduce the chance of a split-brain occurance, perhaps by making the nodes perform their restarts one-by-one rather than in parallel.

Revision history for this message
Billy Olsen (billy-olsen) wrote :

Are logs available from the juju units and systemd journal to help corroborate the restarts of the nodes? The charm has built-in logic to roll restarts through the cluster (since the 17.11 charm release). By default, it will wait 30 seconds between each node to do the restart.

It takes the unit number and performs modulo division with the number of units found, and multiplies that by the known-wait time (default to 30 seconds). I can see that this strategy would provide a collision and restart services at the same time on two different nodes in the cluster, if all of the unit numbers result in the same modulo division (i.e. 0, 3, 6 or 1, 4, 7).

The charm also provides module-nodes and known-wait config options as tunable parameters to fine tune these values.

Changed in charm-rabbitmq-server:
status: New → Incomplete
Revision history for this message
Billy Olsen (billy-olsen) wrote :

Sorry, the charm also provides modulo-nodes and known-wait config options. There was a typo in the modulo-nodes option.

Revision history for this message
Drew Freiberger (afreiberger) wrote :

I'll try to dig up logs, but simple sleep offsets for services that sometimes take between 2 seconds and 2 minutes to start can still create race conditions. I think a hand-off of the restarts similar to the ceph charms is in order instead of the simple offset, as there are ways that other charms on the host can offset the processes through host locking, etc that would foil a simple timing offset if there's no collaboration over the peer relation between the nodes.

Revision history for this message
Drew Freiberger (afreiberger) wrote :

Interesting tidbit to add. Since we're not explicitly setting rabbitmq-server's cluster-partition-handling setting, the revert of the setting from autoheal to ignore seems to have come into play during this upgrade-charm.

The upgrade-charm was at 2020-11-02 16:51 for unit rmq/0. I see in the rabbit logs on the same node a log about autohealing:

rmq/0:
2020-07-14 19:32:20 DEBUG juju-log writing config file: /etc/rabbitmq/rabbitmq.config , data: {'ssl_mode': 'off', 'cluster_partition_handling': 'autoheal'}
2020-11-02 16:51:49 DEBUG juju-log writing config file: /etc/rabbitmq/rabbitmq.config , data: {'ssl_mode': 'off', 'cluster_partition_handling': 'ignore'}

=INFO REPORT==== 2-Nov-2020::16:45:53 ===
Autoheal: final node has stopped, starting...
--
=INFO REPORT==== 2-Nov-2020::16:54:08 ===
started TCP Listener on [::]:5672
--
=INFO REPORT==== 2-Nov-2020::16:54:08 ===
Server startup complete; 0 plugins started.

rmq/1:
2020-07-14 19:09:52 DEBUG juju-log writing config file: /etc/rabbitmq/rabbitmq.config , data: {'ssl_mode': 'off', 'cluster_partition_handling': 'autoheal'}
2020-11-02 16:29:28 DEBUG juju-log writing config file: /etc/rabbitmq/rabbitmq.config , data: {'ssl_mode': 'off', 'cluster_partition_handling': 'ignore'}

=INFO REPORT==== 2-Nov-2020::16:43:28 ===
started TCP Listener on [::]:5672
--
=INFO REPORT==== 2-Nov-2020::16:43:28 ===
Server startup complete; 0 plugins started.
--
=WARNING REPORT==== 2-Nov-2020::16:45:18 ===
Autoheal: we were selected to restart; winner is 'rabbit@juju-3e01e3-27-lxd-3'
--
=ERROR REPORT==== 2-Nov-2020::16:45:53 ===
Mnesia('rabbit@juju-3e01e3-28-lxd-4'): ** ERROR ** mnesia_event got {inconsistent_database, starting_partitioned_network, 'rabbit@juju-3e01e3-27-lxd-3'}
--
=INFO REPORT==== 2-Nov-2020::16:45:55 ===
started TCP Listener on [::]:5672
--
=INFO REPORT==== 2-Nov-2020::16:45:55 ===
Server startup complete; 0 plugins started.

rmq/2:
2020-07-14 19:23:00 DEBUG juju-log writing config file: /etc/rabbitmq/rabbitmq.config , data: {'ssl_mode': 'off', 'cluster_partition_handling': 'autoheal'}
2020-11-02 16:43:46 DEBUG juju-log writing config file: /etc/rabbitmq/rabbitmq.config , data: {'ssl_mode': 'off', 'cluster_partition_handling': 'ignore'}

=INFO REPORT==== 2-Nov-2020::16:45:07 ===
started TCP Listener on [::]:5672
--
=INFO REPORT==== 2-Nov-2020::16:45:07 ===
Server startup complete; 0 plugins started.

From this small bit of the logs, it almost appears as if rmq/1 and rmq/2 did the proper 30 second offset from each other and rmq/2 trailed by 5 minutes. It also appears that rmq/2 was still set to autoheal and was attempting to do such while rmq/1 and rmq/2 had already gotten set to ignore partitioning.

I'll collect these 6 full logs for your team to review in more depth.

Changed in charm-rabbitmq-server:
status: Incomplete → New
Revision history for this message
Billy Olsen (billy-olsen) wrote :

> but simple sleep offsets for services that sometimes take between 2 seconds and 2 minutes to start can still create race conditions.

Wholeheartedly agree with this statement. There is no doubt that this is racy in various scenarios. I was looking for the logs to get more data on the restarts in an effort to

a) provide evidence that regarding the commentary "It is believed that the services restarted at roughly the same time"

b) show whether there was something else that needed to be chased

> It also appears that rmq/2 was still set to autoheal and was attempting to do such while rmq/1 and rmq/2 had already gotten set to ignore partitioning.

This is an example of b) in that there may need to be something chased around how the partitioning strategy is handled, especially when it changes.

Thanks for the information.

Revision history for this message
Drew Freiberger (afreiberger) wrote :

Update, full logs mentioned in comment #4 were provided out of band to the team on MM in my private-fileshare directory

tags: added: openstack-upgrade
Changed in charm-rabbitmq-server:
status: New → Triaged
importance: Undecided → High
Revision history for this message
Trent Lloyd (lathiat) wrote :

I've seen similar behaviour. One of the things I found was that while known-wait/modulu-nodes is used to defer restarts by 30 seconds between units, the work taken often takes longer than 30 seconds so collides with the other units and I also found the extra work the leader does which can often be the first unit in that list takes even longer making it even more likely to collide with the other units.

I would suggest that absent co-ordinating the restarts between nodes that they could try and wait up until a longer default timeout (maybe 60-180 seconds) for the cluster to be healthy and all queues to be synchronised before doing a local restart. Or at least for the number of nodes online and queues synchronising to stop changing for 30+ seconds indicating it's likely in a steady state.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-rabbitmq-server (master)
Changed in charm-rabbitmq-server:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on charm-rabbitmq-server (master)

Change abandoned by "gnuoy <email address hidden>" on branch: master
Review: https://review.opendev.org/c/openstack/charm-rabbitmq-server/+/821071
Reason: Abandoned in favour of a stacked patchset:

https://review.opendev.org/c/openstack/charm-rabbitmq-server/+/823799 Charmhelper sync pulling coordinator
https://review.opendev.org/c/openstack/charm-rabbitmq-server/+/823800 Coordination module for rabbit restarts
https://review.opendev.org/c/openstack/charm-rabbitmq-server/+/823801 Coordinate package upgrades across cluster
https://review.opendev.org/c/openstack/charm-rabbitmq-server/+/823802 Coordinate cluster join events

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-rabbitmq-server (master)

Reviewed: https://review.opendev.org/c/openstack/charm-rabbitmq-server/+/823802
Committed: https://opendev.org/openstack/charm-rabbitmq-server/commit/12de0d964c2c24613df50a8abd212c49c34117e4
Submitter: "Zuul (22348)"
Branch: master

commit 12de0d964c2c24613df50a8abd212c49c34117e4
Author: Liam Young <email address hidden>
Date: Fri Jan 7 11:25:59 2022 +0000

    Coordinate cluster join events

    Use the coordination module when cluster join events are called.
    The `cluster_wait` method has been removed as it is no longer used
    and `cluster_with` has been broken up into three new methods (
    `clustered_with_leader`, `update_peer_cluster_status` and
    `join_leader`) which can be called separately. The `modulo-nodes`
    and `known-wait` charm options have been removed as they are no
    longer needed.

    Closes-Bug: #1902793
    Change-Id: I136f5dcc855da329071e119b67df25d9045e86cc

Changed in charm-rabbitmq-server:
status: In Progress → Fix Committed
Changed in charm-rabbitmq-server:
milestone: none → 22.04
Changed in charm-rabbitmq-server:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.