Bug #1902793 “Add controlled restart of RMQ services to avoid sp...” : Bugs : OpenStack RabbitMQ Server Charm

Revision history for this message

Billy Olsen (billy-olsen) wrote on 2020-11-05:

#1

Are logs available from the juju units and systemd journal to help corroborate the restarts of the nodes? The charm has built-in logic to roll restarts through the cluster (since the 17.11 charm release). By default, it will wait 30 seconds between each node to do the restart.

It takes the unit number and performs modulo division with the number of units found, and multiplies that by the known-wait time (default to 30 seconds). I can see that this strategy would provide a collision and restart services at the same time on two different nodes in the cluster, if all of the unit numbers result in the same modulo division (i.e. 0, 3, 6 or 1, 4, 7).

The charm also provides module-nodes and known-wait config options as tunable parameters to fine tune these values.

Changed in charm-rabbitmq-server:
status:	New → Incomplete

Revision history for this message

Billy Olsen (billy-olsen) wrote on 2020-11-05:

#2

Sorry, the charm also provides modulo-nodes and known-wait config options. There was a typo in the modulo-nodes option.

Revision history for this message

Drew Freiberger (afreiberger) wrote on 2020-11-06:

#3

I'll try to dig up logs, but simple sleep offsets for services that sometimes take between 2 seconds and 2 minutes to start can still create race conditions. I think a hand-off of the restarts similar to the ceph charms is in order instead of the simple offset, as there are ways that other charms on the host can offset the processes through host locking, etc that would foil a simple timing offset if there's no collaboration over the peer relation between the nodes.

Revision history for this message

Drew Freiberger (afreiberger) wrote on 2020-11-06:

#4

Interesting tidbit to add. Since we're not explicitly setting rabbitmq-server's cluster-partition-handling setting, the revert of the setting from autoheal to ignore seems to have come into play during this upgrade-charm.

The upgrade-charm was at 2020-11-02 16:51 for unit rmq/0. I see in the rabbit logs on the same node a log about autohealing:

rmq/0:
2020-07-14 19:32:20 DEBUG juju-log writing config file: /etc/rabbitmq/rabbitmq.config , data: {'ssl_mode': 'off', 'cluster_partition_handling': 'autoheal'}
2020-11-02 16:51:49 DEBUG juju-log writing config file: /etc/rabbitmq/rabbitmq.config , data: {'ssl_mode': 'off', 'cluster_partition_handling': 'ignore'}

=INFO REPORT==== 2-Nov-2020::16:45:53 ===
Autoheal: final node has stopped, starting...
--
=INFO REPORT==== 2-Nov-2020::16:54:08 ===
started TCP Listener on [::]:5672
--
=INFO REPORT==== 2-Nov-2020::16:54:08 ===
Server startup complete; 0 plugins started.

rmq/1:
2020-07-14 19:09:52 DEBUG juju-log writing config file: /etc/rabbitmq/rabbitmq.config , data: {'ssl_mode': 'off', 'cluster_partition_handling': 'autoheal'}
2020-11-02 16:29:28 DEBUG juju-log writing config file: /etc/rabbitmq/rabbitmq.config , data: {'ssl_mode': 'off', 'cluster_partition_handling': 'ignore'}

=INFO REPORT==== 2-Nov-2020::16:43:28 ===
started TCP Listener on [::]:5672
--
=INFO REPORT==== 2-Nov-2020::16:43:28 ===
Server startup complete; 0 plugins started.
--
=WARNING REPORT==== 2-Nov-2020::16:45:18 ===
Autoheal: we were selected to restart; winner is 'rabbit@juju-3e01e3-27-lxd-3'
--
=ERROR REPORT==== 2-Nov-2020::16:45:53 ===
Mnesia('rabbit@juju-3e01e3-28-lxd-4'): ** ERROR ** mnesia_event got {inconsistent_database, starting_partitioned_network, 'rabbit@juju-3e01e3-27-lxd-3'}
--
=INFO REPORT==== 2-Nov-2020::16:45:55 ===
started TCP Listener on [::]:5672
--
=INFO REPORT==== 2-Nov-2020::16:45:55 ===
Server startup complete; 0 plugins started.

rmq/2:
2020-07-14 19:23:00 DEBUG juju-log writing config file: /etc/rabbitmq/rabbitmq.config , data: {'ssl_mode': 'off', 'cluster_partition_handling': 'autoheal'}
2020-11-02 16:43:46 DEBUG juju-log writing config file: /etc/rabbitmq/rabbitmq.config , data: {'ssl_mode': 'off', 'cluster_partition_handling': 'ignore'}

=INFO REPORT==== 2-Nov-2020::16:45:07 ===
started TCP Listener on [::]:5672
--
=INFO REPORT==== 2-Nov-2020::16:45:07 ===
Server startup complete; 0 plugins started.

From this small bit of the logs, it almost appears as if rmq/1 and rmq/2 did the proper 30 second offset from each other and rmq/2 trailed by 5 minutes. It also appears that rmq/2 was still set to autoheal and was attempting to do such while rmq/1 and rmq/2 had already gotten set to ignore partitioning.

I'll collect these 6 full logs for your team to review in more depth.

Interesting tidbit to add.  Since we're not explicitly setting rabbitmq-server's cluster-partition-handling setting, the revert of the setting from autoheal to ignore seems to have come into play during this upgrade-charm.

The upgrade-charm was at 2020-11-02 16:51 for unit rmq/0.  I see in the rabbit logs on the same node a log about autohealing:

rmq/0:
2020-07-14 19:32:20 DEBUG juju-log writing config file: /etc/rabbitmq/rabbitmq.config , data: {'ssl_mode': 'off', 'cluster_partition_handling': 'autoheal'}
2020-11-02 16:51:49 DEBUG juju-log writing config file: /etc/rabbitmq/rabbitmq.config , data: {'ssl_mode': 'off', 'cluster_partition_handling': 'ignore'}

=INFO REPORT==== 2-Nov-2020::16:45:53 ===
Autoheal: final node has stopped, starting...
--
=INFO REPORT==== 2-Nov-2020::16:54:08 ===
started TCP Listener on [::]:5672
--
=INFO REPORT==== 2-Nov-2020::16:54:08 ===
Server startup complete; 0 plugins started.

rmq/1:
2020-07-14 19:09:52 DEBUG juju-log writing config file: /etc/rabbitmq/rabbitmq.config , data: {'ssl_mode': 'off', 'cluster_partition_handling': 'autoheal'}
2020-11-02 16:29:28 DEBUG juju-log writing config file: /etc/rabbitmq/rabbitmq.config , data: {'ssl_mode': 'off', 'cluster_partition_handling': 'ignore'}

=INFO REPORT==== 2-Nov-2020::16:43:28 ===
started TCP Listener on [::]:5672
--
=INFO REPORT==== 2-Nov-2020::16:43:28 ===
Server startup complete; 0 plugins started.
--
=WARNING REPORT==== 2-Nov-2020::16:45:18 ===
Autoheal: we were selected to restart; winner is 'rabbit@juju-3e01e3-27-lxd-3'
--
=ERROR REPORT==== 2-Nov-2020::16:45:53 ===
Mnesia('rabbit@juju-3e01e3-28-lxd-4'): ** ERROR ** mnesia_event got {inconsistent_database, starting_partitioned_network, 'rabbit@juju-3e01e3-27-lxd-3'}
--
=INFO REPORT==== 2-Nov-2020::16:45:55 ===
started TCP Listener on [::]:5672
--
=INFO REPORT==== 2-Nov-2020::16:45:55 ===
Server startup complete; 0 plugins started.

rmq/2:
2020-07-14 19:23:00 DEBUG juju-log writing config file: /etc/rabbitmq/rabbitmq.config , data: {'ssl_mode': 'off', 'cluster_partition_handling': 'autoheal'}
2020-11-02 16:43:46 DEBUG juju-log writing config file: /etc/rabbitmq/rabbitmq.config , data: {'ssl_mode': 'off', 'cluster_partition_handling': 'ignore'}

=INFO REPORT==== 2-Nov-2020::16:45:07 ===
started TCP Listener on [::]:5672
--
=INFO REPORT==== 2-Nov-2020::16:45:07 ===
Server startup complete; 0 plugins started.

From this small bit of the logs, it almost appears as if rmq/1 and rmq/2 did the proper 30 second offset from each other and rmq/2 trailed by 5 minutes.  It also appears that rmq/2 was still set to autoheal and was attempting to do such while rmq/1 and rmq/2 had already gotten set to ignore partitioning.

I'll collect these 6 full logs for your team to review in more depth.

Drew Freiberger (afreiberger) on 2020-11-06

Changed in charm-rabbitmq-server:
status:	Incomplete → New

Revision history for this message

Billy Olsen (billy-olsen) wrote on 2020-11-07:

#5

> but simple sleep offsets for services that sometimes take between 2 seconds and 2 minutes to start can still create race conditions.

Wholeheartedly agree with this statement. There is no doubt that this is racy in various scenarios. I was looking for the logs to get more data on the restarts in an effort to

a) provide evidence that regarding the commentary "It is believed that the services restarted at roughly the same time"

b) show whether there was something else that needed to be chased

> It also appears that rmq/2 was still set to autoheal and was attempting to do such while rmq/1 and rmq/2 had already gotten set to ignore partitioning.

This is an example of b) in that there may need to be something chased around how the partitioning strategy is handled, especially when it changes.

Thanks for the information.

Revision history for this message

Drew Freiberger (afreiberger) wrote on 2020-11-13:

#6

Update, full logs mentioned in comment #4 were provided out of band to the team on MM in my private-fileshare directory

Billy Olsen (billy-olsen) on 2020-11-30

tags:

added: openstack-upgrade

Billy Olsen (billy-olsen) on 2021-10-25

Changed in charm-rabbitmq-server:
status:	New → Triaged
importance:	Undecided → High

Revision history for this message

Trent Lloyd (lathiat) wrote on 2021-11-04:

#7

I've seen similar behaviour. One of the things I found was that while known-wait/modulu-nodes is used to defer restarts by 30 seconds between units, the work taken often takes longer than 30 seconds so collides with the other units and I also found the extra work the leader does which can often be the first unit in that list takes even longer making it even more likely to collide with the other units.

I would suggest that absent co-ordinating the restarts between nodes that they could try and wait up until a longer default timeout (maybe 60-180 seconds) for the cluster to be healthy and all queues to be synchronised before doing a local restart. Or at least for the number of nodes online and queues synchronising to stop changing for 30+ seconds indicating it's likely in a steady state.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-12-08: Fix proposed to charm-rabbitmq-server (master)

#8

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/charm-rabbitmq-server/+/821071

Changed in charm-rabbitmq-server:
status:	Triaged → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2022-01-07:

#9

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/charm-rabbitmq-server/+/823802

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2022-01-07: Change abandoned on charm-rabbitmq-server (master)

#10

Change abandoned by "gnuoy <email address hidden>" on branch: master
Review: https://review.opendev.org/c/openstack/charm-rabbitmq-server/+/821071
Reason: Abandoned in favour of a stacked patchset:

https://review.opendev.org/c/openstack/charm-rabbitmq-server/+/823799 Charmhelper sync pulling coordinator
https://review.opendev.org/c/openstack/charm-rabbitmq-server/+/823800 Coordination module for rabbit restarts
https://review.opendev.org/c/openstack/charm-rabbitmq-server/+/823801 Coordinate package upgrades across cluster
https://review.opendev.org/c/openstack/charm-rabbitmq-server/+/823802 Coordinate cluster join events

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2022-03-28: Fix merged to charm-rabbitmq-server (master)

#11

Reviewed: https://review.opendev.org/c/openstack/charm-rabbitmq-server/+/823802
Committed: https://opendev.org/openstack/charm-rabbitmq-server/commit/12de0d964c2c24613df50a8abd212c49c34117e4
Submitter: "Zuul (22348)"
Branch: master

commit 12de0d964c2c24613df50a8abd212c49c34117e4
Author: Liam Young <email address hidden>
Date: Fri Jan 7 11:25:59 2022 +0000

Coordinate cluster join events

    Use the coordination module when cluster join events are called.
    The `cluster_wait` method has been removed as it is no longer used
    and `cluster_with` has been broken up into three new methods (
    `clustered_with_leader`, `update_peer_cluster_status` and
    `join_leader`) which can be called separately. The `modulo-nodes`
    and `known-wait` charm options have been removed as they are no
    longer needed.

Closes-Bug: #1902793
Change-Id: I136f5dcc855da329071e119b67df25d9045e86cc

Changed in charm-rabbitmq-server:
status:	In Progress → Fix Committed

Billy Olsen (billy-olsen) on 2022-03-29

Changed in charm-rabbitmq-server:
milestone:	none → 22.04

Alex Kavanagh (ajkavanagh) on 2022-05-10

Changed in charm-rabbitmq-server:
status:	Fix Committed → Fix Released

OpenStack RabbitMQ Server Charm

Add controlled restart of RMQ services to avoid split-brain

Bug Description

Other bug subscribers

Remote bug watches