Add controlled restart of RMQ services to avoid split-brain
Bug #1902793 reported by
Paul Goins
This bug affects 2 people
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
OpenStack RabbitMQ Server Charm |
Fix Released
|
High
|
Unassigned |
Bug Description
After an upgrade-charm on a customer's cloud, we found RabbitMQ to have gone into a split-brain situation.
It is believed that the services restarted at roughly the same time, and thus the race may have caused the split-brain situation.
We believe the charm should be reviewed to determine whether there's a need to improve how it handles service restart so as to avoid or reduce the chance of a split-brain occurance, perhaps by making the nodes perform their restarts one-by-one rather than in parallel.
Changed in charm-rabbitmq-server: | |
status: | Incomplete → New |
tags: | added: openstack-upgrade |
Changed in charm-rabbitmq-server: | |
status: | New → Triaged |
importance: | Undecided → High |
Changed in charm-rabbitmq-server: | |
milestone: | none → 22.04 |
Changed in charm-rabbitmq-server: | |
status: | Fix Committed → Fix Released |
To post a comment you must log in.
Are logs available from the juju units and systemd journal to help corroborate the restarts of the nodes? The charm has built-in logic to roll restarts through the cluster (since the 17.11 charm release). By default, it will wait 30 seconds between each node to do the restart.
It takes the unit number and performs modulo division with the number of units found, and multiplies that by the known-wait time (default to 30 seconds). I can see that this strategy would provide a collision and restart services at the same time on two different nodes in the cluster, if all of the unit numbers result in the same modulo division (i.e. 0, 3, 6 or 1, 4, 7).
The charm also provides module-nodes and known-wait config options as tunable parameters to fine tune these values.