config-changed on non-service-affecting variables (known-wait, modulo-nodes, queue_thresholds) causes unexpected queue mirroring and cluster_wait
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
OpenStack RabbitMQ Server Charm |
Fix Released
|
High
|
Liam Young |
Bug Description
On cs:rabbitmq-
On a cluster of 3 units, running the following:
juju config rabbitmq-server queue_threshold
Produces statuses like:
rabbitmq-server/0* active executing 0/lxd/0 10.0.0.1 5672/tcp (config-changed) Enabling queue mirroring
rabbitmq-server/1 maintenance executing 1/lxd/0 10.0.0.1 5672/tcp (config-changed) Waiting 30 seconds for operation ...
rabbitmq-server/2 maintenance executing 2/lxd/0 10.0.0.1 5672/tcp (config-changed) Waiting 60 seconds for operation ...
Here is a private link to the log from rabbitmq-server/0 (leader unit) during the above operation.
https:/
It appears the entire config-changed kicks off cluster_with(), package installs (possibly updates?), and reconfigures all of the related amqp app clients. This seems a bit dangerous operationally that so much is touched for changes of the config unrelated to the functioning of rabbitmq-server.
This is further exacerbated by potential need for higher modulo-nodes and known-wait times that could hold machine locks hostage for a long period of time if you have, for instance, modulo-nodes = 6 and known-wait = 300, you could have your unit matching modulo 5 of rabbitmq-server holding host lock on an innocent config-changed for 25+ minutes because of cluster_with()'s cluster_wait() call. See related bug lp#1903771.
description: | updated |
Changed in charm-rabbitmq-server: | |
assignee: | nobody → Liam Young (gnuoy) |
Changed in charm-rabbitmq-server: | |
milestone: | none → 22.04 |
Changed in charm-rabbitmq-server: | |
status: | Fix Committed → Fix Released |
config_changed explicitly calls rabbit. set_all_ mirroring_ queues. This function always sets the policy and does not check if it was needed. There have been some upstream bugs we have hit that suggest to try and not call this function during turbulent times as it can cause some de-sync but I don't have a reference to that bug atm. But in general avoiding these types of operations when not needed is ideal.
config_changed then also calls cluster_changed "in case min-cluster-size has changed" and update_clients ("ensure all clients connections are up to date on upgrade)
cluster_changed then calls cluster_with though I couldn't see an obvious path for cluster_with to re-run the other code but it does reset relations which will fire relation changed hooks.