rabbitmq_feature_flag check fails for quorum_queue

Bug #2047297 reported by Boris Lukashev
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
kolla-ansible
New
Undecided
Unassigned

Bug Description

Upgrading from Zed to Antelope (Jammy hosts & containers), the rabbitmq quorum queue check fails in an odd manner (with or without the option enabled):
```
(item=quorum_queue) => {"action": "community.rabbitmq.rabbitmq_feature_flag", "ansible_loop_var": "item", "changed": false, "cmd": "/usr/sbin/rabbitmqctl -q -n rabbit list_feature_flags", "item": "quorum_queue", "msg": "Error: {:badrpc, :nodedown}\nArguments given:\n\t-q -n rabbit list_feature_flags\n\n\u001b[1mUsage\u001b[0m\n\nrabbitmqctl [--node <node>] [--longnames] [--quiet] list_feature_flags [<column> ...] [--timeout <timeout>]", "rc": 64, "stderr": "Error: {:badrpc, :nodedown}\nArguments given:\n\t-q -n rabbit list_feature_flags\n\n\u001b[1mUsage\u001b[0m\n\nrabbitmqctl [--node <node>] [--longnames] [--quiet] list_feature_flags [<column> ...] [--timeout <timeout>]\n", "stderr_lines": ["Error: {:badrpc, :nodedown}", "Arguments given:", "\t-q -n rabbit list_feature_flags", "", "\u001b[1mUsage\u001b[0m", "", "rabbitmqctl [--node <node>] [--longnames] [--quiet] list_feature_flags [<column> ...] [--timeout <timeout>]"], "stdout": "", "stdout_lines": []}
```

which one might expect if rabbitmq were actually down but all nodes in the cluster happily respond to:
```
docker exec -ti rabbitmq /usr/sbin/rabbitmqctl -q -n rabbit list_feature_flags
name state
classic_mirrored_queue_version enabled
classic_queue_type_delivery_support enabled
drop_unroutable_metric enabled
empty_basic_get_metric enabled
implicit_default_bindings enabled
maintenance_mode_status enabled
quorum_queue enabled
stream_queue enabled
user_limits enabled
virtual_host_metadata enabled

```

... does sort of block completion of upgrade though given that it happens whether om_enable_rabbitmq_quorum_queues is set or not.

Revision history for this message
Boris Lukashev (rageltman) wrote (last edit ):
Download full text (4.7 KiB)

Something's rotten in the state of Rabbit... any action taken by Kolla-Ansible for RabbitMQ fails:
```
TASK [service-rabbitmq : nova | Ensure RabbitMQ users exist] **************************************************************************************************************************************************************************************************
skipping: [ctl01] => (item=None)
skipping: [ctl01]
skipping: [ctl02] => (item=None)
skipping: [ctl02]
FAILED - RETRYING: [ctl00]: nova | Ensure RabbitMQ users exist (5 retries left).
FAILED - RETRYING: [ctl00]: nova | Ensure RabbitMQ users exist (4 retries left).
FAILED - RETRYING: [ctl00]: nova | Ensure RabbitMQ users exist (3 retries left).
FAILED - RETRYING: [ctl00]: nova | Ensure RabbitMQ users exist (2 retries left).
FAILED - RETRYING: [ctl00]: nova | Ensure RabbitMQ users exist (1 retries left).
failed: [ctl00] (item=None) => {"attempts": 5, "censored": "the output has been hidden due to the fact that 'no_log: true' was specified for this result", "changed": false}
fatal: [ctl00 -> {{ service_rabbitmq_delegate_host }}]: FAILED! => {"censored": "the output has been hidden due to the fact that 'no_log: true' was specified for this result", "changed": false}
```
while running `list_users` on the host returns:
```
docker exec -ti rabbitmq /usr/sbin/rabbitmqctl -q -n rabbit list_users
user tags
openstack [administrator]
```

Kolla-ansible cannot reconfigure RabbitMQ either:
```
TASK [rabbitmq : Put RabbitMQ node into maintenance mode] *****************************************************************************************************************************************************************************************************
fatal: [ctl00]: FAILED! => {"action": "community.rabbitmq.rabbitmq_upgrade", "changed": false, "cmd": "/usr/sbin/rabbitmqctl list_feature_flags -q", "msg": "Error: {:badrpc, :nodedown}\nArguments given:\n\tlist_feature_flags -q\n\n\u001b[1mUsage\u001b[0m\n\nrabbitmqctl [--node <node>] [--longnames] [--quiet] list_feature_flags [<column> ...] [--timeout <timeout>]", "rc": 64, "stderr": "Error: {:badrpc, :nodedown}\nArguments given:\n\tlist_feature_flags -q\n\n\u001b[1mUsage\u001b[0m\n\nrabbitmqctl [--node <node>] [--longnames] [--quiet] list_feature_flags [<column> ...] [--timeout <timeout>]\n", "stderr_lines": ["Error: {:badrpc, :nodedown}", "Arguments given:", "\tlist_feature_flags -q", "", "\u001b[1mUsage\u001b[0m", "", "rabbitmqctl [--node <node>] [--longnames] [--quiet] list_feature_flags [<column> ...] [--timeout <timeout>]"], "stdout": "", "stdout_lines": []}
```
leaving the cloud in a somewhat strange and very down state while the cluster reports up and not in maintenance/showing alarms:
```
Basics

Cluster name: rabbit@ctl01
Total CPU cores available cluster-wide: 96

Disk Nodes

rabbit@ctl00
rabbit@ctl01
rabbit@ctl02

Running Nodes

rabbit@ctl00
rabbit@ctl01
rabbit@ctl02

Versions

rabbit@ctl00: RabbitMQ 3.10.24 on Erlang 25.3.2.3
rabbit@ctl01: RabbitMQ 3.10.24 on Erlang 25.3.2.3
rabbit@ctl02: RabbitMQ 3.10.24 on Erlang 25.3.2.3

CPU Cores

Node: rabbit@ctl00, available CPU cores: 32
Node: rabbit@ctl01, available CPU cores: 32
Node: rabbit@ctl02, availab...

Read more...

Revision history for this message
Boris Lukashev (rageltman) wrote (last edit ):

Tracked the `reconfigure` piece down to:
```
    def is_maint_flag_enabled(self):
        feature_flags = self._exec('rabbitmqctl', ['list_feature_flags', '-q'], True)
        for param_item in feature_flags:
            name, state = param_item.split('\t')
            if name == 'maintenance_mode_status' and state == 'enabled':
                return True
        return False

```
inside `ansible_collections/community/rabbitmq/plugins/modules/rabbitmq_upgrade.py` and there doesnt seem to be any magic going on here except maybe it's comprehension of `self` because running that exact command (`rabbitmqctl list_feature_flags -q`) against all rabbitmq containers works fine.

Revision history for this message
Boris Lukashev (rageltman) wrote :

... Ceph is the culprit.
Looks like the Ceph stack re-introduced 127.0.1.1 into the hosts file during its lifecycle right under the explicitly commented-out prior entry. No idea how the CLI commands work but the Ansible-run ones dont; but we need a safety check for that hostsfile entry because it can apparently be re-introduced by other things fairly quietly and then cause nondescript failures in the MQ managed here.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.