kolla-ansible

rabbitmq_feature_flag check fails for quorum_queue

Bug #2047297 reported by Boris Lukashev on 2023-12-24

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	kolla-ansible	New	Undecided	Unassigned

Bug Description

Upgrading from Zed to Antelope (Jammy hosts & containers), the rabbitmq quorum queue check fails in an odd manner (with or without the option enabled):
```
(item=quorum_queue) => {"action": "community.rabbitmq.rabbitmq_feature_flag", "ansible_loop_var": "item", "changed": false, "cmd": "/usr/sbin/rabbitmqctl -q -n rabbit list_feature_flags", "item": "quorum_queue", "msg": "Error: {:badrpc, :nodedown}\nArguments given:\n\t-q -n rabbit list_feature_flags\n\n\u001b[1mUsage\u001b[0m\n\nrabbitmqctl [--node <node>] [--longnames] [--quiet] list_feature_flags [<column> ...] [--timeout <timeout>]", "rc": 64, "stderr": "Error: {:badrpc, :nodedown}\nArguments given:\n\t-q -n rabbit list_feature_flags\n\n\u001b[1mUsage\u001b[0m\n\nrabbitmqctl [--node <node>] [--longnames] [--quiet] list_feature_flags [<column> ...] [--timeout <timeout>]\n", "stderr_lines": ["Error: {:badrpc, :nodedown}", "Arguments given:", "\t-q -n rabbit list_feature_flags", "", "\u001b[1mUsage\u001b[0m", "", "rabbitmqctl [--node <node>] [--longnames] [--quiet] list_feature_flags [<column> ...] [--timeout <timeout>]"], "stdout": "", "stdout_lines": []}
```

which one might expect if rabbitmq were actually down but all nodes in the cluster happily respond to:
```
docker exec -ti rabbitmq /usr/sbin/rabbitmqctl -q -n rabbit list_feature_flags
name state
classic_mirrored_queue_version enabled
classic_queue_type_delivery_support enabled
drop_unroutable_metric enabled
empty_basic_get_metric enabled
implicit_default_bindings enabled
maintenance_mode_status enabled
quorum_queue enabled
stream_queue enabled
user_limits enabled
virtual_host_metadata enabled

```

... does sort of block completion of upgrade though given that it happens whether om_enable_rabbitmq_quorum_queues is set or not.

Revision history for this message

Boris Lukashev (rageltman) wrote on 2023-12-24 (last edit on 2023-12-24):

Download full text (4.7 KiB)

Something's rotten in the state of Rabbit... any action taken by Kolla-Ansible for RabbitMQ fails:
```
TASK [service-rabbitmq : nova | Ensure RabbitMQ users exist] **************************************************************************************************************************************************************************************************
skipping: [ctl01] => (item=None)
skipping: [ctl01]
skipping: [ctl02] => (item=None)
skipping: [ctl02]
FAILED - RETRYING: [ctl00]: nova | Ensure RabbitMQ users exist (5 retries left).
FAILED - RETRYING: [ctl00]: nova | Ensure RabbitMQ users exist (4 retries left).
FAILED - RETRYING: [ctl00]: nova | Ensure RabbitMQ users exist (3 retries left).
FAILED - RETRYING: [ctl00]: nova | Ensure RabbitMQ users exist (2 retries left).
FAILED - RETRYING: [ctl00]: nova | Ensure RabbitMQ users exist (1 retries left).
failed: [ctl00] (item=None) => {"attempts": 5, "censored": "the output has been hidden due to the fact that 'no_log: true' was specified for this result", "changed": false}
fatal: [ctl00 -> {{ service_rabbitmq_delegate_host }}]: FAILED! => {"censored": "the output has been hidden due to the fact that 'no_log: true' was specified for this result", "changed": false}
```
while running `list_users` on the host returns:
```
docker exec -ti rabbitmq /usr/sbin/rabbitmqctl -q -n rabbit list_users
user tags
openstack [administrator]
```

Kolla-ansible cannot reconfigure RabbitMQ either:
```
TASK [rabbitmq : Put RabbitMQ node into maintenance mode] *****************************************************************************************************************************************************************************************************
fatal: [ctl00]: FAILED! => {"action": "community.rabbitmq.rabbitmq_upgrade", "changed": false, "cmd": "/usr/sbin/rabbitmqctl list_feature_flags -q", "msg": "Error: {:badrpc, :nodedown}\nArguments given:\n\tlist_feature_flags -q\n\n\u001b[1mUsage\u001b[0m\n\nrabbitmqctl [--node <node>] [--longnames] [--quiet] list_feature_flags [<column> ...] [--timeout <timeout>]", "rc": 64, "stderr": "Error: {:badrpc, :nodedown}\nArguments given:\n\tlist_feature_flags -q\n\n\u001b[1mUsage\u001b[0m\n\nrabbitmqctl [--node <node>] [--longnames] [--quiet] list_feature_flags [<column> ...] [--timeout <timeout>]\n", "stderr_lines": ["Error: {:badrpc, :nodedown}", "Arguments given:", "\tlist_feature_flags -q", "", "\u001b[1mUsage\u001b[0m", "", "rabbitmqctl [--node <node>] [--longnames] [--quiet] list_feature_flags [<column> ...] [--timeout <timeout>]"], "stdout": "", "stdout_lines": []}
```
leaving the cloud in a somewhat strange and very down state while the cluster reports up and not in maintenance/showing alarms:
```
Basics

Cluster name: rabbit@ctl01
Total CPU cores available cluster-wide: 96

Disk Nodes

rabbit@ctl00
rabbit@ctl01
rabbit@ctl02

Running Nodes

rabbit@ctl00
rabbit@ctl01
rabbit@ctl02

Versions

rabbit@ctl00: RabbitMQ 3.10.24 on Erlang 25.3.2.3
rabbit@ctl01: RabbitMQ 3.10.24 on Erlang 25.3.2.3
rabbit@ctl02: RabbitMQ 3.10.24 on Erlang 25.3.2.3

CPU Cores

Node: rabbit@ctl00, available CPU cores: 32
Node: rabbit@ctl01, available CPU cores: 32
Node: rabbit@ctl02, availab...

Something's rotten in the state of Rabbit... any action taken by Kolla-Ansible for RabbitMQ fails:
```
TASK [service-rabbitmq : nova | Ensure RabbitMQ users exist] **************************************************************************************************************************************************************************************************
skipping: [ctl01] => (item=None) 
skipping: [ctl01]
skipping: [ctl02] => (item=None) 
skipping: [ctl02]
FAILED - RETRYING: [ctl00]: nova | Ensure RabbitMQ users exist (5 retries left).
FAILED - RETRYING: [ctl00]: nova | Ensure RabbitMQ users exist (4 retries left).
FAILED - RETRYING: [ctl00]: nova | Ensure RabbitMQ users exist (3 retries left).
FAILED - RETRYING: [ctl00]: nova | Ensure RabbitMQ users exist (2 retries left).
FAILED - RETRYING: [ctl00]: nova | Ensure RabbitMQ users exist (1 retries left).
failed: [ctl00] (item=None) => {"attempts": 5, "censored": "the output has been hidden due to the fact that 'no_log: true' was specified for this result", "changed": false}
fatal: [ctl00 -> {{ service_rabbitmq_delegate_host }}]: FAILED! => {"censored": "the output has been hidden due to the fact that 'no_log: true' was specified for this result", "changed": false}
```
while running `list_users` on the host returns:
```
docker exec -ti rabbitmq /usr/sbin/rabbitmqctl -q -n rabbit list_users
user	tags
openstack	[administrator]
```

Cluster name: rabbit@ctl01
Total CPU cores available cluster-wide: 96

Disk Nodes

rabbit@ctl00
rabbit@ctl01
rabbit@ctl02

Running Nodes

rabbit@ctl00
rabbit@ctl01
rabbit@ctl02

Versions

rabbit@ctl00: RabbitMQ 3.10.24 on Erlang 25.3.2.3
rabbit@ctl01: RabbitMQ 3.10.24 on Erlang 25.3.2.3
rabbit@ctl02: RabbitMQ 3.10.24 on Erlang 25.3.2.3

CPU Cores

Node: rabbit@ctl00, available CPU cores: 32
Node: rabbit@ctl01, available CPU cores: 32
Node: rabbit@ctl02, available CPU cores: 32

Maintenance status

Node: rabbit@ctl00, status: not under maintenance
Node: rabbit@ctl01, status: not under maintenance
Node: rabbit@ctl02, status: not under maintenance

Alarms

(none)

Network Partitions

(none)

Listeners

Node: rabbit@ctl00, interface: [::], port: 15672, protocol: http, purpose: HTTP API
Node: rabbit@ctl00, interface: x.y.122.10, port: 25672, protocol: clustering, purpose: inter-node and CLI tool communication
Node: rabbit@ctl00, interface: x.y.122.10, port: 5672, protocol: amqp, purpose: AMQP 0-9-1 and AMQP 1.0
Node: rabbit@ctl01, interface: [::], port: 15672, protocol: http, purpose: HTTP API
Node: rabbit@ctl01, interface: x.y.122.11, port: 25672, protocol: clustering, purpose: inter-node and CLI tool communication
Node: rabbit@ctl01, interface: x.y.122.11, port: 5672, protocol: amqp, purpose: AMQP 0-9-1 and AMQP 1.0
Node: rabbit@ctl02, interface: [::], port: 15672, protocol: http, purpose: HTTP API
Node: rabbit@ctl02, interface: x.y.122.12, port: 25672, protocol: clustering, purpose: inter-node and CLI tool communication
Node: rabbit@ctl02, interface: x.y.122.12, port: 5672, protocol: amqp, purpose: AMQP 0-9-1 and AMQP 1.0

Feature flags

Flag: classic_mirrored_queue_version, state: enabled
Flag: classic_queue_type_delivery_support, state: enabled
Flag: drop_unroutable_metric, state: enabled
Flag: empty_basic_get_metric, state: enabled
Flag: implicit_default_bindings, state: enabled
Flag: maintenance_mode_status, state: enabled
Flag: quorum_queue, state: enabled
Flag: stream_queue, state: enabled
Flag: user_limits, state: enabled
Flag: virtual_host_metadata, state: enabled
```

Revision history for this message

Boris Lukashev (rageltman) wrote on 2023-12-24 (last edit on 2023-12-24):

Tracked the `reconfigure` piece down to:
```
    def is_maint_flag_enabled(self):
        feature_flags = self._exec('rabbitmqctl', ['list_feature_flags', '-q'], True)
        for param_item in feature_flags:
            name, state = param_item.split('\t')
            if name == 'maintenance_mode_status' and state == 'enabled':
                return True
        return False

```
inside `ansible_collections/community/rabbitmq/plugins/modules/rabbitmq_upgrade.py` and there doesnt seem to be any magic going on here except maybe it's comprehension of `self` because running that exact command (`rabbitmqctl list_feature_flags -q`) against all rabbitmq containers works fine.

Revision history for this message

Boris Lukashev (rageltman) wrote on 2023-12-24:

... Ceph is the culprit.
Looks like the Ceph stack re-introduced 127.0.1.1 into the hosts file during its lifecycle right under the explicitly commented-out prior entry. No idea how the CLI commands work but the Ansible-run ones dont; but we need a safety check for that hostsfile entry because it can apparently be re-introduced by other things fairly quietly and then cause nondescript failures in the MQ managed here.

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.