Bug #1993149 “VMs stay stuck in scheduling when rabbitmq leader ...” : Bugs : OpenStack RabbitMQ Server Charm

Revision history for this message

Sreekanth Chowdary Sreeramineni (sreersr) wrote on 2022-10-18:

#1

We followed one of the below steps to produce the bug:

1. Login to leader rabbitmq-server unit and run "shutdown now".
2. Login to rabbitmq-server unit which is first in the list of 'nova.conf' file and run "shutdown now".

Work Around applied to resolve the issue of vm scheduling when hit with this bug:

juju ssh cinder/0 "sudo systemctl restart cinder-scheduler.service"
juju ssh cinder/0 "sudo systemctl restart cinder-volume.service"
juju ssh cinder/1 "sudo systemctl restart cinder-scheduler.service"
juju ssh cinder/1 "sudo systemctl restart cinder-volume.service"
juju ssh cinder/2 "sudo systemctl restart cinder-scheduler.service"
juju ssh cinder/2 "sudo systemctl restart cinder-volume.service"
juju ssh neutron-api/0 "sudo systemctl restart neutron-server.service"
juju ssh neutron-api/1 "sudo systemctl restart neutron-server.service"
juju ssh neutron-api/2 "sudo systemctl restart neutron-server.service"
juju ssh nova-cloud-controller/0 "sudo systemctl restart nova-conductor.service"
juju ssh nova-cloud-controller/0 "sudo systemctl restart nova-scheduler.service"
juju ssh nova-cloud-controller/0 "sudo systemctl restart nova-spiceproxy.service"
juju ssh nova-cloud-controller/1 "sudo systemctl restart nova-conductor.service"
juju ssh nova-cloud-controller/1 "sudo systemctl restart nova-scheduler.service"
juju ssh nova-cloud-controller/1 "sudo systemctl restart nova-spiceproxy.service"
juju ssh nova-cloud-controller/2 "sudo systemctl restart nova-conductor.service"
juju ssh nova-cloud-controller/2 "sudo systemctl restart nova-scheduler.service"
juju ssh nova-cloud-controller/2 "sudo systemctl restart nova-spiceproxy.service"
for i in {6..14}; do juju ssh $i " sudo systemctl restart nova-compute"; done
for i in {6..14}; do juju ssh $i " sudo systemctl restart nova-api-metadata"; done

Revision history for this message

Przemyslaw Hausman (phausman) wrote on 2022-10-18:

#2

I can confirm I was able to reproduce the issue in a separate Focal/Yoga environment.

I noticed that even without shutting down rabbitmq-server leader unit, client (e.g. nova-cloud-controller) keep disconnecting from non-leader rabbitmq-server units, see /var/log/nova/nova-api-wsgi.log:

```
2022-10-18 11:33:58.928 207484 ERROR oslo.messaging._drivers.impl_rabbit [-] [e4a6f33c-f700-4aa0-b84a-4bc045ead67b] AMQP server on 192.168.30.233:5672 is unreachable: <RecoverableConnectionError: unknown error>. Trying again in 1 seconds.: amqp.exceptions.RecoverableConnectionError: <RecoverableConnectionError: unknown error>
2022-10-18 11:36:50.937 207485 ERROR oslo.messaging._drivers.impl_rabbit [-] [dc9b1800-ed78-4276-bcec-c059a29c8f54] AMQP server on 192.168.30.255:5672 is unreachable: <RecoverableConnectionError: unknown error>. Trying again in 1 seconds.: amqp.exceptions.RecoverableConnectionError: <RecoverableConnectionError: unknown error>
2022-10-18 11:36:51.024 207486 ERROR oslo.messaging._drivers.impl_rabbit [-] [1f708e80-fb6b-439c-914b-a8ab20c52f19] AMQP server on 192.168.30.255:5672 is unreachable: <RecoverableConnectionError: unknown error>. Trying again in 1 seconds.: amqp.exceptions.RecoverableConnectionError: <RecoverableConnectionError: unknown error>
2022-10-18 11:38:23.586 207485 INFO oslo.messaging._drivers.impl_rabbit [-] [dc9b1800-ed78-4276-bcec-c059a29c8f54] Reconnected to AMQP server on 192.168.30.255:5672 via [amqp] client with port 58520.
2022-10-18 11:38:23.696 207484 ERROR oslo.messaging._drivers.impl_rabbit [-] [e4a6f33c-f700-4aa0-b84a-4bc045ead67b] AMQP server on 192.168.30.233:5672 is unreachable: <RecoverableConnectionError: unknown error>. Trying again in 1 seconds.: amqp.exceptions.RecoverableConnectionError: <RecoverableConnectionError: unknown error>
2022-10-18 11:38:23.741 207487 ERROR oslo.messaging._drivers.impl_rabbit [-] [7e1c0354-dd18-4953-a70a-1bce9cfdc31e] AMQP server on 192.168.30.233:5672 is unreachable: <RecoverableConnectionError: unknown error>. Trying again in 1 seconds.: amqp.exceptions.RecoverableConnectionError: <RecoverableConnectionError: unknown error>
2022-10-18 11:38:23.819 207486 INFO oslo.messaging._drivers.impl_rabbit [-] [1f708e80-fb6b-439c-914b-a8ab20c52f19] Reconnected to AMQP server on 192.168.30.255:5672 via [amqp] client with port 58522.
```

In the above snippet, only the two non-leader rabbitmq-server units (192.168.30.233 and 192.168.30.255) keep disconnecting every few minutes. I did not notice disconnections of the third (leader) unit.

Adding ~field-critical as this is blocking the customer deployment.

I can confirm I was able to reproduce the issue in a separate Focal/Yoga environment.

I noticed that even without shutting down rabbitmq-server leader unit, client (e.g. nova-cloud-controller) keep disconnecting from non-leader rabbitmq-server units, see /var/log/nova/nova-api-wsgi.log:

```
2022-10-18 11:33:58.928 207484 ERROR oslo.messaging._drivers.impl_rabbit [-] [e4a6f33c-f700-4aa0-b84a-4bc045ead67b] AMQP server on 192.168.30.233:5672 is unreachable: <RecoverableConnectionError: unknown error>. Trying again in 1 seconds.: amqp.exceptions.RecoverableConnectionError: <RecoverableConnectionError: unknown error>
2022-10-18 11:36:50.937 207485 ERROR oslo.messaging._drivers.impl_rabbit [-] [dc9b1800-ed78-4276-bcec-c059a29c8f54] AMQP server on 192.168.30.255:5672 is unreachable: <RecoverableConnectionError: unknown error>. Trying again in 1 seconds.: amqp.exceptions.RecoverableConnectionError: <RecoverableConnectionError: unknown error>
2022-10-18 11:36:51.024 207486 ERROR oslo.messaging._drivers.impl_rabbit [-] [1f708e80-fb6b-439c-914b-a8ab20c52f19] AMQP server on 192.168.30.255:5672 is unreachable: <RecoverableConnectionError: unknown error>. Trying again in 1 seconds.: amqp.exceptions.RecoverableConnectionError: <RecoverableConnectionError: unknown error>
2022-10-18 11:38:23.586 207485 INFO oslo.messaging._drivers.impl_rabbit [-] [dc9b1800-ed78-4276-bcec-c059a29c8f54] Reconnected to AMQP server on 192.168.30.255:5672 via [amqp] client with port 58520.
2022-10-18 11:38:23.696 207484 ERROR oslo.messaging._drivers.impl_rabbit [-] [e4a6f33c-f700-4aa0-b84a-4bc045ead67b] AMQP server on 192.168.30.233:5672 is unreachable: <RecoverableConnectionError: unknown error>. Trying again in 1 seconds.: amqp.exceptions.RecoverableConnectionError: <RecoverableConnectionError: unknown error>
2022-10-18 11:38:23.741 207487 ERROR oslo.messaging._drivers.impl_rabbit [-] [7e1c0354-dd18-4953-a70a-1bce9cfdc31e] AMQP server on 192.168.30.233:5672 is unreachable: <RecoverableConnectionError: unknown error>. Trying again in 1 seconds.: amqp.exceptions.RecoverableConnectionError: <RecoverableConnectionError: unknown error>
2022-10-18 11:38:23.819 207486 INFO oslo.messaging._drivers.impl_rabbit [-] [1f708e80-fb6b-439c-914b-a8ab20c52f19] Reconnected to AMQP server on 192.168.30.255:5672 via [amqp] client with port 58522.
```

In the above snippet, only the two non-leader rabbitmq-server units (192.168.30.233 and 192.168.30.255) keep disconnecting every few minutes. I did not notice disconnections of the third (leader) unit.

Adding ~field-critical as this is blocking the customer deployment.

Revision history for this message

Przemyslaw Hausman (phausman) wrote on 2022-10-18:

#3

I'm attaching output of `rabbitmqctl cluster_status` from all rabbitmq-server units before and after shutting down one unit.

juju-crushdump and juju bundle contain sensitive information, so I'm not sharing it publicly: https://docs.google.com/document/d/1yKPw3vgQWy1rkamQ5YYaoFnWh7PO_TiUEl_r2cnlGaQ/

Steps I used to reproduce:

1. Poweroff rabbitmq-server/0

2. Create VM.

Building the VM fails:

```
$ openstack server show vm-2 -c fault -f value
{'code': 500, 'created': '2022-10-18T14:50:31Z', 'message': 'MessagingTimeout', 'details': 'Traceback (most recent call last):\n File "/usr/lib/python3/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 441, in get\n return self._queues[msg_id].get(block=True, timeout=timeout)\n File "/usr/lib/python3/dist-packages/eventlet/queue.py", line 322, in get\n return waiter.wait()\n File "/usr/lib/python3/dist-packages/eventlet/queue.py", line 141, in wait\n return get_hub().switch()\n File "/usr/lib/python3/dist-packages/eventlet/hubs/hub.py", line 313, in switch\n return self.greenlet.switch()\n_queue.Empty\n\nDuring handling of the above exception, another exception occurred:\n\nTraceback (most recent call last):\n File "/usr/lib/python3/dist-packages/nova/conductor/manager.py", line 1548, in schedule_and_build_instances\n host_lists = self._schedule_instances(context, request_specs[0],\n File "/usr/lib/python3/dist-packages/nova/conductor/manager.py", line 908, in _schedule_instances\n host_lists = self.query_client.select_destinations(\n File "/usr/lib/python3/dist-packages/nova/scheduler/client/query.py", line 41, in select_destinations\n return self.scheduler_rpcapi.select_destinations(context, spec_obj,\n File "/usr/lib/python3/dist-packages/nova/scheduler/rpcapi.py", line 160, in select_destinations\n return cctxt.call(ctxt, \'select_destinations\', **msg_args)\n File "/usr/lib/python3/dist-packages/oslo_messaging/rpc/client.py", line 189, in call\n result = self.transport._send(\n File "/usr/lib/python3/dist-packages/oslo_messaging/transport.py", line 123, in _send\n return self._driver.send(target, ctxt, message,\n File "/usr/lib/python3/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 689, in send\n return self._send(target, ctxt, message, wait_for_reply, timeout,\n File "/usr/lib/python3/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 678, in _send\n result = self._waiter.wait(msg_id, timeout,\n File "/usr/lib/python3/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 567, in wait\n message = self.waiters.get(msg_id, timeout=timeout)\n File "/usr/lib/python3/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 443, in get\n raise oslo_messaging.MessagingTimeout(\noslo_messaging.exceptions.MessagingTimeout: Timed out waiting for a reply to message ID 0bdb814301504f479abcafea2ee3f7b6\n'}
```

I'm attaching output of `rabbitmqctl cluster_status` from all rabbitmq-server units before and after shutting down one unit.

juju-crushdump and juju bundle contain sensitive information, so I'm not sharing it publicly: https://docs.google.com/document/d/1yKPw3vgQWy1rkamQ5YYaoFnWh7PO_TiUEl_r2cnlGaQ/

Steps I used to reproduce:

1. Poweroff rabbitmq-server/0

2. Create VM.

Building the VM fails:

```
$ openstack server show vm-2 -c fault -f value
{'code': 500, 'created': '2022-10-18T14:50:31Z', 'message': 'MessagingTimeout', 'details': 'Traceback (most recent call last):\n  File "/usr/lib/python3/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 441, in get\n    return self._queues[msg_id].get(block=True, timeout=timeout)\n  File "/usr/lib/python3/dist-packages/eventlet/queue.py", line 322, in get\n    return waiter.wait()\n  File "/usr/lib/python3/dist-packages/eventlet/queue.py", line 141, in wait\n    return get_hub().switch()\n  File "/usr/lib/python3/dist-packages/eventlet/hubs/hub.py", line 313, in switch\n    return self.greenlet.switch()\n_queue.Empty\n\nDuring handling of the above exception, another exception occurred:\n\nTraceback (most recent call last):\n  File "/usr/lib/python3/dist-packages/nova/conductor/manager.py", line 1548, in schedule_and_build_instances\n    host_lists = self._schedule_instances(context, request_specs[0],\n  File "/usr/lib/python3/dist-packages/nova/conductor/manager.py", line 908, in _schedule_instances\n    host_lists = self.query_client.select_destinations(\n  File "/usr/lib/python3/dist-packages/nova/scheduler/client/query.py", line 41, in select_destinations\n    return self.scheduler_rpcapi.select_destinations(context, spec_obj,\n  File "/usr/lib/python3/dist-packages/nova/scheduler/rpcapi.py", line 160, in select_destinations\n    return cctxt.call(ctxt, \'select_destinations\', **msg_args)\n  File "/usr/lib/python3/dist-packages/oslo_messaging/rpc/client.py", line 189, in call\n    result = self.transport._send(\n  File "/usr/lib/python3/dist-packages/oslo_messaging/transport.py", line 123, in _send\n    return self._driver.send(target, ctxt, message,\n  File "/usr/lib/python3/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 689, in send\n    return self._send(target, ctxt, message, wait_for_reply, timeout,\n  File "/usr/lib/python3/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 678, in _send\n    result = self._waiter.wait(msg_id, timeout,\n  File "/usr/lib/python3/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 567, in wait\n    message = self.waiters.get(msg_id, timeout=timeout)\n  File "/usr/lib/python3/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 443, in get\n    raise oslo_messaging.MessagingTimeout(\noslo_messaging.exceptions.MessagingTimeout: Timed out waiting for a reply to message ID 0bdb814301504f479abcafea2ee3f7b6\n'}
```

Revision history for this message

Przemyslaw Hausman (phausman) wrote on 2022-10-18:

#4

rabbitctl_cluster_status_before_shutdown.txt Edit (5.7 KiB, text/plain)

Revision history for this message

Przemyslaw Hausman (phausman) wrote on 2022-10-18:

#5

rabbitctl_cluster_status_after_shutdown.txt Edit (3.0 KiB, text/plain)

Revision history for this message

Corey Bryant (corey.bryant) wrote on 2022-10-18:

#6

Hello,

Thank you for reporting this bug.

Please can you confirm that rabbitmq-server's min-cluster-size config is set?

Thanks,
Corey

Revision history for this message

Przemyslaw Hausman (phausman) wrote on 2022-10-18 (last edit on 2022-10-18):

#7

@corey.bryant, yes it is. Please see rabbitmq-server's config below:

```
$ juju config rabbitmq-server --format json | jq -r '.settings | keys[] as $k | "\($k): \(.[$k] | .value)"'
access-network: null
busiest_queues: 0
check-vhosts: null
cluster-network: null
cluster-partition-handling: pause_minority
connection-backlog: null
cron-timeout: 300
enable-auto-restarts: true
erl-vm-io-thread-multiplier: null
exclude_queues: []
ha-bindiface: eth0
ha-mcastport: 5406
ha-vip-only: false
harden: null
key: null
management_plugin: true
max-cluster-tries: 3
min-cluster-size: 3
mirroring-queues: true
mnesia-table-loading-retry-limit: 10
mnesia-table-loading-retry-timeout: 30000
nagios_context: juju
nagios_servicegroups:
notification-ttl: 3600000
prefer-ipv6: false
queue-master-locator: min-masters
queue_thresholds: [[\*, \*, 25000, 27500]]
source: cloud:focal-yoga
ssl: off
ssl_ca: null
ssl_cert: null
ssl_enabled: false
ssl_key: null
ssl_port: 5671
stats_cron_schedule: */5 * * * *
use-syslog: false
vip: null
vip_cidr: 24
vip_iface: eth0
```

Revision history for this message

Przemyslaw Hausman (phausman) wrote on 2022-10-18:

#8

The problem may be related to CIS hardening, that is applied on this cloud.

I have redeployed the same juju bundle, but this time without CIS hardening, and was not able to reproduce the problem.

Revision history for this message

Corey Bryant (corey.bryant) wrote on 2022-10-18:

#9

Thanks @phausman. There's no /etc/nova/ on the nova-cloud-controller units in the crashdump. Is that expected? I was trying to recreate this this afternoon but having issues getting a deployment up on serverstack. I'll try again in the morning.

Revision history for this message

Przemyslaw Hausman (phausman) wrote on 2022-10-19:

#10

@corey.bryant, yes, this is expected as the crushdump was taken with `--small /var/log`. Please let me know if you need any additional info and I'll pull it from the target environment.

Revision history for this message

Corey Bryant (corey.bryant) wrote on 2022-10-19:

#11

@phausman, that's interesting about CIS hardening. if you get any more details for recreating or specific logs, etc for access issues or anything else, that would be great.

Revision history for this message

Claudyson (claudyson) wrote on 2022-10-19:

#12

In our OpenStack Yoga cloud environment (rabbitmq-server 3.9.13-1 / Ubuntu Jammy) deployed using Openstack-ansible we are getting the same issue during Rabbitmq HA tests.
When we stop any rabbitmq cluster member and start a VM creation the VM status remain in BUILD until we start a Rabbitmq member again.
We aren´t using CIS hardening (not available in Ubuntu Jammy)

Revision history for this message

Corey Bryant (corey.bryant) wrote on 2022-10-19:

#13

I was able to recreate this as well on jammy-yoga. A few findings:

* transport_url is in the DEFAULT section as well as the oslo_messaging_notifications section for nova.conf, though I don't think this is a problem

* the cell_mappings table in the nova_api database contains the same transport url for cell1 as expected (note this doesn't get updated when a rabbit node goes down, probably as expected)

* relation_settings['ha_queues'] never gets set to True anymore in charm-rabbitmq-server for versions >= 3.0.1 [1]. This results in rabbit_ha_queues never getting set to True in clients such as nova [2]. This doesn't solve the problem but may also be required in order to properly mirror queues for failover.

[1] https://opendev.org/openstack/charm-rabbitmq-server/src/branch/master/hooks/rabbitmq_server_relations.py#L485

[2] https://opendev.org/openstack/charm-nova-cloud-controller/src/branch/master/charmhelpers/contrib/openstack/context.py#L703

* kombu_failover_strategy should be defaulting to round-robin according to [3]. I think this is the next place to dig into.

[3] https://docs.openstack.org/oslo.messaging/latest/configuration/opts.html#oslo_messaging_rabbit.kombu_failover_strategy

Changed in charm-rabbitmq-server:
assignee:	nobody → Corey Bryant (corey.bryant)

Revision history for this message

DUFOUR Olivier (odufourc) wrote on 2022-10-20 (last edit on 2022-10-20):

#14

Download full text (6.2 KiB)

I've made many many tests on my lab. The following could be noted

This is not reproducible on :
* Focal Ussuri
* Focal Wallaby

However this has been reproduced on :
* Focal Yoga without CIS
* Focal Yoga with CIS

I think it is safe to say that RabbitMQ Cluster is not the root cause.
TL;DR : it seems to be an issue with python3-oslo.messaging on the control plane units

One noticeable behavior is when looking on a unit using extensively rabbitmq communications such as nova-cloud-controller, like in /var/log/nova/nova-conductor.log :

On Yoga these lines would repeat indefinitely :
2022-10-19 11:12:02.457 203949 ERROR oslo.messaging._drivers.impl_rabbit [-] [c152039d-ea4a-4d69-a8ed-30ba5cc621ed] AMQP server on 192.168.24.87:5672 is unreachable: <RecoverableConnectionError: unknown error>. Trying again in 1 seconds.: amqp.exceptions.RecoverableConnectionError: <RecoverableConnectionError: unknown error>
2022-10-19 11:12:05.529 203949 ERROR oslo.messaging._drivers.impl_rabbit [-] [c152039d-ea4a-4d69-a8ed-30ba5cc621ed] AMQP server on 192.168.24.87:5672 is unreachable: <RecoverableConnectionError: unknown error>. Trying again in 1 seconds.: amqp.exceptions.RecoverableConnectionError: <RecoverableConnectionError: unknown error>
2022-10-19 11:12:08.600 203949 ERROR oslo.messaging._drivers.impl_rabbit [-] [c152039d-ea4a-4d69-a8ed-30ba5cc621ed] AMQP server on 192.168.24.87:5672 is unreachable: <RecoverableConnectionError: unknown error>. Trying again in 1 seconds.: amqp.exceptions.RecoverableConnectionError: <RecoverableConnectionError: unknown error>
...
2022-10-19 11:12:23.956 203949 ERROR oslo.messaging._drivers.impl_rabbit [-] [c152039d-ea4a-4d69-a8ed-30ba5cc621ed] AMQP server on 192.168.24.87:5672 is unreachable: <RecoverableConnectionError: unknown error>. Trying again in 1 seconds.: amqp.exceptions.RecoverableConnectionError: <RecoverableConnectionError: unknown error>
2022-10-19 11:12:27.029 203949 ERROR oslo.messaging._drivers.impl_rabbit [-] [c152039d-ea4a-4d69-a8ed-30ba5cc621ed] AMQP server on 192.168.24.87:5672 is unreachable: <RecoverableConnectionError: unknown error>. Trying again in 1 seconds.: amqp.exceptions.RecoverableConnectionError: <RecoverableConnectionError: unknown error>
2022-10-19 11:12:30.105 203949 ERROR oslo.messaging._drivers.impl_rabbit [-] [c152039d-ea4a-4d69-a8ed-30ba5cc621ed] AMQP server on 192.168.24.87:5672 is unreachable: <RecoverableConnectionError: unknown error>. Trying again in 1 seconds.: amqp.exceptions.RecoverableConnectionError: <RecoverableConnectionError: unknown error>

Whereas on Ussuri or Wallaby Openstack, it is seen that the workers move quickly to another rabbitmq server from the cluster :
2022-10-20 07:50:13.142 73006 ERROR oslo.messaging._drivers.impl_rabbit [-] [9b752ed1-7e84-4618-bfd6-847867ff6fc4] AMQP server on 192.168.24.252:5672 is unreachable: [Errno 104] Connection reset by peer. Trying again in 1 seconds.: ConnectionResetError: [Errno 104] Connection reset by peer
2022-10-20 07:50:13.370 73005 ERROR oslo.messaging._drivers.impl_rabbit [req-7648decf-3e3d-4dcf-946d-f9e75b0de335 - - - - -] [a2a5d71e-711f-4221-a50c-87d1e72e4481] AMQP server on 192.168.24.252:5672 is unreachable: [Errno 104...

I've made many many tests on my lab. The following could be noted

This is not reproducible on :
* Focal Ussuri
* Focal Wallaby

However this has been reproduced on :
* Focal Yoga without CIS
* Focal Yoga with CIS

I think it is safe to say that RabbitMQ Cluster is not the root cause.
TL;DR : it seems to be an issue with python3-oslo.messaging on the control plane units

One noticeable behavior is when looking on a unit using extensively rabbitmq communications such as nova-cloud-controller, like in /var/log/nova/nova-conductor.log :

On Yoga these lines would repeat indefinitely :
2022-10-19 11:12:02.457 203949 ERROR oslo.messaging._drivers.impl_rabbit [-] [c152039d-ea4a-4d69-a8ed-30ba5cc621ed] AMQP server on 192.168.24.87:5672 is unreachable: <RecoverableConnectionError: unknown error>. Trying again in 1 seconds.: amqp.exceptions.RecoverableConnectionError: <RecoverableConnectionError: unknown error>
2022-10-19 11:12:05.529 203949 ERROR oslo.messaging._drivers.impl_rabbit [-] [c152039d-ea4a-4d69-a8ed-30ba5cc621ed] AMQP server on 192.168.24.87:5672 is unreachable: <RecoverableConnectionError: unknown error>. Trying again in 1 seconds.: amqp.exceptions.RecoverableConnectionError: <RecoverableConnectionError: unknown error>
2022-10-19 11:12:08.600 203949 ERROR oslo.messaging._drivers.impl_rabbit [-] [c152039d-ea4a-4d69-a8ed-30ba5cc621ed] AMQP server on 192.168.24.87:5672 is unreachable: <RecoverableConnectionError: unknown error>. Trying again in 1 seconds.: amqp.exceptions.RecoverableConnectionError: <RecoverableConnectionError: unknown error>
...
2022-10-19 11:12:23.956 203949 ERROR oslo.messaging._drivers.impl_rabbit [-] [c152039d-ea4a-4d69-a8ed-30ba5cc621ed] AMQP server on 192.168.24.87:5672 is unreachable: <RecoverableConnectionError: unknown error>. Trying again in 1 seconds.: amqp.exceptions.RecoverableConnectionError: <RecoverableConnectionError: unknown error>
2022-10-19 11:12:27.029 203949 ERROR oslo.messaging._drivers.impl_rabbit [-] [c152039d-ea4a-4d69-a8ed-30ba5cc621ed] AMQP server on 192.168.24.87:5672 is unreachable: <RecoverableConnectionError: unknown error>. Trying again in 1 seconds.: amqp.exceptions.RecoverableConnectionError: <RecoverableConnectionError: unknown error>
2022-10-19 11:12:30.105 203949 ERROR oslo.messaging._drivers.impl_rabbit [-] [c152039d-ea4a-4d69-a8ed-30ba5cc621ed] AMQP server on 192.168.24.87:5672 is unreachable: <RecoverableConnectionError: unknown error>. Trying again in 1 seconds.: amqp.exceptions.RecoverableConnectionError: <RecoverableConnectionError: unknown error>

Whereas on Ussuri or Wallaby Openstack, it is seen that the workers move quickly to another rabbitmq server from the cluster :
2022-10-20 07:50:13.142 73006 ERROR oslo.messaging._drivers.impl_rabbit [-] [9b752ed1-7e84-4618-bfd6-847867ff6fc4] AMQP server on 192.168.24.252:5672 is unreachable: [Errno 104] Connection reset by peer. Trying again in 1 seconds.: ConnectionResetError: [Errno 104] Connection reset by peer
2022-10-20 07:50:13.370 73005 ERROR oslo.messaging._drivers.impl_rabbit [req-7648decf-3e3d-4dcf-946d-f9e75b0de335 - - - - -] [a2a5d71e-711f-4221-a50c-87d1e72e4481] AMQP server on 192.168.24.252:5672 is unreachable: [Errno 104] Connection reset by peer. Trying again in 1 seconds.: ConnectionResetError: [Errno 104] Connection reset by peer
2022-10-20 07:50:14.156 73006 ERROR oslo.messaging._drivers.impl_rabbit [-] [9b752ed1-7e84-4618-bfd6-847867ff6fc4] AMQP server on 192.168.24.252:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 1 seconds.: ConnectionRefusedError: [Errno 111] ECONNREFUSED
2022-10-20 07:50:14.383 73005 ERROR oslo.messaging._drivers.impl_rabbit [req-7648decf-3e3d-4dcf-946d-f9e75b0de335 - - - - -] [a2a5d71e-711f-4221-a50c-87d1e72e4481] AMQP server on 192.168.24.252:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 1 seconds.: ConnectionRefusedError: [Errno 111] ECONNREFUSED
2022-10-20 07:50:14.515 73004 INFO oslo.messaging._drivers.impl_rabbit [-] A recoverable connection/channel error occurred, trying to reconnect: [Errno 104] Connection reset by peer
2022-10-20 07:50:15.197 73006 INFO oslo.messaging._drivers.impl_rabbit [-] [9b752ed1-7e84-4618-bfd6-847867ff6fc4] Reconnected to AMQP server on 192.168.24.69:5672 via [amqp] client with port 48002.
2022-10-20 07:50:15.403 73005 INFO oslo.messaging._drivers.impl_rabbit [req-7648decf-3e3d-4dcf-946d-f9e75b0de335 - - - - -] [a2a5d71e-711f-4221-a50c-87d1e72e4481] Reconnected to AMQP server on 192.168.24.249:5672 via [amqp] client with port 33050.
2022-10-20 07:50:21.908 73005 INFO oslo.messaging._drivers.impl_rabbit [-] A recoverable connection/channel error occurred, trying to reconnect: [Errno 104] Connection reset by peer
2022-10-20 07:50:26.926 73005 ERROR oslo.messaging._drivers.impl_rabbit [-] Connection failed: timed out (retrying in 0 seconds): socket.timeout: timed out
2022-10-20 07:50:56.738 73006 INFO nova.compute.rpcapi [req-85dfba48-86f5-4071-b599-9fb8d1538349 - - - - -] Automatically selected compute RPC version 6.0 from minimum service version 56

Since the RabbitMQ cluster is mostly identical between Focal Ussuri/Wallaby/Yoga, that leaves another possibility, the python3 rabbitmq library (python3-oslo.messaging)

This is an ugly test/workaround on a Focal Yoga deployment but by :
* adding focal-wallaby cloud-archive repository
* downgrading the package python3-oslo.messaging to 12.7.1 (versus 12.13.1 on Focal Yoga)
* restarting the services using the library

The issue about seeing the control plane stuck reconnecting to a dead rabbitmq unit disappears.

Current workaround : 
for i in nova-cloud-controller nova-compute neutron-api glance cinder; do
juju run -a $i -- sudo bash -c "echo 'deb http://ubuntu-cloud.archive.canonical.com/ubuntu focal-updates/wallaby main' >> /etc/apt/sources.list.d/cloud-archive.list"
juju run -a $i "sudo apt update; sudo DEBIAN_FRONTEND=noninteractive apt install python3-oslo.messaging=12.7.1-0ubuntu1~cloud0 -y --allow-downgrades"
done

juju run -a nova-compute sudo systemctl restart nova-compute nova-api-metadata
juju run -a nova-cloud-controller sudo systemctl restart nova-scheduler nova-conductor
juju run -a neutron-api sudo systemctl restart neutron-server
juju run -a glance sudo systemctl restart glance-api
juju run -a cinder sudo systemctl restart cinder-volume cinder-scheduler

Revision history for this message

DUFOUR Olivier (odufourc) wrote on 2022-10-20:

#15

I tested against Xena's version of python3-oslo.messaging and the issue was not reproducible either.

I think it is safe to assume that a regression was then inserted between 2.9.1 and 2.13.1 of python3.oslo.messaging

Revision history for this message

Corey Bryant (corey.bryant) wrote on 2022-10-20:

#16

Olivier, thank you very much. These are some great data points. I'm going to quickly confirm your findings and will then dig in to what's new in oslo.messaging since xena.

Revision history for this message

Corey Bryant (corey.bryant) wrote on 2022-10-20:

#17

Upstream changes between 12.9.1 and 12.13.0:

- [7e8acbf8] Adding support for rabbitmq quorum queues
- [7b3968d9] [rabbit] use retry parameters during notification sending
- [76182ec7] Update python testing classifier
- [1db6de63] Reproduce bug 1917645
- [02a38f50] amqp1: fix race when reconnecting
- [d24edef1] Remove deprecation of heartbeat_in_pthread
- [ca939fc0] rabbit: move stdlib_threading bits into _utils
- [23040424] Add Python3 yoga unit tests
- [2a052499] Update master for stable/xena
- [129c2233] use message id cache for RPC listener
- [bdcf915e] limit maximum timeout in the poll loop

Revision history for this message

Corey Bryant (corey.bryant) wrote on 2022-10-20:

#18

I'm fairly sure this is something to do with [bdcf915e]. I've tested a few times now successfully with that commit partially reverted via the following ppa (on top of focal-yoga). For comparison, I was also able to recreate the reported bug with the focal-yoga cloud archive without this ppa enabled.

Can anyone give this package a try with a focal-yoga deploy?

sudo add-apt-repository ppa:corey.bryant/focal-yoga --yes && sudo apt install python3-oslo.messaging=12.13.0-0ubuntu1.1~bpo20.04.1~ppa202210201553 --yes

Corey Bryant (corey.bryant) on 2022-10-20

Changed in charm-rabbitmq-server:
status:	New → Triaged
importance:	Undecided → Critical

Revision history for this message

DUFOUR Olivier (odufourc) wrote on 2022-10-21:

#19

I tried the package on PPA on Focal Yoga and it did solve the issue and deploying resources seems to work on my lab. So finding the culprit is done, thank you Corey for that + the PPA package.

Just in case, I made a test on Jammy Zed Openstack release. Although it uses a more recent version of python3-oslo.messaging, closer to upstream release, the issue is still present, although in a slightly less severe manner.
--> the units would still spam every seconds about trying to reconnect indefinitely to the RabbitMQ unit

this didn't happen when shutting down any RabbitMQ but only one or two specific units only among 3 of them.
--> I think it depends most likely on which rabbitmq server, the Nova/Cinder/Glance/Neutron workers have decided to connect to.
(Although this might be because aside from Mysql-innodb-cluster and rabbitmq-server, the control plane units are not in HA so there are way fewer units involved in my test)

Revision history for this message

Przemyslaw Hausman (phausman) wrote on 2022-10-21:

#20

Corey, I have also tried the package from the PPA on a target environment and I can confirm it works fine. Stopping any rabbitmq-server does not break the cloud anymore.

Revision history for this message

Corey Bryant (corey.bryant) wrote on 2022-10-21:

#21

Great to hear, thanks very much for testing.

Changed in python-oslo.messaging (Ubuntu):
status:	New → Triaged
importance:	Undecided → Critical
assignee:	nobody → Corey Bryant (corey.bryant)
Changed in charm-rabbitmq-server:
assignee:	Corey Bryant (corey.bryant) → nobody
importance:	Critical → Low

Revision history for this message

Corey Bryant (corey.bryant) wrote on 2022-10-21:

#22

I've added the package and upstream oslo.messaging projects which need fixing. I've triaged the rabbitmq charm as low since it is affected but doesn't require a fix.

Changed in python-oslo.messaging (Ubuntu Jammy):
status:	New → Triaged
importance:	Undecided → Critical

Revision history for this message

Corey Bryant (corey.bryant) wrote on 2022-10-21:

#23

New versions of the python-oslo.messaging Ubuntu package have been uploaded to the kinetic and jammy unapproved queue and are awaiting SRU team review. The new package versions revert commit [bdcf915e] as a stop-gap until a proper fix is in place.

description:

updated

Revision history for this message

Khoi (khoinh5) wrote on 2022-10-25:

#24

Hi guys. I have tested with oslo.messaging 12.9.4. It has not fixed this problem yet. It happens with cinder service too, not only for nova.

Revision history for this message

Brian Murray (brian-murray) wrote on 2022-10-26: Please test proposed package

#25

Hello Aqsa, or anyone else affected,

Accepted python-oslo.messaging into kinetic-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/python-oslo.messaging/14.0.0-0ubuntu1.1 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, what testing has been performed on the package and change the tag from verification-needed-kinetic to verification-done-kinetic. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-kinetic. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Changed in python-oslo.messaging (Ubuntu Kinetic):
status:	Triaged → Fix Committed
tags:	added: verification-needed verification-needed-kinetic

Revision history for this message

Brian Murray (brian-murray) wrote on 2022-10-26:

#26

Hello Aqsa, or anyone else affected,

Accepted python-oslo.messaging into jammy-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/python-oslo.messaging/12.13.0-0ubuntu1.1 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, what testing has been performed on the package and change the tag from verification-needed-jammy to verification-done-jammy. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-jammy. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Changed in python-oslo.messaging (Ubuntu Jammy):
status:	Triaged → Fix Committed
tags:	added: verification-needed-jammy

Revision history for this message

Corey Bryant (corey.bryant) wrote on 2022-10-26:

#27

Hello Aqsa, or anyone else affected,

Accepted python-oslo.messaging into yoga-proposed. The package will build now and be available in the Ubuntu Cloud Archive in a few hours, and then in the -proposed repository.

Please help us by testing this new package. To enable the -proposed repository:

sudo add-apt-repository cloud-archive:yoga-proposed
sudo apt-get update

Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-yoga-needed to verification-yoga-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-yoga-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in cloud-archive:
status:	Triaged → Fix Committed
tags:	added: verification-yoga-needed

Revision history for this message

Aqsa Malik (aqsam) wrote on 2022-10-27:

#28

Hi Corey,

Thank you for providing the fix.

I have tried the python3-oslo.messaging package (12.13.0-0ubuntu1.1) in focal-yoga cloud environment and can confirm that it fixes the issue for us.

Stopping any rabbitmq-server unit any number of times doesn't impact the cloud in any way and all the VMs are successfully launched.
I also checked all the logs while testing this and can see that it readily failover to another rabbitmq-server unit once a unit goes down.

tags:

added: verification-yoga-done
removed: verification-yoga-needed

Revision history for this message

DUFOUR Olivier (odufourc) wrote on 2022-10-28:

#29

I can confirm this fixes the issue encountered on my lab and on a bigger deployment on Focal-Yoga.
I will try to do a quick test on Jammy Yoga and Zed

Just a quick comment, after installing the package update, it is necessary to restart many services to ensure they use the updated python library, before testing any rabbitMQ interruption.
This is the list of services I restart to ensure the fix is working :
juju run -a nova-compute sudo systemctl restart nova-compute nova-api-metadata ceilometer-agent
juju run -a nova-cloud-controller sudo systemctl restart nova-scheduler nova-conductor apache2
juju run -a neutron-api sudo systemctl restart neutron-server
juju run -a glance sudo systemctl restart glance-api
juju run -a cinder sudo systemctl restart cinder-volume cinder-scheduler apache2
juju run -a octavia sudo systemctl restart octavia-worker
juju run -a masakari sudo systemctl restart masakari-engine apache2
juju run -a heat sudo systemctl restart heat-api heat-engine
juju run -a aodh sudo systemctl restart aodh-listener aodh-notifier
juju run -a designate sudo systemctl restart designate-api designate-mdns designate-worker designate-agent

Nobuto Murata (nobuto) on 2022-10-28

tags:

added: verification-needed-zed

Revision history for this message

Ubuntu SRU Bot (ubuntu-sru-bot) wrote on 2022-10-28: Autopkgtest regression report (python-oslo.messaging/14.0.0-0ubuntu1.1)

#30

All autopkgtests for the newly accepted python-oslo.messaging (14.0.0-0ubuntu1.1) for kinetic have finished running.
The following regressions have been reported in tests triggered by the package:

senlin/1:14.0.0-0ubuntu1 (s390x)

Please visit the excuses page listed below and investigate the failures, proceeding afterwards as per the StableReleaseUpdates policy regarding autopkgtest regressions [1].

https://people.canonical.com/~ubuntu-archive/proposed-migration/kinetic/update_excuses.html#python-oslo.messaging

[1] https://wiki.ubuntu.com/StableReleaseUpdates#Autopkgtest_Regressions

Thank you!

Revision history for this message

DUFOUR Olivier (odufourc) wrote on 2022-10-28:

#31

Tested on Jammy Yoga successfully as well :

ubuntu@juju-0ec0fb-0-lxd-11:~$ apt policy python3-oslo.messaging
python3-oslo.messaging:
  Installed: 12.13.0-0ubuntu1.1
  Candidate: 12.13.0-0ubuntu1.1
  Version table:
*** 12.13.0-0ubuntu1.1 500
        500 http://archive.ubuntu.com/ubuntu jammy-proposed/main amd64 Packages
        100 /var/lib/dpkg/status
     12.13.0-0ubuntu1 500
        500 http://archive.ubuntu.com/ubuntu jammy/main amd64 Packages

tags:

added: verification-done-jammy
removed: verification-needed-jammy

Revision history for this message

DUFOUR Olivier (odufourc) wrote on 2022-10-28:

#32

I can confirm the fix works on Jammy Zed deployment too.

ubuntu@juju-c8bcfc-0-lxd-4:~$ apt policy python3-oslo.messaging
python3-oslo.messaging:
  Installed: 14.0.0-0ubuntu1.1~cloud0
  Candidate: 14.0.0-0ubuntu1.1~cloud0
  Version table:
*** 14.0.0-0ubuntu1.1~cloud0 500
        500 http://ubuntu-cloud.archive.canonical.com/ubuntu jammy-proposed/zed/main amd64 Packages
        100 /var/lib/dpkg/status
     14.0.0-0ubuntu1~cloud0 500
        500 http://ubuntu-cloud.archive.canonical.com/ubuntu jammy-updates/zed/main amd64 Packages
     12.13.0-0ubuntu1 500
        500 http://archive.ubuntu.com/ubuntu jammy/main amd64 Packages

tags:

added: verification-done-zed
removed: verification-needed-zed

Revision history for this message

DUFOUR Olivier (odufourc) wrote on 2022-10-31:

#33

Fix is working on Kinetic Zed deployment. (It was a bit painful to deploy and test)

$ juju ssh nova-cloud-controller/0 sudo apt policy python3-oslo.messaging
python3-oslo.messaging:
  Installed: 14.0.0-0ubuntu1.1
  Candidate: 14.0.0-0ubuntu1.1
  Version table:
*** 14.0.0-0ubuntu1.1 500
        500 http://archive.ubuntu.com/ubuntu kinetic-proposed/main amd64 Packages
        100 /var/lib/dpkg/status
     14.0.0-0ubuntu1 500
        500 http://archive.ubuntu.com/ubuntu kinetic/main amd64 Packages

tags:

added: verification-done-kinetic
removed: verification-needed-kinetic

DUFOUR Olivier (odufourc) on 2022-11-01

tags:

added: verification-done
removed: verification-needed

Revision history for this message

Launchpad Janitor (janitor) wrote on 2022-11-01:

#34

This bug was fixed in the package python-oslo.messaging - 14.0.0-0ubuntu1.1

---------------
python-oslo.messaging (14.0.0-0ubuntu1.1) kinetic; urgency=medium

  * d/p/revert-limit-maximum-timeout-in-the-poll-loop.patch: This reverts
    an upstream patch that is preventing active/active rabbitmq from
    failing over when a node goes down (LP: #1993149).

-- Corey Bryant <email address hidden> Fri, 21 Oct 2022 16:45:11 -0400

Changed in python-oslo.messaging (Ubuntu):
status:	Fix Committed → Fix Released

Revision history for this message

Tsutomu Kusanagi (t-kusanagi) wrote on 2022-11-15:

#35

Hi,

We have the same trouble, and we are using Yoga on Ubuntu 20.04.2 LTS (Focal Fossa). So, do you have any plan to release this version of oslo_messaging for 20.04?

oslo_messaging package version is the following:

```bash
root@myhost:/usr/lib/python3/dist-packages# dpkg -l | grep oslo | grep messaging
ii python3-oslo.messaging 12.13.0-0ubuntu1~cloud0
```

We have confirmed that the patch of https://launchpad.net/ubuntu/+source/python-oslo.messaging/12.13.0-0ubuntu1.1 mentioned in #26 resolves the issue of our environment (Ubuntu 20.04.2) too.

Thanks in advance,

Revision history for this message

Launchpad Janitor (janitor) wrote on 2022-11-15:

#36

This bug was fixed in the package python-oslo.messaging - 14.0.0-0ubuntu1.1

---------------
python-oslo.messaging (14.0.0-0ubuntu1.1) kinetic; urgency=medium

  * d/p/revert-limit-maximum-timeout-in-the-poll-loop.patch: This reverts
    an upstream patch that is preventing active/active rabbitmq from
    failing over when a node goes down (LP: #1993149).

-- Corey Bryant <email address hidden> Fri, 21 Oct 2022 16:45:11 -0400

Changed in python-oslo.messaging (Ubuntu Kinetic):
status:	Fix Committed → Fix Released

Revision history for this message

Brian Murray (brian-murray) wrote on 2022-11-15: Update Released

#37

The verification of the Stable Release Update for python-oslo.messaging has completed successfully and the package is now being released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Revision history for this message

Launchpad Janitor (janitor) wrote on 2022-11-15:

#38

This bug was fixed in the package python-oslo.messaging - 12.13.0-0ubuntu1.1

---------------
python-oslo.messaging (12.13.0-0ubuntu1.1) jammy; urgency=medium

  * d/gbp.conf: Create stable/yoga branch.
  * d/p/revert-limit-maximum-timeout-in-the-poll-loop.patch: This reverts
    an upstream patch that is preventing active/active rabbitmq from
    failing over when a node goes down (LP: #1993149).

-- Corey Bryant <email address hidden> Thu, 20 Oct 2022 15:48:16 -0400

Changed in python-oslo.messaging (Ubuntu Jammy):
status:	Fix Committed → Fix Released

Revision history for this message

Corey Bryant (corey.bryant) wrote on 2022-11-15:

#39

This bug was fixed in the package python-oslo.messaging - 14.0.0-0ubuntu1.1~cloud0
---------------

python-oslo.messaging (14.0.0-0ubuntu1.1~cloud0) jammy-zed; urgency=medium
.
   * New update for the Ubuntu Cloud Archive.
.
python-oslo.messaging (14.0.0-0ubuntu1.1) kinetic; urgency=medium
.
   * d/p/revert-limit-maximum-timeout-in-the-poll-loop.patch: This reverts
     an upstream patch that is preventing active/active rabbitmq from
     failing over when a node goes down (LP: #1993149).

Changed in cloud-archive:
status:	Fix Committed → Fix Released

Revision history for this message

Corey Bryant (corey.bryant) wrote on 2022-11-15:

#41

The verification of the Stable Release Update for python-oslo.messaging has completed successfully and the package has now been released to -updates. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Revision history for this message

Corey Bryant (corey.bryant) wrote on 2022-11-15:

#42

This bug was fixed in the package python-oslo.messaging - 12.13.0-0ubuntu1.1~cloud0
---------------

python-oslo.messaging (12.13.0-0ubuntu1.1~cloud0) focal-yoga; urgency=medium
.
   * New update for the Ubuntu Cloud Archive.
.
python-oslo.messaging (12.13.0-0ubuntu1.1) jammy; urgency=medium
.
   * d/gbp.conf: Create stable/yoga branch.
   * d/p/revert-limit-maximum-timeout-in-the-poll-loop.patch: This reverts
     an upstream patch that is preventing active/active rabbitmq from
     failing over when a node goes down (LP: #1993149).

Revision history for this message

Andrew Norrie (apnorrie) wrote on 2022-12-01:

#43

Commenting that we're deploying ~1000 nodes with kolla-ansible Xena containers
running oslo.messaging 12.9.3 and we have this bug.

Have hand coded in the patch for a test and it resolves this issue.

Really looking forward to seeing this patch implemented upstream.

Many thanks to everyone for all their terrific contributions.

Revision history for this message

Khoi (khoinh5) wrote on 2022-12-04:

#44

Sr, I use Kolla Ansible Xena and build images with oslo.messaging==14.0.0, it still got this problem.

Revision history for this message

Andrew Bogott (andrewbogott) wrote on 2022-12-05:

#45

tl;dr: Adding this config seems to resolve the issue for me:

[oslo_messaging_rabbit]
kombu_reconnect_delay=0.1

long version:

I've been staring at [bdcf915e] off and on for several days, and it looks right to me, in theory. That section of code consists of rather a lot of nested timeouts, and this bug looks to be like an issue of having inner-loop timouts fire before their outer-loop timeouts have a chance to.

In particular, I think the issue is in this scrap of kombu.connection._ensure_connection:

        def on_error(exc, intervals, retries, interval=0):
            round = self.completes_cycle(retries)
            if round:
                interval = next(intervals)
            if errback:
                errback(exc, interval)
            self.maybe_switch_next() # select next host

return interval if round else 0

If errback (a callback passed in by the oslo driver) throws an exception 100% of the time (as it seems to post-[bdcf915e]) then failover never happens. I can prevent that ensuring that oslo_messaging_rabbit->kombu_reconnect_delay is less than ACK_REQUEUE_EVERY_SECONDS_MAX (which is now one of our max timeouts thanks to [bdcf915e].)

I'm not 100% convinced that this is the correct fix since it's easy to luck your way out of a timing bug, but it has the advantage of not require a package upgrade.

I also note that kombu_reconnect_delay is only used in one section of code, prefaced with:

            # TODO(sileht): Check if this is useful since we
            # use kombu for HA connection, the interval_step
            # should sufficient, because the underlying kombu transport
            # connection object freed.

...so maybe we can rip out that code and remove kombu_reconnect_delay entirely (which would also resolve the timeout contention).

Revision history for this message

Khoi (khoinh5) wrote on 2022-12-05:

#46

Andrew Bogott (andrewbogott),
I followed your guide and it resolved my problem with Kolla-Ansible Xena. Thank you.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2022-12-05: Fix proposed to oslo.messaging (master)

#47

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/oslo.messaging/+/866616

Changed in oslo.messaging:
status:	New → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2022-12-05:

#48

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/oslo.messaging/+/866617

Revision history for this message

Andrew Bogott (andrewbogott) wrote on 2022-12-05:

#49

I have proposed two different solutions to this issue... I prefer 866616 but it is also likely riskier.

Revision history for this message

Nobuto Murata (nobuto) wrote on 2023-02-14:

#50

Marking as Invalid as it wasn't a charm issue in the end.

Changed in charm-rabbitmq-server:
status:	Triaged → Invalid

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2023-05-18: Fix merged to oslo.messaging (master)

#51

Reviewed: https://review.opendev.org/c/openstack/oslo.messaging/+/866617
Committed: https://opendev.org/openstack/oslo.messaging/commit/0602d1a10ac20c48fa35ad711355c79ee5b0ec77
Submitter: "Zuul (22348)"
Branch: master

commit 0602d1a10ac20c48fa35ad711355c79ee5b0ec77
Author: Andrew Bogott <email address hidden>
Date: Mon Dec 5 09:25:00 2022 -0600

Increase ACK_REQUEUE_EVERY_SECONDS_MAX to exceed default kombu_reconnect_delay

    Previously the two values were the same; this caused us
    to always exceed the timeout limit ACK_REQUEUE_EVERY_SECONDS_MAX
    which results in various code paths never being traversed
    due to premature timeout exceptions.

Also apply min/max values to kombu_reconnect_delay so it doesn't
exceed ACK_REQUEUE_EVERY_SECONDS_MAX and break things again.

Closes-Bug: #1993149
Change-Id: I103d2aa79b4bd2c331810583aeca53e22ee27a49

Changed in oslo.messaging:
status:	In Progress → Fix Released

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2023-05-18: Fix proposed to oslo.messaging (stable/2023.1)

#52

Fix proposed to branch: stable/2023.1
Review: https://review.opendev.org/c/openstack/oslo.messaging/+/883533

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2023-05-18: Fix proposed to oslo.messaging (stable/zed)

#53

Fix proposed to branch: stable/zed
Review: https://review.opendev.org/c/openstack/oslo.messaging/+/883537

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2023-05-18: Fix proposed to oslo.messaging (stable/yoga)

#54

Fix proposed to branch: stable/yoga
Review: https://review.opendev.org/c/openstack/oslo.messaging/+/883538

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2023-05-18: Fix proposed to oslo.messaging (stable/xena)

#55

Fix proposed to branch: stable/xena
Review: https://review.opendev.org/c/openstack/oslo.messaging/+/883539

Revision history for this message

Sven Kieske (s-kieske) wrote on 2023-05-22:

#56

Can someone reassign this bug to the correct component please? this is a bug in oslo messaging, not in charm-rabbitmq-server.

It seems I don't have the necessary rights to do so.

Also I'd like to know if someone is actively working on the proposed backports for xena, yoga and zed and if there is anything I can do to help there.

Thanks in advance

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2023-05-22: Change abandoned on oslo.messaging (master)

#57

Change abandoned by "Andrew Bogott <email address hidden>" on branch: master
Review: https://review.opendev.org/c/openstack/oslo.messaging/+/866616
Reason: fixed in https://review.opendev.org/c/openstack/oslo.messaging/+/866617

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2023-05-22: Fix merged to oslo.messaging (stable/2023.1)

#58

Reviewed: https://review.opendev.org/c/openstack/oslo.messaging/+/883533
Committed: https://opendev.org/openstack/oslo.messaging/commit/b4b49248bcfcb169f96ab2d47b5d207b1354ffa8
Submitter: "Zuul (22348)"
Branch: stable/2023.1

commit b4b49248bcfcb169f96ab2d47b5d207b1354ffa8
Author: Andrew Bogott <email address hidden>
Date: Mon Dec 5 09:25:00 2022 -0600

Increase ACK_REQUEUE_EVERY_SECONDS_MAX to exceed default kombu_reconnect_delay

    Previously the two values were the same; this caused us
    to always exceed the timeout limit ACK_REQUEUE_EVERY_SECONDS_MAX
    which results in various code paths never being traversed
    due to premature timeout exceptions.

Also apply min/max values to kombu_reconnect_delay so it doesn't
exceed ACK_REQUEUE_EVERY_SECONDS_MAX and break things again.

    Closes-Bug: #1993149
    Change-Id: I103d2aa79b4bd2c331810583aeca53e22ee27a49
    (cherry picked from commit 0602d1a10ac20c48fa35ad711355c79ee5b0ec77)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2023-06-01: Fix included in openstack/oslo.messaging 14.3.1

#59

This issue was fixed in the openstack/oslo.messaging 14.3.1 release.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2023-06-02: Fix merged to oslo.messaging (stable/zed)

#60

Reviewed: https://review.opendev.org/c/openstack/oslo.messaging/+/883537
Committed: https://opendev.org/openstack/oslo.messaging/commit/fa3195a3459cae3f4e9be43f114ee2d5eb7a60f1
Submitter: "Zuul (22348)"
Branch: stable/zed

commit fa3195a3459cae3f4e9be43f114ee2d5eb7a60f1
Author: Andrew Bogott <email address hidden>
Date: Mon Dec 5 09:25:00 2022 -0600

Increase ACK_REQUEUE_EVERY_SECONDS_MAX to exceed default kombu_reconnect_delay

    Previously the two values were the same; this caused us
    to always exceed the timeout limit ACK_REQUEUE_EVERY_SECONDS_MAX
    which results in various code paths never being traversed
    due to premature timeout exceptions.

Also apply min/max values to kombu_reconnect_delay so it doesn't
exceed ACK_REQUEUE_EVERY_SECONDS_MAX and break things again.

    Closes-Bug: #1993149
    Change-Id: I103d2aa79b4bd2c331810583aeca53e22ee27a49
    (cherry picked from commit 0602d1a10ac20c48fa35ad711355c79ee5b0ec77)
    (cherry picked from commit b4b49248bcfcb169f96ab2d47b5d207b1354ffa8)

tags:

added: in-stable-zed

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2023-06-12: Fix merged to oslo.messaging (stable/yoga)

#61

Reviewed: https://review.opendev.org/c/openstack/oslo.messaging/+/883538
Committed: https://opendev.org/openstack/oslo.messaging/commit/f20a905ea6f41399c1723f8f1cbd0bc1097b8672
Submitter: "Zuul (22348)"
Branch: stable/yoga

commit f20a905ea6f41399c1723f8f1cbd0bc1097b8672
Author: Andrew Bogott <email address hidden>
Date: Mon Dec 5 09:25:00 2022 -0600

Increase ACK_REQUEUE_EVERY_SECONDS_MAX to exceed default kombu_reconnect_delay

    Previously the two values were the same; this caused us
    to always exceed the timeout limit ACK_REQUEUE_EVERY_SECONDS_MAX
    which results in various code paths never being traversed
    due to premature timeout exceptions.

Also apply min/max values to kombu_reconnect_delay so it doesn't
exceed ACK_REQUEUE_EVERY_SECONDS_MAX and break things again.

    Closes-Bug: #1993149
    Change-Id: I103d2aa79b4bd2c331810583aeca53e22ee27a49
    (cherry picked from commit 0602d1a10ac20c48fa35ad711355c79ee5b0ec77)
    (cherry picked from commit b4b49248bcfcb169f96ab2d47b5d207b1354ffa8)
    (cherry picked from commit fa3195a3459cae3f4e9be43f114ee2d5eb7a60f1)

tags:

added: in-stable-yoga

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2023-06-23: Fix merged to oslo.messaging (stable/xena)

#62

Reviewed: https://review.opendev.org/c/openstack/oslo.messaging/+/883539
Committed: https://opendev.org/openstack/oslo.messaging/commit/ae7d6d28aad3a0490813dbb997b064fb5db7d5c4
Submitter: "Zuul (22348)"
Branch: stable/xena

commit ae7d6d28aad3a0490813dbb997b064fb5db7d5c4
Author: Andrew Bogott <email address hidden>
Date: Mon Dec 5 09:25:00 2022 -0600

Increase ACK_REQUEUE_EVERY_SECONDS_MAX to exceed default kombu_reconnect_delay

    Previously the two values were the same; this caused us
    to always exceed the timeout limit ACK_REQUEUE_EVERY_SECONDS_MAX
    which results in various code paths never being traversed
    due to premature timeout exceptions.

Also apply min/max values to kombu_reconnect_delay so it doesn't
exceed ACK_REQUEUE_EVERY_SECONDS_MAX and break things again.

    Closes-Bug: #1993149
    Change-Id: I103d2aa79b4bd2c331810583aeca53e22ee27a49
    (cherry picked from commit 0602d1a10ac20c48fa35ad711355c79ee5b0ec77)
    (cherry picked from commit b4b49248bcfcb169f96ab2d47b5d207b1354ffa8)
    (cherry picked from commit fa3195a3459cae3f4e9be43f114ee2d5eb7a60f1)
    (cherry picked from commit ee6923c07c469183fc82cfda30ab78c295cc6a12)

tags:

added: in-stable-xena

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2023-06-29: Fix included in openstack/oslo.messaging 14.2.1

#63

This issue was fixed in the openstack/oslo.messaging 14.2.1 release.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2023-07-06: Fix included in openstack/oslo.messaging 12.13.1

#64

This issue was fixed in the openstack/oslo.messaging 12.13.1 release.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2023-07-06: Fix included in openstack/oslo.messaging 14.0.1

#65

This issue was fixed in the openstack/oslo.messaging 14.0.1 release.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2024-03-07: Fix included in openstack/oslo.messaging xena-eom

#66

This issue was fixed in the openstack/oslo.messaging xena-eom release.

OpenStack RabbitMQ Server Charm

VMs stay stuck in scheduling when rabbitmq leader unit is down

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches

	Status	Importance	Assigned to
OpenStack RabbitMQ Server Charm	Invalid	Low	Unassigned
Ubuntu Cloud Archive	Fix Released	Critical	Unassigned
Yoga	Fix Released	Critical	Unassigned
Zed	Fix Released	Critical	Unassigned
oslo.messaging	Fix Released	Undecided	Unassigned
python-oslo.messaging (Ubuntu)	Fix Released	Critical	Corey Bryant
Jammy	Fix Released	Critical	Unassigned
Kinetic	Fix Released	Critical	Corey Bryant