Comment 12 for bug 1835637

Revision history for this message
Robert Varjasi (robert.varjasi) wrote (last edit ):

We are hitting the same issue with Openstack-Ansible Wallaby (23.1.0) deployment on Ubuntu 20.04.3 LTS with rabbitmq 3.8.23-1.

We have a 3 nodes controller cluster, with configured HA policy for queues. After restarting one rabbitmq node, few neutron-ovs-agent services stopped working on the compute nodes, reporting down state. Plus Neutron network nodes are throwing errors like above:

t 07 09:00:24 network1-neutron-server-container-21d3a527 neutron-server[2559]: 2021-10-07 09:00:24.934 2559 ERROR oslo.messaging._drivers.impl_rabbit [req-3d24f101-1b3e-4e82-9e0a-ded770c6c9a8 14c7975d4b5c4eeea3b156ab25a5fc8d 8b91b8ed072b4cdaa71706c1055cfa07 - default default] Failed to publish message to topic 'neutron': Queue.declare: (404) NOT_FOUND - failed to perform operation on queue 'notifications_designate.info' in vhost '/neutron' due to timeout: amqp.exceptions.NotFound: Queue.declare: (404) NOT_FOUND - failed to perform operation on queue 'notifications_designate.info' in vhost '/neutron' due to timeout

Amqp servers are up and running, cluster is healthy in the rabbitmq perspective. Networking links are up, monitoring doesnt report any problems. Tempest tests was successful after the deployment. Everything went wrong when we start to test our control plane resiliency with restarting 1 node out of 3.

Neutron service are now useless.

Seems Rabbitmq have some problem too:

controller2_rabbit_mq_container-a436cc12 | CHANGED | rc=0 >>
2021-10-07 12:46:54.098 [error] <0.10378.0> Discarding message {'$gen_call',{<0.10378.0>,#Ref<0.2855445335.1402994689.61751>},{info,[state]}} from <0.10378.0> to <0.10395.3> in an old incarnation (1633108660) of this node (1633603362)

2021-10-07 12:46:54.098 [error] emulator Discarding message {'$gen_call',{<0.10378.0>,#Ref<0.2855445335.1402994689.61750>},{info,[state]}} from <0.10378.0> to <0.31877.12> in an old incarnation (1633108660) of this node (1633603362)

2021-10-07 12:46:54.098 [error] <0.10378.0> Discarding message {'$gen_call',{<0.10378.0>,#Ref<0.2855445335.1402994689.61752>},{info,[state]}} from <0.10378.0> to <0.9749.19> in an old incarnation (1633108660) of this node (1633603362)

2021-10-07 12:46:54.099 [error] emulator Discarding message {'$gen_call',{<0.10378.0>,#Ref<0.2855445335.1402994689.61751>},{info,[state]}} from <0.10378.0> to <0.10395.3> in an old incarnation (1633108660) of this node (1633603362)

2021-10-07 12:46:54.099 [error] emulator Discarding message {'$gen_call',{<0.10378.0>,#Ref<0.2855445335.1402994689.61752>},{info,[state]}} from <0.10378.0> to <0.9749.19> in an old incarnation (1633108660) of this node (1633603362)

After shutting down the harmful node which emitting 'discarding messages' services are started to report UP states. Maybe its a rabbitmq BUG.