(404) NOT_FOUND - failed to perform operation on queue 'notifications.info' in vhost '/nova' due to timeout

Bug #1835637 reported by Satish Patel
68
This bug affects 14 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Incomplete
Undecided
Unassigned

Bug Description

Running stein version / deployed by openstack-ansible

============
[root@ostack-osa ~]# openstack keypair create --public-key ~/.ssh/id_rsa.pub magnum-key-bar -vvvv
Starting new HTTP connection (1): 172.28.8.9:8774
http://172.28.8.9:8774 "POST /v2.1/os-keypairs HTTP/1.1" 504 None
RESP: [504] Cache-Control: no-cache Connection: close Content-Type: text/html
RESP BODY: Omitted, Content-Type is set to text/html. Only application/json responses have their bodies logged.
Unknown Error (HTTP 504)
clean_up CreateKeypair: Unknown Error (HTTP 504)
END return value: 1
=============

I have verified haproxy load-balancer working fine and nova-api listening on 8774, when i checked logs on nova-api i found following error

2019-07-06 19:35:31.200 56 ERROR oslo.messaging._drivers.impl_rabbit [req-242de4d8-9568-40d5-b4db-52d83a8d244f 63847837de444225accd1ae1db2b1f11 6297c04e9593466d9c6747874e379444 - default default] Failed to publish message to topic 'nova': Queue.declare: (404) NOT_FOUND - failed to perform operation on queue 'versioned_notifications.error' in vhost '/nova' due to timeout: NotFound: Queue.declare: (404) NOT_FOUND - failed to perform operation on queue 'versioned_notifications.error' in vhost '/nova' due to timeout
2019-07-06 19:35:31.202 56 ERROR oslo.messaging._drivers.impl_rabbit [req-242de4d8-9568-40d5-b4db-52d83a8d244f 63847837de444225accd1ae1db2b1f11 6297c04e9593466d9c6747874e379444 - default default] Unable to connect to AMQP server on 172.28.15.248:5671 after inf tries: Queue.declare: (404) NOT_FOUND - failed to perform operation on queue 'versioned_notifications.error' in vhost '/nova' due to timeout: NotFound: Queue.declare: (404) NOT_FOUND - failed to perform operation on queue 'versioned_notifications.error' in vhost '/nova' due to timeout
2019-07-06 19:35:31.204 56 ERROR oslo_messaging.notify.messaging [req-242de4d8-9568-40d5-b4db-52d83a8d244f 63847837de444225accd1ae1db2b1f11 6297c04e9593466d9c6747874e379444 - default default] Could not send notification to versioned_notifications. Payload={'_context_domain': None, '_context_request_id': 'req-242de4d8-9568-40d5-b4db-52d83a8d244f', '_context_global_request_id': None, '_context_quota_class': None, 'event_type': u'compute.exception', '_context_service_catalog': [{u'endpoints': [{u'adminURL': u'http://172.28.8.9:9311', u'region': u'RegionOne', u'internalURL': u'http://172.28.8.9:9311', u'publicURL': u'https://10.30.8.9:9311'}], u'type': u'key-manager', u'name': u'barbican'}, {u'endpoints': [{u'adminURL': u'http://172.28.8.9:8776/v3/6297c04e9593466d9c6747874e379444', u'region': u'RegionOne', u'internalURL': u'http://172.28.8.9:8776/v3/6297c04e9593466d9c6747874e379444', u'publicURL': u'https://10.30.8.9:8776/v3/6297c04e9593466d9c6747874e379444'}], u'type': u'volumev3', u'name': u'cinderv3'}, {u'endpoints': [{u'adminURL': u'http://172.28.8.9:8780', u'region': u'RegionOne', u'internalURL': u'http://172.28.8.9:8780', u'publicURL': u'https://10.30.8.9:8780'}], u'type': u'placement', u'name': u'placement'}, {u'endpoints': [{u'adminURL': u'http://172.28.8.9:9696', u'region': u'RegionOne', u'internalURL': u'http://172.28.8.9:9696', u'publicURL': u'https://10.30.8.9:9696'}], u'type': u'network', u'name': u'neutron'}, {u'endpoints': [{u'adminURL': u'http://172.28.8.9:9292', u'region': u'RegionOne', u'internalURL': u'http://172.28.8.9:9292', u'publicURL': u'https://10.30.8.9:9292'}], u'type': u'image', u'name': u'glance'}], 'timestamp': u'2019-07-06 23:34:28.146951', '_context_user': u'63847837de444225accd1ae1db2b1f11', '_unique_id': '36959759468747838bcf2cd94602da0a', '_context_resource_uuid': None, '_context_is_admin_project': True, '_context_read_deleted': 'no', '_context_user_id': u'63847837de444225accd1ae1db2b1f11', 'payload': {'nova_object.version': '1.1', 'nova_object.name': 'ExceptionPayload', 'nova_object.namespace': 'nova', 'nova_object.data': {'module_name': u'nova.objects.keypair', 'exception': u'KeyPairExists', 'traceback': u'Traceback (most recent call last):\n File "/openstack/venvs/nova-19.0.0.0rc3.dev6/lib/python2.7/site-packages/nova/exception_wrapper.py", line 69, in wrapped\n return f(self, context, *args, **kw)\n File "/openstack/venvs/nova-19.0.0.0rc3.dev6/lib/python2.7/site-packages/nova/compute/api.py", line 5841, in import_key_pair\n keypair.create()\n File "/openstack/venvs/nova-19.0.0.0rc3.dev6/lib/python2.7/site-packages/oslo_versionedobjects/base.py", line 226, in wrapper\n return fn(self, *args, **kwargs)\n File "/openstack/venvs/nova-19.0.0.0rc3.dev6/lib/python2.7/site-packages/nova/objects/keypair.py", line 173, in create\n self._create()\n File "/openstack/venvs/nova-19.0.0.0rc3.dev6/lib/python2.7/site-packages/nova/objects/keypair.py", line 177, in _create\n db_keypair = self._create_in_db(self._context, updates)\n File "/openstack/venvs/nova-19.0.0.0rc3.dev6/lib/python2.7/site-packages/nova/objects/keypair.py", line 136, in _create_in_db\n return _create_in_db(context, values)\n File "/openstack/venvs/nova-19.0.0.0rc3.dev6/lib/python2.7/site-packages/oslo_db/sqlalchemy/enginefacade.py", line 1012, in wrapper\n return fn(*args, **kwargs)\n File "/openstack/venvs/nova-19.0.0.0rc3.dev6/lib/python2.7/site-packages/nova/objects/keypair.py", line 73, in _create_in_db\n raise exception.KeyPairExists(key_name=values[\'name\'])\nKeyPairExists: Key pair \'magnum-key-foo\' already exists.\n', 'exception_message': u"Key pair 'magnum-key-foo' already exists.", 'function_name': u'_create_in_db'}}, '_context_project_name': u'admin', '_context_system_scope': None, '_context_user_identity': u'63847837de444225accd1ae1db2b1f11 6297c04e9593466d9c6747874e379444 - default default', '_context_auth_token': 'gAAAAABdIS9H21rHUX7kQ4xW-2vUVbFX3f8PQPQwyuY-TahQRxthQQg5vWDLQSR70eiz3iZi7u2IYLLFLz3_S3k0-il-vqcBoWs4kOiBwHofPs1CUyMJAO_zhHE1QtnlNMto0vXZ4UyYFjuXJOY3e7NibpQLL26qIJbZehbhn1s02ZwKBetSVm8', '_context_show_deleted': False, '_context_tenant': u'6297c04e9593466d9c6747874e379444', '_context_roles': [u'admin', u'reader', u'member'], 'priority': 'ERROR', '_context_read_only': False, '_context_is_admin': True, '_context_project_id': u'6297c04e9593466d9c6747874e379444', '_context_project_domain': u'default', '_context_timestamp': '2019-07-06T23:31:20.611554', '_context_user_domain': u'default', '_context_user_name': u'admin', 'publisher_id': u'nova-api:ostack-infra-2-1-nova-api-container-86de0dda', 'message_id': u'e2727710-d0fa-47b5-a136-ed01a8f10b5d', '_context_project': u'6297c04e9593466d9c6747874e379444', '_context_remote_address': '172.28.8.7'}: MessageDeliveryFailure: Unable to connect to AMQP server on 172.28.15.248:5671 after inf tries: Queue.declare: (404) NOT_FOUND - failed to perform operation on queue 'versioned_notifications.error' in vhost '/nova' due to timeout
2019-07-06 19:35:31.204 56 ERROR oslo_messaging.notify.messaging Traceback (most recent call last):
2019-07-06 19:35:31.204 56 ERROR oslo_messaging.notify.messaging File "/openstack/venvs/nova-19.0.0.0rc3.dev6/lib/python2.7/site-packages/oslo_messaging/notify/messaging.py", line 70, in notify
2019-07-06 19:35:31.204 56 ERROR oslo_messaging.notify.messaging retry=retry)
2019-07-06 19:35:31.204 56 ERROR oslo_messaging.notify.messaging File "/openstack/venvs/nova-19.0.0.0rc3.dev6/lib/python2.7/site-packages/oslo_messaging/transport.py", line 135, in _send_notification
2019-07-06 19:35:31.204 56 ERROR oslo_messaging.notify.messaging retry=retry)
2019-07-06 19:35:31.204 56 ERROR oslo_messaging.notify.messaging File "/openstack/venvs/nova-19.0.0.0rc3.dev6/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 649, in send_notification
2019-07-06 19:35:31.204 56 ERROR oslo_messaging.notify.messaging envelope=(version == 2.0), notify=True, retry=retry)
2019-07-06 19:35:31.204 56 ERROR oslo_messaging.notify.messaging File "/openstack/venvs/nova-19.0.0.0rc3.dev6/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 613, in _send
2019-07-06 19:35:31.204 56 ERROR oslo_messaging.notify.messaging conn.notify_send(exchange, target.topic, msg, retry=retry)
2019-07-06 19:35:31.204 56 ERROR oslo_messaging.notify.messaging File "/openstack/venvs/nova-19.0.0.0rc3.dev6/lib/python2.7/site-packages/oslo_messaging/_drivers/impl_rabbit.py", line 1230, in notify_send
2019-07-06 19:35:31.204 56 ERROR oslo_messaging.notify.messaging exchange, msg, routing_key=topic, retry=retry)
2019-07-06 19:35:31.204 56 ERROR oslo_messaging.notify.messaging File "/openstack/venvs/nova-19.0.0.0rc3.dev6/lib/python2.7/site-packages/oslo_messaging/_drivers/impl_rabbit.py", line 1086, in _ensure_publishing
2019-07-06 19:35:31.204 56 ERROR oslo_messaging.notify.messaging self.ensure(method, retry=retry, error_callback=_error_callback)
2019-07-06 19:35:31.204 56 ERROR oslo_messaging.notify.messaging File "/openstack/venvs/nova-19.0.0.0rc3.dev6/lib/python2.7/site-packages/oslo_messaging/_drivers/impl_rabbit.py", line 755, in ensure
2019-07-06 19:35:31.204 56 ERROR oslo_messaging.notify.messaging raise exceptions.MessageDeliveryFailure(msg)
2019-07-06 19:35:31.204 56 ERROR oslo_messaging.notify.messaging MessageDeliveryFailure: Unable to connect to AMQP server on 172.28.15.248:5671 after inf tries: Queue.declare: (404) NOT_FOUND - failed to perform operation on queue 'versioned_notifications.error' in vhost '/nova' due to timeout
2019-07-06 19:35:31.204 56 ERROR oslo_messaging.notify.messaging

Revision history for this message
Balazs Gibizer (balazs-gibizer) wrote :

I think the problem is either with the AMQP config of nova or the rabbitmq server nova tries to connect base on that configuration.

From your logs: "Unable to connect to AMQP server on 172.28.15.248:5671"

Please verify that the AMQP server is listening on the above address.

Changed in nova:
status: New → Incomplete
Revision history for this message
Satish Patel (satish-txt) wrote :
Download full text (3.9 KiB)

This is something big going on.. i have reproduce this error again, i have noticed when i reboot my 3 controller node, i start noticing this kind of error.

I have rolling restart rabbitmq and now my neutron service isn't happy throwing following error in rabbitmq logs.

I have two openstack cluster running queens and stein ( many time i have reboot queens controller and they always come clean, but stein always give me hard time with rabbitmq)

2019-08-01 19:14:58.460 [info] <0.7948.6> accepting AMQP connection <0.7948.6> (172.28.15.247:39370 -> 172.28.15.22:5671)
2019-08-01 19:14:58.463 [info] <0.7948.6> Connection <0.7948.6> (172.28.15.247:39370 -> 172.28.15.22:5671) has a client-provided name: neutron-server:1203:deb45768-d961-4375-a2ea-b1d8a47b44ff
2019-08-01 19:14:58.466 [info] <0.7948.6> connection <0.7948.6> (172.28.15.247:39370 -> 172.28.15.22:5671 - neutron-server:1203:deb45768-d961-4375-a2ea-b1d8a47b44ff): user 'neutron' authenticated and granted access to vhost '/neutron'
2019-08-01 19:14:58.479 [info] <0.7953.6> accepting AMQP connection <0.7953.6> (172.28.15.247:39372 -> 172.28.15.22:5671)
2019-08-01 19:14:58.482 [info] <0.7953.6> Connection <0.7953.6> (172.28.15.247:39372 -> 172.28.15.22:5671) has a client-provided name: neutron-server:1201:8a5feb3c-88cf-43ad-ba68-c3cf800e530e
2019-08-01 19:14:58.485 [info] <0.7953.6> connection <0.7953.6> (172.28.15.247:39372 -> 172.28.15.22:5671 - neutron-server:1201:8a5feb3c-88cf-43ad-ba68-c3cf800e530e): user 'neutron' authenticated and granted access to vhost '/neutron'
2019-08-01 19:14:58.614 [info] <0.7988.6> accepting AMQP connection <0.7988.6> (172.28.15.247:39380 -> 172.28.15.22:5671)
2019-08-01 19:14:58.617 [info] <0.7988.6> Connection <0.7988.6> (172.28.15.247:39380 -> 172.28.15.22:5671) has a client-provided name: neutron-server:1201:50b6cf86-9055-45d7-846d-d13201b406cf
2019-08-01 19:14:58.620 [info] <0.7988.6> connection <0.7988.6> (172.28.15.247:39380 -> 172.28.15.22:5671 - neutron-server:1201:50b6cf86-9055-45d7-846d-d13201b406cf): user 'neutron' authenticated and granted access to vhost '/neutron'
2019-08-01 19:15:02.316 [info] <0.8008.6> accepting AMQP connection <0.8008.6> (172.28.15.247:39402 -> 172.28.15.22:5671)
2019-08-01 19:15:02.319 [info] <0.8008.6> Connection <0.8008.6> (172.28.15.247:39402 -> 172.28.15.22:5671) has a client-provided name: neutron-server:1201:6bf1c5e5-90f8-4e65-995a-2b06442e0970
2019-08-01 19:15:02.322 [info] <0.8008.6> connection <0.8008.6> (172.28.15.247:39402 -> 172.28.15.22:5671 - neutron-server:1201:6bf1c5e5-90f8-4e65-995a-2b06442e0970): user 'neutron' authenticated and granted access to vhost '/neutron'
2019-08-01 19:25:23.173 [info] <0.8780.6> accepting AMQP connection <0.8780.6> (172.28.15.12:48242 -> 172.28.15.22:5671)
2019-08-01 19:25:23.176 [info] <0.8780.6> Connection <0.8780.6> (172.28.15.12:48242 -> 172.28.15.22:5671) has a client-provided name: neutron-server:221:a1f0e472-dd3d-4240-9438-e5c8aff75489
2019-08-01 19:25:23.179 [info] <0.8780.6> connection <0.8780.6> (172.28.15.12:48242 -> 172.28.15.22:5671 - neutron-server:221:a1f0e472-dd3d-4240-9438-e5c8aff75489): user 'neutron' authenticated and granted access to vhost '/neutron'
2019-08-0...

Read more...

Revision history for this message
Satish Patel (satish-txt) wrote :
Revision history for this message
Satish Patel (satish-txt) wrote :

[root@ostack-infra-2-2-rabbit-mq-container-a760b2d0 root]# rabbitmqctl eval "rabbit_diagnostics:maybe_stuck()."
warning: the VM is running with native name encoding of latin1 which may cause Elixir to malfunction as it expects utf8. Please ensure your locale is set to UTF-8 (which can be verified by running "locale" in your shell)
2019-08-01 19:33:54 There are 10811 processes.
2019-08-01 19:33:54 Investigated 1 processes this round, 5000ms to go.
2019-08-01 19:33:55 Investigated 1 processes this round, 4500ms to go.
2019-08-01 19:33:55 Investigated 1 processes this round, 4000ms to go.
2019-08-01 19:33:56 Investigated 1 processes this round, 3500ms to go.
2019-08-01 19:33:56 Investigated 1 processes this round, 3000ms to go.
2019-08-01 19:33:57 Investigated 1 processes this round, 2500ms to go.
2019-08-01 19:33:57 Investigated 1 processes this round, 2000ms to go.
2019-08-01 19:33:58 Investigated 1 processes this round, 1500ms to go.
2019-08-01 19:33:58 Investigated 1 processes this round, 1000ms to go.
2019-08-01 19:33:59 Investigated 1 processes this round, 500ms to go.
2019-08-01 19:33:59 Found 1 suspicious processes.
2019-08-01 19:33:59 [{pid,<10459.1.0>},
                     {registered_name,erts_code_purger},
                     {current_stacktrace,
                         [{erts_code_purger,wait_for_request,0,[]}]},
                     {initial_call,{erts_code_purger,start,0}},
                     {message_queue_len,0},
                     {links,[]},
                     {monitors,[]},
                     {monitored_by,[]},
                     {heap_size,233}]
ok

Revision history for this message
Satish Patel (satish-txt) wrote :

rabbitmq version is 3.7.15

Revision history for this message
Russell Holloway (russell-holloway) wrote :

I also have a freshly deployed stable/stein instance from openstack-ansible. I began hitting these issues after a rolling restart of my infrastructure hosts (including the 3 rmq containers), getting all sorts of errors show up (cannot launch VMs due to errors in nova-scheduler, for example).

Is there a workaround or known way to recover in the meantime? I'm not even sure how to get the cluster working again. All rmq instances are shown as running.

Revision history for this message
Fabian Zimmermann (dev-faz) wrote :

Also having this issue from time to time. Things normally work:

* delete the vhost and create a new (empty) one => you have to create all users, passwords, ...

or

* use "delete_damaged_queue" from https://github.com/devfaz/os-debug to delete the broken queues.

Seems like setting a ha-policy on every queue reduces/eliminates the issue. Would be glad if someone could also proof this.

Revision history for this message
Satish Patel (satish-txt) wrote :

@Fabian,

I do have ha-policy for every queue but i am still seeing same issue.

Revision history for this message
Adrien Cunin (adri2000) wrote :

Experienced the same issue on an OpenStack-Ansible Stein deployment, on Ubuntu 18.04, after rebooting control plane nodes.

Here is the workaround I used successfully (it's kind of OpenStack-Ansible specific):

On each RabbitMQ node (container):

```
rabbitmqctl stop_app
rabbitmqctl reset
service rabbitmq-server stop
```

Then, redeploy RabbitMQ:

```
openstack-ansible rabbitmq-install.yml -e 'rabbitmq_upgrade=true'
```

Then, re-run playbooks for all the OpenStack services relying on RabbitMQ.

Revision history for this message
Mathias Lindberg (shambulance) wrote :

Sorry about reviving this thread but we are experiencing this problem on a Rocky and a Stein installation.
I have found a work around that might work for you or not (or destroy something), ymmv "you have been warned".
I stop the rabbitmq services on all of the the controlplane nodes, and *IMPORTANT* restart them in the same order i.e the last one to go down is the first one to start up.
On to a question, how can we fix this issue in our installations.
Seem to me a rabbitmq + mnesia issue, could we just update those packages to versions that is not affected by this? Thanks in advance for any pointers!

Revision history for this message
Radosław Piliszek (yoctozepto) wrote :

I have just been randomly hit by this, albeit with Neutron agent being unable to connect to RMQ. The workaround from the previous comment works if one does not mind the deployment freezing for a while (as RMQ is going down as if for upgrade).

Revision history for this message
Robert Varjasi (robert.varjasi) wrote (last edit ):

We are hitting the same issue with Openstack-Ansible Wallaby (23.1.0) deployment on Ubuntu 20.04.3 LTS with rabbitmq 3.8.23-1.

We have a 3 nodes controller cluster, with configured HA policy for queues. After restarting one rabbitmq node, few neutron-ovs-agent services stopped working on the compute nodes, reporting down state. Plus Neutron network nodes are throwing errors like above:

t 07 09:00:24 network1-neutron-server-container-21d3a527 neutron-server[2559]: 2021-10-07 09:00:24.934 2559 ERROR oslo.messaging._drivers.impl_rabbit [req-3d24f101-1b3e-4e82-9e0a-ded770c6c9a8 14c7975d4b5c4eeea3b156ab25a5fc8d 8b91b8ed072b4cdaa71706c1055cfa07 - default default] Failed to publish message to topic 'neutron': Queue.declare: (404) NOT_FOUND - failed to perform operation on queue 'notifications_designate.info' in vhost '/neutron' due to timeout: amqp.exceptions.NotFound: Queue.declare: (404) NOT_FOUND - failed to perform operation on queue 'notifications_designate.info' in vhost '/neutron' due to timeout

Amqp servers are up and running, cluster is healthy in the rabbitmq perspective. Networking links are up, monitoring doesnt report any problems. Tempest tests was successful after the deployment. Everything went wrong when we start to test our control plane resiliency with restarting 1 node out of 3.

Neutron service are now useless.

Seems Rabbitmq have some problem too:

controller2_rabbit_mq_container-a436cc12 | CHANGED | rc=0 >>
2021-10-07 12:46:54.098 [error] <0.10378.0> Discarding message {'$gen_call',{<0.10378.0>,#Ref<0.2855445335.1402994689.61751>},{info,[state]}} from <0.10378.0> to <0.10395.3> in an old incarnation (1633108660) of this node (1633603362)

2021-10-07 12:46:54.098 [error] emulator Discarding message {'$gen_call',{<0.10378.0>,#Ref<0.2855445335.1402994689.61750>},{info,[state]}} from <0.10378.0> to <0.31877.12> in an old incarnation (1633108660) of this node (1633603362)

2021-10-07 12:46:54.098 [error] <0.10378.0> Discarding message {'$gen_call',{<0.10378.0>,#Ref<0.2855445335.1402994689.61752>},{info,[state]}} from <0.10378.0> to <0.9749.19> in an old incarnation (1633108660) of this node (1633603362)

2021-10-07 12:46:54.099 [error] emulator Discarding message {'$gen_call',{<0.10378.0>,#Ref<0.2855445335.1402994689.61751>},{info,[state]}} from <0.10378.0> to <0.10395.3> in an old incarnation (1633108660) of this node (1633603362)

2021-10-07 12:46:54.099 [error] emulator Discarding message {'$gen_call',{<0.10378.0>,#Ref<0.2855445335.1402994689.61752>},{info,[state]}} from <0.10378.0> to <0.9749.19> in an old incarnation (1633108660) of this node (1633603362)

After shutting down the harmful node which emitting 'discarding messages' services are started to report UP states. Maybe its a rabbitmq BUG.

Revision history for this message
Alberto Molina Coballes (alb-molina) wrote :

Facing the same issue with openstack-ansible Wallaby, this workaround has been used:

1. Stop all nova services
2. Delete the vhost /nova in rabbitmq:

rabbitmqctl delete_vhost /nova

3. Add the vhost /nova again:

rabbitmqctl add_vhost /nova

4. Set permissions on /nova to nova user:

rabbitmqctl set_permissions -p /nova nova ".*" ".*" ".*"

5. Start all nova services

Revision history for this message
Satish Patel (satish-txt) wrote :

@Alb-molina in your method you are not setting up cluster HA policy that will make all your queue non-HA.

Revision history for this message
Satish Patel (satish-txt) wrote :

After RabbitMQ upgrade from 3.7.x to 3.8.14 i can see much stability. Its been over 2 weeks and not a single crash so assuming 3.7 version has some bugs. I would highly recommend to upgrade 3.8 or 3.9.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.