topic_send may loss messages if the queue not exists

Bug #1661510 reported by JiaJunsu
14
This bug affects 3 people
Affects Status Importance Assigned to Milestone
oslo.messaging
In Progress
Medium
Gabriele Santomaggio

Bug Description

If neutron agents get started before server, the agents' messages(sent to server) may be trashed by RabbitMQ.

Oslo.messaging only declare exchange but not queue when send a 'topic' message[1]. In RabbitMQ-server's tutorials, it said 'If we send a message to non-existing location, RabbitMQ will just trash the message'[2].

We've found that may make agents' messages get lost when server is not started. The worse thing is that agents will wait for reply of those messges until timeout, that means agents could not provide service until they get waiting timeout and resend messages to server. We expect agents to receive reply messages and ready to work as soon as the server getting started.

There may be three optional way to solve this:
1.Do not declare exchange when sending 'topic' messages, we will get exception if msg is sent to non-existing exchange.
2.Make sure the queue is exist before sending messages, and raise QueueNotFound exception if we found queue non-existing.
3.Declare the queue before sending messages, just like what notify_send do.

[1]https://github.com/openstack/oslo.messaging/blob/master/oslo_messaging/_drivers/impl_rabbit.py#L1276
[2]https://www.rabbitmq.com/tutorials/tutorial-one-python.html

JiaJunsu (jiajunsu)
description: updated
Revision history for this message
shen.zhixing (fooy5460) wrote :

Can neutron agent wait server to start up? Like "nova compute" call wait_until_ready until "conductor service" is started.

self.conductor_api.wait_until_ready(context.get_admin_context())

Revision history for this message
Ken Giusti (kgiusti) wrote :

One possible approach would be to set the 'mandatory' flag when sending the message. This should cause the broker to send back the message with a "NO ROUTE" error. Not sure exactly how well this is supported (or if it would even work).

Option 1 isn't guaranteed to work since different servers can use the same exchange. Another server may have already created it for its own queue.

Not sure exactly if option 2 is possible with the kombu library, and even if it was there'd be no guarantee that the queue won't be deleted before the client sends the message.

Option 3 will lead to stale RPC requests building up in rabbit's queues.

Changed in oslo.messaging:
status: New → Confirmed
importance: Undecided → Medium
Revision history for this message
JiaJunsu (jiajunsu) wrote :

For option 2, py-amqp support argument `passive` in queue_declare[1].
And we may rewrite the function `Queue.declare` to accept argument `passive` and pass it to `queue_declare` in kombu[2].

[1] https://github.com/celery/py-amqp/blob/v2.3.2/amqp/channel.py#L1028
[2] https://github.com/celery/kombu/blob/v4.2.1/kombu/entity.py#L624

Revision history for this message
Gabriele Santomaggio (gsantomaggio) wrote :

The mandatory is the flag for this kind of situation, but in Kombu is not supported [1].
Check if a queue exists for each single publish dorps the performances drastically.

I would work to implement the mandatory flag and handle the important messages.

https://github.com/celery/kombu/blob/maswoter/kombu/messaging.py#L129

Ken Giusti (kgiusti)
Changed in oslo.messaging:
assignee: nobody → Gabriele Santomaggio (gsantomaggio)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to oslo.messaging (master)

Fix proposed to branch: master
Review: https://review.opendev.org/659078

Changed in oslo.messaging:
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.opendev.org/660373

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on oslo.messaging (master)

Change abandoned by Gabriele Santomaggio (<email address hidden>) on branch: master
Review: https://review.opendev.org/659078
Reason: closed in favor of https://review.opendev.org/#/c/660373/

Revision history for this message
Gabriele Santomaggio (gsantomaggio) wrote :

We are working on it, it seems that Kombu in-memory transport does not support "on_return" function see the issue [1].

I proposed this PR [2], it does not implment "on_return" but at least it does not crash.

1- https://github.com/celery/kombu/issues/1050
2- https://github.com/celery/kombu/pull/1053

Revision history for this message
Gabriele Santomaggio (gsantomaggio) wrote :
Revision history for this message
sean mooney (sean-k-mooney) wrote :

hi i think on one had this can now be cosed as i think
https://blueprints.launchpad.net/oslo.messaging/+spec/transport-options
has been completed
https://review.opendev.org/#/q/topic:bp/transport-options+(status:open+OR+status:merged)

how ever could you provide some guidance on how we would consume this freatre adn fixes in nova so that we can prevnet this issue.

this can result in a number of issue in nova the most common is a vm being stuck in a specific state forever because the rpc was lost. generally this is fixed by restarting the comptue agents on the nova side after rabbit mq was restart but it can be really had to figure out why this is happening for operators.

i am basically wondering should we log the failure and retry or consider making the host as down to indicate that nova can no longer deliver message to the compute host topic queue.

restarting the compute agent recreates the topic queue but it not obvious that that need to be done when we hit this failure mode so a log message would help as would marking the host as down. as a signale to the operator that they need to take some action.

Revision history for this message
Ben Nemec (bnemec) wrote :

I have a brief example of how to use TransportOptions in the project update for last cycle: https://docs.google.com/presentation/d/1nFN2qWQJF7sUmUBDnEU66MDl7z5d6aO2AmE8l2zmPJ0/edit#slide=id.g6221d147b8_1_10 (it's slide 5 if the link doesn't take you directly there).

I also have a bug open to document this better since there's really nothing in the published docs right now: https://bugs.launchpad.net/oslo.messaging/+bug/1849741

And if you want to listen to me ramble about this it's in the project update video: https://youtu.be/gNwWgedIDxc?t=238

Revision history for this message
Gabriele Santomaggio (gsantomaggio) wrote :
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.