duplicate messages are processed
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
oslo.messaging |
Fix Released
|
Undecided
|
Nikita Kalyanov |
Bug Description
oslo.messaging version: Train (however the code in question is still in master)
Messaging Server: Rabbit MQ
Prior to https:/
However, now the cache is working only for Notifications. The code to check duplicates is still there but the cache is filled only in NotificationAMQ
In particular, we observe duplicate operations in Trove DBaaS. Although the operation itself is idempotent, when the message is processed the second time, the calling code already got the reply to the first message and thinks that the agent is not busy anymore. The subsequent operations may clash with the processing of the second message. This reproduces fairly reliably on our CI.
After some strace and tcpdump, the whole flow looks like this:
1. We receive a new message and ACK it.
2. However, the connection to the Rabbit MQ server gets broken and the server cannot get our ACK.
3. The kernel retries the sending of ACK packets and the Rabbit MQ server keeps re-sending the message to us.
4. We process the message and try to send the reply.
5. The ensure machinery detects that the connection is broken and re-establishes it.
6. Our ACK and our reply finally reach the server, but at the same time we get the same message a second time.
7. The caller side of an RPC call gets the response and thinks that we are free now and ready to do another operation while we are actually busy processing the same message a second time.
8. Our processing conflicts with the subsequent operations sent to us.
Changed in oslo.messaging: | |
assignee: | nobody → Nikita Kalyanov (nikitakalyanov) |
status: | In Progress → New |
status: | New → In Progress |
description: | updated |
description: | updated |
One of the solutions could be to activate the message id caching for regular RPC messages, as a useful feature that once was there. However, we should not reject the duplicate message because it is likely already ACK'ed. The reject is happening if we let the DuplicateMessag eError exception to raise up, then it will be caught and we will try to reject the message here: https:/ /github. com/openstack/ oslo.messaging/ blob/5aa645b38b 4c1cf08b00e687e b6c7c4b8a0211fc /oslo_messaging /_drivers/ impl_rabbit. py#L386 That is why we should not throw DuplicateMessag eError up the stack, but rather log it and continue.