duplicate messages are processed

Bug #1935883 reported by Nikita Kalyanov
30
This bug affects 7 people
Affects Status Importance Assigned to Milestone
oslo.messaging
Fix Released
Undecided
Nikita Kalyanov

Bug Description

oslo.messaging version: Train (however the code in question is still in master)
Messaging Server: Rabbit MQ

Prior to https://review.opendev.org/c/openstack/oslo.messaging/+/581417 we had a common message id cache mechanism that worked for both regular RPC messages and Notifications (because the common class was used). It prevented duplicate messages from processing.
However, now the cache is working only for Notifications. The code to check duplicates is still there but the cache is filled only in NotificationAMQPIncomingMessage class and not in AMQPIncomingMessage. This leads to the duplicate messages being processed sometimes, which may be not expected by the application-level code. It is possible if the connection gets broken while the server did not receive our ACK.
In particular, we observe duplicate operations in Trove DBaaS. Although the operation itself is idempotent, when the message is processed the second time, the calling code already got the reply to the first message and thinks that the agent is not busy anymore. The subsequent operations may clash with the processing of the second message. This reproduces fairly reliably on our CI.
After some strace and tcpdump, the whole flow looks like this:
1. We receive a new message and ACK it.
2. However, the connection to the Rabbit MQ server gets broken and the server cannot get our ACK.
3. The kernel retries the sending of ACK packets and the Rabbit MQ server keeps re-sending the message to us.
4. We process the message and try to send the reply.
5. The ensure machinery detects that the connection is broken and re-establishes it.
6. Our ACK and our reply finally reach the server, but at the same time we get the same message a second time.
7. The caller side of an RPC call gets the response and thinks that we are free now and ready to do another operation while we are actually busy processing the same message a second time.
8. Our processing conflicts with the subsequent operations sent to us.

Revision history for this message
Nikita Kalyanov (nikitakalyanov) wrote (last edit ):

One of the solutions could be to activate the message id caching for regular RPC messages, as a useful feature that once was there. However, we should not reject the duplicate message because it is likely already ACK'ed. The reject is happening if we let the DuplicateMessageError exception to raise up, then it will be caught and we will try to reject the message here: https://github.com/openstack/oslo.messaging/blob/5aa645b38b4c1cf08b00e687eb6c7c4b8a0211fc/oslo_messaging/_drivers/impl_rabbit.py#L386 That is why we should not throw DuplicateMessageError up the stack, but rather log it and continue.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to oslo.messaging (master)
Changed in oslo.messaging:
status: New → In Progress
Changed in oslo.messaging:
assignee: nobody → Nikita Kalyanov (nikitakalyanov)
status: In Progress → New
status: New → In Progress
description: updated
description: updated
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to oslo.messaging (master)

Reviewed: https://review.opendev.org/c/openstack/oslo.messaging/+/800575
Committed: https://opendev.org/openstack/oslo.messaging/commit/129c223307141be1bc767dfca704b06192458d5b
Submitter: "Zuul (22348)"
Branch: master

commit 129c223307141be1bc767dfca704b06192458d5b
Author: Nikita Kalyanov <email address hidden>
Date: Tue Jul 13 03:31:27 2021 +0300

    use message id cache for RPC listener

    Return back the message id cache feature to RPC listener, it was
    removed while refactoring in I708c3d6676b974d8daac6817c15f596cdf35817b
    See attached bug for more info.

    We should not raise DuplicateMessageError to avoid rejecting the
    previously ACK'ed message.

    Closes-Bug: #1935883
    Change-Id: Ie237e9e3fdc3fc27b3deb18b94751cdc3afd190e

Changed in oslo.messaging:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to oslo.messaging (stable/xena)

Fix proposed to branch: stable/xena
Review: https://review.opendev.org/c/openstack/oslo.messaging/+/808878

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/oslo.messaging 12.10.0

This issue was fixed in the openstack/oslo.messaging 12.10.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to oslo.messaging (stable/xena)

Reviewed: https://review.opendev.org/c/openstack/oslo.messaging/+/808878
Committed: https://opendev.org/openstack/oslo.messaging/commit/4f459f94f85fd6e688cb03a981f4b2c7ab50c119
Submitter: "Zuul (22348)"
Branch: stable/xena

commit 4f459f94f85fd6e688cb03a981f4b2c7ab50c119
Author: Nikita Kalyanov <email address hidden>
Date: Tue Jul 13 03:31:27 2021 +0300

    use message id cache for RPC listener

    Return back the message id cache feature to RPC listener, it was
    removed while refactoring in I708c3d6676b974d8daac6817c15f596cdf35817b
    See attached bug for more info.

    We should not raise DuplicateMessageError to avoid rejecting the
    previously ACK'ed message.

    Closes-Bug: #1935883
    Change-Id: Ie237e9e3fdc3fc27b3deb18b94751cdc3afd190e
    (cherry picked from commit 129c223307141be1bc767dfca704b06192458d5b)

tags: added: in-stable-xena
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/oslo.messaging 12.9.2

This issue was fixed in the openstack/oslo.messaging 12.9.2 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.