rabbit: message are no more ack before application callback

Bug #1695746 reported by Mehdi Abaakouk
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
oslo.messaging
Fix Released
High
Ken Giusti

Bug Description

Hi,

Since https://review.openstack.org/#/c/436958, messages are no more acked before the rpc application callback. This introduces a ton of race condition when network loss or HA failover occurs. And lead to message loss or duplicate.

Cheers,

Mehdi Abaakouk (sileht)
Changed in oslo.messaging:
status: New → Confirmed
importance: Undecided → Critical
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to oslo.messaging (stable/ocata)

Fix proposed to branch: stable/ocata
Review: https://review.openstack.org/470753

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to oslo.messaging (stable/newton)

Fix proposed to branch: stable/newton
Review: https://review.openstack.org/470759

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on oslo.messaging (stable/newton)

Change abandoned by Mehdi Abaakouk (sileht) (<email address hidden>) on branch: stable/newton
Review: https://review.openstack.org/470759

Mehdi Abaakouk (sileht)
description: updated
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to oslo.messaging (stable/ocata)

Fix proposed to branch: stable/ocata
Review: https://review.openstack.org/470762

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on oslo.messaging (stable/ocata)

Change abandoned by Mehdi Abaakouk (sileht) (<email address hidden>) on branch: stable/ocata
Review: https://review.openstack.org/470753

Mehdi Abaakouk (sileht)
Changed in oslo.messaging:
importance: Critical → High
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Change abandoned by Mehdi Abaakouk (sileht) (<email address hidden>) on branch: stable/ocata
Review: https://review.openstack.org/470762
Reason: I will restore it when master is merged

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on oslo.messaging (stable/newton)

Change abandoned by Mehdi Abaakouk (sileht) (<email address hidden>) on branch: stable/newton
Review: https://review.openstack.org/466787

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on oslo.messaging (stable/ocata)

Change abandoned by Mehdi Abaakouk (sileht) (<email address hidden>) on branch: stable/ocata
Review: https://review.openstack.org/470762
Reason: Sorry I have missed that on I remove it.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to oslo.messaging (stable/newton)

Fix proposed to branch: stable/newton
Review: https://review.openstack.org/474456

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on oslo.messaging (stable/newton)

Change abandoned by Mehdi Abaakouk (sileht) (<email address hidden>) on branch: stable/newton
Review: https://review.openstack.org/470759
Reason: This change have a new changeID: https://review.openstack.org/474456

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Change abandoned by Mehdi Abaakouk (sileht) (<email address hidden>) on branch: stable/newton
Review: https://review.openstack.org/474456
Reason: Thanks a lot for catching this. A good reason for me to stop provide backport in advance...

Also changing all the Change-ID have created some mess, git-review refuses to update this patch. And it force me to update the old one: https://review.openstack.org/466787

So I have restored it and abandon this one.

Revision history for this message
jazeltq (jazeltq-k) wrote :

I test newton 5.10.2, and in ceilometer agent notification, the log shows
2017-10-21 03:33:19.779 225636 ERROR root [-] Unexpected exception occurred 60 time(s)... retrying.
2017-10-21 03:33:19.779 225636 ERROR root Traceback (most recent call last):
2017-10-21 03:33:19.779 225636 ERROR root File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 250, in wrapper
2017-10-21 03:33:19.779 225636 ERROR root return infunc(*args, **kwargs)
2017-10-21 03:33:19.779 225636 ERROR root File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/base.py", line 203, in _runner
2017-10-21 03:33:19.779 225636 ERROR root batch_size=self.batch_size, batch_timeout=self.batch_timeout)
2017-10-21 03:33:19.779 225636 ERROR root File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/base.py", line 52, in wrapper
2017-10-21 03:33:19.779 225636 ERROR root msg = func(in_self, timeout=timeout_watch.leftover(True))
2017-10-21 03:33:19.779 225636 ERROR root File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 286, in poll
2017-10-21 03:33:19.779 225636 ERROR root self._message_operations_handler.process()
2017-10-21 03:33:19.779 225636 ERROR root File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 89, in process
2017-10-21 03:33:19.779 225636 ERROR root task()
2017-10-21 03:33:19.779 225636 ERROR root File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/impl_rabbit.py", line 251, in acknowledge
2017-10-21 03:33:19.779 225636 ERROR root self._raw_message.ack()
2017-10-21 03:33:19.779 225636 ERROR root File "/usr/lib/python2.7/site-packages/kombu/message.py", line 88, in ack
2017-10-21 03:33:19.779 225636 ERROR root self.channel.basic_ack(self.delivery_tag)
2017-10-21 03:33:19.779 225636 ERROR root File "/usr/lib/python2.7/site-packages/amqp/channel.py", line 1583, in basic_ack
2017-10-21 03:33:19.779 225636 ERROR root self._send_method((60, 80), args)
2017-10-21 03:33:19.779 225636 ERROR root File "/usr/lib/python2.7/site-packages/amqp/abstract_channel.py", line 50, in _send_method
2017-10-21 03:33:19.779 225636 ERROR root raise RecoverableConnectionError('connection already closed')
2017-10-21 03:33:19.779 225636 ERROR root RecoverableConnectionError: connection already closed

Ken Giusti (kgiusti)
Changed in oslo.messaging:
assignee: nobody → Ken Giusti (kgiusti)
Revision history for this message
Ken Giusti (kgiusti) wrote :

Some notes I've made while researching this issue:

1) original issue:
https://review.openstack.org/#/c/436958/
TL;DR - rabbitmq driver was issuing message acknowledge call from wrong thread - the executor work thread. This patch moves the message ack back to the I/O thread.

2) Bug caused by fix to #1:
https://review.openstack.org/#/c/463673/
TL;DR - the message ack is now no longer synchronous with the processing of the RPC request. If the ACK fails the RPC server is designed to drop the RPC request in order to avoid duplication of the request (the request is left on the queue if ack fails). See:
https://git.openstack.org/cgit/openstack/oslo.messaging/tree/oslo_messaging/rpc/server.py#n156
Since the ack happens in a different thread there's no way for the server to know it failed and should drop the message (the except: clause is never hit).

While https://review.openstack.org/#/c/463673 makes the server wait for the ack to run, it does not re-raise any exceptions the ack may have raised, so duplication may still occur (See #1)

3) Bug caused by fix to #2:
https://bugs.launchpad.net/oslo.messaging/+bug/1734788
Making the ACK blocking had a negative effect on the Notification Listener. Unlike the RPC Server, notifications are ACK'ed _after_ they are processed. This is required since the processing returns the ACK/Requeue flag. See:
https://git.openstack.org/cgit/openstack/oslo.messaging/tree/oslo_messaging/notify/listener.py#n180

The way ACKs are blocking created a problem with the batched notification listener. A batched notification listener has a longer timeout, so each ACK would 'hang' until the timeout was hit, which causes unacceptable delays. The fix to this was to stop blocking on the ACK:

https://review.openstack.org/#/c/523431/

Which was already the case for the AMQP 1.0 driver.

Current State
-------------

Currently the rabbitmq driver has corrected the multithreading problem: Ack/Requeue are sent on the proper I/O thread.

However, the RPC server still cannot detect failure of an ACK and drop the request to avoid duplication. The affects both rabbitmq and the AMQP 1.0 drivers.

Since acks/requeues are no longer blocking, the batched notification listeners are working without slowdown.

Now What?
---------

To fix the RPC Server we should attempt to ack before dispatching the request, specifically while on the I/O thread. The ack should be blocking and raise an exception should the ACK fail. Possible issue: blocking ack will slow down the server I/O thread - performance impact will need quantifying.

The AMQP 1.0 driver does not currently support blocking ACK - that would need to be fixed.

In the case of Notifications the Ack/Requeue will have to be done on the dispatched thread as the listener needs the results from the operation. In the case of a listener, the ack should be async in order not to block batching. This assumes we can ignore ACK/Requeue failures (aside from logging) - which we are already doing.

Opinions? Any better ideas?

Revision history for this message
Mehdi Abaakouk (sileht) wrote :

Are you thinking of something like: https://review.openstack.org/577912

Revision history for this message
Ken Giusti (kgiusti) wrote :

Yes - that's exactly what I had in mind for the rabbitmq case.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to oslo.messaging (master)

Fix proposed to branch: master
Review: https://review.openstack.org/581417

Changed in oslo.messaging:
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to oslo.messaging (master)

Reviewed: https://review.openstack.org/581417
Committed: https://git.openstack.org/cgit/openstack/oslo.messaging/commit/?id=26b0be585a0b681d011edcb688750770bcdae199
Submitter: Zuul
Branch: master

commit 26b0be585a0b681d011edcb688750770bcdae199
Author: Kenneth Giusti <email address hidden>
Date: Tue Jul 10 12:03:03 2018 -0400

    Issue blocking ACK for RPC requests from the consumer thread

    The patch for https://review.openstack.org/#/c/436958/ fixed a
    threading problem by moving the ack back to the polling
    thread. However the RPC server expects to catch any failures of the
    ACK and abort the request. This patch adds the ACK error handling
    back to the polling thread.

    This patch is based heavily off the original work done by Mehdi
    Abaakouk (sileht).

    Change-Id: I708c3d6676b974d8daac6817c15f596cdf35817b
    Closes-Bug: #1695746

Changed in oslo.messaging:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/oslo.messaging 9.0.0

This issue was fixed in the openstack/oslo.messaging 9.0.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to oslo.messaging (stable/rocky)

Fix proposed to branch: stable/rocky
Review: https://review.opendev.org/667080

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to oslo.messaging (stable/queens)

Fix proposed to branch: stable/queens
Review: https://review.opendev.org/667083

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to oslo.messaging (stable/rocky)

Reviewed: https://review.opendev.org/667080
Committed: https://git.openstack.org/cgit/openstack/oslo.messaging/commit/?id=a0648f070315b359fcf0a408eb316d9ac27adcc0
Submitter: Zuul
Branch: stable/rocky

commit a0648f070315b359fcf0a408eb316d9ac27adcc0
Author: Kenneth Giusti <email address hidden>
Date: Tue Jul 10 12:03:03 2018 -0400

    Issue blocking ACK for RPC requests from the consumer thread

    The patch for https://review.openstack.org/#/c/436958/ fixed a
    threading problem by moving the ack back to the polling
    thread. However the RPC server expects to catch any failures of the
    ACK and abort the request. This patch adds the ACK error handling
    back to the polling thread.

    This patch is based heavily off the original work done by Mehdi
    Abaakouk (sileht).

    Change-Id: I708c3d6676b974d8daac6817c15f596cdf35817b
    Closes-Bug: #1695746
    (cherry picked from commit 26b0be585a0b681d011edcb688750770bcdae199)

tags: added: in-stable-rocky
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to oslo.messaging (stable/queens)

Reviewed: https://review.opendev.org/667083
Committed: https://git.openstack.org/cgit/openstack/oslo.messaging/commit/?id=e8c67609897888e0eff8f0e3ed5a374d604ca710
Submitter: Zuul
Branch: stable/queens

commit e8c67609897888e0eff8f0e3ed5a374d604ca710
Author: Kenneth Giusti <email address hidden>
Date: Tue Jul 10 12:03:03 2018 -0400

    Issue blocking ACK for RPC requests from the consumer thread

    The patch for https://review.openstack.org/#/c/436958/ fixed a
    threading problem by moving the ack back to the polling
    thread. However the RPC server expects to catch any failures of the
    ACK and abort the request. This patch adds the ACK error handling
    back to the polling thread.

    This patch is based heavily off the original work done by Mehdi
    Abaakouk (sileht).

    Change-Id: I708c3d6676b974d8daac6817c15f596cdf35817b
    Closes-Bug: #1695746
    (cherry picked from commit 26b0be585a0b681d011edcb688750770bcdae199)

tags: added: in-stable-queens
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/oslo.messaging 5.35.6

This issue was fixed in the openstack/oslo.messaging 5.35.6 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/oslo.messaging 8.1.4

This issue was fixed in the openstack/oslo.messaging 8.1.4 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.