oslo.messaging

Bug #1695746
Comment #13

Comment 13 for bug 1695746

Revision history for this message

Ken Giusti (kgiusti) wrote on 2018-06-20:

#13

Some notes I've made while researching this issue:

1) original issue:
https://review.openstack.org/#/c/436958/
TL;DR - rabbitmq driver was issuing message acknowledge call from wrong thread - the executor work thread. This patch moves the message ack back to the I/O thread.

2) Bug caused by fix to #1:
https://review.openstack.org/#/c/463673/
TL;DR - the message ack is now no longer synchronous with the processing of the RPC request. If the ACK fails the RPC server is designed to drop the RPC request in order to avoid duplication of the request (the request is left on the queue if ack fails). See:
https://git.openstack.org/cgit/openstack/oslo.messaging/tree/oslo_messaging/rpc/server.py#n156
Since the ack happens in a different thread there's no way for the server to know it failed and should drop the message (the except: clause is never hit).

While https://review.openstack.org/#/c/463673 makes the server wait for the ack to run, it does not re-raise any exceptions the ack may have raised, so duplication may still occur (See #1)

3) Bug caused by fix to #2:
https://bugs.launchpad.net/oslo.messaging/+bug/1734788
Making the ACK blocking had a negative effect on the Notification Listener. Unlike the RPC Server, notifications are ACK'ed _after_ they are processed. This is required since the processing returns the ACK/Requeue flag. See:
https://git.openstack.org/cgit/openstack/oslo.messaging/tree/oslo_messaging/notify/listener.py#n180

The way ACKs are blocking created a problem with the batched notification listener. A batched notification listener has a longer timeout, so each ACK would 'hang' until the timeout was hit, which causes unacceptable delays. The fix to this was to stop blocking on the ACK:

https://review.openstack.org/#/c/523431/

Which was already the case for the AMQP 1.0 driver.

Current State
-------------

Currently the rabbitmq driver has corrected the multithreading problem: Ack/Requeue are sent on the proper I/O thread.

However, the RPC server still cannot detect failure of an ACK and drop the request to avoid duplication. The affects both rabbitmq and the AMQP 1.0 drivers.

Since acks/requeues are no longer blocking, the batched notification listeners are working without slowdown.

Now What?
---------

To fix the RPC Server we should attempt to ack before dispatching the request, specifically while on the I/O thread. The ack should be blocking and raise an exception should the ACK fail. Possible issue: blocking ack will slow down the server I/O thread - performance impact will need quantifying.

The AMQP 1.0 driver does not currently support blocking ACK - that would need to be fixed.

In the case of Notifications the Ack/Requeue will have to be done on the dispatched thread as the listener needs the results from the operation. In the case of a listener, the ack should be async in order not to block batching. This assumes we can ignore ACK/Requeue failures (aside from logging) - which we are already doing.

Opinions? Any better ideas?

Some notes I've made while researching this issue:

1) original issue:
https://review.openstack.org/#/c/436958/
TL;DR - rabbitmq driver was issuing message acknowledge call from wrong thread - the executor work thread.  This patch moves the message ack back to the I/O thread.

2) Bug caused by fix to #1:
https://review.openstack.org/#/c/463673/
TL;DR - the message ack is now no longer synchronous with the processing of the RPC request.  If the ACK fails the RPC server is designed to drop the RPC request in order to avoid duplication of the request (the request is left on the queue if ack fails).  See:
https://git.openstack.org/cgit/openstack/oslo.messaging/tree/oslo_messaging/rpc/server.py#n156
Since the ack happens in a different thread there's no way for the server to know it failed and should drop the message (the except: clause is never hit).

While https://review.openstack.org/#/c/463673 makes the server wait for the ack to run, it does not re-raise any exceptions the ack may have raised, so duplication may still occur (See #1)

3) Bug caused by fix to #2:
https://bugs.launchpad.net/oslo.messaging/+bug/1734788
Making the ACK blocking had a negative effect on the Notification Listener.  Unlike the RPC Server, notifications are ACK'ed _after_ they are processed.  This is required since the processing returns the ACK/Requeue flag. See:
https://git.openstack.org/cgit/openstack/oslo.messaging/tree/oslo_messaging/notify/listener.py#n180

The way ACKs are blocking created a problem with the batched notification listener.  A batched notification listener has a longer timeout, so each ACK would 'hang' until the timeout was hit, which causes unacceptable delays.  The fix to this was to stop blocking on the ACK:

https://review.openstack.org/#/c/523431/

Which was already the case for the AMQP 1.0 driver.

Current State
-------------

Currently the rabbitmq driver has corrected the multithreading problem: Ack/Requeue are sent on the proper I/O thread.

However, the RPC server still cannot detect failure of an ACK and drop the request to avoid duplication. The affects both rabbitmq and the AMQP 1.0 drivers.

Since acks/requeues are no longer blocking, the batched notification listeners are working without slowdown.

Now What?
---------

To fix the RPC Server we should attempt to ack before dispatching the request, specifically while on the I/O thread.  The ack should be blocking and raise an exception should the ACK fail.  Possible issue: blocking ack will slow down the server I/O thread - performance impact will need quantifying.

The AMQP 1.0 driver does not currently support blocking ACK - that would need to be fixed.

In the case of Notifications the Ack/Requeue will have to be done on the dispatched thread as the listener needs the results from the operation.  In the case of a listener, the ack should be async in order not to block batching.  This assumes we can ignore ACK/Requeue failures (aside from logging) - which we are already doing.

Opinions?  Any better ideas?