The commit you identified modified how acknowledgement was handled. And since you're seeing unacked messages on the reply queues that commit seems suspect. The debug patch I sent you logged how messages were being acked. I expected the log would show that the acks were not getting done - but according to your output oslo.messaging is acking messages:
The server acking a request message:
2018-11-08 09:39:08.651 29550 INFO oslo_messaging.rpc.server [-] Acknowledging RPC Request msg_id=None
2018-11-08 09:39:08.654 29550 INFO oslo_messaging._drivers.amqpdriver [-] Scheduling ACK for msg msg_id=None
2018-11-08 09:39:08.655 29550 INFO oslo.messaging._drivers.impl_rabbit [-] RabbitMessage.acknowledge: message {u'args': {u'payload': {u'network': {u'id': u'2d0d6769-65d1-40b4-ab47-6694627122d2'}}}, u'version': u'1.0', u'method': u'network_create_end'}
However the following log messages indicate connection failures:
2018-11-08 09:41:09.811 29550 ERROR oslo.messaging._drivers.impl_rabbit [-] [38accff0-f364-43b1-a8ea-8411db85803c] AMQP server on 115.146.81.133:5671 is unreachable: Too many heartbeats missed. Trying again in 1 seconds. Client port: 40800: ConnectionForced: Too many heartbeats missed
2018-11-08 09:41:10.988 29550 INFO oslo.messaging._drivers.impl_rabbit [-] [38accff0-f364-43b1-a8ea-8411db85803c] Reconnected to AMQP server on 115.146.81.133:5671 via [amqp] client with port 40802.
and
2018-11-08 09:19:51.717 28105 ERROR oslo.messaging._drivers.impl_rabbit [-] [1b12584f-4d39-42e5-9ae2-5d4ae71cd2e6] AMQP server on 115.146.81.133:5671 is unreachable: Too many heartbeats missed. Trying again in 1 seconds. Client port: 40744: ConnectionForced: Too many heartbeats missed
2018-11-08 09:19:52.909 28105 INFO oslo.messaging._drivers.impl_rabbit [-] [1b12584f-4d39-42e5-9ae2-5d4ae71cd2e6] Reconnected to AMQP server on 115.146.81.133:5671 via [amqp] client with port 40746.
Both of these issues occur around the time your RPC calls are timing out. This could result in reply messages getting "stuck" in queues since reply queues are randomly named and regenerated when the transport is established.
As a test I set the heartbeat timeout to 15sec and used iptables to start dropping rpc traffic between the rpc client/servers and rabbitmq. This naturally resulted in rpc failures and unconsumed messages on queues:
You said at one point that these issues did not occur when you disabled SSL, is that correct?
I wonder if that is related. Can you enable debug logging on rabbitmq connection and channel and see if anything interesting appears when the heartbeat errors occur?
Thanks for gathering those logs.
This is a bit of a head scratcher.
The commit you identified modified how acknowledgement was handled. And since you're seeing unacked messages on the reply queues that commit seems suspect. The debug patch I sent you logged how messages were being acked. I expected the log would show that the acks were not getting done - but according to your output oslo.messaging is acking messages:
The server acking a request message:
2018-11-08 09:39:08.651 29550 INFO oslo_messaging. rpc.server [-] Acknowledging RPC Request msg_id=None _drivers. amqpdriver [-] Scheduling ACK for msg msg_id=None _drivers. impl_rabbit [-] RabbitMessage. acknowledge: message {u'args': {u'payload': {u'network': {u'id': u'2d0d6769- 65d1-40b4- ab47-6694627122 d2'}}}, u'version': u'1.0', u'method': u'network_ create_ end'}
2018-11-08 09:39:08.654 29550 INFO oslo_messaging.
2018-11-08 09:39:08.655 29550 INFO oslo.messaging.
And the client acking the reply:
2018-11-08 09:38:39.652 29550 INFO oslo_messaging. _drivers. amqpdriver [-] acknowledging reply msg_id= 8c29ac000a10414 784334d1ff0838b f3 _drivers. impl_rabbit [-] RabbitMessage. acknowledge: message {u'_unique_id': u'2c95234fac114 4e58bf5304e34b2 46b9', u'_msg_id': u'8c29ac000a104 14784334d1ff083 8bf3', u'result': u'alive', u'failure': None, u'ending': True}
2018-11-08 09:38:39.652 29550 INFO oslo.messaging.
However the following log messages indicate connection failures:
2018-11-08 09:41:09.811 29550 ERROR oslo.messaging. _drivers. impl_rabbit [-] [38accff0- f364-43b1- a8ea-8411db8580 3c] AMQP server on 115.146.81.133:5671 is unreachable: Too many heartbeats missed. Trying again in 1 seconds. Client port: 40800: ConnectionForced: Too many heartbeats missed _drivers. impl_rabbit [-] [38accff0- f364-43b1- a8ea-8411db8580 3c] Reconnected to AMQP server on 115.146.81.133:5671 via [amqp] client with port 40802.
2018-11-08 09:41:10.988 29550 INFO oslo.messaging.
and
2018-11-08 09:19:51.717 28105 ERROR oslo.messaging. _drivers. impl_rabbit [-] [1b12584f- 4d39-42e5- 9ae2-5d4ae71cd2 e6] AMQP server on 115.146.81.133:5671 is unreachable: Too many heartbeats missed. Trying again in 1 seconds. Client port: 40744: ConnectionForced: Too many heartbeats missed _drivers. impl_rabbit [-] [1b12584f- 4d39-42e5- 9ae2-5d4ae71cd2 e6] Reconnected to AMQP server on 115.146.81.133:5671 via [amqp] client with port 40746.
2018-11-08 09:19:52.909 28105 INFO oslo.messaging.
Both of these issues occur around the time your RPC calls are timing out. This could result in reply messages getting "stuck" in queues since reply queues are randomly named and regenerated when the transport is established.
As a test I set the heartbeat timeout to 15sec and used iptables to start dropping rpc traffic between the rpc client/servers and rabbitmq. This naturally resulted in rpc failures and unconsumed messages on queues:
sudo iptables -I INPUT -p tcp --sport 5671 --j DROP unacknowledged | grep reply 4b04857ae8ed992 c618eaba 0 0 19247b0b287d19b fd685de3 0 0 288457e9cc01e5e 6167e4e3 0 0 2104e2f81bbeb3b a06e015a 2 0 b7544b1bea6252f c150ceba 1 0 b8f4340bb630543 d3833fe8 1 0 de047cca2189f43 d9233ac2 2 0 1da42969f42ea38 de47101b 0 0 9614d279d45a6ab 0127aa59 0 0 56c4d63aed7146e 48f3b078 2 0 75448c59b57d37b 7db65192 2 0
<wait for heartbeat failures>
sudo iptables -D INPUT 1
sudo rabbitmqctl -p test list_queues name messages messages_
reply_f4b9a59c8
reply_bef3bbc26
reply_e6151828c
reply_d5af26a36
reply_42a76b1b2
reply_8b8f0b4b9
reply_3ae3dc4a6
reply_6dae46b4f
reply_b3527b610
reply_878d49fd2
reply_c0ca4c347
You said at one point that these issues did not occur when you disabled SSL, is that correct?
I wonder if that is related. Can you enable debug logging on rabbitmq connection and channel and see if anything interesting appears when the heartbeat errors occur?
{log_levels, [{channel, debug}, {connection, debug}, {federation, info}, {mirroring, info}]},