oslo.messaging

Bug #1800957
Comment #16

Comment 16 for bug 1800957

Revision history for this message

Ken Giusti (kgiusti) wrote on 2018-11-08:

#16

Thanks for gathering those logs.

This is a bit of a head scratcher.

The commit you identified modified how acknowledgement was handled. And since you're seeing unacked messages on the reply queues that commit seems suspect. The debug patch I sent you logged how messages were being acked. I expected the log would show that the acks were not getting done - but according to your output oslo.messaging is acking messages:

The server acking a request message:

2018-11-08 09:39:08.651 29550 INFO oslo_messaging.rpc.server [-] Acknowledging RPC Request msg_id=None
2018-11-08 09:39:08.654 29550 INFO oslo_messaging._drivers.amqpdriver [-] Scheduling ACK for msg msg_id=None
2018-11-08 09:39:08.655 29550 INFO oslo.messaging._drivers.impl_rabbit [-] RabbitMessage.acknowledge: message {u'args': {u'payload': {u'network': {u'id': u'2d0d6769-65d1-40b4-ab47-6694627122d2'}}}, u'version': u'1.0', u'method': u'network_create_end'}

And the client acking the reply:

2018-11-08 09:38:39.652 29550 INFO oslo_messaging._drivers.amqpdriver [-] acknowledging reply msg_id=8c29ac000a10414784334d1ff0838bf3
2018-11-08 09:38:39.652 29550 INFO oslo.messaging._drivers.impl_rabbit [-] RabbitMessage.acknowledge: message {u'_unique_id': u'2c95234fac1144e58bf5304e34b246b9', u'_msg_id': u'8c29ac000a10414784334d1ff0838bf3', u'result': u'alive', u'failure': None, u'ending': True}

However the following log messages indicate connection failures:

2018-11-08 09:41:09.811 29550 ERROR oslo.messaging._drivers.impl_rabbit [-] [38accff0-f364-43b1-a8ea-8411db85803c] AMQP server on 115.146.81.133:5671 is unreachable: Too many heartbeats missed. Trying again in 1 seconds. Client port: 40800: ConnectionForced: Too many heartbeats missed
2018-11-08 09:41:10.988 29550 INFO oslo.messaging._drivers.impl_rabbit [-] [38accff0-f364-43b1-a8ea-8411db85803c] Reconnected to AMQP server on 115.146.81.133:5671 via [amqp] client with port 40802.

and

2018-11-08 09:19:51.717 28105 ERROR oslo.messaging._drivers.impl_rabbit [-] [1b12584f-4d39-42e5-9ae2-5d4ae71cd2e6] AMQP server on 115.146.81.133:5671 is unreachable: Too many heartbeats missed. Trying again in 1 seconds. Client port: 40744: ConnectionForced: Too many heartbeats missed
2018-11-08 09:19:52.909 28105 INFO oslo.messaging._drivers.impl_rabbit [-] [1b12584f-4d39-42e5-9ae2-5d4ae71cd2e6] Reconnected to AMQP server on 115.146.81.133:5671 via [amqp] client with port 40746.

Both of these issues occur around the time your RPC calls are timing out. This could result in reply messages getting "stuck" in queues since reply queues are randomly named and regenerated when the transport is established.

As a test I set the heartbeat timeout to 15sec and used iptables to start dropping rpc traffic between the rpc client/servers and rabbitmq. This naturally resulted in rpc failures and unconsumed messages on queues:

sudo iptables -I INPUT -p tcp --sport 5671 --j DROP
<wait for heartbeat failures>
sudo iptables -D INPUT 1
sudo rabbitmqctl -p test list_queues name messages messages_unacknowledged | grep reply
reply_f4b9a59c84b04857ae8ed992c618eaba 0 0
reply_bef3bbc2619247b0b287d19bfd685de3 0 0
reply_e6151828c288457e9cc01e5e6167e4e3 0 0
reply_d5af26a362104e2f81bbeb3ba06e015a 2 0
reply_42a76b1b2b7544b1bea6252fc150ceba 1 0
reply_8b8f0b4b9b8f4340bb630543d3833fe8 1 0
reply_3ae3dc4a6de047cca2189f43d9233ac2 2 0
reply_6dae46b4f1da42969f42ea38de47101b 0 0
reply_b3527b6109614d279d45a6ab0127aa59 0 0
reply_878d49fd256c4d63aed7146e48f3b078 2 0
reply_c0ca4c34775448c59b57d37b7db65192 2 0

You said at one point that these issues did not occur when you disabled SSL, is that correct?
I wonder if that is related. Can you enable debug logging on rabbitmq connection and channel and see if anything interesting appears when the heartbeat errors occur?

{log_levels, [{channel, debug}, {connection, debug}, {federation, info}, {mirroring, info}]},

Thanks for gathering those logs.

This is a bit of a head scratcher.

The commit you identified modified how acknowledgement was handled.  And since you're seeing unacked messages on the reply queues that commit seems suspect.  The debug patch I sent you logged how messages were being acked.  I expected the log would show that the acks were not getting done - but according to your output oslo.messaging is acking messages:

The server acking a request message:

And the client acking the reply:

However the following log messages indicate connection failures:

and

sudo iptables -I INPUT -p tcp --sport 5671 --j DROP
<wait for heartbeat failures>
sudo iptables -D INPUT 1
sudo rabbitmqctl -p test list_queues name messages messages_unacknowledged | grep reply
reply_f4b9a59c84b04857ae8ed992c618eaba	0	0
reply_bef3bbc2619247b0b287d19bfd685de3	0	0
reply_e6151828c288457e9cc01e5e6167e4e3	0	0
reply_d5af26a362104e2f81bbeb3ba06e015a	2	0
reply_42a76b1b2b7544b1bea6252fc150ceba	1	0
reply_8b8f0b4b9b8f4340bb630543d3833fe8	1	0
reply_3ae3dc4a6de047cca2189f43d9233ac2	2	0
reply_6dae46b4f1da42969f42ea38de47101b	0	0
reply_b3527b6109614d279d45a6ab0127aa59	0	0
reply_878d49fd256c4d63aed7146e48f3b078	2	0
reply_c0ca4c34775448c59b57d37b7db65192	2	0

You said at one point that these issues did not occur when you disabled SSL, is that correct?  
I wonder if that is related.  Can you enable debug logging on rabbitmq connection and channel and see if anything interesting appears when the heartbeat errors occur?

{log_levels, [{channel, debug}, {connection, debug}, {federation, info}, {mirroring, info}]},