Bug #1800957 “Upgrading to pike version causes rabbit timeouts w...” : Bugs : oslo.messaging package : Ubuntu

Revision history for this message

Sam Morrison (sorrison) wrote on 2018-10-31:

#1

I should also note that downgrading back to the ocata versions of everything fixes this.

Sam Morrison (sorrison) on 2018-11-01

description:

updated

Revision history for this message

Sam Morrison (sorrison) wrote on 2018-11-01:

#2

OK I tried running the newer version of python-amqp with the older version of oslo.messaging and it works fine so I don't suspect python-amqp to be an issue.

Revision history for this message

Sam Morrison (sorrison) wrote on 2018-11-01:

#3

OK I have narrowed this down to olso.messaging itself.

Going from

ocata: 5.17.1
pike: 5.30.0

Breaks it.

Revision history for this message

Sam Morrison (sorrison) wrote on 2018-11-01:

#4

Can confirm 5.17.4 works and 5.18.0 doesn't work

Revision history for this message

Sam Morrison (sorrison) wrote on 2018-11-01:

#5

Running commentary....

So I'd say it's one of these:

https://github.com/openstack/oslo.messaging/commit/7e71ac821f4103032afc8eff8007ad48afe49a2c
https://github.com/openstack/oslo.messaging/commit/5bacea1f42f7f5dc822b5f4f5968a3d8d3361b59

Revision history for this message

Sam Morrison (sorrison) wrote on 2018-11-01:

#6

and then there was one. It's https://github.com/openstack/oslo.messaging/commit/7e71ac821f4103032afc8eff8007ad48afe49a2c

Ken Giusti (kgiusti) on 2018-11-02

Changed in oslo.messaging:
assignee:	nobody → Ken Giusti (kgiusti)

Revision history for this message

Ken Giusti (kgiusti) wrote on 2018-11-05:

#7

Hi Sam,

Are the SSL connections coming up at all? It sounds like they do come up, then drop at a later time. Is that correct?

There were some changes to the SSL configuration option names between ocata and pike, but pike should still support the old option names as well.

thanks

Revision history for this message

Sam Morrison (sorrison) wrote on 2018-11-05:

#8

The connections to the rabbit server are fine, I see them. The issue is the messages in the reply queue aren't acked.

The connection is fine and things work until the service does something that needs a reply

Revision history for this message

Ken Giusti (kgiusti) wrote on 2018-11-06:

#9

Gotcha.

I'm trying to create a minimal reproducer for this but I'm not experiencing the ack problem on that bad commit.

Would you be willing to try the following test in your environment?

I've got a sample minimal RPC server and client on my github account:

https://github.com/kgiusti/oslo-messaging-clients/tree/bug/1800957

Can you pull that down and take a look at the README - it explains how to configure and run the clients.

If you can reproduce this can you run the clients with '--debug' specified and send me the traces as well as your rabbitmq, nova and neutron config files?

thanks,

-K

Revision history for this message

Sam Morrison (sorrison) wrote on 2018-11-07:

#10

OK have given this a go

I set requirements to be what is available in pike

oslo.messaging==5.30.0
oslo.config==4.11.0
eventlet==0.18.4

And settings file has:
-----------------------------
[DEFAULT]
transport_url = rabbit://test:test@<hostname>:5671/test

[oslo_messaging_rabbit]
rabbit_ha_queues=True
amqp_durable_queues=True
ssl_version=TLSv1
ssl=True
heartbeat_timeout_threshold=60
-----------------------------

Debug output at: http://paste.openstack.org/show/734310/

RabbitMQ config at http://paste.openstack.org/show/734311/

Revision history for this message

Ken Giusti (kgiusti) wrote on 2018-11-07:

#11

Ugh, so it looks like the test client worked.

It's easy to miss in the logs, but if you look at the root log messages you'll see:

DEBUG:root:RPC return value={u'args': {u'name': u'echo'}, u'method': u'echo', u'context': {u'application': u'rpc-client', u'time': u'Wed Nov 7 09:55:22 2018'}}

and

10 RPC call(s) complete

There is a buggy debug log message being issued:

DEBUG:oslo.messaging._drivers.impl_rabbit:Timed out waiting for RPC response:
DEBUG:oslo.messaging._drivers.impl_rabbit:Timed out waiting for RPC response: Timeout while waiting on RPC response - topic: "<unknown>", RPC method: "<unknown>" info: "<unknown>"

This is (incorrectly) logged each time the I/O poll loop is exited due to the poll timeout, which it does periodically.

See the way the timeout is handled in 5.30.0 https://git.openstack.org/cgit/openstack/oslo.messaging/tree/oslo_messaging/_drivers/impl_rabbit.py?h=5.30.0#n1077

So that's not a real failure, and that simple client seems to work.

Would you be willing to try a debugging patch against oslo.messaging 5.30.0?

http://paste.openstack.org/show/734374/

That will issue an INFO log message each time an RPC request or reply is acked. If you can run this with your reproducer and post the logs from both the rpc server side and the rpc client side I'd very much appreciate that.

thanks

Revision history for this message

Sam Morrison (sorrison) wrote on 2018-11-07:

#12

OK I have added this onto the client side (neutron-dhcp-agent)

http://paste.openstack.org/show/734375/

Will do the same thing with logs from the server side soon too

Revision history for this message

Sam Morrison (sorrison) wrote on 2018-11-07:

#13

Sorry my above paste was missing some stuff at the start, here is a complete one http://paste.openstack.org/show/734376/

Revision history for this message

Sam Morrison (sorrison) wrote on 2018-11-07:

#14

Here's another go with same error http://paste.openstack.org/show/734378/

Revision history for this message

Sam Morrison (sorrison) wrote on 2018-11-07:

#15

Here is a message timeout with logs from server too

http://paste.openstack.org/show/734379/

Let me know if that's what you need, thanks for you help.

Revision history for this message

Ken Giusti (kgiusti) wrote on 2018-11-08:

#16

Download full text (3.8 KiB)

Thanks for gathering those logs.

This is a bit of a head scratcher.

The commit you identified modified how acknowledgement was handled. And since you're seeing unacked messages on the reply queues that commit seems suspect. The debug patch I sent you logged how messages were being acked. I expected the log would show that the acks were not getting done - but according to your output oslo.messaging is acking messages:

The server acking a request message:

2018-11-08 09:39:08.651 29550 INFO oslo_messaging.rpc.server [-] Acknowledging RPC Request msg_id=None
2018-11-08 09:39:08.654 29550 INFO oslo_messaging._drivers.amqpdriver [-] Scheduling ACK for msg msg_id=None
2018-11-08 09:39:08.655 29550 INFO oslo.messaging._drivers.impl_rabbit [-] RabbitMessage.acknowledge: message {u'args': {u'payload': {u'network': {u'id': u'2d0d6769-65d1-40b4-ab47-6694627122d2'}}}, u'version': u'1.0', u'method': u'network_create_end'}

And the client acking the reply:

2018-11-08 09:38:39.652 29550 INFO oslo_messaging._drivers.amqpdriver [-] acknowledging reply msg_id=8c29ac000a10414784334d1ff0838bf3
2018-11-08 09:38:39.652 29550 INFO oslo.messaging._drivers.impl_rabbit [-] RabbitMessage.acknowledge: message {u'_unique_id': u'2c95234fac1144e58bf5304e34b246b9', u'_msg_id': u'8c29ac000a10414784334d1ff0838bf3', u'result': u'alive', u'failure': None, u'ending': True}

However the following log messages indicate connection failures:

2018-11-08 09:41:09.811 29550 ERROR oslo.messaging._drivers.impl_rabbit [-] [38accff0-f364-43b1-a8ea-8411db85803c] AMQP server on 115.146.81.133:5671 is unreachable: Too many heartbeats missed. Trying again in 1 seconds. Client port: 40800: ConnectionForced: Too many heartbeats missed
2018-11-08 09:41:10.988 29550 INFO oslo.messaging._drivers.impl_rabbit [-] [38accff0-f364-43b1-a8ea-8411db85803c] Reconnected to AMQP server on 115.146.81.133:5671 via [amqp] client with port 40802.

and

2018-11-08 09:19:51.717 28105 ERROR oslo.messaging._drivers.impl_rabbit [-] [1b12584f-4d39-42e5-9ae2-5d4ae71cd2e6] AMQP server on 115.146.81.133:5671 is unreachable: Too many heartbeats missed. Trying again in 1 seconds. Client port: 40744: ConnectionForced: Too many heartbeats missed
2018-11-08 09:19:52.909 28105 INFO oslo.messaging._drivers.impl_rabbit [-] [1b12584f-4d39-42e5-9ae2-5d4ae71cd2e6] Reconnected to AMQP server on 115.146.81.133:5671 via [amqp] client with port 40746.

Both of these issues occur around the time your RPC calls are timing out. This could result in reply messages getting "stuck" in queues since reply queues are randomly named and regenerated when the transport is established.

As a test I set the heartbeat timeout to 15sec and used iptables to start dropping rpc traffic between the rpc client/servers and rabbitmq. This naturally resulted in rpc failures and unconsumed messages on queues:

sudo iptables -I INPUT -p tcp --sport 5671 --j DROP
<wait for heartbeat failures>
sudo iptables -D INPUT 1
sudo rabbitmqctl -p test list_queues name messages messages_unacknowledged | grep reply
reply_f4b9a59c84b04857ae8ed992c618eaba 0 0
reply_bef3bbc2619247b0b287d19bfd685de3 0 0
reply_e6151828c288457e9cc01e5e6167e4e3 0 0
reply_d5af26a362104e2f...

Thanks for gathering those logs.

This is a bit of a head scratcher.

The commit you identified modified how acknowledgement was handled.  And since you're seeing unacked messages on the reply queues that commit seems suspect.  The debug patch I sent you logged how messages were being acked.  I expected the log would show that the acks were not getting done - but according to your output oslo.messaging is acking messages:

The server acking a request message:

2018-11-08 09:39:08.651 29550 INFO oslo_messaging.rpc.server [-] Acknowledging RPC Request msg_id=None
2018-11-08 09:39:08.654 29550 INFO oslo_messaging._drivers.amqpdriver [-] Scheduling ACK for msg msg_id=None
2018-11-08 09:39:08.655 29550 INFO oslo.messaging._drivers.impl_rabbit [-] RabbitMessage.acknowledge: message {u'args': {u'payload': {u'network': {u'id': u'2d0d6769-65d1-40b4-ab47-6694627122d2'}}}, u'version': u'1.0', u'method': u'network_create_end'}

And the client acking the reply:

2018-11-08 09:38:39.652 29550 INFO oslo_messaging._drivers.amqpdriver [-] acknowledging reply msg_id=8c29ac000a10414784334d1ff0838bf3
2018-11-08 09:38:39.652 29550 INFO oslo.messaging._drivers.impl_rabbit [-] RabbitMessage.acknowledge: message {u'_unique_id': u'2c95234fac1144e58bf5304e34b246b9', u'_msg_id': u'8c29ac000a10414784334d1ff0838bf3', u'result': u'alive', u'failure': None, u'ending': True}

However the following log messages indicate connection failures:

2018-11-08 09:41:09.811 29550 ERROR oslo.messaging._drivers.impl_rabbit [-] [38accff0-f364-43b1-a8ea-8411db85803c] AMQP server on 115.146.81.133:5671 is unreachable: Too many heartbeats missed. Trying again in 1 seconds. Client port: 40800: ConnectionForced: Too many heartbeats missed
2018-11-08 09:41:10.988 29550 INFO oslo.messaging._drivers.impl_rabbit [-] [38accff0-f364-43b1-a8ea-8411db85803c] Reconnected to AMQP server on 115.146.81.133:5671 via [amqp] client with port 40802.

and

2018-11-08 09:19:51.717 28105 ERROR oslo.messaging._drivers.impl_rabbit [-] [1b12584f-4d39-42e5-9ae2-5d4ae71cd2e6] AMQP server on 115.146.81.133:5671 is unreachable: Too many heartbeats missed. Trying again in 1 seconds. Client port: 40744: ConnectionForced: Too many heartbeats missed
2018-11-08 09:19:52.909 28105 INFO oslo.messaging._drivers.impl_rabbit [-] [1b12584f-4d39-42e5-9ae2-5d4ae71cd2e6] Reconnected to AMQP server on 115.146.81.133:5671 via [amqp] client with port 40746.

Both of these issues occur around the time your RPC calls are timing out. This could result in reply messages getting "stuck" in queues since reply queues are randomly named and regenerated when the transport is established.

As a test I set the heartbeat timeout to 15sec and used iptables to start dropping rpc traffic between the rpc client/servers and rabbitmq. This naturally resulted in rpc failures and unconsumed messages on queues:

sudo iptables -I INPUT -p tcp --sport 5671 --j DROP
<wait for heartbeat failures>
sudo iptables -D INPUT 1
sudo rabbitmqctl -p test list_queues name messages messages_unacknowledged | grep reply
reply_f4b9a59c84b04857ae8ed992c618eaba	0	0
reply_bef3bbc2619247b0b287d19bfd685de3	0	0
reply_e6151828c288457e9cc01e5e6167e4e3	0	0
reply_d5af26a362104e2f81bbeb3ba06e015a	2	0
reply_42a76b1b2b7544b1bea6252fc150ceba	1	0
reply_8b8f0b4b9b8f4340bb630543d3833fe8	1	0
reply_3ae3dc4a6de047cca2189f43d9233ac2	2	0
reply_6dae46b4f1da42969f42ea38de47101b	0	0
reply_b3527b6109614d279d45a6ab0127aa59	0	0
reply_878d49fd256c4d63aed7146e48f3b078	2	0
reply_c0ca4c34775448c59b57d37b7db65192	2	0

You said at one point that these issues did not occur when you disabled SSL, is that correct?  
I wonder if that is related.  Can you enable debug logging on rabbitmq connection and channel and see if anything interesting appears when the heartbeat errors occur?

{log_levels, [{channel, debug}, {connection, debug}, {federation, info}, {mirroring, info}]},

Revision history for this message

Sam Morrison (sorrison) wrote on 2018-11-08:

#17

Yes without ssl it works fine.

The patch in question is in 5.18.0 but not 5.17.3

5.17.3 works, 5.18.0 doesn't

If I add the patch to 5.17.3 it breaks, if I remove the patch from 5.18.0 it works so I'm pretty confidant that this is the issue.

We have seen the issue in 3 of our clouds.

1. rabbitmq 3.6.5 (large cloud with ~1000 compute nodes)
2. rabbitmq 3.6.10 (small could with 8 compute nodes)
3. rabbitmq 3.6.10 (small cloud with 6 compute nodes)

Haven't seen it in 1 of our clouds
RabbitMQ 3.6.14 (small cloud with 3 compute nodes)

We upgraded rabbitMQ to 3.6.14 in our large cloud and the rabbit cluster fell over in a giant heap so we had to roll back.

I've upgraded the other small clouds to 3.6.14 and they seem to be ok now so I'm wondering if this is a combination of rabbit version and oslo version.

Revision history for this message

Sam Morrison (sorrison) wrote on 2018-11-12:

#19

On a possibly related note we see these errors when we upgraded nova to pike (and hence the olso-messaging upgrading too)

2018-11-12 16:00:24.273 1474 ERROR oslo.messaging._drivers.impl_rabbit [-] [b6febed7-8c39-4aaa-a2f3-62f0f1c5ded0] AMQP server on mq35671 is unreachable: <AMQPError: unknown e
rror>. Trying again in 1 seconds. Client port: None: RecoverableConnectionError: <AMQPError: unknown error>
2018-11-12 16:00:24.629 1473 ERROR oslo.messaging._drivers.impl_rabbit [-] [0f300bbe-1eb9-427b-a3b6-bf671d9eab57] AMQP server on mq3:5671 is unreachable: <AMQPError: unknown e
rror>. Trying again in 1 seconds. Client port: None: RecoverableConnectionError: <AMQPError: unknown error>
2018-11-12 16:00:25.341 1474 INFO oslo.messaging._drivers.impl_rabbit [-] [b6febed7-8c39-4aaa-a2f3-62f0f1c5ded0] Reconnected to AMQP server on mq3:5671 via [amqp] client with
port 54216.
2018-11-12 16:00:25.728 1473 INFO oslo.messaging._drivers.impl_rabbit [-] [0f300bbe-1eb9-427b-a3b6-bf671d9eab57] Reconnected to AMQP server on mq3:5671 via [amqp] client with
port 54230.
2018-11-12 16:06:25.342 1474 ERROR oslo.messaging._drivers.impl_rabbit [-] [b6febed7-8c39-4aaa-a2f3-62f0f1c5ded0] AMQP server on mq3:5671 is unreachable: <AMQPError: unknown e
rror>. Trying again in 1 seconds. Client port: 55978: RecoverableConnectionError: <AMQPError: unknown error>
2018-11-12 16:06:26.412 1474 INFO oslo.messaging._drivers.impl_rabbit [-] [b6febed7-8c39-4aaa-a2f3-62f0f1c5ded0] Reconnected to AMQP server on mq3:5671 via [amqp] client with
port 56036.

These fill up the nove-conductor logs and only started happening after the upgrade. It still seems to be working ok though.

Revision history for this message

Sam Morrison (sorrison) wrote on 2018-11-12:

#20

And in RabbitMQ logs we start to see the following errors after upgrading too:

{handshake_timeout,frame_header}

Revision history for this message

Sam Morrison (sorrison) wrote on 2018-11-12:

#21

OK I have started again and done some more thorough testing, here's what I've found.

WORKS:
oslo.messaging==5.17.4
amqp==1.4.9
kombu==3.0.37

BROKEN:
oslo.messaging==5.17.4
amqp==2.1.4
kombu==4.0.2+really4.0.2+dfsg-2ubuntu1~cloud0 (ubuntu version with weird version number)

WORKS:
oslo.messaging==5.30.0 (with 5bacea1f42f7f5dc822b5f4f5968a3d8d3361b59 reverted)
amqp==1.4.9
kombu==3.0.37

So I think it's pretty conclusive that oslo.messaging is ok and that the issue lies with amqp or kombu.

I think in my previous testing I wasn't cleaning up my environment completely in between runs and so amqp and kombu were at different versions depending on if I was upgrading or downgrading.

Revision history for this message

Sam Morrison (sorrison) wrote on 2018-11-13:

#22

For reference we have packaged up oslo.messaging stable/pike branch with one patch to revert the use kombu 4 library.

Code is https://github.com/NeCTAR-RC/oslo.messaging/tree/nectar/pike

Initial tests look good.

Revision history for this message

Sam Morrison (sorrison) wrote on 2018-11-15:

#23

So with nova we have upgraded our conductors to pike and they now are getting these errors

2018-11-14 12:03:30.896 2249 ERROR oslo.messaging._drivers.impl_rabbit [-] [2003dd6e-6f51-4528-bad7-21c21ee00f42] AMQP server on mq1-np.os.melbourne.rc.nectar.org.au:5671 is unreachable: <AMQPError: unknown error>. Trying again in 1 seconds. Client port: None: RecoverableConnectionError: <AMQPError: unknown error>
2018-11-14 12:03:31.955 2249 INFO oslo.messaging._drivers.impl_rabbit [-] [2003dd6e-6f51-4528-bad7-21c21ee00f42] Reconnected to AMQP server on mq1-np.os.melbourne.rc.nectar.org.au:5671 via [amqp] client with port 57736.

So I think the root cause of these issues is the <AMQPError: unknown error> error that we see when using pike clients as opposed to ocata.

This causes MessagingTimeout errors on the compute nodes which are still running ocata. So I suspect messagetime out issues are just a symptom.

Revision history for this message

Sam Morrison (sorrison) wrote on 2018-11-15:

#24

Should mention that changing conductor to not use ssl to connect to rabbit stopped getting these amqp reconnect errors

Revision history for this message

Gökhan (skylightcoder) wrote on 2018-11-15:

#25

Hi Sam, I have got timeout errors at nova-compute side on my cloud. I used your oslo.messaging patch and changed amqp version to amqp==1.4.9 and kombu version to kombu==3.0.37 versions, it worked well. Now I don't get any timeout errors at nova-compute side. But I am using different library versions at nova-conductor side and it is working well. at nova-conductor side, I am using amqp (2.2.1) , kombu (4.1.0) and oslo.messaging (5.30.2)

Revision history for this message

Sam Morrison (sorrison) wrote on 2018-11-15:

#26

@skylightcoder are you using SSL and also what version of RabbitMQ are you using?

Good to hear it's not only me having issues!

Revision history for this message

Gökhan (skylightcoder) wrote on 2018-11-16:

#27

Yes I am using SSL and Rabbitmq version is 3.6.9. This my rabbitmqctl status output : http://paste.openstack.org/show/734900/

Revision history for this message

Ken Giusti (kgiusti) wrote on 2018-11-21:

#28

Hi Sam and Gokhan,

Sorry I haven't made any progress on this issue, however I do believe you've identified the problem as kombu/amqp related.

Specifically if you take a look at the _current_ pike release you'll see that oslo.messaging explicitly avoids kombu 4.0.2 and amqp 2.1.4:

https://git.openstack.org/cgit/openstack/oslo.messaging/tree/requirements.txt?h=5.30.6#n27

which are the versions of the libs you've found to be broken.

The correct versions of those libs for pike should be amqp==2.2.1 and kombu==4.1.0 (at least that's what the upper constraints for pike have:

https://git.openstack.org/cgit/openstack/requirements/tree/upper-constraints.txt?h=stable/pike#n411
https://git.openstack.org/cgit/openstack/requirements/tree/upper-constraints.txt?h=stable/pike#n194

As far as those error messages: AMQPError is the generic base for exceptions thrown from the amqp library. Unfortunately there's not enough detail in the exception to understand what actually is failing. However where that exception is logged is a connection error handler that causes oslo.messaging to re-connect, which is the best that oslo.messaging can do at that point.

What results do you get when moving to the latest release of Pike (5.30.6) using amqp==2.2.1 and kombu==4.1.0?

Changed in oslo.messaging:
status:	New → Incomplete

Revision history for this message

Sam Morrison (sorrison) wrote on 2018-11-21:

#29

Hmm ok I'll look at using the newer versions. These are the versions that ship with the Ubuntu Pike Cloud Archive so anyone using this version of openstack would be affected.

Revision history for this message

Ken Giusti (kgiusti) wrote on 2018-11-26:

#30

Let us know if moving to the newer versions solve your SSL problems. If that is the case I'll open a bug on Ubuntu cloud archive referencing this issue.

https://wiki.ubuntu.com/OpenStack/CloudArchive#Reporting_Bugs

Revision history for this message

Sam Morrison (sorrison) wrote on 2018-11-26:

#31

We upgraded nova to pike last week and we have experienced similar issues but they seem slightly different and the downgrading of amqp and kombu hasn't helped in this case. We had to roll back to nova ocata and have spent the last few day trying to figure out what the problem is. Still unknown at this stage.

Revision history for this message

Sam Morrison (sorrison) wrote on 2018-11-28:

#32

OK I've done some more thorough testing with our nova install in production

Working
oslo.messaging 5.17.1 (with patch[1] to make it work in pike)
amqp 1.4.9
kombu 3.0.33

Not working
oslo.messaging 5.30.6 (with patch[2] to make it work with kombu < 4)
amqp 1.4.9
kombu 3.0.33

Also Not working
oslo.messaging 5.30.6
amqp 2.2.2
kombu 4.1.0

Also not working (ubuntu pike cloud archive packages)
oslo.messaging 5.30.0
amqp 2.1.4
kombu 4.0.2

So to me it looks like an issue in oslo.messaging and makes me start to suspect https://github.com/openstack/oslo.messaging/commit/7e71ac821f4103032afc8eff8007ad48afe49a2c again.

[1] https://github.com/NeCTAR-RC/oslo.messaging/commit/0ae7609453c0070f1b980eaa927955aa88b319b2
[2] https://github.com/NeCTAR-RC/oslo.messaging/commit/0dccfb698d6ed020ba2853611341789d450069cf

Revision history for this message

Sam Morrison (sorrison) wrote on 2018-12-07:

#33

Looks like this could also be an issue with the queens versions too. We upgraded ceilometer from ocata to queens and seeing similar issues with messages not being ack'd. We have downgraded back to ocata for now.

Revision history for this message

Antonio Ojea (aojea) wrote on 2019-01-09:

#34

We have the same issues with nova-compute, some messages get stuck in the queue when using SSL.
We've identified that this problem happen periodically, always in the same queue and with large messages (> 100k characters)

We can observe that suddenly the queue starts to grow

> Every 2.0s: sudo rabbitmqctl list_queues messages consumers name message_bytes messages_unacknowledged > messages_ready head_message_timestamp consumer_utilisation memory state| grep reply >

> 4 1 reply_7271709f3e8b4a51a6e63a647ffd6698 625435 4 0 1.0 36984 > running

then nova-compute starts to log RPC timeout errors like this:

> 2019-01-09 16:56:02.934 2456 ERROR oslo_service.periodic_task MessagingTimeout: Timed out waiting for a reply to message ID e6b115486c244118b1ce4a3438c8fe84

and sometime later the connection is recreated because of heartbeats missed

> 2019-01-09 16:56:22.882 2456 ERROR oslo.messaging._drivers.impl_rabbit [-] [71579db1-c4d8-4e39-9cf1-e547d74fc350] AMQP server on ardana-cp1-c1-m1-mgmt:5671 is unreachable: Too many heartbeats missed. Trying again in 5 seconds. Client port: 50634: ConnectionForced: Too many heartbeats missed

Then we start over the same cycle, messages piling up -> RPC timeout -> heartbeat missed

We tried different combinations, but no luck:

Also Not working
oslo.messaging 5.30.2
amqp 2.1.3
kombu 4.0.1

Also Not working
oslo.messaging 5.30.2
amqp 2.2.1
kombu 4.1.0

Also Not working
oslo.messaging 5.30.6
amqp 2.2.1
kombu 4.1.0

Antonio Ojea (aojea) on 2019-01-10

Changed in oslo.messaging:
status:	Incomplete → Confirmed

Revision history for this message

Antonio Ojea (aojea) wrote on 2019-01-11:

#35

Download full text (3.8 KiB)

@kgiusti I think that I'm able to reproduce the issue only with the tools/simulator.py script that's in the oslo.messaging repository.

These are the steps that I followed

* Install a rabbitmq server with ssl enable

docker run -d --rm --name rabbitssl -p 5672:5672 -p 5671:5671 -p 61613:61613 -p 61614:61614 -p 15672:15672 mehdijrgr/docker-rabbitmq-ssl

* Open a shell on the container to configure a test user ...
$ docker exec -it rabbitssl bash

root@ef4c4bce08a9:/#
root@ef4c4bce08a9:/# rabbitmqctl add_user test test
Creating user "test"
root@ef4c4bce08a9:/# rabbitmqctl set_user_tags test administrator
Setting tags for user "test" to [administrator]
root@ef4c4bce08a9:/# rabbitmqctl set_permissions -p / test ".*" ".*" ".*"
Setting permissions for user "test" in vhost "/"

* ... and start monitoring the queues

root@ef4c4bce08a9:/# watch -c "rabbitmqctl list_queues name messages_unacknowledged"

* Install oslo-messaging out of the container

$ git clone git://git.openstack.org/openstack/oslo.messaging
$ cd oslo.messaging/
$ git checkout stable/pike
$ virtualenv .venv
$ source .venv/bin/activate
$ pip install -r requirements.txt
$ python setup.py install

* configure oslo.messaging.conf to use the SSL connection (I had to generate a config file to be able to use SSL)

$ oslo-config-generator --namespace oslo.messaging > oslo.messaging.conf

[DEFAULT]
transport_url = rabbit://test:test@localhost:5671
[oslo_messaging_rabbit]
ssl = True

* Modify the simulator.py to generate a large message with a fixed size

diff --git a/tools/simulator.py b/tools/simulator.py
index f952df0a..8de50ac8 100755
--- a/tools/simulator.py
+++ b/tools/simulator.py
@@ -472,9 +472,8 @@ def generate_messages(messages_count):
     LOG.info("Generating %d random messages", messages_count)
     generator = init_random_generator()
     for i in six.moves.range(messages_count):
- length = generator()
- msg = ''.join(random.choice(
- string.ascii_lowercase) for x in six.moves.range(length))
+ length = 2000000
+ msg = "A"*length
         MESSAGES.append(msg)

LOG.info("Messages has been prepared")

* without SSL it works without errors

$ python tools/simulator.py --url rabbit://localhost:5672 rpc-client -p 1 -w 1 -m 200 --timeout 10
$ python tools/simulator.py --url rabbit://localhost:5672 rpc-server

2019-01-11 13:47:58,728 INFO root =================================== summary ===================================
2019-01-11 13:47:58,728 INFO root client: duration: 223.15 count: 200 (0.9 msg/sec) bytes: 400000000 (1792476 bps)
2019-01-11 13:47:58,729 INFO root error: duration: 0.00 count: 0 (0.0 msg/sec) bytes: 0 (0 bps)
2019-01-11 13:47:58,729 INFO root round-trip: duration: 222.20 count: 200 (0.9 msg/sec) bytes: 400000000 (1800187 bps) latency: 0.113 min: 0.045 max: 1.492

* start a test using SSL

$ python tools/simulator.py --config-file oslo.messaging.conf rpc-client -p 1 -w 1 -m 200 --timeout 10
$ python tools/simulator.py --config-file oslo.messaging.conf rpc-server

With SSL one of the queues starts to have unacknowledged messages, not always in the same queue (sometimes profiler_topic.profiler_server, others...

Here's a description of my environment and the results of a few test runs:

I am running rabbitmq in the container as described in your previous
post.

I am running the clients on my fedora 28 x86_64 machine:

$ python --version
Python 2.7.15

$ uname -a
Linux localhost.localdomain 4.19.13-200.fc28.x86_64 #1 SMP Sat Dec 29 23:10:35 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

$ openssl version
OpenSSL 1.1.0i-fips  14 Aug 2018

In my local oslo.messaging repo I've checked out stable/pike and
modified the simulator.py as you've described:

$ git diff
diff --git a/tools/simulator.py b/tools/simulator.py
index f952df0a..871ea33c 100755
--- a/tools/simulator.py
+++ b/tools/simulator.py
@@ -472,10 +472,12 @@ def generate_messages(messages_count):
     LOG.info("Generating %d random messages", messages_count)
     generator = init_random_generator()
     for i in six.moves.range(messages_count):
-        length = generator()
-        msg = ''.join(random.choice(
-                      string.ascii_lowercase) for x in six.moves.range(length))
-        MESSAGES.append(msg)
+        # length = generator()
+        # msg = ''.join(random.choice(
+        #               string.ascii_lowercase) for x in
+        #               six.moves.range(length))
+        length = 2000000
+        MESSAGES.append("A"*length)
 
     LOG.info("Messages has been prepared")

I've created the SSL configuration file:

$ cat test.conf
[DEFAULT]
transport_url = rabbit://test:test@localhost:5671
[oslo_messaging_rabbit]
ssl = True

And installed oslo.messaging + deps in a virtualenv:

$ pip list | grep oslo
oslo.concurrency   3.29.0     
oslo.config        6.7.0      
oslo.context       2.22.0     
oslo.i18n          3.23.0     
oslo.log           3.42.2     
oslo.messaging     5.30.7.dev1
oslo.middleware    3.37.0     
oslo.serialization 2.28.1     
oslo.service       1.34.0     
oslo.utils         3.39.1

I'm running the SSL test clients like this:
$ python simulator.py --config-file test.conf rpc-server &
$ python simulator.py --config-file test.conf rpc-client -p 1 -w 1 -m 200 --timeout 10

Verified all traffic passed via TLS1.2 (wireshark dump)

Output of three runs (no unacked msgs seen):

2019-01-14 11:28:50,453 INFO root =================================== summary ===================================
2019-01-14 11:28:50,453 INFO root client: duration: 254.24 count: 200 (0.8 msg/sec) bytes: 400000000 (1573303 bps)
2019-01-14 11:28:50,453 INFO root error: duration: 0.00 count: 0 (0.0 msg/sec) bytes: 0 (0 bps)
2019-01-14 11:28:50,453 INFO root round-trip: duration: 254.24 count: 200 (0.8 msg/sec) bytes: 400000000 (1573303 bps) latency: 0.271 min: 0.192 max: 1.335

2019-01-14 11:41:10,798 INFO root =================================== summary ===================================
2019-01-14 11:41:10,799 INFO root client: duration: 259.28 count: 200 (0.8 msg/sec) bytes: 400000000 (1542724 bps)
2019-01-14 11:41:10,799 INFO root error: duration: 0.00 count: 0 (0.0 msg/sec) bytes: 0 (0 bps)
2019-01-14 11:41:10,800 INFO root round-trip: duration: 258.28 count: 200 (0.8 msg/sec) bytes: 400000000 (1548706 bps) latency: 0.293 min: 0.190 max: 1.374

2019-01-14 11:41:22,700 INFO root =================================== summary ===================================
2019-01-14 11:41:22,701 INFO root server: duration: 258.31 count: 200 (0.8 msg/sec) bytes: 400000000 (1548510 bps) latency: 0.145 min: 0.091 max: 1.250

A third run under heavy load: one message gets dropped (call times out), but no unacked msgs on queues:

2019-01-14 12:06:32,619 INFO root =================================== summary ===================================
2019-01-14 12:06:32,620 INFO root client: duration: 831.16 count: 200 (0.2 msg/sec) bytes: 400000000 (481258 bps)
2019-01-14 12:06:32,620 INFO root error: duration: 1.00 count: 1 (1.0 msg/sec) bytes: 2000000 (2000000 bps)
2019-01-14 12:06:32,620 INFO root round-trip: duration: 830.15 count: 199 (0.2 msg/sec) bytes: 398000000 (479429 bps) latency: 0.293 min: 0.186 max: 2.352

2019-01-14 12:07:26,186 INFO root =================================== summary ===================================
2019-01-14 12:07:26,187 INFO root server: duration: 830.23 count: 199 (0.2 msg/sec) bytes: 398000000 (479387 bps) latency: 0.143 min: 0.090 max: 1.168

I'm running the clients in non-ssl mode like this:
$ python simulator.py --url rabbit://localhost:5672 rpc-server &
$ python simulator.py --url rabbit://localhost:5672 rpc-client -p 1 -w 1 -m 200 --timeout 10

Output of two runs

2019-01-14 11:34:14,968 INFO root =================================== summary ===================================
2019-01-14 11:34:14,968 INFO root client: duration: 217.29 count: 200 (0.9 msg/sec) bytes: 400000000 (1840823 bps)
2019-01-14 11:34:14,968 INFO root error: duration: 0.00 count: 0 (0.0 msg/sec) bytes: 0 (0 bps)
2019-01-14 11:34:14,969 INFO root round-trip: duration: 216.23 count: 200 (0.9 msg/sec) bytes: 400000000 (1849909 bps) latency: 0.084 min: 0.047 max: 0.810

2019-01-14 11:34:52,388 INFO root =================================== summary ===================================
2019-01-14 11:34:52,389 INFO root server: duration: 216.13 count: 200 (0.9 msg/sec) bytes: 400000000 (1850762 bps) latency: 0.046 min: 0.020 max: 0.773

2019-01-14 11:49:33,486 INFO root =================================== summary ===================================
2019-01-14 11:49:33,486 INFO root client: duration: 218.21 count: 200 (0.9 msg/sec) bytes: 400000000 (1833069 bps)
2019-01-14 11:49:33,486 INFO root error: duration: 0.00 count: 0 (0.0 msg/sec) bytes: 0 (0 bps)
2019-01-14 11:49:33,486 INFO root round-trip: duration: 217.21 count: 200 (0.9 msg/sec) bytes: 400000000 (1841516 bps) latency: 0.086 min: 0.049 max: 1.173

2019-01-14 11:49:45,468 INFO root =================================== summary ===================================
2019-01-14 11:49:45,468 INFO root server: duration: 217.14 count: 200 (0.9 msg/sec) bytes: 400000000 (1842114 bps) latency: 0.051 min: 0.023 max: 1.133

Here's the rabbitmq queue listing after 4 test runs:

Every 2.0s: rabbitmqctl list_queues name messages_u...  Mon Jan 14 16:49:44 2019

Listing queues
reply_10de6283cce047769e2830c487e809d8  0
reply_82c3983ed54241b88f3941e692bfd9aa  0
reply_ea93a026556b409183eda9273f1111ae  0
reply_ef2e491a12774864af8e470b13c40b01  0
profiler_topic_fanout_cb5d5e4903d643659b05ebc96763520f  0
profiler_topic.profiler_server  0
profiler_topic  0

Revision history for this message

Sam Morrison (sorrison) wrote on 2019-01-16:

#40

I haven't been able to replicate the error following the steps above either.
Only difference is that I used our existing rabbit server as opposed to the docker one.

This however looks exactly like the issues we see in production and so I'm glad at least that someone else is experiencing it too in a way.

Revision history for this message

Antonio Ojea (aojea) wrote on 2019-01-19:

#41

Can you try using a higher number of threads in the simulator client?
i.e. -p 10

I think that this problem only happens under heavy load

Revision history for this message

Sam Morrison (sorrison) wrote on 2019-01-20:

#42

OK yes tried with -p 10 and that did the trick, lots of timeouts now

File "tools/simulator.py", line 596, in _rpc_call
    res = client.call({}, remote_method, message=msg)
  File "/root/oslo.messaging/.venv/local/lib/python2.7/site-packages/oslo_messaging/rpc/client.py", line 179, in call
    retry=self.retry)
  File "/root/oslo.messaging/.venv/local/lib/python2.7/site-packages/oslo_messaging/transport.py", line 128, in _send
    retry=retry)
  File "/root/oslo.messaging/.venv/local/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 645, in send
    call_monitor_timeout, retry=retry)
  File "/root/oslo.messaging/.venv/local/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 634, in _send
    call_monitor_timeout)
  File "/root/oslo.messaging/.venv/local/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 520, in wait
    message = self.waiters.get(msg_id, timeout=timeout)
  File "/root/oslo.messaging/.venv/local/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 397, in get
    'to message ID %s' % msg_id)
MessagingTimeout: Timed out waiting for a reply to message ID 46f86421b6444b52877b2cac8614e561

And Also errors like:
2019-01-21 09:58:12,303 INFO oslo_messaging._drivers.amqpdriver No calling threads waiting for msg_id : c80c02f601b7409c9ec79f724ce8aab1

Revision history for this message

Gabriele Santomaggio (gsantomaggio) wrote on 2019-01-25:

#43

After a thorough analysis, we ended up on the py-amqp socket read function [1]
and more in detail on the SSL Class implementation [2]
Using a "big" buffer the read function did not work correctly, by debbuging this code [3]:
```
frame_type, channel, size = unpack('>BHI', frame_header)
```
we saw data as:

```
frame_type = 125 channel = 1258 size = 152632741
frame_type = 95 channel = 2587 size = 262885920
```
etc...
So, we had a socket overflow reading. It caused a loop client side and problem as:

````
too many heartbeats missed
MessagingTimeout: Timed out waiting for a reply to message ID
```
Server side:

````
frame_header timeout (most likely)
rabbitmq missed heartbeats from client timeout 60s
```

This patch [4] solves the problem. It avoids continuing read data from the socket.

We tested the patch in our systems and seems to work correctly.

-
Gabriele Santomaggio
Cloud Developer @Suse

1 - https://github.com/celery/py-amqp/blob/2.1/amqp/transport.py#L233
2 - https://github.com/celery/py-amqp/blob/2.1/amqp/transport.py#L316
3 - https://github.com/celery/py-amqp/blob/2.1/amqp/transport.py#L233
4 - https://github.com/celery/py-amqp/pull/247

Revision history for this message

Ken Giusti (kgiusti) wrote on 2019-01-25:

#44

Nice work Gabriele!

Sam & Antonio: does the patch Gabriele provided in the above comment work for you both?

Revision history for this message

Antonio Ojea (aojea) wrote on 2019-01-28:

#45

+1

Revision history for this message

Ken Giusti (kgiusti) wrote on 2019-01-28:

#46

I'm now able to reproduce the problem using the procedure described in comment #35. I did have to bump up the thread count and lower the wait time - I'm running the rpc-client like this:

python tools/simulator.py --config-file ./oslo.conf rpc-client -p 10 -w 0.3 -m 100 --timeout 10

I can reproduce this problem across all 2.X releases of the amqp package.

This is bad since during the pike release we moved to kombu >= 4.0 which doesn't appear to work with amqp <= 2.0.

I had to go way back to oslo.messaging 5.17.0 before I found a release that I could run with amqp==1.4.9 and kombu==3.0.37 and not reproduce the problem.

Revision history for this message

Sam Morrison (sorrison) wrote on 2019-01-30:

#47

Great, this is excellent news. After doing some successful testing we have deployed this to prod and seems to be working well. We can move on with our lives once again :-)

Thank you so much for this as this has saved us a lot of pain.

Revision history for this message

Ken Giusti (kgiusti) wrote on 2019-01-31:

#48

That fix for py-amqp (see comment #43) has landed on master today.
We'll need to bump the minimum version of amqp in oslo.messaging's requirements once the fix is released.

thanks everyone!

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-02-18: Fix proposed to oslo.messaging (master)

#49

Fix proposed to branch: master
Review: https://review.openstack.org/637576

Changed in oslo.messaging:
assignee:	Ken Giusti (kgiusti) → Magnus Bergman (magnusbe)
status:	Confirmed → In Progress

Revision history for this message

Magnus Bergman (magnusbe) wrote on 2019-02-18:

#50

Saw that amqp 2.4.1 has been released and as it contains the fix i took the liberty to send a patch proposal. Would you mind taking a look, Ken?

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-02-20: Fix merged to oslo.messaging (master)

#51

Reviewed: https://review.openstack.org/637576
Committed: https://git.openstack.org/cgit/openstack/oslo.messaging/commit/?id=64f74cd2cc17a9331f943f25c76dde6b4c604247
Submitter: Zuul
Branch: master

commit 64f74cd2cc17a9331f943f25c76dde6b4c604247
Author: mb <email address hidden>
Date: Mon Feb 18 17:16:58 2019 +0100

Bump amqp requirement version to >=2.4.1

    Bumping the version of amqp in requirements.txt and
    lower-constraints.txt from 2.4.0 to 2.4.1 to mitigate issues with
    MessagingTimeouts when using TLS due to a amqp bug that was fixed in
    2.4.1

Change-Id: I0b094f32dec24c70f95ccd509164a14a71fcfc7d
Closes-Bug: #1800957

Changed in oslo.messaging:
status:	In Progress → Fix Released

Revision history for this message

Ken Giusti (kgiusti) wrote on 2019-02-21:

#52

Unfortunately we can't backport this fix to previous stable branches since it is a change to requirements which is technically a feature release.

I'll add release notes to the stable branches to alert people to this issue and recommend bumping amqp to the correct version.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-02-21: Related fix proposed to oslo.messaging (stable/rocky)

#53

Related fix proposed to branch: stable/rocky
Review: https://review.openstack.org/638461

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-02-22: Related fix proposed to oslo.messaging (stable/queens)

#54

Related fix proposed to branch: stable/queens
Review: https://review.openstack.org/638733

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-02-22: Related fix proposed to oslo.messaging (stable/pike)

#55

Related fix proposed to branch: stable/pike
Review: https://review.openstack.org/638735

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-02-25: Related fix merged to oslo.messaging (stable/pike)

#56

Reviewed: https://review.openstack.org/638735
Committed: https://git.openstack.org/cgit/openstack/oslo.messaging/commit/?id=3113175443b975d72699b3bbe6fea488c414a3ef
Submitter: Zuul
Branch: stable/pike

commit 3113175443b975d72699b3bbe6fea488c414a3ef
Author: Kenneth Giusti <email address hidden>
Date: Fri Feb 22 12:50:14 2019 -0500

Add release note for amqp library TLS/SSL error

Change-Id: I90c172d673084ff0cb0a3dd0fae21a5d62d35286
Related-Bug: #1800957

tags:

added: in-stable-pike

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-02-25: Related fix merged to oslo.messaging (stable/rocky)

#57

Reviewed: https://review.openstack.org/638461
Committed: https://git.openstack.org/cgit/openstack/oslo.messaging/commit/?id=693cd964e02002ab28366df08c5b75f396198492
Submitter: Zuul
Branch: stable/rocky

commit 693cd964e02002ab28366df08c5b75f396198492
Author: Kenneth Giusti <email address hidden>
Date: Thu Feb 21 11:44:54 2019 -0500

Add release note for amqp library TLS/SSL error

Change-Id: I0b094f32dec24c70f95ccd509164a14a71fcfc7d
Related-Bug: #1800957

tags:

added: in-stable-rocky

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-02-26: Fix included in openstack/oslo.messaging 9.5.0

#58

This issue was fixed in the openstack/oslo.messaging 9.5.0 release.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-03-04: Related fix merged to oslo.messaging (stable/queens)

#59

Reviewed: https://review.openstack.org/638733
Committed: https://git.openstack.org/cgit/openstack/oslo.messaging/commit/?id=389dacdc6bda1cad890c786e2bc6f9594e86a191
Submitter: Zuul
Branch: stable/queens

commit 389dacdc6bda1cad890c786e2bc6f9594e86a191
Author: Kenneth Giusti <email address hidden>
Date: Fri Feb 22 12:42:08 2019 -0500

Add release note for amqp library TLS/SSL error

Change-Id: I18281f406538902b6ae983de25be0fd2452ac0d3
Related-Bug: #1800957

tags:

added: in-stable-queens

Revision history for this message

Launchpad Janitor (janitor) wrote on 2019-10-30:

#60

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in oslo.messaging (Ubuntu):
status:	New → Confirmed

Affects		Status	Importance	Assigned to	Milestone
	oslo.messaging	Fix Released	High	Magnus Bergman
	oslo.messaging (Ubuntu)	Confirmed	Undecided	Unassigned

Ubuntu
oslo.messaging package

Upgrading to pike version causes rabbit timeouts with ssl

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches

Ubuntuoslo.messaging package

Upgrading to pike version causes rabbit timeouts with ssl

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches

Ubuntu
oslo.messaging package