Upgrading to pike version causes rabbit timeouts with ssl

Bug #1800957 reported by Sam Morrison on 2018-10-31
50
This bug affects 9 people
Affects Status Importance Assigned to Milestone
oslo.messaging
High
Magnus Bergman
oslo.messaging (Ubuntu)
Undecided
Unassigned

Bug Description

We have discovered an issue when upgrading our clouds from ocata to pike.

oslo.messaging versions
ocata: 5.17.1
pike: 5.30.0

python-amqp versions
ocata: 1.4.9
pike: 2.1.4

On upgrading to pike we get several issues with neutron-dhcp-agent and nova-compute.

The error we see is:

2018-11-01 10:05:48.580 7908 ERROR neutron.agent.dhcp.agent [req-79e8c605-055e-4354-b749-7dd7baabf864 - - - - -] Failed reporting state!: MessagingTimeout: Timed out waiting for a reply to message ID ae039d1695984addbfaaef032ce4fda3
2018-11-01 10:05:48.580 7908 ERROR neutron.agent.dhcp.agent Traceback (most recent call last):
2018-11-01 10:05:48.580 7908 ERROR neutron.agent.dhcp.agent File "/usr/lib/python2.7/dist-packages/neutron/agent/dhcp/agent.py", line 740, in _report_state
2018-11-01 10:05:48.580 7908 ERROR neutron.agent.dhcp.agent ctx, self.agent_state, True)
2018-11-01 10:05:48.580 7908 ERROR neutron.agent.dhcp.agent File "/usr/lib/python2.7/dist-packages/neutron/agent/rpc.py", line 92, in report_state
2018-11-01 10:05:48.580 7908 ERROR neutron.agent.dhcp.agent return method(context, 'report_state', **kwargs)
2018-11-01 10:05:48.580 7908 ERROR neutron.agent.dhcp.agent File "/usr/lib/python2.7/dist-packages/oslo_messaging/rpc/client.py", line 169, in call
2018-11-01 10:05:48.580 7908 ERROR neutron.agent.dhcp.agent retry=self.retry)
2018-11-01 10:05:48.580 7908 ERROR neutron.agent.dhcp.agent File "/usr/lib/python2.7/dist-packages/oslo_messaging/transport.py", line 123, in _send
2018-11-01 10:05:48.580 7908 ERROR neutron.agent.dhcp.agent timeout=timeout, retry=retry)
2018-11-01 10:05:48.580 7908 ERROR neutron.agent.dhcp.agent File "/usr/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 578, in send
2018-11-01 10:05:48.580 7908 ERROR neutron.agent.dhcp.agent retry=retry)
2018-11-01 10:05:48.580 7908 ERROR neutron.agent.dhcp.agent File "/usr/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 567, in _send
2018-11-01 10:05:48.580 7908 ERROR neutron.agent.dhcp.agent result = self._waiter.wait(msg_id, timeout)
2018-11-01 10:05:48.580 7908 ERROR neutron.agent.dhcp.agent File "/usr/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 459, in wait
2018-11-01 10:05:48.580 7908 ERROR neutron.agent.dhcp.agent message = self.waiters.get(msg_id, timeout=timeout)
2018-11-01 10:05:48.580 7908 ERROR neutron.agent.dhcp.agent File "/usr/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 347, in get
2018-11-01 10:05:48.580 7908 ERROR neutron.agent.dhcp.agent 'to message ID %s' % msg_id)
2018-11-01 10:05:48.580 7908 ERROR neutron.agent.dhcp.agent MessagingTimeout: Timed out waiting for a reply to message ID ae039d1695984addbfaaef032ce4fda3

Steps to reproduce are:

Start neutron-dhcp-agent with no networks being hosted on it.
agent reporting is fine, I have manually pdb'd this and triggered the agent report hundreds of times every 1-2 seconds and neutron-server always responds in ~1 second.

Now schedule a network onto the agent
Now the agent sync times out.

I can see the reply queue in rabbit and it starts to fill up with unacked messages and the agent starts to produce the stack trace above consistently.

Removing the network and restarting the agent gets the agent reporting normally again.

Now if I do the same thing except don't use the rabbit ssl port and setting everything works flawlessly.

We also see this behaviour with nova-compute. Something happens and then all messages get stuck in unack and timeouts appear in the log.

I suspect this could be more to do with the python-amqp version but I'm not certain.
We've tried with the SSL in rabbitmq and used versions 3.6.5 and 3.6.10, we've also tried using an F5 LB in front to offload SSL to that but to no avail.

Sam Morrison (sorrison) wrote :

I should also note that downgrading back to the ocata versions of everything fixes this.

Sam Morrison (sorrison) on 2018-11-01
description: updated
Sam Morrison (sorrison) wrote :

OK I tried running the newer version of python-amqp with the older version of oslo.messaging and it works fine so I don't suspect python-amqp to be an issue.

Sam Morrison (sorrison) wrote :

OK I have narrowed this down to olso.messaging itself.

Going from

ocata: 5.17.1
pike: 5.30.0

Breaks it.

Sam Morrison (sorrison) wrote :

Can confirm 5.17.4 works and 5.18.0 doesn't work

Ken Giusti (kgiusti) on 2018-11-02
Changed in oslo.messaging:
assignee: nobody → Ken Giusti (kgiusti)
Ken Giusti (kgiusti) wrote :

Hi Sam,

Are the SSL connections coming up at all? It sounds like they do come up, then drop at a later time. Is that correct?

There were some changes to the SSL configuration option names between ocata and pike, but pike should still support the old option names as well.

thanks

Sam Morrison (sorrison) wrote :

The connections to the rabbit server are fine, I see them. The issue is the messages in the reply queue aren't acked.

The connection is fine and things work until the service does something that needs a reply

Ken Giusti (kgiusti) wrote :

Gotcha.

I'm trying to create a minimal reproducer for this but I'm not experiencing the ack problem on that bad commit.

Would you be willing to try the following test in your environment?

I've got a sample minimal RPC server and client on my github account:

https://github.com/kgiusti/oslo-messaging-clients/tree/bug/1800957

Can you pull that down and take a look at the README - it explains how to configure and run the clients.

If you can reproduce this can you run the clients with '--debug' specified and send me the traces as well as your rabbitmq, nova and neutron config files?

thanks,

-K

Sam Morrison (sorrison) wrote :

OK have given this a go

I set requirements to be what is available in pike

oslo.messaging==5.30.0
oslo.config==4.11.0
eventlet==0.18.4

And settings file has:
-----------------------------
[DEFAULT]
transport_url = rabbit://test:test@<hostname>:5671/test

[oslo_messaging_rabbit]
rabbit_ha_queues=True
amqp_durable_queues=True
ssl_version=TLSv1
ssl=True
heartbeat_timeout_threshold=60
-----------------------------

Debug output at: http://paste.openstack.org/show/734310/

RabbitMQ config at http://paste.openstack.org/show/734311/

Ken Giusti (kgiusti) wrote :

Ugh, so it looks like the test client worked.

It's easy to miss in the logs, but if you look at the root log messages you'll see:

DEBUG:root:RPC return value={u'args': {u'name': u'echo'}, u'method': u'echo', u'context': {u'application': u'rpc-client', u'time': u'Wed Nov 7 09:55:22 2018'}}

and

10 RPC call(s) complete

There is a buggy debug log message being issued:

DEBUG:oslo.messaging._drivers.impl_rabbit:Timed out waiting for RPC response:
DEBUG:oslo.messaging._drivers.impl_rabbit:Timed out waiting for RPC response: Timeout while waiting on RPC response - topic: "<unknown>", RPC method: "<unknown>" info: "<unknown>"

This is (incorrectly) logged each time the I/O poll loop is exited due to the poll timeout, which it does periodically.

See the way the timeout is handled in 5.30.0 https://git.openstack.org/cgit/openstack/oslo.messaging/tree/oslo_messaging/_drivers/impl_rabbit.py?h=5.30.0#n1077

So that's not a real failure, and that simple client seems to work.

Would you be willing to try a debugging patch against oslo.messaging 5.30.0?

http://paste.openstack.org/show/734374/

That will issue an INFO log message each time an RPC request or reply is acked. If you can run this with your reproducer and post the logs from both the rpc server side and the rpc client side I'd very much appreciate that.

thanks

Sam Morrison (sorrison) wrote :

OK I have added this onto the client side (neutron-dhcp-agent)

http://paste.openstack.org/show/734375/

Will do the same thing with logs from the server side soon too

Sam Morrison (sorrison) wrote :

Sorry my above paste was missing some stuff at the start, here is a complete one http://paste.openstack.org/show/734376/

Sam Morrison (sorrison) wrote :

Here's another go with same error http://paste.openstack.org/show/734378/

Sam Morrison (sorrison) wrote :

Here is a message timeout with logs from server too

http://paste.openstack.org/show/734379/

Let me know if that's what you need, thanks for you help.

Ken Giusti (kgiusti) wrote :
Download full text (3.8 KiB)

Thanks for gathering those logs.

This is a bit of a head scratcher.

The commit you identified modified how acknowledgement was handled. And since you're seeing unacked messages on the reply queues that commit seems suspect. The debug patch I sent you logged how messages were being acked. I expected the log would show that the acks were not getting done - but according to your output oslo.messaging is acking messages:

The server acking a request message:

2018-11-08 09:39:08.651 29550 INFO oslo_messaging.rpc.server [-] Acknowledging RPC Request msg_id=None
2018-11-08 09:39:08.654 29550 INFO oslo_messaging._drivers.amqpdriver [-] Scheduling ACK for msg msg_id=None
2018-11-08 09:39:08.655 29550 INFO oslo.messaging._drivers.impl_rabbit [-] RabbitMessage.acknowledge: message {u'args': {u'payload': {u'network': {u'id': u'2d0d6769-65d1-40b4-ab47-6694627122d2'}}}, u'version': u'1.0', u'method': u'network_create_end'}

And the client acking the reply:

2018-11-08 09:38:39.652 29550 INFO oslo_messaging._drivers.amqpdriver [-] acknowledging reply msg_id=8c29ac000a10414784334d1ff0838bf3
2018-11-08 09:38:39.652 29550 INFO oslo.messaging._drivers.impl_rabbit [-] RabbitMessage.acknowledge: message {u'_unique_id': u'2c95234fac1144e58bf5304e34b246b9', u'_msg_id': u'8c29ac000a10414784334d1ff0838bf3', u'result': u'alive', u'failure': None, u'ending': True}

However the following log messages indicate connection failures:

2018-11-08 09:41:09.811 29550 ERROR oslo.messaging._drivers.impl_rabbit [-] [38accff0-f364-43b1-a8ea-8411db85803c] AMQP server on 115.146.81.133:5671 is unreachable: Too many heartbeats missed. Trying again in 1 seconds. Client port: 40800: ConnectionForced: Too many heartbeats missed
2018-11-08 09:41:10.988 29550 INFO oslo.messaging._drivers.impl_rabbit [-] [38accff0-f364-43b1-a8ea-8411db85803c] Reconnected to AMQP server on 115.146.81.133:5671 via [amqp] client with port 40802.

and

2018-11-08 09:19:51.717 28105 ERROR oslo.messaging._drivers.impl_rabbit [-] [1b12584f-4d39-42e5-9ae2-5d4ae71cd2e6] AMQP server on 115.146.81.133:5671 is unreachable: Too many heartbeats missed. Trying again in 1 seconds. Client port: 40744: ConnectionForced: Too many heartbeats missed
2018-11-08 09:19:52.909 28105 INFO oslo.messaging._drivers.impl_rabbit [-] [1b12584f-4d39-42e5-9ae2-5d4ae71cd2e6] Reconnected to AMQP server on 115.146.81.133:5671 via [amqp] client with port 40746.

Both of these issues occur around the time your RPC calls are timing out. This could result in reply messages getting "stuck" in queues since reply queues are randomly named and regenerated when the transport is established.

As a test I set the heartbeat timeout to 15sec and used iptables to start dropping rpc traffic between the rpc client/servers and rabbitmq. This naturally resulted in rpc failures and unconsumed messages on queues:

sudo iptables -I INPUT -p tcp --sport 5671 --j DROP
<wait for heartbeat failures>
sudo iptables -D INPUT 1
sudo rabbitmqctl -p test list_queues name messages messages_unacknowledged | grep reply
reply_f4b9a59c84b04857ae8ed992c618eaba 0 0
reply_bef3bbc2619247b0b287d19bfd685de3 0 0
reply_e6151828c288457e9cc01e5e6167e4e3 0 0
reply_d5af26a362104e2f...

Read more...

Sam Morrison (sorrison) wrote :

Yes without ssl it works fine.

The patch in question is in 5.18.0 but not 5.17.3

5.17.3 works, 5.18.0 doesn't

If I add the patch to 5.17.3 it breaks, if I remove the patch from 5.18.0 it works so I'm pretty confidant that this is the issue.

We have seen the issue in 3 of our clouds.

1. rabbitmq 3.6.5 (large cloud with ~1000 compute nodes)
2. rabbitmq 3.6.10 (small could with 8 compute nodes)
3. rabbitmq 3.6.10 (small cloud with 6 compute nodes)

Haven't seen it in 1 of our clouds
RabbitMQ 3.6.14 (small cloud with 3 compute nodes)

We upgraded rabbitMQ to 3.6.14 in our large cloud and the rabbit cluster fell over in a giant heap so we had to roll back.

I've upgraded the other small clouds to 3.6.14 and they seem to be ok now so I'm wondering if this is a combination of rabbit version and oslo version.

Sam Morrison (sorrison) wrote :

On a possibly related note we see these errors when we upgraded nova to pike (and hence the olso-messaging upgrading too)

2018-11-12 16:00:24.273 1474 ERROR oslo.messaging._drivers.impl_rabbit [-] [b6febed7-8c39-4aaa-a2f3-62f0f1c5ded0] AMQP server on mq35671 is unreachable: <AMQPError: unknown e
rror>. Trying again in 1 seconds. Client port: None: RecoverableConnectionError: <AMQPError: unknown error>
2018-11-12 16:00:24.629 1473 ERROR oslo.messaging._drivers.impl_rabbit [-] [0f300bbe-1eb9-427b-a3b6-bf671d9eab57] AMQP server on mq3:5671 is unreachable: <AMQPError: unknown e
rror>. Trying again in 1 seconds. Client port: None: RecoverableConnectionError: <AMQPError: unknown error>
2018-11-12 16:00:25.341 1474 INFO oslo.messaging._drivers.impl_rabbit [-] [b6febed7-8c39-4aaa-a2f3-62f0f1c5ded0] Reconnected to AMQP server on mq3:5671 via [amqp] client with
port 54216.
2018-11-12 16:00:25.728 1473 INFO oslo.messaging._drivers.impl_rabbit [-] [0f300bbe-1eb9-427b-a3b6-bf671d9eab57] Reconnected to AMQP server on mq3:5671 via [amqp] client with
port 54230.
2018-11-12 16:06:25.342 1474 ERROR oslo.messaging._drivers.impl_rabbit [-] [b6febed7-8c39-4aaa-a2f3-62f0f1c5ded0] AMQP server on mq3:5671 is unreachable: <AMQPError: unknown e
rror>. Trying again in 1 seconds. Client port: 55978: RecoverableConnectionError: <AMQPError: unknown error>
2018-11-12 16:06:26.412 1474 INFO oslo.messaging._drivers.impl_rabbit [-] [b6febed7-8c39-4aaa-a2f3-62f0f1c5ded0] Reconnected to AMQP server on mq3:5671 via [amqp] client with
port 56036.

These fill up the nove-conductor logs and only started happening after the upgrade. It still seems to be working ok though.

Sam Morrison (sorrison) wrote :

And in RabbitMQ logs we start to see the following errors after upgrading too:

{handshake_timeout,frame_header}

Sam Morrison (sorrison) wrote :

OK I have started again and done some more thorough testing, here's what I've found.

WORKS:
oslo.messaging==5.17.4
amqp==1.4.9
kombu==3.0.37

BROKEN:
oslo.messaging==5.17.4
amqp==2.1.4
kombu==4.0.2+really4.0.2+dfsg-2ubuntu1~cloud0 (ubuntu version with weird version number)

WORKS:
oslo.messaging==5.30.0 (with 5bacea1f42f7f5dc822b5f4f5968a3d8d3361b59 reverted)
amqp==1.4.9
kombu==3.0.37

So I think it's pretty conclusive that oslo.messaging is ok and that the issue lies with amqp or kombu.

I think in my previous testing I wasn't cleaning up my environment completely in between runs and so amqp and kombu were at different versions depending on if I was upgrading or downgrading.

Sam Morrison (sorrison) wrote :

For reference we have packaged up oslo.messaging stable/pike branch with one patch to revert the use kombu 4 library.

Code is https://github.com/NeCTAR-RC/oslo.messaging/tree/nectar/pike

Initial tests look good.

Sam Morrison (sorrison) wrote :

So with nova we have upgraded our conductors to pike and they now are getting these errors

2018-11-14 12:03:30.896 2249 ERROR oslo.messaging._drivers.impl_rabbit [-] [2003dd6e-6f51-4528-bad7-21c21ee00f42] AMQP server on mq1-np.os.melbourne.rc.nectar.org.au:5671 is unreachable: <AMQPError: unknown error>. Trying again in 1 seconds. Client port: None: RecoverableConnectionError: <AMQPError: unknown error>
2018-11-14 12:03:31.955 2249 INFO oslo.messaging._drivers.impl_rabbit [-] [2003dd6e-6f51-4528-bad7-21c21ee00f42] Reconnected to AMQP server on mq1-np.os.melbourne.rc.nectar.org.au:5671 via [amqp] client with port 57736.

So I think the root cause of these issues is the <AMQPError: unknown error> error that we see when using pike clients as opposed to ocata.

This causes MessagingTimeout errors on the compute nodes which are still running ocata. So I suspect messagetime out issues are just a symptom.

Sam Morrison (sorrison) wrote :

Should mention that changing conductor to not use ssl to connect to rabbit stopped getting these amqp reconnect errors

Gökhan (skylightcoder) wrote :

Hi Sam, I have got timeout errors at nova-compute side on my cloud. I used your oslo.messaging patch and changed amqp version to amqp==1.4.9 and kombu version to kombu==3.0.37 versions, it worked well. Now I don't get any timeout errors at nova-compute side. But I am using different library versions at nova-conductor side and it is working well. at nova-conductor side, I am using amqp (2.2.1) , kombu (4.1.0) and oslo.messaging (5.30.2)

Sam Morrison (sorrison) wrote :

@skylightcoder are you using SSL and also what version of RabbitMQ are you using?

Good to hear it's not only me having issues!

Gökhan (skylightcoder) wrote :

Yes I am using SSL and Rabbitmq version is 3.6.9. This my rabbitmqctl status output : http://paste.openstack.org/show/734900/

Ken Giusti (kgiusti) wrote :

Hi Sam and Gokhan,

Sorry I haven't made any progress on this issue, however I do believe you've identified the problem as kombu/amqp related.

Specifically if you take a look at the _current_ pike release you'll see that oslo.messaging explicitly avoids kombu 4.0.2 and amqp 2.1.4:

https://git.openstack.org/cgit/openstack/oslo.messaging/tree/requirements.txt?h=5.30.6#n27

which are the versions of the libs you've found to be broken.

The correct versions of those libs for pike should be amqp==2.2.1 and kombu==4.1.0 (at least that's what the upper constraints for pike have:

https://git.openstack.org/cgit/openstack/requirements/tree/upper-constraints.txt?h=stable/pike#n411
https://git.openstack.org/cgit/openstack/requirements/tree/upper-constraints.txt?h=stable/pike#n194

As far as those error messages: AMQPError is the generic base for exceptions thrown from the amqp library. Unfortunately there's not enough detail in the exception to understand what actually is failing. However where that exception is logged is a connection error handler that causes oslo.messaging to re-connect, which is the best that oslo.messaging can do at that point.

What results do you get when moving to the latest release of Pike (5.30.6) using amqp==2.2.1 and kombu==4.1.0?

Changed in oslo.messaging:
status: New → Incomplete
Sam Morrison (sorrison) wrote :

Hmm ok I'll look at using the newer versions. These are the versions that ship with the Ubuntu Pike Cloud Archive so anyone using this version of openstack would be affected.

Ken Giusti (kgiusti) wrote :

Let us know if moving to the newer versions solve your SSL problems. If that is the case I'll open a bug on Ubuntu cloud archive referencing this issue.

https://wiki.ubuntu.com/OpenStack/CloudArchive#Reporting_Bugs

Sam Morrison (sorrison) wrote :

We upgraded nova to pike last week and we have experienced similar issues but they seem slightly different and the downgrading of amqp and kombu hasn't helped in this case. We had to roll back to nova ocata and have spent the last few day trying to figure out what the problem is. Still unknown at this stage.

Sam Morrison (sorrison) wrote :

OK I've done some more thorough testing with our nova install in production

Working
oslo.messaging 5.17.1 (with patch[1] to make it work in pike)
amqp 1.4.9
kombu 3.0.33

Not working
oslo.messaging 5.30.6 (with patch[2] to make it work with kombu < 4)
amqp 1.4.9
kombu 3.0.33

Also Not working
oslo.messaging 5.30.6
amqp 2.2.2
kombu 4.1.0

Also not working (ubuntu pike cloud archive packages)
oslo.messaging 5.30.0
amqp 2.1.4
kombu 4.0.2

So to me it looks like an issue in oslo.messaging and makes me start to suspect https://github.com/openstack/oslo.messaging/commit/7e71ac821f4103032afc8eff8007ad48afe49a2c again.

[1] https://github.com/NeCTAR-RC/oslo.messaging/commit/0ae7609453c0070f1b980eaa927955aa88b319b2
[2] https://github.com/NeCTAR-RC/oslo.messaging/commit/0dccfb698d6ed020ba2853611341789d450069cf

Sam Morrison (sorrison) wrote :

Looks like this could also be an issue with the queens versions too. We upgraded ceilometer from ocata to queens and seeing similar issues with messages not being ack'd. We have downgraded back to ocata for now.

Antonio Ojea (itsuugo) wrote :

We have the same issues with nova-compute, some messages get stuck in the queue when using SSL.
We've identified that this problem happen periodically, always in the same queue and with large messages (> 100k characters)

We can observe that suddenly the queue starts to grow

> Every 2.0s: sudo rabbitmqctl list_queues messages consumers name message_bytes messages_unacknowledged > messages_ready head_message_timestamp consumer_utilisation memory state| grep reply >

> 4 1 reply_7271709f3e8b4a51a6e63a647ffd6698 625435 4 0 1.0 36984 > running

then nova-compute starts to log RPC timeout errors like this:

> 2019-01-09 16:56:02.934 2456 ERROR oslo_service.periodic_task MessagingTimeout: Timed out waiting for a reply to message ID e6b115486c244118b1ce4a3438c8fe84

and sometime later the connection is recreated because of heartbeats missed

> 2019-01-09 16:56:22.882 2456 ERROR oslo.messaging._drivers.impl_rabbit [-] [71579db1-c4d8-4e39-9cf1-e547d74fc350] AMQP server on ardana-cp1-c1-m1-mgmt:5671 is unreachable: Too many heartbeats missed. Trying again in 5 seconds. Client port: 50634: ConnectionForced: Too many heartbeats missed

Then we start over the same cycle, messages piling up -> RPC timeout -> heartbeat missed

We tried different combinations, but no luck:

Also Not working
oslo.messaging 5.30.2
amqp 2.1.3
kombu 4.0.1

Also Not working
oslo.messaging 5.30.2
amqp 2.2.1
kombu 4.1.0

Also Not working
oslo.messaging 5.30.6
amqp 2.2.1
kombu 4.1.0

Antonio Ojea (itsuugo) on 2019-01-10
Changed in oslo.messaging:
status: Incomplete → Confirmed
Antonio Ojea (itsuugo) wrote :
Download full text (3.8 KiB)

@kgiusti I think that I'm able to reproduce the issue only with the tools/simulator.py script that's in the oslo.messaging repository.

These are the steps that I followed

* Install a rabbitmq server with ssl enable

docker run -d --rm --name rabbitssl -p 5672:5672 -p 5671:5671 -p 61613:61613 -p 61614:61614 -p 15672:15672 mehdijrgr/docker-rabbitmq-ssl

* Open a shell on the container to configure a test user ...
$ docker exec -it rabbitssl bash

root@ef4c4bce08a9:/#
root@ef4c4bce08a9:/# rabbitmqctl add_user test test
Creating user "test"
root@ef4c4bce08a9:/# rabbitmqctl set_user_tags test administrator
Setting tags for user "test" to [administrator]
root@ef4c4bce08a9:/# rabbitmqctl set_permissions -p / test ".*" ".*" ".*"
Setting permissions for user "test" in vhost "/"

* ... and start monitoring the queues

root@ef4c4bce08a9:/# watch -c "rabbitmqctl list_queues name messages_unacknowledged"

* Install oslo-messaging out of the container

$ git clone git://git.openstack.org/openstack/oslo.messaging
$ cd oslo.messaging/
$ git checkout stable/pike
$ virtualenv .venv
$ source .venv/bin/activate
$ pip install -r requirements.txt
$ python setup.py install

* configure oslo.messaging.conf to use the SSL connection (I had to generate a config file to be able to use SSL)

$ oslo-config-generator --namespace oslo.messaging > oslo.messaging.conf

[DEFAULT]
transport_url = rabbit://test:test@localhost:5671
[oslo_messaging_rabbit]
ssl = True

* Modify the simulator.py to generate a large message with a fixed size

diff --git a/tools/simulator.py b/tools/simulator.py
index f952df0a..8de50ac8 100755
--- a/tools/simulator.py
+++ b/tools/simulator.py
@@ -472,9 +472,8 @@ def generate_messages(messages_count):
     LOG.info("Generating %d random messages", messages_count)
     generator = init_random_generator()
     for i in six.moves.range(messages_count):
- length = generator()
- msg = ''.join(random.choice(
- string.ascii_lowercase) for x in six.moves.range(length))
+ length = 2000000
+ msg = "A"*length
         MESSAGES.append(msg)

     LOG.info("Messages has been prepared")

* without SSL it works without errors

$ python tools/simulator.py --url rabbit://localhost:5672 rpc-client -p 1 -w 1 -m 200 --timeout 10
$ python tools/simulator.py --url rabbit://localhost:5672 rpc-server

2019-01-11 13:47:58,728 INFO root =================================== summary ===================================
2019-01-11 13:47:58,728 INFO root client: duration: 223.15 count: 200 (0.9 msg/sec) bytes: 400000000 (1792476 bps)
2019-01-11 13:47:58,729 INFO root error: duration: 0.00 count: 0 (0.0 msg/sec) bytes: 0 (0 bps)
2019-01-11 13:47:58,729 INFO root round-trip: duration: 222.20 count: 200 (0.9 msg/sec) bytes: 400000000 (1800187 bps) latency: 0.113 min: 0.045 max: 1.492

* start a test using SSL

$ python tools/simulator.py --config-file oslo.messaging.conf rpc-client -p 1 -w 1 -m 200 --timeout 10
$ python tools/simulator.py --config-file oslo.messaging.conf rpc-server

With SSL one of the queues starts to have unacknowledged messages, not always in the same queue (sometimes profiler_topic.profiler_server, others...

Read more...

Ken Giusti (kgiusti) wrote :

Thanks for doing this @itsuugo I appreciate it.

I've followed the steps above and still don't hit the unack problem.

I'm beginning to suspect it may be related to the environment.

Since we both used the same container-based rabbitmq server I think it's safe to say the problem isn't on the rabbitmq side.

My environment: I'm using fedora 28 x86_64 (+ all current patches).

I'm running this test under python 2.7.15.

I've got 1.1.0i version of the OpenSSL libraries installed.

I've attached the output of pip list (run in the same environment as the simulator for the test).

What's your environment like?

Antonio Ojea (itsuugo) wrote :

@kgiusti I repeat the test in a fedora 28 x86_64 VM and a hit the same issue, after some time one of the queues starts to have unack messages.

Can you share the output summary of your simulator.py tests with and without SSL, with a msg length = 2000000 and the same parameters I used "rpc-client -p 1 -w 1 -m 200 --timeout 10"?

2019-01-11 13:47:58,728 INFO root =================================== summary ===================================
2019-01-11 13:47:58,728 INFO root client: duration: 223.15 count: 200 (0.9 msg/sec) bytes: 400000000 (1792476 bps)
2019-01-11 13:47:58,729 INFO root error: duration: 0.00 count: 0 (0.0 msg/sec) bytes: 0 (0 bps)
2019-01-11 13:47:58,729 INFO root round-trip: duration: 222.20 count: 200 (0.9 msg/sec) bytes: 400000000 (1800187 bps) latency: 0.113 min: 0.045 max: 1.492

Ben Nemec (bnemec) on 2019-01-14
Changed in oslo.messaging:
importance: Undecided → High
Ben Nemec (bnemec) wrote :

Unfortunately I wasn't able to reproduce the problem either. With SSL, I get the following summary:

2019-01-14 16:46:44,657 INFO root =================================== summary ===================================
2019-01-14 16:46:44,657 INFO root client: duration: 246.45 count: 200 (0.8 msg/sec) bytes: 400000000 (1623078 bps)
2019-01-14 16:46:44,658 INFO root error: duration: 0.00 count: 0 (0.0 msg/sec) bytes: 0 (0 bps)
2019-01-14 16:46:44,658 INFO root round-trip: duration: 246.43 count: 200 (0.8 msg/sec) bytes: 400000000 (1623168 bps) latency: 0.233 min: 0.193 max: 1.026

Non-ssl looks like this:

2019-01-14 16:53:15,950 INFO root =================================== summary ===================================
2019-01-14 16:53:15,950 INFO root client: duration: 221.25 count: 200 (0.9 msg/sec) bytes: 400000000 (1807903 bps)
2019-01-14 16:53:15,950 INFO root error: duration: 0.00 count: 0 (0.0 msg/sec) bytes: 0 (0 bps)
2019-01-14 16:53:15,951 INFO root round-trip: duration: 220.25 count: 200 (0.9 msg/sec) bytes: 400000000 (1816125 bps) latency: 0.104 min: 0.067 max: 0.738

This is also a fedora 28 vm full updated, so not sure what would be different. :-/

Ken Giusti (kgiusti) wrote :
Download full text (6.3 KiB)

Here's a description of my environment and the results of a few test runs:

I am running rabbitmq in the container as described in your previous
post.

I am running the clients on my fedora 28 x86_64 machine:

$ python --version
Python 2.7.15

$ uname -a
Linux localhost.localdomain 4.19.13-200.fc28.x86_64 #1 SMP Sat Dec 29 23:10:35 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

$ openssl version
OpenSSL 1.1.0i-fips 14 Aug 2018

In my local oslo.messaging repo I've checked out stable/pike and
modified the simulator.py as you've described:

$ git diff
diff --git a/tools/simulator.py b/tools/simulator.py
index f952df0a..871ea33c 100755
--- a/tools/simulator.py
+++ b/tools/simulator.py
@@ -472,10 +472,12 @@ def generate_messages(messages_count):
     LOG.info("Generating %d random messages", messages_count)
     generator = init_random_generator()
     for i in six.moves.range(messages_count):
- length = generator()
- msg = ''.join(random.choice(
- string.ascii_lowercase) for x in six.moves.range(length))
- MESSAGES.append(msg)
+ # length = generator()
+ # msg = ''.join(random.choice(
+ # string.ascii_lowercase) for x in
+ # six.moves.range(length))
+ length = 2000000
+ MESSAGES.append("A"*length)

     LOG.info("Messages has been prepared")

I've created the SSL configuration file:

$ cat test.conf
[DEFAULT]
transport_url = rabbit://test:test@localhost:5671
[oslo_messaging_rabbit]
ssl = True

And installed oslo.messaging + deps in a virtualenv:

$ pip list | grep oslo
oslo.concurrency 3.29.0
oslo.config 6.7.0
oslo.context 2.22.0
oslo.i18n 3.23.0
oslo.log 3.42.2
oslo.messaging 5.30.7.dev1
oslo.middleware 3.37.0
oslo.serialization 2.28.1
oslo.service 1.34.0
oslo.utils 3.39.1

I'm running the SSL test clients like this:
$ python simulator.py --config-file test.conf rpc-server &
$ python simulator.py --config-file test.conf rpc-client -p 1 -w 1 -m 200 --timeout 10

Verified all traffic passed via TLS1.2 (wireshark dump)

Output of three runs (no unacked msgs seen):

2019-01-14 11:28:50,453 INFO root =================================== summary ===================================
2019-01-14 11:28:50,453 INFO root client: duration: 254.24 count: 200 (0.8 msg/sec) bytes: 400000000 (1573303 bps)
2019-01-14 11:28:50,453 INFO root error: duration: 0.00 count: 0 (0.0 msg/sec) bytes: 0 (0 bps)
2019-01-14 11:28:50,453 INFO root round-trip: duration: 254.24 count: 200 (0.8 msg/sec) bytes: 400000000 (1573303 bps) latency: 0.271 min: 0.192 max: 1.335

2019-01-14 11:41:10,798 INFO root =================================== summary ===================================
2019-01-14 11:41:10,799 INFO root client: duration: 259.28 count: 200 (0.8 msg/sec) bytes: 400000000 (1542724 bps)
2019-01-14 11:41:10,799 INFO root error: duration: 0.00 count: 0 (0.0 msg/sec) bytes: 0 (0 bps)
2019-01-14 11:41:10,800 INFO root round-trip: duration: 258.28 count: 200 (0.8 msg/sec) bytes: 400000000 (1548706 bps) latency: 0.293 min: 0.190 max: 1.374

2019-01-14 11:41:22,700 INFO ...

Read more...

Sam Morrison (sorrison) wrote :

I haven't been able to replicate the error following the steps above either.
Only difference is that I used our existing rabbit server as opposed to the docker one.

This however looks exactly like the issues we see in production and so I'm glad at least that someone else is experiencing it too in a way.

Antonio Ojea (itsuugo) wrote :

Can you try using a higher number of threads in the simulator client?
i.e. -p 10

I think that this problem only happens under heavy load

Sam Morrison (sorrison) wrote :

OK yes tried with -p 10 and that did the trick, lots of timeouts now

 File "tools/simulator.py", line 596, in _rpc_call
    res = client.call({}, remote_method, message=msg)
  File "/root/oslo.messaging/.venv/local/lib/python2.7/site-packages/oslo_messaging/rpc/client.py", line 179, in call
    retry=self.retry)
  File "/root/oslo.messaging/.venv/local/lib/python2.7/site-packages/oslo_messaging/transport.py", line 128, in _send
    retry=retry)
  File "/root/oslo.messaging/.venv/local/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 645, in send
    call_monitor_timeout, retry=retry)
  File "/root/oslo.messaging/.venv/local/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 634, in _send
    call_monitor_timeout)
  File "/root/oslo.messaging/.venv/local/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 520, in wait
    message = self.waiters.get(msg_id, timeout=timeout)
  File "/root/oslo.messaging/.venv/local/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 397, in get
    'to message ID %s' % msg_id)
MessagingTimeout: Timed out waiting for a reply to message ID 46f86421b6444b52877b2cac8614e561

And Also errors like:
2019-01-21 09:58:12,303 INFO oslo_messaging._drivers.amqpdriver No calling threads waiting for msg_id : c80c02f601b7409c9ec79f724ce8aab1

After a thorough analysis, we ended up on the py-amqp socket read function [1]
and more in detail on the SSL Class implementation [2]
Using a "big" buffer the read function did not work correctly, by debbuging this code [3]:
```
frame_type, channel, size = unpack('>BHI', frame_header)
```
we saw data as:

```
frame_type = 125 channel = 1258 size = 152632741
frame_type = 95 channel = 2587 size = 262885920
```
etc...
So, we had a socket overflow reading. It caused a loop client side and problem as:

````
too many heartbeats missed
MessagingTimeout: Timed out waiting for a reply to message ID
```
Server side:

````
frame_header timeout (most likely)
rabbitmq missed heartbeats from client timeout 60s
```

This patch [4] solves the problem. It avoids continuing read data from the socket.

We tested the patch in our systems and seems to work correctly.

-
Gabriele Santomaggio
Cloud Developer @Suse

1 - https://github.com/celery/py-amqp/blob/2.1/amqp/transport.py#L233
2 - https://github.com/celery/py-amqp/blob/2.1/amqp/transport.py#L316
3 - https://github.com/celery/py-amqp/blob/2.1/amqp/transport.py#L233
4 - https://github.com/celery/py-amqp/pull/247

Ken Giusti (kgiusti) wrote :

Nice work Gabriele!

Sam & Antonio: does the patch Gabriele provided in the above comment work for you both?

Antonio Ojea (itsuugo) wrote :

+1

Ken Giusti (kgiusti) wrote :

I'm now able to reproduce the problem using the procedure described in comment #35. I did have to bump up the thread count and lower the wait time - I'm running the rpc-client like this:

python tools/simulator.py --config-file ./oslo.conf rpc-client -p 10 -w 0.3 -m 100 --timeout 10

I can reproduce this problem across all 2.X releases of the amqp package.

This is bad since during the pike release we moved to kombu >= 4.0 which doesn't appear to work with amqp <= 2.0.

I had to go way back to oslo.messaging 5.17.0 before I found a release that I could run with amqp==1.4.9 and kombu==3.0.37 and not reproduce the problem.

Sam Morrison (sorrison) wrote :

Great, this is excellent news. After doing some successful testing we have deployed this to prod and seems to be working well. We can move on with our lives once again :-)

Thank you so much for this as this has saved us a lot of pain.

Ken Giusti (kgiusti) wrote :

That fix for py-amqp (see comment #43) has landed on master today.
We'll need to bump the minimum version of amqp in oslo.messaging's requirements once the fix is released.

thanks everyone!

Fix proposed to branch: master
Review: https://review.openstack.org/637576

Changed in oslo.messaging:
assignee: Ken Giusti (kgiusti) → Magnus Bergman (magnusbe)
status: Confirmed → In Progress
Magnus Bergman (magnusbe) wrote :

Saw that amqp 2.4.1 has been released and as it contains the fix i took the liberty to send a patch proposal. Would you mind taking a look, Ken?

Reviewed: https://review.openstack.org/637576
Committed: https://git.openstack.org/cgit/openstack/oslo.messaging/commit/?id=64f74cd2cc17a9331f943f25c76dde6b4c604247
Submitter: Zuul
Branch: master

commit 64f74cd2cc17a9331f943f25c76dde6b4c604247
Author: mb <email address hidden>
Date: Mon Feb 18 17:16:58 2019 +0100

    Bump amqp requirement version to >=2.4.1

    Bumping the version of amqp in requirements.txt and
    lower-constraints.txt from 2.4.0 to 2.4.1 to mitigate issues with
    MessagingTimeouts when using TLS due to a amqp bug that was fixed in
    2.4.1

    Change-Id: I0b094f32dec24c70f95ccd509164a14a71fcfc7d
    Closes-Bug: #1800957

Changed in oslo.messaging:
status: In Progress → Fix Released
Ken Giusti (kgiusti) wrote :

Unfortunately we can't backport this fix to previous stable branches since it is a change to requirements which is technically a feature release.

I'll add release notes to the stable branches to alert people to this issue and recommend bumping amqp to the correct version.

Reviewed: https://review.openstack.org/638735
Committed: https://git.openstack.org/cgit/openstack/oslo.messaging/commit/?id=3113175443b975d72699b3bbe6fea488c414a3ef
Submitter: Zuul
Branch: stable/pike

commit 3113175443b975d72699b3bbe6fea488c414a3ef
Author: Kenneth Giusti <email address hidden>
Date: Fri Feb 22 12:50:14 2019 -0500

    Add release note for amqp library TLS/SSL error

    Change-Id: I90c172d673084ff0cb0a3dd0fae21a5d62d35286
    Related-Bug: #1800957

tags: added: in-stable-pike

Reviewed: https://review.openstack.org/638461
Committed: https://git.openstack.org/cgit/openstack/oslo.messaging/commit/?id=693cd964e02002ab28366df08c5b75f396198492
Submitter: Zuul
Branch: stable/rocky

commit 693cd964e02002ab28366df08c5b75f396198492
Author: Kenneth Giusti <email address hidden>
Date: Thu Feb 21 11:44:54 2019 -0500

    Add release note for amqp library TLS/SSL error

    Change-Id: I0b094f32dec24c70f95ccd509164a14a71fcfc7d
    Related-Bug: #1800957

tags: added: in-stable-rocky

This issue was fixed in the openstack/oslo.messaging 9.5.0 release.

Reviewed: https://review.openstack.org/638733
Committed: https://git.openstack.org/cgit/openstack/oslo.messaging/commit/?id=389dacdc6bda1cad890c786e2bc6f9594e86a191
Submitter: Zuul
Branch: stable/queens

commit 389dacdc6bda1cad890c786e2bc6f9594e86a191
Author: Kenneth Giusti <email address hidden>
Date: Fri Feb 22 12:42:08 2019 -0500

    Add release note for amqp library TLS/SSL error

    Change-Id: I18281f406538902b6ae983de25be0fd2452ac0d3
    Related-Bug: #1800957

tags: added: in-stable-queens
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Bug attachments