Uncontrolled RabbitMQ channel creation after rabbitmq-server failure

Bug #1406629 reported by Bjoern
98
This bug affects 16 people
Affects Status Importance Assigned to Milestone
OpenStack-Ansible
Fix Released
High
Kevin Carter
Icehouse
Fix Released
High
Jesse Pretorius
Juno
Fix Released
High
Jesse Pretorius
Trunk
Fix Released
High
Kevin Carter
oslo.messaging
Fix Released
High
Mehdi Abaakouk

Bug Description

We noticed a pretty serious issue in oslo.messaging version 1.5.1 whenever one out of several configured rabbitmq server die/crashes that we suddenly see a influx of rabbitmq channels per connection. Usually in a normal setup we see one rabbitmq channel per connection but in this situation we suddenly see a growing number of channels per connection, for instance :

#rabbitmqctl list_connections name channels | awk '{ if ($4 >1) print }'

172.29.238.239:60651 -> 172.29.236.219:5672 453
172.29.238.56:43869 -> 172.29.236.219:5672 452
172.29.238.56:43870 -> 172.29.236.219:5672 452
172.29.239.102:37953 -> 172.29.238.113:5672 454
172.29.239.122:36418 -> 172.29.236.219:5672 452
172.29.239.122:36419 -> 172.29.236.219:5672 60
172.29.239.214:45186 -> 172.29.236.219:5672 453

This output should be zero in normal situation.
At this point I also noticed that all remaining rabbitmq server have been connected by the amqp connection layer, TCP sessions are established.

Before failure of rabbitmq-server (cluster of 3 nodes):
---------------------
The amqp connection layer has only one TCP connection open to the first configured rabbit_host

 root@compute1_neutron_agents_container-0d5e3efc:~# netstat -nt |grep 5672 | awk '{print $5}' | sort -u
 172.29.236.219:5672

After failure of the first configured rabbitmq-server :
--------------------
The amqp connection layer has now a TCP connection open to each remaining rabbitmq server

 root@compute1_neutron_agents_container-0d5e3efc:~# netstat -nt |grep 5672 | awk '{print $5}' | sort -u
 172.29.236.169:5672
 172.29.238.113:5672

Stack strace :

2014-12-30 17:40:28.197 1574 ERROR oslo.messaging._drivers.impl_rabbit [-] AMQP server on 172.29.236.219:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 1 seconds.
2014-12-30 17:40:28.198 1574 ERROR oslo.messaging._drivers.impl_rabbit [-] Failed to consume message from queue: [Errno 111] ECONNREFUSED
2014-12-30 17:40:28.198 1574 TRACE oslo.messaging._drivers.impl_rabbit Traceback (most recent call last):
2014-12-30 17:40:28.198 1574 TRACE oslo.messaging._drivers.impl_rabbit File "/usr/local/lib/python2.7/dist-packages/kombu/connection.py", line 436, in _ensured
2014-12-30 17:40:28.198 1574 TRACE oslo.messaging._drivers.impl_rabbit return fun(*args, **kwargs)
2014-12-30 17:40:28.198 1574 TRACE oslo.messaging._drivers.impl_rabbit File "/usr/local/lib/python2.7/dist-packages/kombu/connection.py", line 507, in __call__
2014-12-30 17:40:28.198 1574 TRACE oslo.messaging._drivers.impl_rabbit self.revive(create_channel())
2014-12-30 17:40:28.198 1574 TRACE oslo.messaging._drivers.impl_rabbit File "/usr/local/lib/python2.7/dist-packages/kombu/connection.py", line 242, in channel
2014-12-30 17:40:28.198 1574 TRACE oslo.messaging._drivers.impl_rabbit chan = self.transport.create_channel(self.connection)
2014-12-30 17:40:28.198 1574 TRACE oslo.messaging._drivers.impl_rabbit File "/usr/local/lib/python2.7/dist-packages/kombu/connection.py", line 741, in connection
2014-12-30 17:40:28.198 1574 TRACE oslo.messaging._drivers.impl_rabbit self._connection = self._establish_connection()
2014-12-30 17:40:28.198 1574 TRACE oslo.messaging._drivers.impl_rabbit File "/usr/local/lib/python2.7/dist-packages/kombu/connection.py", line 696, in _establish_connection
2014-12-30 17:40:28.198 1574 TRACE oslo.messaging._drivers.impl_rabbit conn = self.transport.establish_connection()
2014-12-30 17:40:28.198 1574 TRACE oslo.messaging._drivers.impl_rabbit File "/usr/local/lib/python2.7/dist-packages/kombu/transport/pyamqp.py", line 112, in establish_connection
2014-12-30 17:40:28.198 1574 TRACE oslo.messaging._drivers.impl_rabbit conn = self.Connection(**opts)
2014-12-30 17:40:28.198 1574 TRACE oslo.messaging._drivers.impl_rabbit File "/usr/local/lib/python2.7/dist-packages/amqp/connection.py", line 165, in __init__
2014-12-30 17:40:28.198 1574 TRACE oslo.messaging._drivers.impl_rabbit self.transport = self.Transport(host, connect_timeout, ssl)
2014-12-30 17:40:28.198 1574 TRACE oslo.messaging._drivers.impl_rabbit File "/usr/local/lib/python2.7/dist-packages/amqp/connection.py", line 186, in Transport
2014-12-30 17:40:28.198 1574 TRACE oslo.messaging._drivers.impl_rabbit return create_transport(host, connect_timeout, ssl)
2014-12-30 17:40:28.198 1574 TRACE oslo.messaging._drivers.impl_rabbit File "/usr/local/lib/python2.7/dist-packages/amqp/transport.py", line 299, in create_transport
2014-12-30 17:40:28.198 1574 TRACE oslo.messaging._drivers.impl_rabbit return TCPTransport(host, connect_timeout)
2014-12-30 17:40:28.198 1574 TRACE oslo.messaging._drivers.impl_rabbit File "/usr/local/lib/python2.7/dist-packages/amqp/transport.py", line 95, in __init__
2014-12-30 17:40:28.198 1574 TRACE oslo.messaging._drivers.impl_rabbit raise socket.error(last_err)
2014-12-30 17:40:28.198 1574 TRACE oslo.messaging._drivers.impl_rabbit error: [Errno 111] ECONNREFUSED

Configured rabbitmq settings (omitted password/user) :

rabbit_hosts = 172.29.236.219:5672,172.29.238.113:5672,172.29.236.169:5672
rabbit_port = 5672

Oslo versions:
oslo.config==1.5.0
oslo.db==1.3.0
oslo.i18n==1.1.0
oslo.messaging==1.5.1
oslo.middleware==0.2.0
oslo.rootwrap==1.4.0
oslo.serialization==1.1.0
oslo.utils==1.1.0

Rabbitmq version : 3.4.2-1 (from http://www.rabbitmq.com/debian/)

Currently we can only stop this behavior if we restart all Openstack services utilizing rabbitmq

Tags: in-juno
Revision history for this message
Bjoern (bjoern-t) wrote :

This issue, let the channels grow until the Erlang VM is out of processes and the whole Openstack setup comes to a halt. At least from the provisioning perspective

Ian Cordasco (icordasc)
Changed in oslo.messaging:
status: New → Confirmed
Revision history for this message
Ian Cordasco (icordasc) wrote :
Download full text (11.3 KiB)

I've also extracted the following traces from the logs https://gist.github.com/sigmavirus24/88e55420bfc7c9e089e7 (reproduced below). The last one is the same one that Bjoern pasted above. It seems to me that oslo.messaging should be catching some of these exceptions and attempting to do what is necessary to clean up the channel it was using for rabbit. Rabbit will not reap channels that are no longer in use or are effectively broken, that means it is the client's responsibility to do so.

There may be a case to be made that Kombu should be doing this, but at this point it would seem that oslo.messaging is well positioned to fix this for openstack and then work a patch into Kombu.

2014-12-05 18:29:48.642 1621 INFO oslo.messaging._drivers.impl_rabbit [-] Connecting to AMQP server on rabbit_host0:5672
2014-12-05 18:29:48.651 1621 INFO oslo.messaging._drivers.impl_rabbit [-] Connected to AMQP server on rabbit_host0:5672
2014-12-09 17:24:58.445 1621 ERROR oslo.messaging._drivers.impl_rabbit [-] Failed to consume message from queue: [Errno 104] Connection reset
by peer
2014-12-09 17:24:58.445 1621 TRACE oslo.messaging._drivers.impl_rabbit Traceback (most recent call last):
2014-12-09 17:24:58.445 1621 TRACE oslo.messaging._drivers.impl_rabbit File "/usr/local/lib/python2.7/dist-packages/oslo/messaging/_drivers
/impl_rabbit.py", line 655, in ensure
2014-12-09 17:24:58.445 1621 TRACE oslo.messaging._drivers.impl_rabbit return method()
2014-12-09 17:24:58.445 1621 TRACE oslo.messaging._drivers.impl_rabbit File "/usr/local/lib/python2.7/dist-packages/oslo/messaging/_drivers
/impl_rabbit.py", line 735, in _consume
2014-12-09 17:24:58.445 1621 TRACE oslo.messaging._drivers.impl_rabbit return self.connection.drain_events(timeout=timeout)
2014-12-09 17:24:58.445 1621 TRACE oslo.messaging._drivers.impl_rabbit File "/usr/local/lib/python2.7/dist-packages/kombu/connection.py", l
ine 275, in drain_events
2014-12-09 17:24:58.445 1621 TRACE oslo.messaging._drivers.impl_rabbit return self.transport.drain_events(self.connection, **kwargs)
2014-12-09 17:24:58.445 1621 TRACE oslo.messaging._drivers.impl_rabbit File "/usr/local/lib/python2.7/dist-packages/kombu/transport/pyamqp.
py", line 91, in drain_events
2014-12-09 17:24:58.445 1621 TRACE oslo.messaging._drivers.impl_rabbit return connection.drain_events(**kwargs)
2014-12-09 17:24:58.445 1621 TRACE oslo.messaging._drivers.impl_rabbit File "/usr/local/lib/python2.7/dist-packages/amqp/connection.py", li
ne 323, in drain_events
2014-12-09 17:24:58.445 1621 TRACE oslo.messaging._drivers.impl_rabbit return amqp_method(channel, args)
2014-12-09 17:24:58.445 1621 TRACE oslo.messaging._drivers.impl_rabbit File "/usr/local/lib/python2.7/dist-packages/amqp/connection.py", li
ne 526, in _close
2014-12-09 17:24:58.445 1621 TRACE oslo.messaging._drivers.impl_rabbit self._x_close_ok()
2014-12-09 17:24:58.445 1621 TRACE oslo.messaging._drivers.impl_rabbit File "/usr/local/lib/python2.7/dist-packages/amqp/connection.py", line 554, in _x_close_ok
2014-12-09 17:24:58.445 1621 TRACE oslo.messaging._drivers.impl_rabbit self._send_method((10, 51))
2014-12-09 17:24:58.445 1621 TRACE oslo.messaging._drivers.impl_rabbit File "/usr/loca...

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to oslo.messaging (master)

Fix proposed to branch: master
Review: https://review.openstack.org/145232

Changed in oslo.messaging:
assignee: nobody → Mehdi Abaakouk (sileht)
status: Confirmed → In Progress
Mehdi Abaakouk (sileht)
Changed in oslo.messaging:
importance: Undecided → High
milestone: none → next-kilo
Revision history for this message
Bjoern (bjoern-t) wrote :

Thanks for tackling this issue. I did test it based on a oslo.messaging 1.5.1 plus the preview patch of https://review.openstack.org/145232 and the issue worsened. After I did patch the impl_rabbit script and restart neutron-agent services for instance, the channels immediately increased without having even an rabbitmq failure during that test period. The rabbitmq-cluster was intact during this test.

Revision history for this message
Bjoern (bjoern-t) wrote :

I'm currently testing patchset 9. Should have the results pretty soon

Changed in openstack-ansible:
milestone: none → next
David Wilde (dave-wilde)
Changed in openstack-ansible:
status: New → Confirmed
importance: Undecided → High
status: Confirmed → Triaged
Revision history for this message
Kevin Carter (kevin-carter) wrote :

@Bjoern Teipel

just following up with you regarding the oslo messaging issues. I wanted to see if you've been able to determine if version 1.5.2 Patch-Set 9 is any better when testing the rabbitmq connection issues. I'm also curious if you've tried any other version of oslo messaging, if so are they better or worse? I'm curious because within Juno we should be ok to use version "1.4.1" if it improves the situation with rabbit.

Revision history for this message
Bjoern (bjoern-t) wrote :

Kevin,

yes the Patchset 9 seem to improve the situation. I'm currently working to get this patch tested in a full environment where I can spinout resources. Also the version 1.4.1 did not seem to show this behavior.

Changed in oslo.messaging:
milestone: 1.6.0 → next-kilo
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to os-ansible-deployment (master)

Reviewed: https://review.openstack.org/151503
Committed: https://git.openstack.org/cgit/stackforge/os-ansible-deployment/commit/?id=227c1029f7febb3527e0729b84bb03a8c9d2d631
Submitter: Jenkins
Branch: master

commit 227c1029f7febb3527e0729b84bb03a8c9d2d631
Author: Kevin Carter <email address hidden>
Date: Thu Jan 29 13:49:48 2015 -0600

    Pinned Oslo Messaging version to version 1.4.1

    This change pins oslo messaging to version 1.4.1 which will satisfy
    both icehouse and juno requirements. This change was in direct
    response a support request.

    The issue with the latest Oslo Messaging for Juno, Oslo 1.5.1,
    is that there is a potential for it to open too many connections
    to rabbitmq. While this behavior is present in version 1.4.1 as well
    it is less so than in 1.5.1. At this time pinning the version will
    create a better overall deployment experience.

    Change-Id: I37793557d53fc5965547d2b8cc3ca44812def8db
    Partial-Bug: 1406629

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to os-ansible-deployment (juno)

Fix proposed to branch: juno
Review: https://review.openstack.org/151611

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to os-ansible-deployment (icehouse)

Fix proposed to branch: icehouse
Review: https://review.openstack.org/151612

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to os-ansible-deployment (juno)

Reviewed: https://review.openstack.org/151611
Committed: https://git.openstack.org/cgit/stackforge/os-ansible-deployment/commit/?id=46565e781cbb291f9093101fd944b7d5ecc627b6
Submitter: Jenkins
Branch: juno

commit 46565e781cbb291f9093101fd944b7d5ecc627b6
Author: Kevin Carter <email address hidden>
Date: Thu Jan 29 13:49:48 2015 -0600

    Pinned Oslo Messaging version to version 1.4.1

    This change pins oslo messaging to version 1.4.1 which will satisfy
    both icehouse and juno requirements. This change was in direct
    response a support request.

    The issue with the latest Oslo Messaging for Juno, Oslo 1.5.1,
    is that there is a potential for it to open too many connections
    to rabbitmq. While this behavior is present in version 1.4.1 as well
    it is less so than in 1.5.1. At this time pinning the version will
    create a better overall deployment experience.

    Change-Id: I37793557d53fc5965547d2b8cc3ca44812def8db
    Partial-Bug: 1406629
    (cherry picked from commit 227c1029f7febb3527e0729b84bb03a8c9d2d631)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to os-ansible-deployment (icehouse)

Reviewed: https://review.openstack.org/151612
Committed: https://git.openstack.org/cgit/stackforge/os-ansible-deployment/commit/?id=bc4d5e469dfb347fc51304eeab68fd4010ff753a
Submitter: Jenkins
Branch: icehouse

commit bc4d5e469dfb347fc51304eeab68fd4010ff753a
Author: Kevin Carter <email address hidden>
Date: Thu Jan 29 13:49:48 2015 -0600

    Pinned Oslo Messaging version to version 1.4.1

    This change pins oslo messaging to version 1.4.1 which will satisfy
    both icehouse and juno requirements. This change was in direct
    response a support request.

    The issue with the latest Oslo Messaging for Juno, Oslo 1.5.1,
    is that there is a potential for it to open too many connections
    to rabbitmq. While this behavior is present in version 1.4.1 as well
    it is less so than in 1.5.1. At this time pinning the version will
    create a better overall deployment experience.

    Change-Id: I37793557d53fc5965547d2b8cc3ca44812def8db
    Partial-Bug: 1406629
    (cherry picked from commit 227c1029f7febb3527e0729b84bb03a8c9d2d631)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to oslo.messaging (master)

Reviewed: https://review.openstack.org/145232
Committed: https://git.openstack.org/cgit/openstack/oslo.messaging/commit/?id=e7e5506b92606d1ea160da4a22be6bd66353d1cc
Submitter: Jenkins
Branch: master

commit e7e5506b92606d1ea160da4a22be6bd66353d1cc
Author: Mehdi Abaakouk <email address hidden>
Date: Tue Jan 6 15:11:55 2015 +0100

    Ensure kombu channels are closed

    Kombu doesn't always close channel we created for us.
    So we must ensure they are closed correctly.

    Closes-bug: #1406629

    Change-Id: I7c1b31c37ac75dd4fded4d86f046c18f9c2dd7b8

Changed in oslo.messaging:
status: In Progress → Fix Committed
Mehdi Abaakouk (sileht)
Changed in oslo.messaging:
status: Fix Committed → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to openstack-ansible (juno)

Related fix proposed to branch: juno
Review: https://review.openstack.org/236027

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to openstack-ansible (juno)

Reviewed: https://review.openstack.org/236027
Committed: https://git.openstack.org/cgit/openstack/openstack-ansible/commit/?id=fe1aedf5aa12995733a58fd5e9fff3b826f44490
Submitter: Jenkins
Branch: juno

commit fe1aedf5aa12995733a58fd5e9fff3b826f44490
Author: Kevin Carter <email address hidden>
Date: Fri Oct 16 12:36:33 2015 -0500

    Backport dist_sort for RabbitMQ host

    This change backports the deterministic sorting from Kilo and
    beyond into the Juno code base. This makes RabbitMQ more fault
    tolerant and balances the load on messaging.

    Change-Id: Ie570228d3d5f6815413aa42819cdf6fc86aa350c
    Related-Bug: #1406629
    Signed-off-by: Kevin Carter <email address hidden>

tags: added: in-juno
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.