oslo.messaging delays reconnect trying to close old channel

Bug #1689801 reported by Dmitry Mescheryakov
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Mirantis OpenStack
Fix Released
Medium
Dmitry Mescheryakov

Bug Description

Version: 9.x

Steps to reproduce:
1. Deploy a MOS env with 3 controllers and 1 compute node
2. Download that file and save it as simulator.py: http://paste.openstack.org/show/608983/
   That is a modified copy of upstream simulator, if you are curious, make a diff against https://github.com/openstack/oslo.messaging/blob/master/tools/simulator.py
3. Go to compute node and apply that patch http://paste.openstack.org/show/608984/ to /usr/lib/python2.7/dist-packages/oslo_messaging/_drivers/impl_rabbit.py

4. In that console set the following variable:
   RABBIT_URL=rabbit://<user>:<pass>@<node_1_ip>:5673,<user>:<pass>@<node_2_ip>:5673,<user>:<pass>@<node_3_ip>:5673/
   Populate user, pass and node_x_ip using the following parameters from /etc/nova/nova.conf: rabbit_hosts, rabbit_userid and rabbit_password

5. Open another console to controller, which IP goes first in RABBIT_URL list.
6. Return to console opened in step #3
7. Here run
   python simulator.py --url $RABBIT_URL rpc-client -m 2 -w 10 --is-cast true

   With that command simulator will send 2 messages (-m) with interval between messages 10 seconds (-w) in 'cast' mode, hence you don't need rpc server.

8. Wait for simulator to send the first message. It is done ones the following lines appear in console:
2017-05-10 11:21:53,661 DEBUG oslo_messaging._drivers.amqpdriver CAST unique_id: ...

9. Once you see these lines, quickly (you have 10 seconds to do that) switch to controller console opened in step #5 and here execute
    iptables -I OUTPUT 1 -p tcp --sport 5673 -j DROP

    That will block AMQP traffic to that node.

10. Observe the following lines next:
2017-05-10 11:22:03,677 DEBUG oslo_messaging._drivers.amqpdriver CAST unique_id:
2017-05-10 11:23:03,701 ERROR oslo.messaging._drivers.impl_rabbit AMQP server on 192.168.0.3:5673 is unreachable: ...
...
2017-05-10 11:23:09,735 ERROR oslo.messaging._drivers.impl_rabbit AMQP server on 192.168.0.3:5673 is unreachable: ...
2017-05-10 11:23:48,365 INFO oslo.messaging._drivers.impl_rabbit Reconnected to AMQP server ...
...

Note that 40 seconds pass between last 'server is unreachable' complaint and reconnect. That is an artificial delay caused by bug in code.

To remove iptables rule set in step #9 on controller execute
iptables -D OUTPUT -p tcp --sport 5673 -j DROP

That bug is very similar to https://bugs.launchpad.net/mos/+bug/1688581 and it would be easier to verify them together.

Changed in mos:
importance: Undecided → Medium
status: New → Confirmed
assignee: nobody → Dmitry Mescheryakov (dmitrymex)
milestone: none → 9.x-updates
description: updated
tags: added: area-oslo customer-found
description: updated
description: updated
summary: - oslo.messaging delays reconnection trying to close old channel
+ oslo.messaging delays reconnect trying to close old channel
Revision history for this message
Dmitry Mescheryakov (dmitrymex) wrote :

The fix is on review here: https://review.fuel-infra.org/#/c/35213/

Changed in mos:
status: Confirmed → In Progress
Changed in mos:
milestone: 9.x-updates → 9.2-mu-3
Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Fix merged to openstack/oslo.messaging (9.0/mitaka)

Reviewed: https://review.fuel-infra.org/35213
Submitter: Pkgs Jenkins <email address hidden>
Branch: 9.0/mitaka

Commit: 9f13af582f9eba68b95e569683d6686469afd79a
Author: Dmitry Mescheryakov <email address hidden>
Date: Wed May 10 13:10:33 2017

Do not close channel belonging to dropped connection

Without the patch the following call of _set_current_channel() will
try to close existing channel. If rabbitmq on the other side of the
channel completely droppes packets, oslo.messaging will hang
closing the channel until the connection dies on oslo.messaging's
side.

Closes-Bug: #1689801
Prod bug: https://mirantis.jira.com/browse/OSCORE-28

Change-Id: I63a1d20f08196fb1c088956633015f224a9dce80

Changed in mos:
status: In Progress → Fix Committed
Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Fix proposed to openstack/oslo.messaging (mcp/ocata)

Fix proposed to branch: mcp/ocata
Change author: Dmitry Mescheryakov <email address hidden>
Review: https://review.fuel-infra.org/35869

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Fix proposed to openstack/oslo.messaging (mcp/newton)

Fix proposed to branch: mcp/newton
Change author: Dmitry Mescheryakov <email address hidden>
Review: https://review.fuel-infra.org/35873

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Related fix proposed to openstack/oslo.messaging (9.0/mitaka)

Related fix proposed to branch: 9.0/mitaka
Change author: Dmitry Mescheryakov <email address hidden>
Review: https://review.fuel-infra.org/35877

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Related fix merged to openstack/oslo.messaging (9.0/mitaka)

Reviewed: https://review.fuel-infra.org/35877
Submitter: Pkgs Jenkins <email address hidden>
Branch: 9.0/mitaka

Commit: 895f0bcbf3f612d1fd2bb43d3581e78a1d95df74
Author: Dmitry Mescheryakov <email address hidden>
Date: Thu Jul 6 16:04:27 2017

Clean up declared queues data on channel delete

In https://review.fuel-infra.org/#/c/35213 we delete channel once we
decide to drop current connection. That way we avoid wasting time
on trying to close the channel.

But I forgot to clean up set of declared queues for the channel.
That does not break anything, but as a result that list grows with
each reconnect, leaking memory.

Related-Bug: #1689801
Related prod: https://mirantis.jira.com/browse/OSCORE-28

Change-Id: I2f44f070445369ff9a1895b23ff4fef8326e5a58

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Fix proposed to openstack/oslo.messaging (mcp/1.0/mitaka)

Fix proposed to branch: mcp/1.0/mitaka
Change author: Dmitry Mescheryakov <email address hidden>
Review: https://review.fuel-infra.org/35941

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Related fix proposed to openstack/oslo.messaging (mcp/1.0/mitaka)

Related fix proposed to branch: mcp/1.0/mitaka
Change author: Dmitry Mescheryakov <email address hidden>
Review: https://review.fuel-infra.org/35942

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Fix merged to openstack/oslo.messaging (mcp/1.0/mitaka)

Reviewed: https://review.fuel-infra.org/35941
Submitter: Pkgs Jenkins <email address hidden>
Branch: mcp/1.0/mitaka

Commit: 439bad4b42aa8093f120f5caa27fd8d41f624efb
Author: Dmitry Mescheryakov <email address hidden>
Date: Wed Jul 12 11:59:12 2017

Do not close channel belonging to dropped connection

Without the patch the following call of _set_current_channel() will
try to close existing channel. If rabbitmq on the other side of the
channel completely droppes packets, oslo.messaging will hang
closing the channel until the connection dies on oslo.messaging's
side.

Closes-Bug: #1689801
Prod bug: https://mirantis.jira.com/browse/OSCORE-28

Change-Id: I63a1d20f08196fb1c088956633015f224a9dce80

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Related fix merged to openstack/oslo.messaging (mcp/1.0/mitaka)

Reviewed: https://review.fuel-infra.org/35942
Submitter: Pkgs Jenkins <email address hidden>
Branch: mcp/1.0/mitaka

Commit: a253ebe185068e4c8a6523c3cf00bfb83c0ef9d0
Author: Dmitry Mescheryakov <email address hidden>
Date: Wed Jul 12 11:59:12 2017

Clean up declared queues data on channel delete

In https://review.fuel-infra.org/#/c/35213 we delete channel once we
decide to drop current connection. That way we avoid wasting time
on trying to close the channel.

But I forgot to clean up set of declared queues for the channel.
That does not break anything, but as a result that list grows with
each reconnect, leaking memory.

Related-Bug: #1689801
Related prod: https://mirantis.jira.com/browse/OSCORE-28

Change-Id: I2f44f070445369ff9a1895b23ff4fef8326e5a58

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Fix merged to openstack/oslo.messaging (mcp/ocata)

Reviewed: https://review.fuel-infra.org/35869
Submitter: Pkgs Jenkins <email address hidden>
Branch: mcp/ocata

Commit: e22688bd9ce640f4299a6a2c8bf464c41ba826db
Author: Dmitry Mescheryakov <email address hidden>
Date: Thu Jul 6 14:40:29 2017

Do not close channel belonging to dropped connection

Without the patch the following call of _set_current_channel() will
try to close existing channel. If rabbitmq on the other side of the
channel completely droppes packets, oslo.messaging will hang
closing the channel until the connection dies on oslo.messaging's
side.

Closes-Bug: #1689801
Prod bug: https://mirantis.jira.com/browse/OSCORE-28

Change-Id: I63a1d20f08196fb1c088956633015f224a9dce80

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Fix merged to openstack/oslo.messaging (mcp/newton)

Reviewed: https://review.fuel-infra.org/35873
Submitter: Pkgs Jenkins <email address hidden>
Branch: mcp/newton

Commit: 595da720df8688042822abe35807ab23e51bc917
Author: Dmitry Mescheryakov <email address hidden>
Date: Thu Jul 6 15:09:37 2017

Do not close channel belonging to dropped connection

Without the patch the following call of _set_current_channel() will
try to close existing channel. If rabbitmq on the other side of the
channel completely droppes packets, oslo.messaging will hang
closing the channel until the connection dies on oslo.messaging's
side.

Closes-Bug: #1689801
Prod bug: https://mirantis.jira.com/browse/OSCORE-28

Change-Id: I63a1d20f08196fb1c088956633015f224a9dce80

Revision history for this message
Ilya Bumarskov (ibumarskov) wrote :

Verified on Fuel 9.2-MU3 (snapshots/9.0-2017-10-09-094324)

----------------------------------------------------------
2017-10-13 09:14:27,772 ERROR oslo.messaging._drivers.impl_rabbit AMQP server on 10.109.1.2:5673 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 1 seconds.
2017-10-13 09:14:28,799 INFO oslo.messaging._drivers.impl_rabbit Reconnected to AMQP server on 10.109.1.4:5673 via [amqp] client
----------------------------------------------------------

Changed in mos:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.