oslo.messaging does not honor rpc timeout and fails late

Bug #1688581 reported by Dmitry Mescheryakov
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Mirantis OpenStack
Fix Released
Medium
Dmitry Mescheryakov

Bug Description

Version: 9.x

Steps to reproduce:
1. Deploy a MOS env with 3 controllers and 1 compute node
2. Download that file and save it as simulator.py: http://paste.openstack.org/show/608983/
   That is a modified copy of upstream simulator, if you are curious, make a diff against https://github.com/openstack/oslo.messaging/blob/master/tools/simulator.py
3. Go to compute node and apply that patch http://paste.openstack.org/show/608984/ to /usr/lib/python2.7/dist-packages/oslo_messaging/_drivers/impl_rabbit.py

4. In that console set the following variable:
   RABBIT_URL=rabbit://<user>:<pass>@<node_1_ip>:5673,<user>:<pass>@<node_2_ip>:5673,<user>:<pass>@<node_3_ip>:5673/
   Populate user, pass and node_x_ip using the following parameters from /etc/nova/nova.conf: rabbit_hosts, rabbit_userid and rabbit_password

5. Run
   python simulator.py --url $RABBIT_URL rpc-server -w 1

6. Open another console to controller, which IP goes first in RABBIT_URL list.
7. Open yet another console to the compute node and populate RABBIT_URL variable here as well.
8. Here run
   python simulator.py --url $RABBIT_URL rpc-client --timeout 10 -m 2 -w 10

   With that command simulator will send 2 messages (-m) with timeout set to 10 seconds (--timeout) and interval between messages 10 seconds (-w)

9. Wait for simulator to send the first message and receives response from rpc-server. It is done once the following lines appear in console:
2017-05-05 14:05:22,129 DEBUG oslo_messaging._drivers.amqpdriver CALL msg_id: ...
2017-05-05 14:05:23,153 DEBUG oslo_messaging._drivers.amqpdriver received reply msg_id: ...

10. Once you see these lines, quickly (you have 10 seconds to do that) switch to controller console opened in step #6 and here execute
    iptables -I OUTPUT 1 -p tcp --sport 5673 -j DROP

    That will block AMQP traffic to that node.

11. Observe the following lines next:
2017-05-05 14:11:10,076 DEBUG oslo_messaging._drivers.amqpdriver CALL msg_id: ...
2017-05-05 14:12:10,087 DEBUG oslo.messaging._drivers.impl_rabbit Received recoverable error from kombu

What goes next is of no importance. Important here is that it takes oslo.messaging 60 seconds to surrender first attempt and try to reconnect while the timeout is 10 seconds. oslo.messaging should react much faster to the problem.

To remove iptables rule set in step #10 on controller execute
iptables -D OUTPUT -p tcp --sport 5673 -j DROP

That bug is very similar to https://bugs.launchpad.net/mos/+bug/1689801 and it would be easier to verify them together.

tags: added: area-oslo
tags: added: customer-found
Changed in mos:
status: New → Confirmed
importance: Undecided → Medium
assignee: nobody → Dmitry Mescheryakov (dmitrymex)
milestone: none → 9.x-updates
description: updated
description: updated
description: updated
Revision history for this message
Dmitry Mescheryakov (dmitrymex) wrote :

The fix is on review here: https://review.fuel-infra.org/#/c/35212/

Changed in mos:
status: Confirmed → In Progress
Changed in mos:
milestone: 9.x-updates → 9.2-mu-3
description: updated
Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Fix merged to openstack/oslo.messaging (9.0/mitaka)

Reviewed: https://review.fuel-infra.org/35212
Submitter: Pkgs Jenkins <email address hidden>
Branch: 9.0/mitaka

Commit: 4d2194e80c17ca5adadabbdceb1ea7171031a6c3
Author: Dmitry Mescheryakov <email address hidden>
Date: Fri May 5 14:52:23 2017

Declare exchange with the call timeout

We missed to do that before and as a result it took oslo.messaing
long time to reconnect.

Closes-Bug: #1688581
Prod bug: https://mirantis.jira.com/browse/OSCORE-27

Change-Id: Ib7eb58871f4da59012a681a90d8d4820c00c7adb

Changed in mos:
status: In Progress → Fix Committed
Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Fix proposed to openstack/oslo.messaging (mcp/1.0/mitaka)

Fix proposed to branch: mcp/1.0/mitaka
Change author: Dmitry Mescheryakov <email address hidden>
Review: https://review.fuel-infra.org/35946

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Fix merged to openstack/oslo.messaging (mcp/1.0/mitaka)

Reviewed: https://review.fuel-infra.org/35946
Submitter: Pkgs Jenkins <email address hidden>
Branch: mcp/1.0/mitaka

Commit: 438056dd82b7051a5fd20a426f0ef9c4b5388b47
Author: Dmitry Mescheryakov <email address hidden>
Date: Wed Jul 12 13:38:59 2017

Declare exchange with the call timeout

We missed to do that before and as a result it took oslo.messaing
long time to reconnect.

Closes-Bug: #1688581
Prod bug: https://mirantis.jira.com/browse/OSCORE-27

Change-Id: Ib7eb58871f4da59012a681a90d8d4820c00c7adb

Revision history for this message
Ilya Bumarskov (ibumarskov) wrote :

Verified on Fuel 9.2-MU3 (snapshots/9.0-2017-10-09-094324):

-----------------------------------------------------------
2017-10-13 09:08:59,027 DEBUG oslo_messaging._drivers.amqpdriver CALL msg_id: 659767af9d82473facd6139213d41625 size: 205 exchange: openstack topic: profiler_topic.profiler_server
2017-10-13 09:09:09,035 DEBUG oslo.messaging._drivers.impl_rabbit Received recoverable error from kombu:
-----------------------------------------------------------

Changed in mos:
status: Fix Committed → Fix Released
Revision history for this message
Ilya Bumarskov (ibumarskov) wrote :

Verified on Fuel 9.2-MU3 (snapshots/9.0-2017-10-09-094324)

----------------------------------------------------------
2017-10-13 09:08:59,027 DEBUG oslo_messaging._drivers.amqpdriver CALL msg_id: 659767af9d82473facd6139213d41625 size: 205 exchange: openstack topic: profiler_topic.profiler_server
2017-10-13 09:09:09,035 DEBUG oslo.messaging._drivers.impl_rabbit Received recoverable error from kombu:
----------------------------------------------------------

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.