Nonoptimal failover strategy can lead to RPC timeout

Bug #1523865 reported by Dmitry Mescheryakov
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Mirantis OpenStack
Fix Released
High
Dmitry Mescheryakov
7.0.x
Fix Released
High
Rodion Tikunov
8.0.x
Fix Released
High
Dmitry Mescheryakov

Bug Description

Upstream issue: https://bugs.launchpad.net/oslo.messaging/+bug/1519851

We need to fix this in the product because the issue might lead to slow reconnect and, as a result, to random failure like the one described in that bug: https://bugs.launchpad.net/fuel/+bug/1518285 . In that bug it took oslo.messaging in a Neutron process more than a minute to connect resulting failed deployment.

Tags: area-oslo
Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Fix proposed to openstack/oslo.messaging (openstack-ci/fuel-8.0/liberty)

Fix proposed to branch: openstack-ci/fuel-8.0/liberty
Change author: Dmitry Mescheryakov <email address hidden>
Review: https://review.fuel-infra.org/14487

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Fix merged to openstack/oslo.messaging (openstack-ci/fuel-8.0/liberty)

Reviewed: https://review.fuel-infra.org/14487
Submitter: Pkgs Jenkins <email address hidden>
Branch: openstack-ci/fuel-8.0/liberty

Commit: dcb5a1a44951dba3c40d9b5c3ddc31728f789784
Author: Dmitry Mescheryakov <email address hidden>
Date: Tue Dec 8 10:49:39 2015

Use round robin failover strategy for Kombu driver

Shuffle strategy we use right now leads to increased reconnection time
and provides no benefit. Sometimes it might lead to RPC operations
timeout because the strategy provides no guarantee on how long the
reconnection process will take. See the referenced bug for details.

On the other side, round-robin strategy provides least achievable
reconnection time. It also provides guarantee that if K of N RabbitMQ
hosts are alive, it will take at most N - K + 1 attempts to
successfully reconnect to RabbitMQ cluster.

With shuffle strategy during failover clients connect to random hosts
and so the load is distributed evenly between alive RabbitMQs.
But since we shuffle list of hosts before providing it to Kombu, load
will be distributed evenly with round-robin strategy as well.

Closes-Bug: #1523865
Change-Id: I9a510c86bd5a6ce8b707734385af1a83de82804e

tags: added: area-oslo
Revision history for this message
Timur Nurlygayanov (tnurlygayanov) wrote :

Ok, it looks like all works fine, moved to Fix Released.

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Fix proposed to openstack/oslo.messaging (openstack-ci/fuel-7.0/2015.1.0)

Fix proposed to branch: openstack-ci/fuel-7.0/2015.1.0
Change author: Rodion Tikunov <email address hidden>
Review: https://review.fuel-infra.org/17554

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Fix merged to openstack/oslo.messaging (openstack-ci/fuel-7.0/2015.1.0)

Reviewed: https://review.fuel-infra.org/17554
Submitter: Vitaly Sedelnik <email address hidden>
Branch: openstack-ci/fuel-7.0/2015.1.0

Commit: b17b7ed8f62c2ba3bc1fae1f20d55e84751affb8
Author: Rodion Tikunov <email address hidden>
Date: Mon Feb 29 09:24:55 2016

Use round robin failover strategy for Kombu driver

Shuffle strategy we use right now leads to increased reconnection time
and provides no benefit. Sometimes it might lead to RPC operations
timeout because the strategy provides no guarantee on how long the
reconnection process will take. See the referenced bug for details.

On the other side, round-robin strategy provides least achievable
reconnection time. It also provides guarantee that if K of N RabbitMQ
hosts are alive, it will take at most N - K + 1 attempts to
successfully reconnect to RabbitMQ cluster.

With shuffle strategy during failover clients connect to random hosts
and so the load is distributed evenly between alive RabbitMQs.
But since we shuffle list of hosts before providing it to Kombu, load
will be distributed evenly with round-robin strategy as well.

Closes-Bug: #1523865
Change-Id: I9a510c86bd5a6ce8b707734385af1a83de82804e
(cherry picked from commit dcb5a1a44951dba3c40d9b5c3ddc31728f789784)

tags: added: on-verification
Revision history for this message
Alexander Gromov (agromov) wrote :

Verified on MOS 7.0 + mu4

Steps to reproduce:
1. Download the following files:
https://github.com/dmitrymex/example-oslo-messaging/raw/master/example_rpc_server.py
https://raw.githubusercontent.com/dmitrymex/example-oslo-messaging/master/server.conf

2. Replace params from [oslo_messaging_rabbit] with values from /etc/nova/nova.conf.
Note: In 'rabbit_hosts' variable we should indicate one working address and some invalid (for example, 1 correct address and 2 addresses with incorrect ports).

Example (10.109.11.4:5673 - correct address, 10.109.11.7:5693 and 10.109.11.5:5683 - incorrect addresses):
http://paste.openstack.org/show/507286/

3. Run example_rpc_server.py script:
python example_rpc_server.py --config-file server.conf

4. Check the number of attempts to connect to the server.

Without updates:
We can get a situation when we make many attempts to find working host.
The maximum number of attempts is not limited.
For 3 controllers I have got situation when 9 attempts has been required:
http://paste.openstack.org/show/507287/

With updates:
Addresses are chosen consequentially.
The maximum number of attempts is less or equals to controllers count + 1:
http://paste.openstack.org/show/507291/

tags: removed: on-verification
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.