Bug #1523865 “Nonoptimal failover strategy can lead to RPC timeo...” : Bugs : Mirantis OpenStack

Revision history for this message

Fuel Devops McRobotson (fuel-devops-robot) wrote on 2015-12-08: Fix proposed to openstack/oslo.messaging (openstack-ci/fuel-8.0/liberty)

#1

Fix proposed to branch: openstack-ci/fuel-8.0/liberty
Change author: Dmitry Mescheryakov <email address hidden>
Review: https://review.fuel-infra.org/14487

Revision history for this message

Fuel Devops McRobotson (fuel-devops-robot) wrote on 2015-12-15: Fix merged to openstack/oslo.messaging (openstack-ci/fuel-8.0/liberty)

#2

Reviewed: https://review.fuel-infra.org/14487
Submitter: Pkgs Jenkins <email address hidden>
Branch: openstack-ci/fuel-8.0/liberty

Commit: dcb5a1a44951dba3c40d9b5c3ddc31728f789784
Author: Dmitry Mescheryakov <email address hidden>
Date: Tue Dec 8 10:49:39 2015

Use round robin failover strategy for Kombu driver

Shuffle strategy we use right now leads to increased reconnection time
and provides no benefit. Sometimes it might lead to RPC operations
timeout because the strategy provides no guarantee on how long the
reconnection process will take. See the referenced bug for details.

On the other side, round-robin strategy provides least achievable
reconnection time. It also provides guarantee that if K of N RabbitMQ
hosts are alive, it will take at most N - K + 1 attempts to
successfully reconnect to RabbitMQ cluster.

With shuffle strategy during failover clients connect to random hosts
and so the load is distributed evenly between alive RabbitMQs.
But since we shuffle list of hosts before providing it to Kombu, load
will be distributed evenly with round-robin strategy as well.

Closes-Bug: #1523865
Change-Id: I9a510c86bd5a6ce8b707734385af1a83de82804e

Anastasia Kuznetsova (akuznetsova) on 2016-01-25

tags:

added: area-oslo

Revision history for this message

Timur Nurlygayanov (tnurlygayanov) wrote on 2016-02-11:

#3

Ok, it looks like all works fine, moved to Fix Released.

Revision history for this message

Fuel Devops McRobotson (fuel-devops-robot) wrote on 2016-02-29: Fix proposed to openstack/oslo.messaging (openstack-ci/fuel-7.0/2015.1.0)

#4

Fix proposed to branch: openstack-ci/fuel-7.0/2015.1.0
Change author: Rodion Tikunov <email address hidden>
Review: https://review.fuel-infra.org/17554

Revision history for this message

Fuel Devops McRobotson (fuel-devops-robot) wrote on 2016-04-29: Fix merged to openstack/oslo.messaging (openstack-ci/fuel-7.0/2015.1.0)

#5

Reviewed: https://review.fuel-infra.org/17554
Submitter: Vitaly Sedelnik <email address hidden>
Branch: openstack-ci/fuel-7.0/2015.1.0

Commit: b17b7ed8f62c2ba3bc1fae1f20d55e84751affb8
Author: Rodion Tikunov <email address hidden>
Date: Mon Feb 29 09:24:55 2016

Use round robin failover strategy for Kombu driver

Shuffle strategy we use right now leads to increased reconnection time
and provides no benefit. Sometimes it might lead to RPC operations
timeout because the strategy provides no guarantee on how long the
reconnection process will take. See the referenced bug for details.

On the other side, round-robin strategy provides least achievable
reconnection time. It also provides guarantee that if K of N RabbitMQ
hosts are alive, it will take at most N - K + 1 attempts to
successfully reconnect to RabbitMQ cluster.

With shuffle strategy during failover clients connect to random hosts
and so the load is distributed evenly between alive RabbitMQs.
But since we shuffle list of hosts before providing it to Kombu, load
will be distributed evenly with round-robin strategy as well.

Closes-Bug: #1523865
Change-Id: I9a510c86bd5a6ce8b707734385af1a83de82804e
(cherry picked from commit dcb5a1a44951dba3c40d9b5c3ddc31728f789784)

Alexander Gromov (agromov) on 2016-06-02

tags:

added: on-verification

Revision history for this message

Alexander Gromov (agromov) wrote on 2016-06-02:

#6

Verified on MOS 7.0 + mu4

Steps to reproduce:
1. Download the following files:
https://github.com/dmitrymex/example-oslo-messaging/raw/master/example_rpc_server.py
https://raw.githubusercontent.com/dmitrymex/example-oslo-messaging/master/server.conf

2. Replace params from [oslo_messaging_rabbit] with values from /etc/nova/nova.conf.
Note: In 'rabbit_hosts' variable we should indicate one working address and some invalid (for example, 1 correct address and 2 addresses with incorrect ports).

Example (10.109.11.4:5673 - correct address, 10.109.11.7:5693 and 10.109.11.5:5683 - incorrect addresses):
http://paste.openstack.org/show/507286/

3. Run example_rpc_server.py script:
python example_rpc_server.py --config-file server.conf

4. Check the number of attempts to connect to the server.

Without updates:
We can get a situation when we make many attempts to find working host.
The maximum number of attempts is not limited.
For 3 controllers I have got situation when 9 attempts has been required:
http://paste.openstack.org/show/507287/

With updates:
Addresses are chosen consequentially.
The maximum number of attempts is less or equals to controllers count + 1:
http://paste.openstack.org/show/507291/

tags:

removed: on-verification

	Status	Importance	Assigned to	Milestone
Mirantis OpenStack	Fix Released	High	Dmitry Mescheryakov	Mirantis OpenStack 8.0
7.0.x	Fix Released	High	Rodion Tikunov	Mirantis OpenStack 7.0-mu-4
8.0.x	Fix Released	High	Dmitry Mescheryakov	Mirantis OpenStack 8.0

Mirantis OpenStack

Nonoptimal failover strategy can lead to RPC timeout

Bug Description

Other bug subscribers

Remote bug watches