Nonoptimal failover strategy can lead to RPC timeout

Bug #1519851 reported by Dmitry Mescheryakov
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
oslo.messaging
Fix Released
Undecided
Dmitry Mescheryakov

Bug Description

The 'shuffle' failover strategy we use right now in Kombu might lead to RPC timeouts. The strategy is set there:
https://github.com/openstack/oslo.messaging/blob/5840ab3340b49a5862551df1209e3e53fc8bc978/oslo_messaging/_drivers/impl_rabbit.py#L459

Each time current connection drops, the strategy picks random host from all available hosts and tries to connect to it. The strategy is not 'fair' and might select the same host several times in a row. For example, here it took oslo.messaging 6 attempts to reconnect:
http://paste.openstack.org/show/479759/

As a result, reconnection might take significant number of attempts. For instance, if 2 of 3 RabbitMQ nodes are down, probability that it will take at least 12 attempts to successfully reconnect is (2/3)^11 ~ 1%. Each reconnect takes around 5 seconds, so 12 attempts will take more than a minute - the default RPC timeout. And that leads to RPC operations timeout.

Changed in oslo.messaging:
assignee: nobody → Dmitry Mescheryakov (dmitrymex)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to oslo.messaging (master)

Fix proposed to branch: master
Review: https://review.openstack.org/249849

Changed in oslo.messaging:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to oslo.messaging (master)

Reviewed: https://review.openstack.org/249849
Committed: https://git.openstack.org/cgit/openstack/oslo.messaging/commit/?id=6ae46796a61fc97467450b5bdd51dc6a0c86f9f4
Submitter: Jenkins
Branch: master

commit 6ae46796a61fc97467450b5bdd51dc6a0c86f9f4
Author: Dmitry Mescheryakov <email address hidden>
Date: Mon Nov 23 17:27:24 2015 +0300

    Use round robin failover strategy for Kombu driver

    Shuffle strategy we use right now leads to increased reconnection time
    and provides no benefit. Sometimes it might lead to RPC operations
    timeout because the strategy provides no guarantee on how long the
    reconnection process will take. See the referenced bug for details.

    On the other side, round-robin strategy provides least achievable
    reconnection time. It also provides guarantee that if K of N RabbitMQ
    hosts are alive, it will take at most N - K + 1 attempts to
    successfully reconnect to RabbitMQ cluster.

    With shuffle strategy during failover clients connect to random hosts
    and so the load is distributed evenly between alive RabbitMQs.
    But since we shuffle list of hosts before providing it to Kombu, load
    will be distributed evenly with round-robin strategy as well.

    DocImpact
    A new configuration option kombu_failover_strategy for Kombu driver is
    added. It determines how the next RabbitMQ node is chosen in case the
    one we are currently connected to becomes unavailable. It takes effect
    only if more than one RabbitMQ node is provided in config. Available
    options are:

     * round-robin: each RabbitMQ host in the list is tried in cycle until
       oslo.messaging successfully connects. Since oslo.messaging
       shuffles list of RabbitMQ hosts, the order of hosts in the cycle
       will be random and will not depend on order provided in config.

     * shuffle: oslo.messaging selects a random host from the list and
       tries to connect to it. If connection fails, oslo.messaging repeats
       attempt to connect to another random host. Oslo.messaging stops
       once it successfully connects to a host. Note that in each
       iteration a host to connect is selected independently of previous
       iterations, i.e. it might happen that oslo.messaging will try to
       connect to the same host several times in a row.

    The option's default value is round-robin. Before the option was
    introduced, the default strategy was shuffle. For the reasoning,
    see the main body of the commit message and the referenced bug.

    Closes-Bug: #1519851
    Change-Id: I9a510c86bd5a6ce8b707734385af1a83de82804e

Changed in oslo.messaging:
status: In Progress → Fix Committed
Changed in oslo.messaging:
milestone: none → 3.1.0
status: Fix Committed → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to oslo.messaging (feature/pika)

Fix proposed to branch: feature/pika
Review: https://review.openstack.org/257373

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to oslo.messaging (feature/pika)
Download full text (39.3 KiB)

Reviewed: https://review.openstack.org/257373
Committed: https://git.openstack.org/cgit/openstack/oslo.messaging/commit/?id=cc0f8cc8a9ff25c9fb081cac5366c12a0c06ec53
Submitter: Jenkins
Branch: feature/pika

commit a5d78891745b6b9e5827271dc305f00acae1392f
Author: OpenStack Proposal Bot <email address hidden>
Date: Fri Dec 11 15:24:05 2015 +0000

    Updated from global requirements

    Change-Id: Ifd78016c067740477a82dbe06d74d5944ba91893

commit 17ccb2306d03a74304c57d31716a54ba2b3b4311
Author: Mehdi Abaakouk <email address hidden>
Date: Fri Dec 11 10:59:54 2015 +0100

    Move to debug a too verbose log

    When a client is gone (died/restart) and somes replies cannot be sent because
    the the exchange of this client will never comeback. We log one message per
    reply every 0.25 messages during 60 seconds. When the only useful log
    is the one where we decide to drop this replies.

    This change moves the less important message to debug level.

    Change-Id: I508787c0db4dcec2c0027b89eb4e65c4f98022b9
    Related-bug: #1524418

commit 46daf858144202a072c4bf8580aeafec11d20e13
Author: Davanum Srinivas <email address hidden>
Date: Fri Dec 11 11:04:13 2015 +0300

    Cleanup parameter docstrings

    Change-Id: I301fdd51446bf0c0a6dd0d05b26da0556db8367d

commit 3ee86964fa460882d8fcac8686edd0e6bfb12008
Author: Mehdi Abaakouk <email address hidden>
Date: Wed Dec 9 19:37:40 2015 +0100

    Revert "default of kombu_missing_consumer_retry_timeout"

    This reverts commit 8c03a6db6c0396099e7425834998da5478a1df7c.

    Closes-bug: #1524418
    Change-Id: I35538a6c15d6402272e4513bc1beaa537b0dd7b9

commit e72599435c59c09277a9da7686b32aa4f9df7ba4
Author: Mehdi Abaakouk <email address hidden>
Date: Wed Dec 9 18:49:19 2015 +0100

    Don't trigger error_callback for known exc

    When AMQPDestinationNotFound is raised, we must not
    call the error_callback method. The exception is logged
    only if needed in upper layer (amqpdriver.py).

    Related-bug: #1524418

    Change-Id: Ic1ddec2d13172532dbaa572d04a4c22c97ac4fe7

commit 185693a6ed57e02b2f94b0fb8f14a91471605969
Author: Mehdi Abaakouk <email address hidden>
Date: Wed Dec 9 11:23:52 2015 +0100

    Improves comment

    Change-Id: Idc8002e6d622435aac48304857985c0f82be3e32

commit 148e8380ce1cc4f60716300b95104aaa2cf8c543
Author: Mehdi Abaakouk <email address hidden>
Date: Fri Dec 4 14:57:03 2015 +0100

    Fix reconnection when heartbeat is missed

    When a heartbeat is missing we call ensure_connection()
    that runs a dummy method to trigger the reconnection
    code in kombu. But also the code is triggered only if the
    channel is None.

    In case of the heartbeat threads we didn't reset the channel
    before reconnecting, so the dummy method doesn't do anything.

    This change sets the channel to None to ensure the connection
    is reestablished before the dummy method is run.

    Also it replaces the dummy method by checking the kombu connection
    object. So we are sure the connection is reestablished.

    Change-Id: I39f8cd23c5a5498e6f4c1aa3236ed27f3b5d7c9a
    Closes-bug: #1493890

commit 05002...

tags: added: in-feature-pika
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to oslo.messaging (stable/liberty)

Fix proposed to branch: stable/liberty
Review: https://review.openstack.org/278462

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to oslo.messaging (stable/liberty)

Reviewed: https://review.openstack.org/278462
Committed: https://git.openstack.org/cgit/openstack/oslo.messaging/commit/?id=26c85209af04c73246e1aa695a79bb45793fe6b4
Submitter: Jenkins
Branch: stable/liberty

commit 26c85209af04c73246e1aa695a79bb45793fe6b4
Author: Dmitry Mescheryakov <email address hidden>
Date: Mon Nov 23 17:27:24 2015 +0300

    Use round robin failover strategy for Kombu driver

    Shuffle strategy we use right now leads to increased reconnection time
    and provides no benefit. Sometimes it might lead to RPC operations
    timeout because the strategy provides no guarantee on how long the
    reconnection process will take. See the referenced bug for details.

    On the other side, round-robin strategy provides least achievable
    reconnection time. It also provides guarantee that if K of N RabbitMQ
    hosts are alive, it will take at most N - K + 1 attempts to
    successfully reconnect to RabbitMQ cluster.

    With shuffle strategy during failover clients connect to random hosts
    and so the load is distributed evenly between alive RabbitMQs.
    But since we shuffle list of hosts before providing it to Kombu, load
    will be distributed evenly with round-robin strategy as well.

    DocImpact
    A new configuration option kombu_failover_strategy for Kombu driver is
    added. It determines how the next RabbitMQ node is chosen in case the
    one we are currently connected to becomes unavailable. It takes effect
    only if more than one RabbitMQ node is provided in config. Available
    options are:

     * round-robin: each RabbitMQ host in the list is tried in cycle until
       oslo.messaging successfully connects. Since oslo.messaging
       shuffles list of RabbitMQ hosts, the order of hosts in the cycle
       will be random and will not depend on order provided in config.

     * shuffle: oslo.messaging selects a random host from the list and
       tries to connect to it. If connection fails, oslo.messaging repeats
       attempt to connect to another random host. Oslo.messaging stops
       once it successfully connects to a host. Note that in each
       iteration a host to connect is selected independently of previous
       iterations, i.e. it might happen that oslo.messaging will try to
       connect to the same host several times in a row.

    The option's default value is round-robin. Before the option was
    introduced, the default strategy was shuffle. For the reasoning,
    see the main body of the commit message and the referenced bug.

    Closes-Bug: #1519851
    Change-Id: I9a510c86bd5a6ce8b707734385af1a83de82804e
    (cherry picked from commit 6ae46796a61fc97467450b5bdd51dc6a0c86f9f4)

tags: added: in-stable-liberty
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.