Nonoptimal failover strategy can lead to RPC timeout
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
oslo.messaging |
Fix Released
|
Undecided
|
Dmitry Mescheryakov |
Bug Description
The 'shuffle' failover strategy we use right now in Kombu might lead to RPC timeouts. The strategy is set there:
https:/
Each time current connection drops, the strategy picks random host from all available hosts and tries to connect to it. The strategy is not 'fair' and might select the same host several times in a row. For example, here it took oslo.messaging 6 attempts to reconnect:
http://
As a result, reconnection might take significant number of attempts. For instance, if 2 of 3 RabbitMQ nodes are down, probability that it will take at least 12 attempts to successfully reconnect is (2/3)^11 ~ 1%. Each reconnect takes around 5 seconds, so 12 attempts will take more than a minute - the default RPC timeout. And that leads to RPC operations timeout.
Changed in oslo.messaging: | |
assignee: | nobody → Dmitry Mescheryakov (dmitrymex) |
Changed in oslo.messaging: | |
milestone: | none → 3.1.0 |
status: | Fix Committed → Fix Released |
Fix proposed to branch: master /review. openstack. org/249849
Review: https:/