tl;dr: Adding this config seems to resolve the issue for me:
[oslo_messaging_rabbit]
kombu_reconnect_delay=0.1
long version:
I've been staring at [bdcf915e] off and on for several days, and it looks right to me, in theory. That section of code consists of rather a lot of nested timeouts, and this bug looks to be like an issue of having inner-loop timouts fire before their outer-loop timeouts have a chance to.
In particular, I think the issue is in this scrap of kombu.connection._ensure_connection:
def on_error(exc, intervals, retries, interval=0):
round = self.completes_cycle(retries)
if round: interval = next(intervals)
if errback: errback(exc, interval) self.maybe_switch_next() # select next host
return interval if round else 0
If errback (a callback passed in by the oslo driver) throws an exception 100% of the time (as it seems to post-[bdcf915e]) then failover never happens. I can prevent that ensuring that oslo_messaging_rabbit->kombu_reconnect_delay is less than ACK_REQUEUE_EVERY_SECONDS_MAX (which is now one of our max timeouts thanks to [bdcf915e].)
I'm not 100% convinced that this is the correct fix since it's easy to luck your way out of a timing bug, but it has the advantage of not require a package upgrade.
I also note that kombu_reconnect_delay is only used in one section of code, prefaced with:
# TODO(sileht): Check if this is useful since we
# use kombu for HA connection, the interval_step
# should sufficient, because the underlying kombu transport
# connection object freed.
...so maybe we can rip out that code and remove kombu_reconnect_delay entirely (which would also resolve the timeout contention).
tl;dr: Adding this config seems to resolve the issue for me:
[oslo_messaging _rabbit] _delay= 0.1
kombu_reconnect
long version:
I've been staring at [bdcf915e] off and on for several days, and it looks right to me, in theory. That section of code consists of rather a lot of nested timeouts, and this bug looks to be like an issue of having inner-loop timouts fire before their outer-loop timeouts have a chance to.
In particular, I think the issue is in this scrap of kombu.connectio n._ensure_ connection:
def on_error(exc, intervals, retries, interval=0): cycle(retries)
interval = next(intervals)
errback( exc, interval)
self. maybe_switch_ next() # select next host
round = self.completes_
if round:
if errback:
return interval if round else 0
If errback (a callback passed in by the oslo driver) throws an exception 100% of the time (as it seems to post-[bdcf915e]) then failover never happens. I can prevent that ensuring that oslo_messaging_ rabbit- >kombu_ reconnect_ delay is less than ACK_REQUEUE_ EVERY_SECONDS_ MAX (which is now one of our max timeouts thanks to [bdcf915e].)
I'm not 100% convinced that this is the correct fix since it's easy to luck your way out of a timing bug, but it has the advantage of not require a package upgrade.
I also note that kombu_reconnect _delay is only used in one section of code, prefaced with:
# TODO(sileht): Check if this is useful since we
# use kombu for HA connection, the interval_step
# should sufficient, because the underlying kombu transport
# connection object freed.
...so maybe we can rip out that code and remove kombu_reconnect _delay entirely (which would also resolve the timeout contention).