Comment 45 for bug 1993149

Revision history for this message
Andrew Bogott (andrewbogott) wrote :

tl;dr: Adding this config seems to resolve the issue for me:

[oslo_messaging_rabbit]
kombu_reconnect_delay=0.1

long version:

I've been staring at [bdcf915e] off and on for several days, and it looks right to me, in theory. That section of code consists of rather a lot of nested timeouts, and this bug looks to be like an issue of having inner-loop timouts fire before their outer-loop timeouts have a chance to.

In particular, I think the issue is in this scrap of kombu.connection._ensure_connection:

        def on_error(exc, intervals, retries, interval=0):
            round = self.completes_cycle(retries)
            if round:
                interval = next(intervals)
            if errback:
                errback(exc, interval)
            self.maybe_switch_next() # select next host

            return interval if round else 0

If errback (a callback passed in by the oslo driver) throws an exception 100% of the time (as it seems to post-[bdcf915e]) then failover never happens. I can prevent that ensuring that oslo_messaging_rabbit->kombu_reconnect_delay is less than ACK_REQUEUE_EVERY_SECONDS_MAX (which is now one of our max timeouts thanks to [bdcf915e].)

I'm not 100% convinced that this is the correct fix since it's easy to luck your way out of a timing bug, but it has the advantage of not require a package upgrade.

I also note that kombu_reconnect_delay is only used in one section of code, prefaced with:

            # TODO(sileht): Check if this is useful since we
            # use kombu for HA connection, the interval_step
            # should sufficient, because the underlying kombu transport
            # connection object freed.

...so maybe we can rip out that code and remove kombu_reconnect_delay entirely (which would also resolve the timeout contention).