RabbitMQ fails to synchronize exchanges under high load

Bug #1789177 reported by Oleg Bondarev on 2018-08-27
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
oslo.messaging
Undecided
Oleg Bondarev

Bug Description

Input:
 - OpenStack Pike cluster with ~500 nodes
 - DVR enabled in neutron
 - Lots of messages

Scenario: failover of one rabbit node in a cluster

Issue: after failed rabbit node gets back online some rpc communications appear broken
Logs from rabbit:

=ERROR REPORT==== 10-Aug-2018::17:24:37 ===
Channel error on connection <0.14839.1> (10.200.0.24:55834 -> 10.200.0.31:5672, vhost: '/openstack', user: 'openstack'), channel 1:
operation basic.publish caused a channel exception not_found: no exchange 'reply_5675d7991b4a4fb7af5d239f4decb19f' in vhost '/openstack'

Investigation:
After rabbit node gets back online it gets many new connections immediately and fails to synchronize exchanges for some reason (number of exchanges in that cluster was ~1600), on that node it stays low and not increasing.

Workaround: let the recovered node synchronize all exchanges - forbid new connections with iptables rules for some time after failed node gets online (30 sec)

Proposal: do not create new exchanges (use default) for all direct messages - this also fixes the issue.

Is there a good reason for creating new exchanges for direct messages?

Changed in oslo.messaging:
assignee: nobody → Oleg Bondarev (obondarev)

Fix proposed to branch: master
Review: https://review.openstack.org/596661

Changed in oslo.messaging:
status: New → In Progress

Reviewed: https://review.openstack.org/596661
Committed: https://git.openstack.org/cgit/openstack/oslo.messaging/commit/?id=3a5de89dd686dbd9660f140fdddd9c78b20e1632
Submitter: Zuul
Branch: master

commit 3a5de89dd686dbd9660f140fdddd9c78b20e1632
Author: Oleg Bondarev <email address hidden>
Date: Mon Aug 27 12:18:58 2018 +0400

    Use default exchange for direct messaging

    Lots of exchanges create problems during failover under high
    load. Please see bug report for details.

    This is step 1 in the process: only using default exchange
    when publishing. Consumers will still consume on separate
    exchanges (and on default exchange by default) so this
    should be (and tested to be) a non-breaking and
    upgrade-friendly change.

    Step 2 is to update consumers to only listen on default exchange,
    to happen in T release.

    Change-Id: Id3603f4b7e1274b616d76e1c0c009d2ab7f6efb6
    Closes-Bug: #1789177

Changed in oslo.messaging:
status: In Progress → Fix Released

This issue was fixed in the openstack/oslo.messaging 9.2.0 release.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers