RabbitMQ fails to synchronize exchanges under high load (Note for ubuntu: stein, rocky, queens(bionic) changes only fix compatibility with fully patched releases)

Bug #1789177 reported by Oleg Bondarev
28
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Ubuntu Cloud Archive
Invalid
Undecided
Unassigned
Mitaka
Triaged
Medium
Seyeong Kim
Queens
Fix Released
Medium
Seyeong Kim
Rocky
Fix Released
Medium
Chris MacNaughton
Stein
Fix Released
Medium
Unassigned
Train
Fix Released
Undecided
Unassigned
oslo.messaging
Fix Released
Undecided
Oleg Bondarev
python-oslo.messaging (Ubuntu)
Fix Released
Medium
Unassigned
Xenial
Invalid
Medium
Seyeong Kim
Bionic
Fix Released
Medium
Seyeong Kim

Bug Description

[Impact]

If there are many exchanges and queues, after failing over, rabbitmq-server shows us error that exchanges are cannot be found.

Affected
 Bionic (Queens)
Not affected
 Focal

[Test Case]

1. deploy simple rabbitmq cluster
- https://pastebin.ubuntu.com/p/MR76VbMwY5/
2. juju ssh neutron-gateway/0
- for i in {1..1000}; do systemd restart neutron-metering-agent; sleep 2; done
3. it would be better if we can add more exchanges, queues, bindings
- rabbitmq-plugins enable rabbitmq_management
- rabbitmqctl add_user test password
- rabbitmqctl set_user_tags test administrator
- rabbitmqctl set_permissions -p openstack test ".*" ".*" ".*"
- https://pastebin.ubuntu.com/p/brw7rSXD7q/ ( save this as create.sh) [1]
- for i in {1..2000}; do ./create.sh test_$i; done

4. restart rabbitmq-server service or shutdown machine and turn on several times.
5. you can see the exchange not found error

[1] create.sh (pasting here because pastebins don't last forever)
#!/bin/bash

rabbitmqadmin declare exchange -V openstack name=$1 type=direct -u test -p password
rabbitmqadmin declare queue -V openstack name=$1 durable=false -u test -p password 'arguments={"x-expires":1800000}'
rabbitmqadmin -V openstack declare binding source=$1 destination_type="queue" destination=$1 routing_key="" -u test -p password

[Where problems could occur]
1. every service which uses oslo.messaging need to be restarted.
2. Message transferring could be an issue

[Others]

Possible Workaround

1. for exchange not found issue,
- create exchange, queue, binding for problematic name in log
- then restart rabbitmq-server one by one

2. for queue crashed and failed to restart
- delete specific queue in log

// original description

Input:
 - OpenStack Pike cluster with ~500 nodes
 - DVR enabled in neutron
 - Lots of messages

Scenario: failover of one rabbit node in a cluster

Issue: after failed rabbit node gets back online some rpc communications appear broken
Logs from rabbit:

=ERROR REPORT==== 10-Aug-2018::17:24:37 ===
Channel error on connection <0.14839.1> (10.200.0.24:55834 -> 10.200.0.31:5672, vhost: '/openstack', user: 'openstack'), channel 1:
operation basic.publish caused a channel exception not_found: no exchange 'reply_5675d7991b4a4fb7af5d239f4decb19f' in vhost '/openstack'

Investigation:
After rabbit node gets back online it gets many new connections immediately and fails to synchronize exchanges for some reason (number of exchanges in that cluster was ~1600), on that node it stays low and not increasing.

Workaround: let the recovered node synchronize all exchanges - forbid new connections with iptables rules for some time after failed node gets online (30 sec)

Proposal: do not create new exchanges (use default) for all direct messages - this also fixes the issue.

Is there a good reason for creating new exchanges for direct messages?

Changed in oslo.messaging:
assignee: nobody → Oleg Bondarev (obondarev)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to oslo.messaging (master)

Fix proposed to branch: master
Review: https://review.openstack.org/596661

Changed in oslo.messaging:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to oslo.messaging (master)

Reviewed: https://review.openstack.org/596661
Committed: https://git.openstack.org/cgit/openstack/oslo.messaging/commit/?id=3a5de89dd686dbd9660f140fdddd9c78b20e1632
Submitter: Zuul
Branch: master

commit 3a5de89dd686dbd9660f140fdddd9c78b20e1632
Author: Oleg Bondarev <email address hidden>
Date: Mon Aug 27 12:18:58 2018 +0400

    Use default exchange for direct messaging

    Lots of exchanges create problems during failover under high
    load. Please see bug report for details.

    This is step 1 in the process: only using default exchange
    when publishing. Consumers will still consume on separate
    exchanges (and on default exchange by default) so this
    should be (and tested to be) a non-breaking and
    upgrade-friendly change.

    Step 2 is to update consumers to only listen on default exchange,
    to happen in T release.

    Change-Id: Id3603f4b7e1274b616d76e1c0c009d2ab7f6efb6
    Closes-Bug: #1789177

Changed in oslo.messaging:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/oslo.messaging 9.2.0

This issue was fixed in the openstack/oslo.messaging 9.2.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to oslo.messaging (master)

Fix proposed to branch: master
Review: https://review.opendev.org/669158

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to oslo.messaging (master)

Reviewed: https://review.opendev.org/669158
Committed: https://git.openstack.org/cgit/openstack/oslo.messaging/commit/?id=6fe1aec1c74f112db297cd727d2ea400a292b038
Submitter: Zuul
Branch: master

commit 6fe1aec1c74f112db297cd727d2ea400a292b038
Author: Oleg Bondarev <email address hidden>
Date: Thu Jul 4 16:08:45 2019 +0400

    Use default exchange for direct messaging

    Lots of exchanges create problems during failover under high
    load. Please see bug report for details.

    This is a step 2 patch.

    Step 1 was: only using default exchange
    when publishing.
    Step 2 is to update consumers to only listen on default exchange,
    happening now in T release.

    Change-Id: Ib2ba62a642e6ce45c23568daeef9703a647707f3
    Closes-Bug: #1789177

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/oslo.messaging 10.0.0

This issue was fixed in the openstack/oslo.messaging 10.0.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to oslo.messaging (stable/rocky)

Fix proposed to branch: stable/rocky
Review: https://review.opendev.org/713153

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to oslo.messaging (stable/rocky)

Reviewed: https://review.opendev.org/713153
Committed: https://git.openstack.org/cgit/openstack/oslo.messaging/commit/?id=b67a457e4fa71e9220c149087dce013c7f81144f
Submitter: Zuul
Branch: stable/rocky

commit b67a457e4fa71e9220c149087dce013c7f81144f
Author: Oleg Bondarev <email address hidden>
Date: Mon Aug 27 12:18:58 2018 +0400

    Use default exchange for direct messaging

    Lots of exchanges create problems during failover under high
    load. Please see bug report for details.

    This is step 1 in the process: only using default exchange
    when publishing. Consumers will still consume on separate
    exchanges (and on default exchange by default) so this
    should be (and tested to be) a non-breaking and
    upgrade-friendly change.

    Step 2 is to update consumers to only listen on default exchange,
    to happen in T release.

    Change-Id: Id3603f4b7e1274b616d76e1c0c009d2ab7f6efb6
    Closes-Bug: #1789177
    (cherry picked from commit 3a5de89dd686dbd9660f140fdddd9c78b20e1632)

tags: added: in-stable-rocky
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to oslo.messaging (stable/pike)

Fix proposed to branch: stable/pike
Review: https://review.opendev.org/728396

Revision history for this message
norman shen (jshen28) wrote : Re: RabbitMQ fails to synchronize exchanges under high load

Hello, I am also hit by this error log. But I see no exchange inconsistency between different rabbit nodes and I cannot find certain reply_** exchange in each rabbit node.

If I read the code correctly, I believe every time this https://github.com/openstack/oslo.messaging/blob/e44c9883066d9b2d081a594b97aac3d598d491c9/oslo_messaging/_drivers/amqpdriver.py#L323 piece of code is called, it will make sure queue/exchange/bindings will not declared if gone, but I cannot find certain reply_** queue.

I am now wondering if it is possible that the polling thread stops working for some reason and thus exchange will never be declared event after rabbit comes back online.

Revision history for this message
norman shen (jshen28) wrote :

the metrics collected is attached

Revision history for this message
norman shen (jshen28) wrote :

besides I am thinking if this patchset could really solve the problem, because I notice neither exchange or queue could be found in rabbit.

Revision history for this message
zgjun (zgjun) wrote :

I meet the same issue.I have merge this patchset,when queue could be found in rabbit in same case, consumer really received the call msg and reply it , But the publihser wait for call reply timeout without any log output.

so, I think we should make sure the reply queue do exists(redeclare if necessary) before send or after wait timeout.

Revision history for this message
norman shen (jshen28) wrote :

yes, this is also what I think will happen. I think current PS does not solve the root cause.

I do a little bit of debugging and found when problem happens connection, channels seem to be fine, but consumers are gone, and on rabbitmq both exchanges and queues are gone. And since channel does not change, queue and exchanges are not going to be declared again which is problmeatic.

I am using rabbitmq 3.7.4 with min-master enabled, I can trigger the problem quite reliably by tc qdisc add some delay 100ms 10ms on rabbitmq cluster network interface.

Revision history for this message
norman shen (jshen28) wrote :

Here is some debug logs, notice I modified the code a little to print connection name and socket info for consume loop which is used by reply thread.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to oslo.messaging (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/739175

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to oslo.messaging (master)

Reviewed: https://review.opendev.org/739175
Committed: https://git.openstack.org/cgit/openstack/oslo.messaging/commit/?id=196fa877a90d7eb0f82ec9e1c194eef3f98fc0b1
Submitter: Zuul
Branch: master

commit 196fa877a90d7eb0f82ec9e1c194eef3f98fc0b1
Author: shenjiatong <email address hidden>
Date: Fri Jul 3 15:51:21 2020 +0800

    Cancel consumer if queue down

    Previously, we have switched to use default exchanges
    to avoid excessive amounts of exchange not found messages.
    But it does not actually solve the problem because
    reply_* queue is already gone and agent will not receive callbacks.

    after some debugging, I found under some circumstances
    seems rabbitmq consumer does not receive basic cancel
    signal when queue is already gone. This might due to
    rabbitmq try to restart consumer when queue is down
    (for example when split brain). In such cases,
    it might be better to fail early.

    by reading the code, seems like x-cancel-on-ha-failover
    is not dedicated to mirror queues only, https://github.com/rabbitmq/rabbitmq-server/blob/master/src/rabbit_channel.erl#L1894,
    https://github.com/rabbitmq/rabbitmq-server/blob/master/src/rabbit_channel.erl#L1926.

    By failing early, in my own test setup,
    I could solve a certain case of exchange not found problem.

    Change-Id: I2ae53340783e4044dab58035bc0992dc08145b53
    Related-bug: #1789177

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to oslo.messaging (stable/ussuri)

Related fix proposed to branch: stable/ussuri
Review: https://review.opendev.org/747366

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to oslo.messaging (stable/train)

Related fix proposed to branch: stable/train
Review: https://review.opendev.org/747892

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to oslo.messaging (stable/ussuri)

Reviewed: https://review.opendev.org/747366
Committed: https://git.openstack.org/cgit/openstack/oslo.messaging/commit/?id=0a432c7fb107d04f7a41199fe9a8c4fbd344d009
Submitter: Zuul
Branch: stable/ussuri

commit 0a432c7fb107d04f7a41199fe9a8c4fbd344d009
Author: shenjiatong <email address hidden>
Date: Fri Jul 3 15:51:21 2020 +0800

    Cancel consumer if queue down

    Previously, we have switched to use default exchanges
    to avoid excessive amounts of exchange not found messages.
    But it does not actually solve the problem because
    reply_* queue is already gone and agent will not receive callbacks.

    after some debugging, I found under some circumstances
    seems rabbitmq consumer does not receive basic cancel
    signal when queue is already gone. This might due to
    rabbitmq try to restart consumer when queue is down
    (for example when split brain). In such cases,
    it might be better to fail early.

    by reading the code, seems like x-cancel-on-ha-failover
    is not dedicated to mirror queues only, https://github.com/rabbitmq/rabbitmq-server/blob/master/src/rabbit_channel.erl#L1894,
    https://github.com/rabbitmq/rabbitmq-server/blob/master/src/rabbit_channel.erl#L1926.

    By failing early, in my own test setup,
    I could solve a certain case of exchange not found problem.

    Change-Id: I2ae53340783e4044dab58035bc0992dc08145b53
    Related-bug: #1789177
    (cherry picked from commit 196fa877a90d7eb0f82ec9e1c194eef3f98fc0b1)

tags: added: in-stable-ussuri
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to oslo.messaging (stable/stein)

Related fix proposed to branch: stable/stein
Review: https://review.opendev.org/749193

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to oslo.messaging (stable/rocky)

Related fix proposed to branch: stable/rocky
Review: https://review.opendev.org/749194

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to oslo.messaging (stable/queens)

Related fix proposed to branch: stable/queens
Review: https://review.opendev.org/749196

Seyeong Kim (seyeongkim)
Changed in python-oslo.messaging (Ubuntu):
assignee: nobody → Seyeong Kim (seyeongkim)
Seyeong Kim (seyeongkim)
tags: added: sts
description: updated
1 comments hidden view all 104 comments
Revision history for this message
Ubuntu Foundations Team Bug Bot (crichton) wrote :

The attachment "lp1789177_bionic.debdiff" seems to be a debdiff. The ubuntu-sponsors team has been subscribed to the bug report so that they can review and hopefully sponsor the debdiff. If the attachment isn't a patch, please remove the "patch" flag from the attachment, remove the "patch" tag, and if you are member of the ~ubuntu-sponsors, unsubscribe the team.

[This is an automated message performed by a Launchpad user owned by ~brian-murray, for any issue please contact him.]

tags: added: patch
Seyeong Kim (seyeongkim)
description: updated
3 comments hidden view all 104 comments
Revision history for this message
Mathew Hodson (mhodson) wrote :

This was fixed in 12.3.0
---

python-oslo.messaging (12.3.0-0ubuntu1) groovy; urgency=medium

  [ Chris MacNaughton ]
  * New upstream release for OpenStack Victoria.
  * d/control: Align (Build-)Depends with upstream.
  * d/p/no-functional-test.patch: Refreshed.

  [ Corey Bryant ]
  * d/control: Restore min versions of python3-eventlet and python3-tenacity.

 -- Corey Bryant <email address hidden> Thu, 03 Sep 2020 08:59:27 -0400

Changed in python-oslo.messaging (Ubuntu):
status: New → Fix Released
Mathew Hodson (mhodson)
Changed in python-oslo.messaging (Ubuntu):
importance: Undecided → Medium
Changed in python-oslo.messaging (Ubuntu Bionic):
importance: Undecided → Medium
Revision history for this message
Seyeong Kim (seyeongkim) wrote :

I can't reproduce this symptom in Focal though it is 12.1.0
it doesn't have commit 0a432c7fb107d04f7a41199fe9a8c4fbd344d009

I think xenial need fix as well, I can reproduce this in xenial,
I'm preparing debdiff for xenial as well

Mathew Hodson (mhodson)
Changed in python-oslo.messaging (Ubuntu Xenial):
importance: Undecided → Medium
4 comments hidden view all 104 comments
Revision history for this message
Seyeong Kim (seyeongkim) wrote :

For Stein and Train
There is already commit 3a5de89dd686dbd9660f140fdddd9c78b20e1632
But no 6fe1aec1c74f112db297cd727d2ea400a292b038

I think we need to fix this for both releases as well.
only one fix cannot be solve this issue.

and Train's functional test is already removed
but Stein's one wasn't.

Seyeong Kim (seyeongkim)
Changed in python-oslo.messaging (Ubuntu Xenial):
status: New → In Progress
assignee: nobody → Seyeong Kim (seyeongkim)
Changed in python-oslo.messaging (Ubuntu Bionic):
status: New → In Progress
assignee: nobody → Seyeong Kim (seyeongkim)
Changed in python-oslo.messaging (Ubuntu):
assignee: Seyeong Kim (seyeongkim) → nobody
Revision history for this message
Chris MacNaughton (chris.macnaughton) wrote : Please test proposed package

Hello Oleg, or anyone else affected,

Accepted python-oslo.messaging into train-proposed. The package will build now and be available in the Ubuntu Cloud Archive in a few hours, and then in the -proposed repository.

Please help us by testing this new package. To enable the -proposed repository:

  sudo add-apt-repository cloud-archive:train-proposed
  sudo apt-get update

Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-train-needed to verification-train-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-train-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

tags: added: verification-train-needed
Revision history for this message
Chris MacNaughton (chris.macnaughton) wrote :

Hello Oleg, or anyone else affected,

Accepted python-oslo.messaging into stein-proposed. The package will build now and be available in the Ubuntu Cloud Archive in a few hours, and then in the -proposed repository.

Please help us by testing this new package. To enable the -proposed repository:

  sudo add-apt-repository cloud-archive:stein-proposed
  sudo apt-get update

Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-stein-needed to verification-stein-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-stein-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

tags: added: verification-stein-needed
Revision history for this message
Seyeong Kim (seyeongkim) wrote : Re: RabbitMQ fails to synchronize exchanges under high load

Verification for Stein is done.

ii python3-oslo.messaging 9.5.0-0ubuntu1~cloud1

verification steps
1. reproduce this issue
2. update all python3-oslo.messaging in test env
3. restart rabbitmq-server

all Channel issue is gone.

tags: added: verification-stein-done
removed: verification-stein-needed
Revision history for this message
Seyeong Kim (seyeongkim) wrote :

Verification for Train is done.

ii python3-oslo.messaging 9.7.1-0ubuntu3~cloud1 all oslo messaging library - Python 3.x

verification steps
1. reproduce this issue
2. update all python3-oslo.messaging in test env
3. restart rabbitmq-server

all Channel issue is gone.

tags: added: verification-train-done
removed: verification-train-needed
Revision history for this message
Robie Basak (racb) wrote : Please test proposed package

Hello Oleg, or anyone else affected,

Accepted python-oslo.messaging into bionic-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/python-oslo.messaging/5.35.0-0ubuntu2 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, what testing has been performed on the package and change the tag from verification-needed-bionic to verification-done-bionic. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-bionic. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Changed in python-oslo.messaging (Ubuntu Bionic):
status: In Progress → Fix Committed
tags: added: verification-needed verification-needed-bionic
tags: added: verification-queens-needed
Seyeong Kim (seyeongkim)
tags: added: verification-done-bionic
removed: verification-needed-bionic
Seyeong Kim (seyeongkim)
tags: added: verification-queens-done
removed: verification-queens-needed
Mathew Hodson (mhodson)
tags: removed: verification-needed
Changed in python-oslo.messaging (Ubuntu Bionic):
status: Fix Committed → Fix Released
tags: added: verification-queens-failed
removed: verification-queens-done
description: updated
24 comments hidden view all 104 comments
Revision history for this message
Edward Hope-Morley (hopem) wrote : Re: RabbitMQ fails to synchronize exchanges under high load

@seyeongkim the problem here is that we cant make a change in a stable release that requires a maintenance window to upgrade since there will be environments that are not aware of this and e.g. unattended-upgrades that will break when they upgrade. I think the safest action we can take here is cancel the xenial-proposed sru and revert the bionic-updates patch.

Revision history for this message
Corey Bryant (corey.bryant) wrote :

This fix is being reverted fully in queens, rocky, and stein [1]. In order to fast-track the revert it was decided with the SRU team to be best to revert all patches. We can then consider adding patch 1 back as a new SRU following the revert and give that more time for testing.

[1] https://bugs.launchpad.net/oslo.messaging/+bug/1914437

Revision history for this message
Seyeong Kim (seyeongkim) wrote :

I tested below

It was the same scenario I tested
0 deploy test env
- 5.35.0-0ubuntu1~cloud0
1. upgrading olso.messaging in n-ovs
- 5.35.0-0ubuntu2~cloud0 ( from queens-staging launchpad)
2. I got errors
3. upgrading it to new one
- 5.35.0-0ubuntu3~cloud0

it worked fine for me.

I'm trying to reproduce original issue as I want to test 3rd commit only. ( reproduction takes time.. )

I remember that only 1st commit didn't solved original issue in my test.

Revision history for this message
Corey Bryant (corey.bryant) wrote :

I've updated the triage status back to New for releases that were reverted.

Changed in python-oslo.messaging (Ubuntu Bionic):
status: Fix Released → New
Changed in cloud-archive:
status: New → Invalid
Revision history for this message
Corey Bryant (corey.bryant) wrote :

Seyeong, if the first patch helps fix this bug, we can still add it for queens->stein. We were considering keeping that patch but had to revert all of the patches in order to expedite the release with minimum regression potential and testing needed.

Changed in python-oslo.messaging (Ubuntu Bionic):
status: New → Triaged
Revision history for this message
Corey Bryant (corey.bryant) wrote :

I've uploaded new package versions for this bug to the bionic unapproved queue and the rocky-staging ppa. These package versions only include the single patch to switch to the default exchange when publishing.

We have a scenario where queens and train need to communicate but can't because train only consumes from the default exchange. These package updates will fix that scenario. This should also be useful in reducing the impact of the current bug reported here, where RabbitMQ fails to synchronize exchanges under high load.

Note: The single patch to switch to the default exchange when publishing is already in stein so I'm going to mark stein as Fix Released.

Revision history for this message
Corey Bryant (corey.bryant) wrote : Please test proposed package

Hello Oleg, or anyone else affected,

Accepted python-oslo.messaging into rocky-proposed. The package will build now and be available in the Ubuntu Cloud Archive in a few hours, and then in the -proposed repository.

Please help us by testing this new package. To enable the -proposed repository:

  sudo add-apt-repository cloud-archive:rocky-proposed
  sudo apt-get update

Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-rocky-needed to verification-rocky-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-rocky-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

tags: added: verification-rocky-needed
Revision history for this message
Seyeong Kim (seyeongkim) wrote : Re: RabbitMQ fails to synchronize exchanges under high load

I've confirmed that only 1st patch is ok with below steps

1. deploy queens
2. patch nuetron node's oslo.messaging (1st patch only) except nova-compute node's oslo.messaging
3. trying to create instance and delete

And I keep restarting cinder-scheduler while I blocked one rabbitmq-server with iptables -A INPUT -p tcp --dport 5672 -j DROP

I was able to see no exchange error for cinder eventually.

I'm going to prepare 1st and 3rd commit debdiff for this patch today.

Thanks.

Revision history for this message
Seyeong Kim (seyeongkim) wrote :

ah sorry corey you already uploaded it to bionic as well. thanks

Revision history for this message
Seyeong Kim (seyeongkim) wrote :

1. deploy rocky
2. installed updated oslo.messaging pkg in below nodes
- neutron-api
- neutron-gateway
- nova-compute
- - restarted openvswitch-agent only
3. tried to reproduce with below config
- created 3000 test queue, exchange, bindings
- juju config rabbitmq-server min-cluster-size=1
- juju config rabbitmq-server connection-backlog=200 ( to make all rabbitmq-server restart )
- shutdown node with maas controller ( one of rabbitmq-server)
- power on with maas controller

I'm able to see Channel not found error for nova, and for neutron-openvswitch-agent on nova-compute node.
neutron-openvswitch-agent on nova-compute node has fixed but rabbitmq-server shows me channel not found error.

However, I can't launch and delete instance on this environment.

I'm not sure how to say about this result.
Also reproduction itself is quite hard to make. It took a lot of time to find regular behavior but I'm not sure there is.

Revision history for this message
Seyeong Kim (seyeongkim) wrote :

after restarting all rabbitmq-server, status are stable.

Revision history for this message
Seyeong Kim (seyeongkim) wrote :

With 2nd try, I also faced the same error with patched component, not even only openvswitch-agent.

I'm going to try to reproduce with 1st and 3rd commit with manual configuration( enable_cancel_on_failover)

Revision history for this message
Seyeong Kim (seyeongkim) wrote :

testing 1st, 3rd commit and manual configuration enable_cancel_on_failover = True

I did the similar step above with Queens ( as I already made ppa for this )

and in this case, I see different errors. restarting rabbitmq-server solved error msgs

=ERROR REPORT==== 24-Feb-2021::08:07:46 ===
Channel error on connection <0.23680.14> (10.0.0.36:50874 -> 10.0.0.22:5672, vhost: 'openstack', user: 'neutron'), channel 1:
{amqp_error,not_found,
            "queue 'q-l3-plugin_fanout_81f1be30ba514e1189e4c08e1d99a7d0' in vhost 'openstack' has crashed and failed to restart",
            'queue.declare'}

=ERROR REPORT==== 24-Feb-2021::08:07:46 ===
Channel error on connection <0.23680.14> (10.0.0.36:50874 -> 10.0.0.22:5672, vhost: 'openstack', user: 'neutron'), channel 1:
{amqp_error,not_found,
            "queue 'q-l3-plugin_fanout_81f1be30ba514e1189e4c08e1d99a7d0' in vhost 'openstack' has crashed and failed to restart",
            'queue.declare'}

2 comments hidden view all 104 comments
Revision history for this message
Seyeong Kim (seyeongkim) wrote :

Testing with only 1st patch didn't work , I was able to see the same error as the description in this LP
Testing with 1st and 3rd with manual configuration(enable_cancel_on_failover = True ) showed me different error
( mentnioned above )

The different error is happening less time than I assume.

So I think this can be next action but it is not perfect.
1. Patch 1st and 3rd commit,
2. And patch charms ( to set enable_cancel_on_failover )
3. Then handle different error with different LP bug ( if there is )
(above queens and bionic patch has commit #1 and #3 )

Please give some advices if you have any idea.

Thanks.

Revision history for this message
Corey Bryant (corey.bryant) wrote :

Hi Seyeong,

Thanks for testing. I was under the impression that the 3rd patch was dependent on the 2nd patch since they both deal with the consumer side.

What do you think about moving forward with just patch 1? Unfortunately it doesn't reduce the impact of the original bug reported here (based on your testing) but it does fix the incompatibility of queens and train communication but can't because train only consumes from the default exchange.

Thanks,
Corey

Revision history for this message
Seyeong Kim (seyeongkim) wrote :

Hello Corey

That makes sense to me as well.

Thanks

Revision history for this message
Łukasz Zemczak (sil2100) wrote :

Ok, reviewing the change. But question to be answered before this is accepted - seeing comments #80 and #81, does this single patch actually fix the failure to synchronize exchanges under high load? If not, then we need to adjust the description. What will the current patch address in its current state?

Revision history for this message
Liam Young (gnuoy) wrote :

I have tested the rocky scenario that was failing for me. Trilio on Train + OpenStack on Rocky. The Trilio functional test to snapshot a server failed without the fix and passed once python3-oslo.messaging 8.1.0-0ubuntu1~cloud2.2 was installed and services restarted

tags: added: verification-rocky-done
removed: verification-rocky-needed
Seyeong Kim (seyeongkim)
description: updated
Revision history for this message
Corey Bryant (corey.bryant) wrote :

@Łukasz, it's a little awkward. The single patch does not fix the failure to synchronize exchanges under high load (based on Seyeong's testing) however it does fix compatibility with releases that have been fully patched. I've updated the description, hopefully that helps a bit to clear this up.

summary: - RabbitMQ fails to synchronize exchanges under high load
+ RabbitMQ fails to synchronize exchanges under high load (Note for
+ ubuntu: stein, rocky, queens(bionic) changes only fix compatibility with
+ fully patched releases)
Revision history for this message
Corey Bryant (corey.bryant) wrote : Update Released

The verification of the Stable Release Update for python-oslo.messaging has completed successfully and the package has now been released to -updates. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Revision history for this message
Corey Bryant (corey.bryant) wrote :

This bug was fixed in the package python-oslo.messaging - 8.1.0-0ubuntu1~cloud2.2
---------------

 python-oslo.messaging (8.1.0-0ubuntu1~cloud2.2) bionic-rocky; urgency=medium
 .
   [Seyeong Kim]
   * Fix RabbitMQ fails to syncronize exchanges under high load (LP: #1789177)
     - d/p/0001-Use-default-exchange-for-direct-messaging.patch

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to oslo.messaging (stable/queens)

Fix proposed to branch: stable/queens
Review: https://review.opendev.org/c/openstack/oslo.messaging/+/787384

Revision history for this message
Łukasz Zemczak (sil2100) wrote : Please test proposed package

Hello Oleg, or anyone else affected,

Accepted python-oslo.messaging into bionic-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/python-oslo.messaging/5.35.0-0ubuntu4 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, what testing has been performed on the package and change the tag from verification-needed-bionic to verification-done-bionic. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-bionic. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Changed in python-oslo.messaging (Ubuntu Bionic):
status: Triaged → Fix Committed
tags: added: verification-needed verification-needed-bionic
removed: verification-done-bionic
Revision history for this message
Corey Bryant (corey.bryant) wrote :

Hello Oleg, or anyone else affected,

Accepted python-oslo.messaging into queens-proposed. The package will build now and be available in the Ubuntu Cloud Archive in a few hours, and then in the -proposed repository.

Please help us by testing this new package. To enable the -proposed repository:

  sudo add-apt-repository cloud-archive:queens-proposed
  sudo apt-get update

Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-queens-needed to verification-queens-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-queens-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

tags: added: verification-queens-needed
removed: verification-queens-failed
Revision history for this message
Seyeong Kim (seyeongkim) wrote :

sorry for being late, I'll verify this soon

Revision history for this message
Seyeong Kim (seyeongkim) wrote :

tested pkg in queens

test steps are below

1. deploy queens env
2. upgrade python-oslo.messaging in nova-compute/0
3. restart neutron-openvswitch-agent ( only )
4. check logs , no error
5. launch instance if it works, no error

ii python-oslo.messaging 5.35.0-0ubuntu4~cloud0 all oslo messaging library - Python 2.x

tags: added: verification-queens-done
removed: verification-queens-needed
Revision history for this message
Seyeong Kim (seyeongkim) wrote :

tested pkg in bionic

steps are below ( the same as queens above )

1. deploy bionic env
2. upgrade python-oslo.messaging in nova-compute/0
3. restart neutron-openvswitch-agent ( only )
4. check logs , no error
5. launch instance if it works, no error

ii python-oslo.messaging 5.35.0-0ubuntu4 all oslo messaging library - Python 2.x

tags: added: verification-done-bionic
removed: verification-needed-bionic
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package python-oslo.messaging - 5.35.0-0ubuntu4

---------------
python-oslo.messaging (5.35.0-0ubuntu4) bionic; urgency=medium

  [Seyeong Kim]
  * Fix RabbitMQ fails to syncronize exchanges under high load (LP: #1789177)
    - d/p/0001-Use-default-exchange-for-direct-messaging.patch

 -- Corey Bryant <email address hidden> Tue, 23 Feb 2021 09:53:33 -0500

Changed in python-oslo.messaging (Ubuntu Bionic):
status: Fix Committed → Fix Released
Revision history for this message
James Page (james-page) wrote : Update Released

The verification of the Stable Release Update for python-oslo.messaging has completed successfully and the package has now been released to -updates. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

tags: added: verification-done
removed: verification-needed
Revision history for this message
James Page (james-page) wrote :

This bug was fixed in the package python-oslo.messaging - 5.35.0-0ubuntu4~cloud0
---------------

 python-oslo.messaging (5.35.0-0ubuntu4~cloud0) xenial-queens; urgency=medium
 .
   * New update for the Ubuntu Cloud Archive.
 .
 python-oslo.messaging (5.35.0-0ubuntu4) bionic; urgency=medium
 .
   [Seyeong Kim]
   * Fix RabbitMQ fails to syncronize exchanges under high load (LP: #1789177)
     - d/p/0001-Use-default-exchange-for-direct-messaging.patch

Changed in python-oslo.messaging (Ubuntu Xenial):
status: In Progress → Invalid
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on oslo.messaging (stable/rocky)

Change abandoned by "Hervé Beraud <email address hidden>" on branch: stable/rocky
Review: https://review.opendev.org/c/openstack/oslo.messaging/+/749194
Reason: rocky is now unmaintained

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on oslo.messaging (stable/pike)

Change abandoned by "Hervé Beraud <email address hidden>" on branch: stable/pike
Review: https://review.opendev.org/c/openstack/oslo.messaging/+/728396
Reason: stable/pike is no longer maintained (https://releases.openstack.org/). Thanks for your understanding

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on oslo.messaging (stable/queens)

Change abandoned by "Hervé Beraud <email address hidden>" on branch: stable/queens
Review: https://review.opendev.org/c/openstack/oslo.messaging/+/749196
Reason: Queens is no longer maintained (https://releases.openstack.org/).

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to oslo.messaging (stable/stein)

Reviewed: https://review.opendev.org/c/openstack/oslo.messaging/+/749193
Committed: https://opendev.org/openstack/oslo.messaging/commit/b2acc6663f6c3f60e07cdeb1eae97fd1210a4d81
Submitter: "Zuul (22348)"
Branch: stable/stein

commit b2acc6663f6c3f60e07cdeb1eae97fd1210a4d81
Author: shenjiatong <email address hidden>
Date: Fri Jul 3 15:51:21 2020 +0800

    Cancel consumer if queue down

    Previously, we have switched to use default exchanges
    to avoid excessive amounts of exchange not found messages.
    But it does not actually solve the problem because
    reply_* queue is already gone and agent will not receive callbacks.

    after some debugging, I found under some circumstances
    seems rabbitmq consumer does not receive basic cancel
    signal when queue is already gone. This might due to
    rabbitmq try to restart consumer when queue is down
    (for example when split brain). In such cases,
    it might be better to fail early.

    by reading the code, seems like x-cancel-on-ha-failover
    is not dedicated to mirror queues only, https://github.com/rabbitmq/rabbitmq-server/blob/master/src/rabbit_channel.erl#L1894,
    https://github.com/rabbitmq/rabbitmq-server/blob/master/src/rabbit_channel.erl#L1926.

    By failing early, in my own test setup,
    I could solve a certain case of exchange not found problem.

    Change-Id: I2ae53340783e4044dab58035bc0992dc08145b53
    Related-bug: #1789177
    Depends-On: https://review.opendev.org/#/c/747892/
    (cherry picked from commit 196fa877a90d7eb0f82ec9e1c194eef3f98fc0b1)
    (cherry picked from commit 0a432c7fb107d04f7a41199fe9a8c4fbd344d009)
    (cherry picked from commit 5de11fa752ab8e37b95b1785f4c71210bf473f0c)

tags: added: in-stable-stein
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on oslo.messaging (stable/pike)

Change abandoned by "Elod Illes <email address hidden>" on branch: stable/pike
Review: https://review.opendev.org/c/openstack/oslo.messaging/+/728396
Reason: stable/pike has transitioned to End of Life for oslo, open patches need to be abandoned in order to be able to delete the branch.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on oslo.messaging (stable/queens)

Change abandoned by "Elod Illes <email address hidden>" on branch: stable/queens
Review: https://review.opendev.org/c/openstack/oslo.messaging/+/749196
Reason: This branch of this project has transitioned to End of Life, open patches need to be abandoned to be able to delete the branch.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Change abandoned by "Elod Illes <email address hidden>" on branch: stable/queens
Review: https://review.opendev.org/c/openstack/oslo.messaging/+/787384
Reason: This branch of this project has transitioned to End of Life, open patches need to be abandoned to be able to delete the branch.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/oslo.messaging rocky-eol

This issue was fixed in the openstack/oslo.messaging rocky-eol release.

Displaying first 40 and last 40 comments. View all 104 comments or add a comment.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.