oslo_messaging kombu strategy round-robin not working correctly

Bug #2019978 reported by Michal Arbet
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
oslo.messaging
New
Undecided
Unassigned

Bug Description

Hi,

We were doing some HA tests against our openstack cluster and we were testing what will happen if we turn off one rabbitmq from 3 node rabbitmq cluster and found that default oslo.messaging kombu_failover_strategy = round-robin introduced in https://github.com/openstack/oslo.messaging/commit/6ae46796a61fc97467450b5bdd51dc6a0c86f9f4 probably not working as expected.

We turned off 10.157.106.71 and clients didn't reconnect. If I grepped occurences in logs for this rabbitmq server, i found that it is always trying that host which is turned off.

root@controller0:/home/ubuntu# grep -Ri '2023-05-17.*Trying again in' /var/log/kolla | awk '{print $11}' | sort | uniq -c
      5 10.157.106.136:5672
     12 10.157.106.6:5672
  50381 10.157.106.71:5672

root@controller13:/home/ubuntu# grep -Ri '2023-05-17.*Trying again in' /var/log/kolla | awk '{print 11}' | sort | uniq -c
      2 -]
      6 10.157.106.136:5672
      4 10.157.106.6:5672
  41996 10.157.106.71:5672

I was also checking TCP SYN via netstat and I saw every time it was trying to connect to rabbitmq server which was down.

Revision history for this message
Michal Arbet (michalarbet) wrote :

Here is video of cinder-volume which was always trying to connect to server which was down, normally it should try another rabbitmq server and connect.

How to reproduce ? Turn off one rabbitmq server and check logs.

I was testing on yoga version.

Revision history for this message
Andrew Bogott (andrewbogott) wrote :

this seems similar to https://bugs.launchpad.net/charm-rabbitmq-server/+bug/1993149 which is now fixed with https://review.opendev.org/c/openstack/oslo.messaging/+/866617

If you don't want to wait for the backport, you can implement a similar fix in config by setting kombu_reconnect_delay to 0.5.

Revision history for this message
Michal Arbet (michalarbet) wrote : Re: [Bug 2019978] Re: oslo_messaging kombu strategy round-robin not working correctly

Hi Andrew,

yeah, you are right, we already confirmed that mentioned patch is fixing
issue, but unit tests in yoga are failing ...do you know why ?

Thanks
Michal Arbet
Openstack Engineer

Ultimum Technologies a.s.
Na Poříčí 1047/26, 11000 Praha 1
Czech Republic

+420 604 228 897
<email address hidden>
*https://ultimum.io <https://ultimum.io/>*

LinkedIn <https://www.linkedin.com/company/ultimum-technologies> | Twitter
<https://twitter.com/ultimumtech> | Facebook
<https://www.facebook.com/ultimumtechnologies/timeline>

po 22. 5. 2023 v 15:10 odesílatel Andrew Bogott <email address hidden>
napsal:

> this seems similar to https://bugs.launchpad.net/charm-rabbitmq-
> server/+bug/1993149 which is now fixed with
> https://review.opendev.org/c/openstack/oslo.messaging/+/866617
>
> If you don't want to wait for the backport, you can implement a similar
> fix in config by setting kombu_reconnect_delay to 0.5.
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/2019978
>
> Title:
> oslo_messaging kombu strategy round-robin not working correctly
>
> Status in oslo.messaging:
> New
>
> Bug description:
> Hi,
>
> We were doing some HA tests against our openstack cluster and we were
> testing what will happen if we turn off one rabbitmq from 3 node
> rabbitmq cluster and found that default oslo.messaging
> kombu_failover_strategy = round-robin introduced in
>
> https://github.com/openstack/oslo.messaging/commit/6ae46796a61fc97467450b5bdd51dc6a0c86f9f4
> probably not working as expected.
>
> We turned off 10.157.106.71 and clients didn't reconnect. If I grepped
> occurences in logs for this rabbitmq server, i found that it is always
> trying that host which is turned off.
>
>
> root@controller0:/home/ubuntu# grep -Ri '2023-05-17.*Trying again in'
> /var/log/kolla | awk '{print $11}' | sort | uniq -c
> 5 10.157.106.136:5672
> 12 10.157.106.6:5672
> 50381 10.157.106.71:5672
>
> root@controller13:/home/ubuntu# grep -Ri '2023-05-17.*Trying again in'
> /var/log/kolla | awk '{print 11}' | sort | uniq -c
> 2 -]
> 6 10.157.106.136:5672
> 4 10.157.106.6:5672
> 41996 10.157.106.71:5672
>
> I was also checking TCP SYN via netstat and I saw every time it was
> trying to connect to rabbitmq server which was down.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/oslo.messaging/+bug/2019978/+subscriptions
>
>

Revision history for this message
Andrew Bogott (andrewbogott) wrote :

Sorry for the slow response, Michal. My first guess would be that this change hasn't been added to the yoga build (https://review.opendev.org/c/openstack/oslo.messaging/+/866617/4/oslo_messaging/tests/test_config_opts_proxy.py) If it's not that then I can't guess what's breaking without seeing logs.

Revision history for this message
Michal Arbet (michalarbet) wrote :

Hi Andrew,

No worries, thanks fo your response !!

This is already fixed by https://github.com/openstack/oslo.messaging/commit/0602d1a10ac20c48fa35ad711355c79ee5b0ec77 and mentioned tests for yoga were failing because of different Kombu version but this is also fixed by https://github.com/openstack/oslo.messaging/commit/a1b36e7823e2764db7914ad9fbcbc0a5dc300205

This was quite critical bug, but I think it can be now closed, thank you for your support.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Bug attachments

Remote bug watches

Bug watches keep track of this bug in other bug trackers.