Can't failover when rabbit_hosts is configured as 3 hosts

Bug #1657444 reported by likun
18
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Ubuntu Cloud Archive
Invalid
Undecided
Unassigned
Pike
Fix Released
High
Unassigned
oslo.messaging
Fix Released
Undecided
Vincent Untz
python-oslo.messaging (Ubuntu)
Invalid
Undecided
Unassigned
Artful
Fix Released
High
Felipe Reyes

Bug Description

[Impact]

When the heartbeat connection times out it is not treated as a recoverable error nor attempts to reconnect calling ensure_connection(). This leaves the heartbeat thread attempting to reconnect to the same host over and over again.

[Test Case]

* deploy openstack
  bzr branch lp:openstack-charm-testing
  cd openstack-charm-testing
  juju deployer -c default.yaml -d -v artful-pike
  juju add-unit rabbitmq-server
* Force timeout using iptables in a rabbitmq-server node
  sudo iptables -I INPUT -p tcp --dport 5672 -j DROP

Expected result:
once the timeout happens, the heartbeat thread reconnects (picking a new rabbit host if needed).

Actual result:
the heartbeat thread is left in a loop (connect, socket closed, retry, connect...)

[Regression Potential]

Without this patch when the heartbeat connection times out, and it does not attempt to connect to the next configured rabbit host. So the risk is that situations where currently the daemons using this library made it to reconnect to the same host (e.g. the disconnection from the host is only for a few seconds) with this change they will reconnect to the next host, so users may see the connections flapping between two (or more) rabbit hosts.

[Other Info]
I have a rabbitmq cluster of 3 nodes

root@47704165d2bb:/# rabbitmqctl cluster_status
Cluster status of node rabbit@47704165d2bb ...
[{nodes,[{disc,[rabbit@0482398a286e,rabbit@3709521b608a,
                rabbit@47704165d2bb]}]},
 {running_nodes,[rabbit@0482398a286e,rabbit@3709521b608a,rabbit@47704165d2bb]},
 {cluster_name,<<"rabbit@47704165d2bb">>},
 {partitions,[]},
 {alarms,[{rabbit@0482398a286e,[]},
          {rabbit@3709521b608a,[]},
          {rabbit@47704165d2bb,[]}]}]
root@47704165d2bb:/# rabbitmqctl list_policies
Listing policies ...
/ ha-all all ^ha\\. {"ha-mode":"all"} 0

My oslo_message client configuration
[oslo_messaging_rabbit]
rabbit_hosts=120.0.0.56:5671,120.0.0.57:5671,120.0.0.55:5671
rabbit_userid=cloud
rabbit_password=cloud
rabbit_ha_queues=True
rabbit_retry_interval=1
rabbit_retry_backoff=2
rabbit_max_retries=0
rabbit_durable_queues=False

When I run "service rabbitmq-server stop" on one node to simulating a failure, I got following error logs, and the consumer can't failover from the bad node. It will reconnect the failure node forever instead of other nodes. "kombu_failover_strategy" is default value of "round-robin".

2009-01-13 18:32:42.785 17 ERROR oslo.messaging._drivers.impl_rabbit [-] [4e976d46-ceee-4617-b9be-5e4821990738] AMQP server 120.0.0.56:5671 closed the connection. Check login credentials: Socket closed
2009-01-13 18:32:43.819 17 ERROR oslo.messaging._drivers.impl_rabbit [-] Unable to connect to AMQP server on 120.0.0.56:5671 after None tries: Socket closed
2009-01-13 18:32:43.819 17 WARNING oslo.messaging._drivers.impl_rabbit [-] Unexpected error during heartbeart thread processing, retrying...
2009-01-13 18:32:58.874 17 ERROR oslo.messaging._drivers.impl_rabbit [-] [4e976d46-ceee-4617-b9be-5e4821990738] AMQP server 120.0.0.56:5671 closed the connection. Check login credentials: Socket closed
2009-01-13 18:32:59.907 17 ERROR oslo.messaging._drivers.impl_rabbit [-] Unable to connect to AMQP server on 120.0.0.56:5671 after None tries: Socket closed
2009-01-13 18:32:59.907 17 WARNING oslo.messaging._drivers.impl_rabbit [-] Unexpected error during heartbeart thread processing, retrying...

Who can help me. Thanks!

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to oslo.messaging (master)

Fix proposed to branch: master
Review: https://review.openstack.org/519701

Changed in oslo.messaging:
assignee: nobody → Vincent Untz (vuntz)
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to oslo.messaging (master)

Reviewed: https://review.openstack.org/519701
Committed: https://git.openstack.org/cgit/openstack/oslo.messaging/commit/?id=8bfc3637a25a29583b1e0625c78bf159ac878259
Submitter: Zuul
Branch: master

commit 8bfc3637a25a29583b1e0625c78bf159ac878259
Author: Vincent Untz <email address hidden>
Date: Tue Nov 14 17:53:32 2017 +0100

    Catch socket.timeout when doing heartbeat_check

    heartbeat_check in kombu.connection is not reraising exceptions as
    exceptions.OperationalError, and the socket timeout during the heartbeat
    check is really an issue seen in the field when a node goes down; the
    heartbeat thread just tries again and again to deal with it, without
    success.

    Change-Id: I26dbdb18a7e64946db2cba676764ff2d428c7897
    Closes-Bug: #1657444

Changed in oslo.messaging:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/oslo.messaging 5.34.0

This issue was fixed in the openstack/oslo.messaging 5.34.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to oslo.messaging (stable/pike)

Fix proposed to branch: stable/pike
Review: https://review.openstack.org/522289

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to oslo.messaging (stable/pike)

Reviewed: https://review.openstack.org/522289
Committed: https://git.openstack.org/cgit/openstack/oslo.messaging/commit/?id=ae3de7f37c64933f84cc79cdb408ec56643ba83e
Submitter: Zuul
Branch: stable/pike

commit ae3de7f37c64933f84cc79cdb408ec56643ba83e
Author: Vincent Untz <email address hidden>
Date: Tue Nov 14 17:53:32 2017 +0100

    Catch socket.timeout when doing heartbeat_check

    heartbeat_check in kombu.connection is not reraising exceptions as
    exceptions.OperationalError, and the socket timeout during the heartbeat
    check is really an issue seen in the field when a node goes down; the
    heartbeat thread just tries again and again to deal with it, without
    success.

    Change-Id: I26dbdb18a7e64946db2cba676764ff2d428c7897
    Closes-Bug: #1657444
    (cherry picked from commit 8bfc3637a25a29583b1e0625c78bf159ac878259)

tags: added: in-stable-pike
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/oslo.messaging 5.30.2

This issue was fixed in the openstack/oslo.messaging 5.30.2 release.

Changed in cloud-archive:
status: New → Invalid
Changed in python-oslo.messaging (Ubuntu):
status: New → Invalid
Changed in python-oslo.messaging (Ubuntu Artful):
status: New → Triaged
importance: Undecided → High
Revision history for this message
Felipe Reyes (freyes) wrote :
tags: added: sts
Changed in python-oslo.messaging (Ubuntu Artful):
assignee: nobody → Felipe Reyes (freyes)
Revision history for this message
Corey Bryant (corey.bryant) wrote :

Thanks Felipe. I've uploaded the new version of this package to the artful unapproved queue where it is awaiting review by the SRU team.

Revision history for this message
Corey Bryant (corey.bryant) wrote :

Felipe, would you be able to fill in the SRU details? [Impact], [Test Case], [Regression Potential] and then we'll need to subscribe the ubuntu-sru team.

Revision history for this message
Felipe Reyes (freyes) wrote : Re: [Bug 1657444] Re: Can't failover when rabbit_hosts is configured as 3 hosts

On Thu, Jan 11, 2018 at 06:54:44PM -0000, Corey Bryant wrote:
> Felipe, would you be able to fill in the SRU details? [Impact], [Test
> Case], [Regression Potential] and then we'll need to subscribe the
> ubuntu-sru team.

I will, I was exactly writing those.

Felipe Reyes (freyes)
description: updated
Revision history for this message
Robie Basak (racb) wrote : Please test proposed package

Hello likun, or anyone else affected,

Accepted python-oslo.messaging into artful-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/python-oslo.messaging/5.30.0-0ubuntu2 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed.Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested and change the tag from verification-needed-artful to verification-done-artful. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-artful. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in python-oslo.messaging (Ubuntu Artful):
status: Triaged → Fix Committed
tags: added: verification-needed verification-needed-artful
Revision history for this message
Corey Bryant (corey.bryant) wrote :

Hello likun, or anyone else affected,

Accepted python-oslo.messaging into pike-proposed. The package will build now and be available in the Ubuntu Cloud Archive in a few hours, and then in the -proposed repository.

Please help us by testing this new package. To enable the -proposed repository:

  sudo add-apt-repository cloud-archive:pike-proposed
  sudo apt-get update

Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-pike-needed to verification-pike-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-pike-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

tags: added: verification-pike-needed
Revision history for this message
Felipe Reyes (freyes) wrote :

I verified that the package in artful-proposed correctly recovers from a timeout connecting to the next configured host, evidence: https://pastebin.ubuntu.com/26418597/

during my test (launching a vm) I didn't detect any regression

tags: added: verification-done verification-done-artful
removed: verification-needed verification-needed-artful
Revision history for this message
Felipe Reyes (freyes) wrote :

Verified the package in the xenial-pike UCA, when testing it with nova-compute it corrrectly recovers from a connection timing out, evidence: https://pastebin.ubuntu.com/26437464/

no regressions were detected.

tags: added: verification-pike-done
removed: verification-pike-needed
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package python-oslo.messaging - 5.30.0-0ubuntu2

---------------
python-oslo.messaging (5.30.0-0ubuntu2) artful; urgency=medium

  [ Corey Bryant ]
  * d/gbp.conf: Create stable/pike branch.

  [ Felipe Reyes ]
  * d/p/catch-socket.timeout-when-doing-heartbeat_check.patch: reconnect on
    timeout (LP: #1657444).

 -- Felipe Reyes <email address hidden> Wed, 10 Jan 2018 16:10:10 -0300

Changed in python-oslo.messaging (Ubuntu Artful):
status: Fix Committed → Fix Released
Revision history for this message
Łukasz Zemczak (sil2100) wrote : Update Released

The verification of the Stable Release Update for python-oslo.messaging has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Revision history for this message
Corey Bryant (corey.bryant) wrote :

The verification of the Stable Release Update for python-oslo.messaging has completed successfully and the package has now been released to -updates. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Revision history for this message
Corey Bryant (corey.bryant) wrote :

This bug was fixed in the package python-oslo.messaging - 5.30.0-0ubuntu2~cloud0
---------------

 python-oslo.messaging (5.30.0-0ubuntu2~cloud0) xenial-pike; urgency=medium
 .
   * New update for the Ubuntu Cloud Archive.
 .
 python-oslo.messaging (5.30.0-0ubuntu2) artful; urgency=medium
 .
   [ Corey Bryant ]
   * d/gbp.conf: Create stable/pike branch.
 .
   [ Felipe Reyes ]
   * d/p/catch-socket.timeout-when-doing-heartbeat_check.patch: reconnect on
     timeout (LP: #1657444).

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.