oslo.messaging

Can't failover when rabbit_hosts is configured as 3 hosts

Bug #1657444 reported by likun on 2017-01-18

This bug affects 3 people

	Status	Importance	Assigned to
Ubuntu Cloud Archive	Invalid	Undecided	Unassigned
Pike	Fix Released	High	Unassigned
oslo.messaging	Fix Released	Undecided	Vincent Untz
python-oslo.messaging (Ubuntu)	Invalid	Undecided	Unassigned
Artful	Fix Released	High	Felipe Reyes

Bug Description

[Impact]

When the heartbeat connection times out it is not treated as a recoverable error nor attempts to reconnect calling ensure_connection(). This leaves the heartbeat thread attempting to reconnect to the same host over and over again.

[Test Case]

* deploy openstack
  bzr branch lp:openstack-charm-testing
  cd openstack-charm-testing
  juju deployer -c default.yaml -d -v artful-pike
  juju add-unit rabbitmq-server
* Force timeout using iptables in a rabbitmq-server node
  sudo iptables -I INPUT -p tcp --dport 5672 -j DROP

Expected result:
once the timeout happens, the heartbeat thread reconnects (picking a new rabbit host if needed).

Actual result:
the heartbeat thread is left in a loop (connect, socket closed, retry, connect...)

[Regression Potential]

Without this patch when the heartbeat connection times out, and it does not attempt to connect to the next configured rabbit host. So the risk is that situations where currently the daemons using this library made it to reconnect to the same host (e.g. the disconnection from the host is only for a few seconds) with this change they will reconnect to the next host, so users may see the connections flapping between two (or more) rabbit hosts.

[Other Info]
I have a rabbitmq cluster of 3 nodes

root@47704165d2bb:/# rabbitmqctl cluster_status
Cluster status of node rabbit@47704165d2bb ...
[{nodes,[{disc,[rabbit@0482398a286e,rabbit@3709521b608a,
                rabbit@47704165d2bb]}]},
{running_nodes,[rabbit@0482398a286e,rabbit@3709521b608a,rabbit@47704165d2bb]},
{cluster_name,<<"rabbit@47704165d2bb">>},
{partitions,[]},
{alarms,[{rabbit@0482398a286e,[]},
          {rabbit@3709521b608a,[]},
          {rabbit@47704165d2bb,[]}]}]
root@47704165d2bb:/# rabbitmqctl list_policies
Listing policies ...
/ ha-all all ^ha\\. {"ha-mode":"all"} 0

My oslo_message client configuration
[oslo_messaging_rabbit]
rabbit_hosts=120.0.0.56:5671,120.0.0.57:5671,120.0.0.55:5671
rabbit_userid=cloud
rabbit_password=cloud
rabbit_ha_queues=True
rabbit_retry_interval=1
rabbit_retry_backoff=2
rabbit_max_retries=0
rabbit_durable_queues=False

When I run "service rabbitmq-server stop" on one node to simulating a failure, I got following error logs, and the consumer can't failover from the bad node. It will reconnect the failure node forever instead of other nodes. "kombu_failover_strategy" is default value of "round-robin".

2009-01-13 18:32:42.785 17 ERROR oslo.messaging._drivers.impl_rabbit [-] [4e976d46-ceee-4617-b9be-5e4821990738] AMQP server 120.0.0.56:5671 closed the connection. Check login credentials: Socket closed
2009-01-13 18:32:43.819 17 ERROR oslo.messaging._drivers.impl_rabbit [-] Unable to connect to AMQP server on 120.0.0.56:5671 after None tries: Socket closed
2009-01-13 18:32:43.819 17 WARNING oslo.messaging._drivers.impl_rabbit [-] Unexpected error during heartbeart thread processing, retrying...
2009-01-13 18:32:58.874 17 ERROR oslo.messaging._drivers.impl_rabbit [-] [4e976d46-ceee-4617-b9be-5e4821990738] AMQP server 120.0.0.56:5671 closed the connection. Check login credentials: Socket closed
2009-01-13 18:32:59.907 17 ERROR oslo.messaging._drivers.impl_rabbit [-] Unable to connect to AMQP server on 120.0.0.56:5671 after None tries: Socket closed
2009-01-13 18:32:59.907 17 WARNING oslo.messaging._drivers.impl_rabbit [-] Unexpected error during heartbeart thread processing, retrying...

Who can help me. Thanks!

See original description

Tags:

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-11-14: Fix proposed to oslo.messaging (master)

Fix proposed to branch: master
Review: https://review.openstack.org/519701

Changed in oslo.messaging:
assignee:	nobody → Vincent Untz (vuntz)
status:	New → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-11-17: Fix merged to oslo.messaging (master)

Reviewed: https://review.openstack.org/519701
Committed: https://git.openstack.org/cgit/openstack/oslo.messaging/commit/?id=8bfc3637a25a29583b1e0625c78bf159ac878259
Submitter: Zuul
Branch: master

commit 8bfc3637a25a29583b1e0625c78bf159ac878259
Author: Vincent Untz <email address hidden>
Date: Tue Nov 14 17:53:32 2017 +0100

Catch socket.timeout when doing heartbeat_check

    heartbeat_check in kombu.connection is not reraising exceptions as
    exceptions.OperationalError, and the socket timeout during the heartbeat
    check is really an issue seen in the field when a node goes down; the
    heartbeat thread just tries again and again to deal with it, without
    success.

Change-Id: I26dbdb18a7e64946db2cba676764ff2d428c7897
Closes-Bug: #1657444

Changed in oslo.messaging:
status:	In Progress → Fix Released

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-11-20: Fix included in openstack/oslo.messaging 5.34.0

This issue was fixed in the openstack/oslo.messaging 5.34.0 release.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-11-22: Fix proposed to oslo.messaging (stable/pike)

Fix proposed to branch: stable/pike
Review: https://review.openstack.org/522289

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-11-28: Fix merged to oslo.messaging (stable/pike)

Reviewed: https://review.openstack.org/522289
Committed: https://git.openstack.org/cgit/openstack/oslo.messaging/commit/?id=ae3de7f37c64933f84cc79cdb408ec56643ba83e
Submitter: Zuul
Branch: stable/pike

commit ae3de7f37c64933f84cc79cdb408ec56643ba83e
Author: Vincent Untz <email address hidden>
Date: Tue Nov 14 17:53:32 2017 +0100

Catch socket.timeout when doing heartbeat_check

    Change-Id: I26dbdb18a7e64946db2cba676764ff2d428c7897
    Closes-Bug: #1657444
    (cherry picked from commit 8bfc3637a25a29583b1e0625c78bf159ac878259)

tags:

added: in-stable-pike

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-12-11: Fix included in openstack/oslo.messaging 5.30.2

This issue was fixed in the openstack/oslo.messaging 5.30.2 release.

Corey Bryant (corey.bryant) on 2018-01-10

Changed in cloud-archive:
status:	New → Invalid
Changed in python-oslo.messaging (Ubuntu):
status:	New → Invalid
Changed in python-oslo.messaging (Ubuntu Artful):
status:	New → Triaged
importance:	Undecided → High

Revision history for this message

Felipe Reyes (freyes) wrote on 2018-01-11:

lp1657444_artful.debdiff Edit (3.1 KiB, text/plain)

tags:	added: sts
Changed in python-oslo.messaging (Ubuntu Artful):
assignee:	nobody → Felipe Reyes (freyes)

Revision history for this message

Corey Bryant (corey.bryant) wrote on 2018-01-11:

Thanks Felipe. I've uploaded the new version of this package to the artful unapproved queue where it is awaiting review by the SRU team.

Revision history for this message

Corey Bryant (corey.bryant) wrote on 2018-01-11:

Felipe, would you be able to fill in the SRU details? [Impact], [Test Case], [Regression Potential] and then we'll need to subscribe the ubuntu-sru team.

Revision history for this message

Felipe Reyes (freyes) wrote on 2018-01-11: Re: [Bug 1657444] Re: Can't failover when rabbit_hosts is configured as 3 hosts

#10

On Thu, Jan 11, 2018 at 06:54:44PM -0000, Corey Bryant wrote:
> Felipe, would you be able to fill in the SRU details? [Impact], [Test
> Case], [Regression Potential] and then we'll need to subscribe the
> ubuntu-sru team.

I will, I was exactly writing those.

Felipe Reyes (freyes) on 2018-01-11

description:

updated

Revision history for this message

Robie Basak (racb) wrote on 2018-01-17: Please test proposed package

#11

Hello likun, or anyone else affected,

Accepted python-oslo.messaging into artful-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/python-oslo.messaging/5.30.0-0ubuntu2 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed.Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested and change the tag from verification-needed-artful to verification-done-artful. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-artful. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in python-oslo.messaging (Ubuntu Artful):
status:	Triaged → Fix Committed
tags:	added: verification-needed verification-needed-artful

Revision history for this message

Corey Bryant (corey.bryant) wrote on 2018-01-18:

#12

Hello likun, or anyone else affected,

Accepted python-oslo.messaging into pike-proposed. The package will build now and be available in the Ubuntu Cloud Archive in a few hours, and then in the -proposed repository.

Please help us by testing this new package. To enable the -proposed repository:

sudo add-apt-repository cloud-archive:pike-proposed
sudo apt-get update

Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-pike-needed to verification-pike-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-pike-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

tags:

added: verification-pike-needed

Revision history for this message

Felipe Reyes (freyes) wrote on 2018-01-19:

#13

I verified that the package in artful-proposed correctly recovers from a timeout connecting to the next configured host, evidence: https://pastebin.ubuntu.com/26418597/

during my test (launching a vm) I didn't detect any regression

tags:

added: verification-done verification-done-artful
removed: verification-needed verification-needed-artful

Revision history for this message

Felipe Reyes (freyes) wrote on 2018-01-22:

#14

Verified the package in the xenial-pike UCA, when testing it with nova-compute it corrrectly recovers from a connection timing out, evidence: https://pastebin.ubuntu.com/26437464/

no regressions were detected.

tags:

added: verification-pike-done
removed: verification-pike-needed

Revision history for this message

Launchpad Janitor (janitor) wrote on 2018-01-25:

#15

This bug was fixed in the package python-oslo.messaging - 5.30.0-0ubuntu2

---------------
python-oslo.messaging (5.30.0-0ubuntu2) artful; urgency=medium

[ Corey Bryant ]
* d/gbp.conf: Create stable/pike branch.

  [ Felipe Reyes ]
  * d/p/catch-socket.timeout-when-doing-heartbeat_check.patch: reconnect on
    timeout (LP: #1657444).

-- Felipe Reyes <email address hidden> Wed, 10 Jan 2018 16:10:10 -0300

Changed in python-oslo.messaging (Ubuntu Artful):
status:	Fix Committed → Fix Released

Revision history for this message

Łukasz Zemczak (sil2100) wrote on 2018-01-25: Update Released

#16

The verification of the Stable Release Update for python-oslo.messaging has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Revision history for this message

Corey Bryant (corey.bryant) wrote on 2018-02-01:

#17

The verification of the Stable Release Update for python-oslo.messaging has completed successfully and the package has now been released to -updates. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Revision history for this message

Corey Bryant (corey.bryant) wrote on 2018-02-01:

#18

This bug was fixed in the package python-oslo.messaging - 5.30.0-0ubuntu2~cloud0
---------------

python-oslo.messaging (5.30.0-0ubuntu2~cloud0) xenial-pike; urgency=medium
.
   * New update for the Ubuntu Cloud Archive.
.
python-oslo.messaging (5.30.0-0ubuntu2) artful; urgency=medium
.
   [ Corey Bryant ]
   * d/gbp.conf: Create stable/pike branch.
.
   [ Felipe Reyes ]
   * d/p/catch-socket.timeout-when-doing-heartbeat_check.patch: reconnect on
     timeout (LP: #1657444).