rabbitmq heartbeat failures don't reset connections

Bug #1436788 reported by James Page
20
This bug affects 3 people
Affects Status Importance Assigned to Milestone
oslo.messaging
Fix Released
Medium
Mehdi Abaakouk
Nominated for Kilo by Matt Riedemann
oslo.messaging (Ubuntu)
Fix Released
High
Unassigned
Vivid
Fix Released
High
Unassigned

Bug Description

Testing with oslo.messaging 1.8.1 and kilo-3 of OpenStack; three node rabbitmq cluster, native rabbitmq clustering, no haproxy!

Restarting the rabbitmq brokers cleanly works just fine, connections switch to a different broker and all is good in the world.

However, if I yank a broker off the network uncleanly, clients connections detect the missed heartbeats, but don't reset connections and switch to a new broker until the system tcp timeout is reached.

See attached log files.

Revision history for this message
James Page (james-page) wrote :
Revision history for this message
Mehdi Abaakouk (sileht) wrote :

After investigation this can occurs when broker disappear at the moment of an amqp frame is written to the socket (like the heartbeat packet or when we publish a msg).

kombu/py-amqp doesn't allow to set a custom timeout for that

upstream bug: https://github.com/celery/kombu/issues/463

Mehdi Abaakouk (sileht)
Changed in oslo.messaging:
status: New → Triaged
Mehdi Abaakouk (sileht)
Changed in oslo.messaging:
importance: Undecided → High
importance: High → Medium
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to oslo.messaging (master)

Reviewed: https://review.openstack.org/174929
Committed: https://git.openstack.org/cgit/openstack/oslo.messaging/commit/?id=287a4f56f45ed9cd40116a9e7b6e529f3382a925
Submitter: Jenkins
Branch: master

commit 287a4f56f45ed9cd40116a9e7b6e529f3382a925
Author: Mehdi Abaakouk <email address hidden>
Date: Fri Apr 17 17:37:34 2015 +0200

    Disable and mark heartbeat as experimental

    Due to some discovered issues since heartbeat is enabled by default.
    Specially #1436788, that needs to fix the underlying library, too.
    So, according to the discution here:
    https://bugs.launchpad.net/oslo.messaging/+bug/1436769/comments/10

    We decide to mark the implementation as experimental and disable it by default.

    Related-bug: #1436788
    Related-bug: #1436769
    Change-Id: Ib7c55977f976bdbbc8df4ad5915e0433cbf84a17

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to oslo.messaging (stable/kilo)

Related fix proposed to branch: stable/kilo
Review: https://review.openstack.org/177076

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to oslo.messaging (stable/kilo)

Reviewed: https://review.openstack.org/177076
Committed: https://git.openstack.org/cgit/openstack/oslo.messaging/commit/?id=a8c06abdd6ea78869b8836f7d20bab63b4fe798b
Submitter: Jenkins
Branch: stable/kilo

commit a8c06abdd6ea78869b8836f7d20bab63b4fe798b
Author: Mehdi Abaakouk <email address hidden>
Date: Fri Apr 17 17:37:34 2015 +0200

    Disable and mark heartbeat as experimental

    Due to some discovered issues since heartbeat is enabled by default.
    Specially #1436788, that needs to fix the underlying library, too.
    So, according to the discution here:
    https://bugs.launchpad.net/oslo.messaging/+bug/1436769/comments/10

    We decide to mark the implementation as experimental and disable it by default.

    Related-bug: #1436788
    Related-bug: #1436769
    Change-Id: Ib7c55977f976bdbbc8df4ad5915e0433cbf84a17
    (cherry picked from commit 287a4f56f45ed9cd40116a9e7b6e529f3382a925)

tags: added: in-stable-kilo
Mehdi Abaakouk (sileht)
Changed in oslo.messaging:
assignee: nobody → Mehdi Abaakouk (sileht)
milestone: none → next-liberty
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to oslo.messaging (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/179356

Changed in oslo.messaging:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to oslo.messaging (master)

Fix proposed to branch: master
Review: https://review.openstack.org/179357

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to oslo.messaging (master)

Reviewed: https://review.openstack.org/179356
Committed: https://git.openstack.org/cgit/openstack/oslo.messaging/commit/?id=0c954cffa2f3710acafa79f01b958a8955823640
Submitter: Jenkins
Branch: master

commit 0c954cffa2f3710acafa79f01b958a8955823640
Author: Mehdi Abaakouk <email address hidden>
Date: Fri May 1 13:27:15 2015 +0200

    Bump kombu and amqp requirements

    We at least need these versions of amqp and kombu to have
    a working heartbeat support.

    Related-bug: #1436788
    Closes-bug: #1436769
    Closes-bug: #1408830

    Change-Id: I61440c5ccf2b540fe9a1e868bdcae9f5d2cf8422

Mehdi Abaakouk (sileht)
Changed in oslo.messaging:
milestone: next-liberty → none
Mehdi Abaakouk (sileht)
Changed in oslo.messaging:
milestone: none → next-liberty
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to oslo.messaging (master)

Reviewed: https://review.openstack.org/179357
Committed: https://git.openstack.org/cgit/openstack/oslo.messaging/commit/?id=77f952a1f7d87c35fb5d7eb7e052c5b9def59b22
Submitter: Jenkins
Branch: master

commit 77f952a1f7d87c35fb5d7eb7e052c5b9def59b22
Author: Mehdi Abaakouk <email address hidden>
Date: Fri May 1 13:12:38 2015 +0200

    rabbit: Set timeout on the underlying socket

    They are some case where the underlying can be stuck
    until the system socket timeout is reached, but in oslo.messaging
    we very often known that is not needed to wait for ever because
    the upper layer (usualy the application) expect to return after
    a certain period.

    So this change set the timeout on the underlying socket each we can
    determine that is not needed to wait more.

    Closes-bug: #1436788
    Change-Id: Ie71ab8147c56eaf672585da107bec8b22af9da6c

Changed in oslo.messaging:
status: In Progress → Fix Committed
Changed in oslo.messaging:
status: Fix Committed → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to oslo.messaging (stable/kilo)

Fix proposed to branch: stable/kilo
Review: https://review.openstack.org/188563

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to oslo.messaging (stable/kilo)

Reviewed: https://review.openstack.org/188563
Committed: https://git.openstack.org/cgit/openstack/oslo.messaging/commit/?id=2daf4dccc323c48ec2bc32ec59523fd3c0ec589f
Submitter: Jenkins
Branch: stable/kilo

commit 2daf4dccc323c48ec2bc32ec59523fd3c0ec589f
Author: Mehdi Abaakouk <email address hidden>
Date: Thu Apr 30 16:13:14 2015 +0200

    rabbit: Set timeout on the underlying socket

    --
    NOTE(mriedem): This is two commits squashed to address a problem with
    the rabbitmq heartbeat patch in stable/kilo (since oslo.messaging 1.8.1).
    --

    rabbit: Remove unused stuffs from publisher

    The publisher code is over engineered, it allows to override
    everything, but this is never used.

    None of the child Class have the same signature, sometimes
    the constructor use the parameter name as the parent class but for
    a different purpose, that make the code hard to read.

    It's was never clear which options is passed to the queue and the
    exchange at this end to kombu.

    This changes removes all of that stuffs, and only use the kombu
    terminology for publisher parameters.

    (cherry picked from commit cca84f66d49114ce4ae85af4bbf03a14bda79121)

    --------------------------------------------

    rabbit: Set timeout on the underlying socket

    They are some case where the underlying can be stuck
    until the system socket timeout is reached, but in oslo.messaging
    we very often known that is not needed to wait for ever because
    the upper layer (usualy the application) expect to return after
    a certain period.

    So this change set the timeout on the underlying socket each we can
    determine that is not needed to wait more.

    Closes-bug: #1436788
    Change-Id: Ie71ab8147c56eaf672585da107bec8b22af9da6c
    (cherry picked from commit 77f952a1f7d87c35fb5d7eb7e052c5b9def59b22)

James Page (james-page)
Changed in oslo.messaging (Ubuntu Wily):
status: New → Fix Released
Changed in oslo.messaging (Ubuntu Vivid):
status: New → Triaged
importance: Undecided → High
Changed in oslo.messaging (Ubuntu Wily):
importance: Undecided → High
Revision history for this message
Chris J Arges (arges) wrote : Please test proposed package

Hello James, or anyone else affected,

Accepted oslo.messaging into vivid-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/oslo.messaging/1.8.3-0ubuntu0.15.04.1 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-needed to verification-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in oslo.messaging (Ubuntu Vivid):
status: Triaged → Fix Committed
tags: added: verification-needed
James Page (james-page)
tags: added: verification-done
removed: verification-needed
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package oslo.messaging - 1.8.3-0ubuntu0.15.04.1

---------------
oslo.messaging (1.8.3-0ubuntu0.15.04.1) vivid; urgency=medium

  * New upstream point release (LP: #1467959):
    - RabbitMQ driver:
      + Adding publisher acknowledgements/confirms for better handling
        of messages during broker shutdown/network failure.
      + Ensure consumer connections closed properly (LP: #1458917).
      + Set timeout on the underlying socket (LP: #1436788).
      + Disable and mark heartbeat as experimental (LP: #1436769).
      + Fix ipv6 support.
    - ZeroMQ driver:
      + Don't raise Timeout on no-matchmaker results (LP: #1186310).
      + Fix issue with Redis not deleting expired keys (LP: #1417464).
      + d/p/Fix-changing-keys-during-iteration-in-matchmaker-hea.patch,
        d/p/Add-pluggability-for-matchmakers.patch: Dropped, included
        upstream.

 -- James Page <email address hidden> Tue, 23 Jun 2015 15:28:01 +0100

Changed in oslo.messaging (Ubuntu Vivid):
status: Fix Committed → Fix Released
Revision history for this message
Chris J Arges (arges) wrote : Update Released

The verification of the Stable Release Update for oslo.messaging has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Revision history for this message
Chris J Arges (arges) wrote : Please test proposed package

Hello James, or anyone else affected,

Accepted oslo.messaging into vivid-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/oslo.messaging/1.8.3-0ubuntu0.15.04.2 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-needed to verification-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

tags: removed: verification-done
tags: added: verification-needed
James Page (james-page)
tags: added: verification-done
removed: verification-needed
Mathew Hodson (mhodson)
no longer affects: oslo.messaging (Ubuntu Wily)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/oslo.messaging 1.8.3

This issue was fixed in the openstack/oslo.messaging 1.8.3 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.