rabbit/Qpid reconnection retry doesn't work as expected

Bug #1400268 reported by Mehdi Abaakouk
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
oslo.messaging
Fix Released
High
Mehdi Abaakouk
oslo.messaging (Ubuntu)
Fix Released
Undecided
Unassigned
Trusty
Fix Released
High
James Page
Utopic
Fix Committed
Undecided
Unassigned

Bug Description

Since oslo.messaging 1.5.0, the executor set the polling timeout to 1:

https://github.com/openstack/oslo.messaging/commit/7bce31a2d14de29f1fc306938c7085f5205c9110

but since: https://github.com/openstack/oslo.messaging/commit/2dd7de989f88c7c095d3c2ef1646d2dec87869a5

the rabbitmq driver iterconsume always honors the timeout even we lost the broker the connection.

but because the executor timeout is always 1 sec, the reconnection occurs every 1 seconds instead of respecting 'rabbit_retry_interval'

Cheers,
sileht

---- ---- ---- ---- ----

[Impact]

 * This patch along with those from LP #1408370 and LP #1338732 fixes rabbitmq reconnects

 * We are backporting this to Icehouse since oslo.messaging 1.3.0
   fails to reconnect to Rabbit properly, particularly nova-compute.

 * This patch alond with it's dependencies metnioend above, will ensure that
   multiple reconnect attempts happen by having connections timout and retry.

[Test Case]

 * Start a service that uses oslo.messaging with rabbitmq e.g. nova-compute

 * Stop rabbitmq while tail-F /var/log/nova/nova-compute.log

 * Observe that nova-compute amqp times out and it is trying to reconnect

 * Restart rabbitmq

 * Observe that rabbitmq connection has re-established

[Regression Potential]

 * None. I have tested in my local cloud environment and it appears to be
   reliable.

Mehdi Abaakouk (sileht)
Changed in oslo.messaging:
assignee: nobody → Mehdi Abaakouk (sileht)
milestone: none → next-kilo
importance: Undecided → High
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to oslo.messaging (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/139980

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Related fix proposed to branch: master
Review: https://review.openstack.org/139981

Changed in oslo.messaging:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to oslo.messaging (master)

Fix proposed to branch: master
Review: https://review.openstack.org/139982

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to oslo.messaging (master)

Reviewed: https://review.openstack.org/139980
Committed: https://git.openstack.org/cgit/openstack/oslo.messaging/commit/?id=023b7f44e2ccd77a7e9ee9ee78431a4646c88f13
Submitter: Jenkins
Branch: master

commit 023b7f44e2ccd77a7e9ee9ee78431a4646c88f13
Author: Mehdi Abaakouk <email address hidden>
Date: Mon Dec 8 10:56:52 2014 +0100

    rabbit: more precise iterconsume timeout

    The iterconsume always set the timeout of kombu to 1 second
    even the requested timeout more precise or < 1 second.

    This change fixes that.

    Related bug: #1400268
    Related bug: #1399257

    Change-Id: I157dab80cdb4afcf9a5f26fa900f96f0696db502

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.openstack.org/139981
Committed: https://git.openstack.org/cgit/openstack/oslo.messaging/commit/?id=43a9dc1de58df6559be02dc9f9ae3f5eeb12cb7a
Submitter: Jenkins
Branch: master

commit 43a9dc1de58df6559be02dc9f9ae3f5eeb12cb7a
Author: Mehdi Abaakouk <email address hidden>
Date: Mon Dec 8 10:28:12 2014 +0100

    qpid: honor iterconsume timeout

    The qpid driver must always honor the timeout passed the iterconsume
    method, this change fixes that.

    Related bug: #1400268
    Related bug: #1399257

    Change-Id: I8f267fc8b5a7abc852f0caf84d1e7c2c342ba951

Changed in oslo.messaging:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to oslo.messaging (master)

Reviewed: https://review.openstack.org/139982
Committed: https://git.openstack.org/cgit/openstack/oslo.messaging/commit/?id=15aa5cbda810ef3f757e9e54280fd8216dc9ef7d
Submitter: Jenkins
Branch: master

commit 15aa5cbda810ef3f757e9e54280fd8216dc9ef7d
Author: Mehdi Abaakouk <email address hidden>
Date: Mon Dec 8 10:52:45 2014 +0100

    The executor doesn't need to set the timeout

    It's up to the driver to set a suitable timeout for polling the broker,
    this one can be different that the one requested by the driver
    caller as long as the caller timeout is respected.

    This change also adds a new driver listener API, to be able
    to stop it cleanly, specially in case of timeout=None.

    Closes bug: #1400268
    Closes bug: #1399257
    Change-Id: I674c0def1efb420c293897d49683593a0b10e291

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to oslo.messaging (stable/juno)

Related fix proposed to branch: stable/juno
Review: https://review.openstack.org/143805

Changed in oslo.messaging:
status: Fix Committed → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to oslo.messaging (stable/icehouse)

Related fix proposed to branch: stable/icehouse
Review: https://review.openstack.org/173717

description: updated
Mehdi Abaakouk (sileht)
tags: added: icehouse-backport-potential
Revision history for this message
Edward Hope-Morley (hopem) wrote :
James Page (james-page)
Changed in oslo.messaging (Ubuntu):
status: New → Fix Released
Changed in oslo.messaging (Ubuntu Trusty):
importance: Undecided → High
status: New → Triaged
James Page (james-page)
Changed in oslo.messaging (Ubuntu Trusty):
status: Triaged → In Progress
assignee: nobody → James Page (james-page)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on oslo.messaging (stable/juno)

Change abandoned by Mehdi Abaakouk (sileht) (<email address hidden>) on branch: stable/juno
Review: https://review.openstack.org/143805
Reason: https://review.openstack.org/#/c/161119/

and

https://review.openstack.org/#/c/161120/

have replaced this changes.

Revision history for this message
Chris J Arges (arges) wrote : Please test proposed package

Hello Mehdi, or anyone else affected,

Accepted oslo.messaging into trusty-proposed. The package will build now and be available at http://launchpad.net/ubuntu/+source/oslo.messaging/1.3.0-0ubuntu1.1 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-needed to verification-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in oslo.messaging (Ubuntu Trusty):
status: In Progress → Fix Committed
tags: added: verification-needed
Revision history for this message
Ante Karamatić (ivoks) wrote :

We used 1.3.0-0ubuntu1.1 and we confirm it solves the problem.

tags: added: verification-done
removed: verification-needed
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package oslo.messaging - 1.3.0-0ubuntu1.1

---------------
oslo.messaging (1.3.0-0ubuntu1.1) trusty; urgency=medium

  * Backport fixes for reliable AMQP reconnect support, ensuring
    nova-compute instances re-connect and message correctly when
    RabbitMQ message brokers disappear is clustered configurations:
    - d/p/0001-rabbit-more-precise-iterconsume-timeout.patch:
      Improve precision of iterconsume timeouts (LP: #1400268).
    - d/p/0002-rabbit-fix-timeout-timer-when-duration-is-None.patch:
      Fix timeout timer when duration is set to None (LP: #1408370).
    - d/p/0003-Declare-DirectPublisher-exchanges-with-passive-True.patch:
      Ensure that message publishers fail and retry if the consumer has
      not yet declared a receiving queue (LP: #1338732).
 -- Edward Hope-Morley <email address hidden> Thu, 23 Apr 2015 15:56:08 +0100

Changed in oslo.messaging (Ubuntu Trusty):
status: Fix Committed → Fix Released
Revision history for this message
Brian Murray (brian-murray) wrote : Update Released

The verification of the Stable Release Update for oslo.messaging has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to oslo.messaging (stable/icehouse)

Reviewed: https://review.openstack.org/173717
Committed: https://git.openstack.org/cgit/openstack/oslo.messaging/commit/?id=b58180210145e1c804ec496576d6bb2caabc68ef
Submitter: Jenkins
Branch: stable/icehouse

commit b58180210145e1c804ec496576d6bb2caabc68ef
Author: Mehdi Abaakouk <email address hidden>
Date: Mon Dec 8 10:56:52 2014 +0100

    rabbit: more precise iterconsume timeout

    The iterconsume always set the timeout of kombu to 1 second
    even the requested timeout more precise or < 1 second.

    This change fixes that.

    Related bug: #1400268
    Related bug: #1399257
    Related-bug: #1338732

    (cherry picked from commit 023b7f44e2ccd77a7e9ee9ee78431a4646c88f13)

    Conflicts:
     oslo/messaging/_drivers/amqpdriver.py
     oslo/messaging/_drivers/impl_rabbit.py

    Change-Id: I157dab80cdb4afcf9a5f26fa900f96f0696db502

tags: added: in-stable-icehouse
Revision history for this message
Chris J Arges (arges) wrote : Please test proposed package

Hello Mehdi, or anyone else affected,

Accepted oslo.messaging into utopic-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/oslo.messaging/1.4.1-0ubuntu1.1 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-needed to verification-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in oslo.messaging (Ubuntu Utopic):
status: New → Fix Committed
tags: removed: verification-done
tags: added: verification-needed
Jian Wen (wenjianhn)
tags: removed: icehouse-backport-potential
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.