Timed out waiting for a reply via rabbit

Bug #1338732 reported by Noel Burton-Krahn
230
This bug affects 43 people
Affects Status Importance Assigned to Milestone
Ubuntu Cloud Archive
Fix Released
Undecided
Unassigned
oslo.messaging
Fix Released
High
Mehdi Abaakouk
oslo.messaging (Ubuntu)
Fix Released
Undecided
Unassigned
Trusty
Fix Released
High
James Page
Utopic
Won't Fix
High
Jorge Niedbalski
Vivid
Fix Released
Undecided
Unassigned

Bug Description

Icehouse
oslo-messaging 1.3.0 (stable/icehouse branch from github, c21a2c2)
rabbitmq-server 3.1.3

Nova rpc calls fail often after rabbit restarts. I've tracked it down to oslo/rabbit/kombu timing out if it's forced to reconnect to rabbit. The code below times out waiting for a reply if the topic has been used in a previous run. The reply always arrives the first time a topic is used, or if the topic is none. But, the second run with the same topic will hang with this error:

MessagingTimeout: Timed out waiting for a reply to message ID ...

This problem seems too basic to not be caught earlier in oslo, but the program below does really reproduce the same symptoms we see in nova when run against a live rabbit server.

Here's a log from a test run: https://gist.github.com/noelbk/12adcfe188d9f54f971d

#! /usr/bin/python

from oslo.config import cfg
import threading
from oslo import messaging
import logging
import time
log = logging.getLogger(__name__)

class OsloTest():
    def test(self):
        # The code below times out waiting for a reply if the topic
        # has been used in a previous run. The reply always arrives
        # the first time a topic is used, or if the topic is none.
        # But, the second run with the same topic will hang with this
        # error:
        #
        # MessagingTimeout: Timed out waiting for a reply to message ID ...
        #
        topic = 'will_hang_on_second_usage'
        #topic = None # never hangs

        url = "%(proto)s://%(user)s:%(password)s@%(host)s/" % dict(
            proto = 'rabbit',
            host = '1.2.3.4',
            password = 'xxxxxxxx',
            user = 'rabbit-mq-user',
            )
        transport = messaging.get_transport(cfg.CONF, url)
        driver = transport._driver

        target = messaging.Target(topic=topic)
        listener = driver.listen(target)
        ctxt={"context": True}
        timeout = 10

        def send_main():
            log.debug("sending msg")
            reply = driver.send(target,
                                ctxt,
                                {'send': 1},
                                wait_for_reply=True,
                                timeout=timeout)

            # times out if topic was not None and used before
            log.debug("received reply=%r" % (reply,))

        send_thread = threading.Thread(target=send_main)
        send_thread.daemon = True
        send_thread.start()

        msg = listener.poll()
        log.debug("received msg=%r" % (msg,))

        msg.reply({'reply': 1})

        log.debug("sent reply")

        send_thread.join()

if __name__ == '__main__':
    FORMAT = '%(asctime)-15s %(process)5d %(thread)5d %(filename)s %(funcName)s %(message)s'
    logging.basicConfig(level=logging.DEBUG, format=FORMAT)
    OsloTest().test()

---- ---- ---- ---- ----

[Impact]

 * This patch along with those from LP #1400268 and LP #1408370 fixes rabbitmq reconnects

 * We are backporting this to Icehouse since oslo.messaging 1.3.0
   fails to reconnect to Rabbit properly, particularly nova-compute.

 * This patch alond with it's dependencies metnioend above, will ensure that
   multiple reconnect attempts happen by having connections timout and retry.

[Test Case]

 * Start a service that uses oslo.messaging with rabbitmq e.g. nova-compute

 * Stop rabbitmq while tail-F /var/log/nova/nova-compute.log

 * Observe that nova-compute amqp times out and it is trying to reconnect

 * Restart rabbitmq

 * Observe that rabbitmq connection has re-established

[Regression Potential]

 * None. I have tested in my local cloud environment and it appears to be
   reliable.

Related branches

Revision history for this message
Noel Burton-Krahn (noelbk) wrote :

The code above was missing a call to acknowledge(), so it kept reading stale messages. However, with acknowedge(), it still reproduces the original bug. Here's how I reproduce the original problem. I verify the cluster is working, then kill rabbit, then try again.

    1. start cluster, create vms, migrate: ok
    2. kill and restart rabbit
    3. migrate vm: timeout

Here's a trace from https://gist.github.com/noelbk/619426fa88c3bdd0534c after rabbit was restarted. This
pattern repeats a few times during a migration after rabbit was
restarted:

    19:24:56 10.35.0.13 sends _msg_id = 29c6579c5de24c00b0b0e55579b8e047
    19:24:56 10.35.0.14 receives _msg_id = 29c6579c5de24c00b0b0e55579b8e047
    19:24:56 10.35.0.14 acknowledges _msg_id = 29c6579c5de24c00b0b0e55579b8e047
    19:25:56 10.35.0.13 Timed out waiting for a reply to message ID 29c6579c5de24c00b0b0e55579b8e047

I'm instrumenting the rpc calls now to see if they all actually do retry and complete after the timeout errors. I'm trying to get a trace of all the rpc calls to see if they're being acknowledged but not replied to in time.

While digging through oslo.messaging, I noticed that in amqpdriver.py, the incoming queues in ReplyWaiter and AMQPListener are plain arrays, not thread-safe Queues. ReplyWaiter does acquire a lock, but I'm not sure if the plain arrays are always thread-safe. Not sure if this is causing my issue.

Still digging...

Revision history for this message
Noel Burton-Krahn (noelbk) wrote :
Download full text (7.8 KiB)

Here's a trace of an RPC call that times out after a reconnect to rabbit. Somehow impl_rabbit._consume is failing to consume the RPC reply that was sent earlier. Full log is here: https://gist.github.com/noelbk/468f852c6f2b78883cc9

1. ALICE sends RPC call to BOB

  Jul 10 07:50:49 10.35.0.13 cobalt-compute: ALICE debug __wrapped [Kombu channel:1] _basic_publish(Message({'body': '{"oslo.message": "{\\"_context_roles\\": [], \\"_msg_id\\": \\"8504f806c278472cac67288977887170\\", \\"_context_quota_class\\": null, \\"_context_request_id\\": \\"req-27caa60d-f916-4962-820b-5e9741021f50\\", \\"_context_service_catalog\\": [], \\"args\\": {\\"instance_id\\": 3, \\"teardown\\": false, \\"host\\": \\"10.35.0.13\\"}, \\"_unique_id\\": \\"7604a4c7254a49eba26a78251ebe8fb5\\", \\"_context_user\\": null, \\"_context_user_id\\": null, \\"_context_project_name\\": null, \\"_context_read_deleted\\": \\"no\\", \\"_reply_q\\": \\"reply_a6882628d63d497394316484050dd50b\\", \\"_context_auth_token\\": null, \\"_context_tenant\\": null, \\"_context_instance_lock_checked\\": false, \\"_context_is_admin\\": true, \\"version\\": \\"1.0\\", \\"_context_project_id\\": null, \\"_context_timestamp\\": \\"2014-07-10T07:50:48.043133\\", \\"_context_user_name\\": null, \\"method\\": \\"setup_networks_on_host\\", \\"_context_remote_address\\": null}", "oslo.version": "2.0"}', 'properties': {'priority': 0, 'application_headers': {'ttl': 60000}, 'delivery_mode': 2, 'content_encoding': 'utf-8', 'content_type': 'application/json'}, 'channel': None}), mandatory=False, routing_key='network', immediate=False, exchange='nova')

2. ALICE looses connection to rabbit, delays reconnect for 1 sec

  Jul 10 07:50:49 10.35.0.13 cobalt-compute: ALICE impl_rabbit ensure NBK: ensure reconnecting...
  Jul 10 07:50:49 10.35.0.13 cobalt-compute: ALICE impl_rabbit _connect Reconnecting to AMQP server on 10.35.0.3:5672
  Jul 10 07:50:49 10.35.0.13 2014-07-10 07:50:49.344 9114 140035666425488 oslo.messaging._drivers.impl_rabbit impl_rabbit acknowledge NBK: acknowledge self._raw_message.body=u'{"oslo.message": "{\\"_context_roles\\": [], \\"_msg_id\\": \\"8504f806c278472cac67288977887170\\", \\"_context_quota_class\\": null, \\"_context_request_id\\": \\"req-27caa60d-f916-4962-820b-5e9741021f50\\", \\"_context_service_catalog\\": [], \\"args\\": {\\"instance_id\\": 3, \\"teardown\\": false, \\"host\\": \\"10.35.0.13\\"}, \\"_unique_id\\": \\"7604a4c7254a49eba26a78251ebe8fb5\\", \\"_context_user\\": null, \\"_context_user_id\\": null, \\"_context_project_name\\": null, \\"_context_read_deleted\\": \\"no\\", \\"_reply_q\\": \\"reply_a6882628d63d497394316484050dd50b\\", \\"_context_auth_token\\": null, \\"_context_tenant\\": null, \\"_context_instance_lock_checked\\": false, \\"_context_is_admin\\": true, \\"version\\": \\"1.0\\", \\"_context_project_id\\": null, \\"_context_timestamp\\": \\"2014-07-10T07:50:48.043133\\", \\"_context_user_name\\": null, \\"method\\": \\"setup_networks_on_host\\", \\"_context_remote_address\\": null}", "oslo.version": "2.0"}'
  Jul 10 07:50:49 10.35.0.13 cobalt-compute: ALICE impl_rabbit _connect Delaying reconnect for 1.0 seconds...

3. BOB received ALICE's RPC ...

Read more...

Revision history for this message
Noel Burton-Krahn (noelbk) wrote :

The problem is that the rabbit queues declared for RPC are not durable, so if the receiver is reconnecting when the sender replies. the reply will be deleted when the queue is deleted. Here's how it goes:

1. Sender sends RPC call
2. Receiver gets call, acknowledges
3. Rabbit dies
4. Sender gets connection dropped error, waits 1 sec to reconnect
5. Receiver connects (successfully), sends reply to reply_queue and disconnects
6. Rabbit deletes the reply queue when the receiver disconnects because it's not durable and there are no other connections
7. Sender reconnects to rabbit
8. Sender waits for reply (that was deleted earlier), times out, fails, does not retry
9. RPC call fails

I've been looking for a way of setting durability on RPC queues in impl_rabbit.py. It supports an "amqp_durable_queues" config option, but only for TopicConsumer objects. The RPC mechanism uses DirectConsumer objects, and there's no way to set the durable option on a queue during an RPC call. There's no way for a RPC client to pass durable=True to _declare_consumer.

Revision history for this message
Noel Burton-Krahn (noelbk) wrote :

Unfortunately, rabbit still drops reply if auto_delete=False, durable=True

https://gist.github.com/noelbk/f8476952809dac8a295c

All the queues and exchanges are creates with auto_delete=False, durable=True (search for exchange_declare and queue_declare above), but the client still fails to consume a message the server published before the client reconnects to rabbit.

I do see that the server calls exchange_declare before publishing, but but queue_declare. What does that mean to publish to an exchange when the client will consume from a queue?

no longer affects: oslo
Revision history for this message
Noel Burton-Krahn (noelbk) wrote :

Found the core problem: producers declare the exchange, but not the queue abound to the exchange. It looks like they're assuming the consumer will declare the queue before the producer replies, which is not the case in the case of a reconnect. In a reconnect, the consumer sleeps for a seconds before reconnecting. The producer then declares the exchange without a queue and the message gets published to nowhere. The fix is to declare the queue and exchange in the producer as well as the consumer.

Revision history for this message
Jens-Christian Fischer (jens-christian-fischer) wrote :

We are (most likely) running into this exact problem (Our Controller Node with Rabbit is running on a VM that hiccups in regular instances. We see a lot of rabbit reconnects from all over the place). Our Cluster is hardly usable because VMs don't get started or deleted consistently.

Can you point me to a patch or tell me which file(s) you are looking at? I would like to try to hot-fix this and see if that solves (some of) our problems.

Revision history for this message
Noel Burton-Krahn (noelbk) wrote : Re: [Bug 1338732] Re: Timed out waiting for a reply via rabbit
Download full text (4.1 KiB)

Sure, I'm just making a pull request for it now, but here's the patch:
https://github.com/noelbk/oslo.messaging/commit/960fc26ff050ca3073ad90eccbef1ca95712e82e

On Thu, Jul 17, 2014 at 6:45 AM, Jens-Christian Fischer <
<email address hidden>> wrote:

> We are (most likely) running into this exact problem (Our Controller
> Node with Rabbit is running on a VM that hiccups in regular instances.
> We see a lot of rabbit reconnects from all over the place). Our Cluster
> is hardly usable because VMs don't get started or deleted consistently.
>
> Can you point me to a patch or tell me which file(s) you are looking at?
> I would like to try to hot-fix this and see if that solves (some of)
> our problems.
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1338732
>
> Title:
> Timed out waiting for a reply via rabbit
>
> Status in Messaging API for OpenStack:
> New
>
> Bug description:
> Icehouse
> oslo-messaging 1.3.0 (stable/icehouse branch from github, c21a2c2)
> rabbitmq-server 3.1.3
>
> Nova rpc calls fail often after rabbit restarts. I've tracked it down
> to oslo/rabbit/kombu timing out if it's forced to reconnect to rabbit.
> The code below times out waiting for a reply if the topic has been
> used in a previous run. The reply always arrives the first time a
> topic is used, or if the topic is none. But, the second run with the
> same topic will hang with this error:
>
> MessagingTimeout: Timed out waiting for a reply to message ID ...
>
> This problem seems too basic to not be caught earlier in oslo, but the
> program below does really reproduce the same symptoms we see in nova
> when run against a live rabbit server.
>
> Here's a log from a test run:
> https://gist.github.com/noelbk/12adcfe188d9f54f971d
>
>
> #! /usr/bin/python
>
> from oslo.config import cfg
> import threading
> from oslo import messaging
> import logging
> import time
> log = logging.getLogger(__name__)
>
> class OsloTest():
> def test(self):
> # The code below times out waiting for a reply if the topic
> # has been used in a previous run. The reply always arrives
> # the first time a topic is used, or if the topic is none.
> # But, the second run with the same topic will hang with this
> # error:
> #
> # MessagingTimeout: Timed out waiting for a reply to message ID
> ...
> #
> topic = 'will_hang_on_second_usage'
> #topic = None # never hangs
>
> url = "%(proto)s://%(user)s:%(password)s@%(host)s/" % dict(
> proto = 'rabbit',
> host = '1.2.3.4',
> password = 'xxxxxxxx',
> user = 'rabbit-mq-user',
> )
> transport = messaging.get_transport(cfg.CONF, url)
> driver = transport._driver
>
> target = messaging.Target(topic=topic)
> listener = driver.listen(target)
> ctxt={"context": True}
> timeout = 10
>
> def send_main():
> log.debug("sending msg")
> reply = ...

Read more...

Changed in oslo.messaging:
status: New → Confirmed
Changed in oslo.messaging:
assignee: nobody → Noel Burton-Krahn (noel-burton-krahn)
status: Confirmed → In Progress
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Related bug which is a part of this issue as well https://bugs.launchpad.net/oslo.messaging/+bug/1349301

Revision history for this message
Stan Lagun (slagun) wrote :

Please see https://review.openstack.org/#/c/110058/. It looks like it solves this problem, but I'm not sure

Revision history for this message
Noel Burton-Krahn (noelbk) wrote :

The review above https://review.openstack.org/#/c/110058/ does not fix this particular problem

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to oslo.messaging (master)

Fix proposed to branch: master
Review: https://review.openstack.org/123115

Changed in oslo.messaging:
assignee: Noel Burton-Krahn (noel-burton-krahn) → Andrey Pavlov (apavlov-e)
Mehdi Abaakouk (sileht)
Changed in oslo.messaging:
status: In Progress → Incomplete
status: Incomplete → In Progress
Revision history for this message
Mehdi Abaakouk (sileht) wrote :

I can't reproduce this issue, the timeout code path of oslo.messaging have been fixed, Can you try with oslo.messaging 1.5 if this is still valid.

Also test code in the bug description is wrong, you need to ack message to make it works:

    msg = listener.poll()
    msg.acknowledge()

Changed in oslo.messaging:
status: In Progress → Incomplete
Changed in oslo.messaging:
assignee: Andrey Pavlov (apavlov-e) → nobody
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on oslo.messaging (master)

Change abandoned by Andrey Pavlov (<email address hidden>) on branch: master
Review: https://review.openstack.org/123115

Revision history for this message
Zhen Qin (zqinit) wrote :

To Andrey Pavlov: could you confirm if you still see this problem in slo.messaging 1.5? Thanks.

Revision history for this message
Andrey Pavlov (apavlov-e) wrote :

Unfortunately I can not. Our deployment bases on Icehouse version and not so big.
Also I faced with this issue in very strange case that resolved by re-configuring other cloud software.

Revision history for this message
Vish Ishaya (vishvananda) wrote :
Download full text (15.7 KiB)

I believe I have reproduced this in 1.5. It is very easy to reproduce in devstack. Steps:

1. launch an instance: nova
 nova boot --image cirros-0.3.2-x86_64-uec --flavor 1 foo
2. call nova console-log
 nova console-log foo
3. restart rabbit
 sudo service rabbitmq-server restart
4. wait a couple of minutes for compute to reconnect
5. call nova console-log
 nova console-log foo

It will fail the first time. Sometimes it fails the second time as well. Errors in the log:

2014-12-19 11:29:47.758 ERROR oslo.messaging._drivers.impl_rabbit [req-a69080c0-4d4c-4de4-9226-0cc5d89f98bc demo demo] Failed to publish message to topic 'compute.devstack': [Errno 104] Connection reset by peer
2014-12-19 11:29:47.758 TRACE oslo.messaging._drivers.impl_rabbit Traceback (most recent call last):
2014-12-19 11:29:47.758 TRACE oslo.messaging._drivers.impl_rabbit File "/usr/local/lib/python2.7/dist-packages/kombu/connection.py", line 436, in _ensured
2014-12-19 11:29:47.758 TRACE oslo.messaging._drivers.impl_rabbit return fun(*args, **kwargs)
2014-12-19 11:29:47.758 TRACE oslo.messaging._drivers.impl_rabbit File "/usr/local/lib/python2.7/dist-packages/kombu/connection.py", line 508, in __call__
2014-12-19 11:29:47.758 TRACE oslo.messaging._drivers.impl_rabbit return fun(*args, channel=channels[0], **kwargs), channels[0]
2014-12-19 11:29:47.758 TRACE oslo.messaging._drivers.impl_rabbit File "/usr/local/lib/python2.7/dist-packages/oslo/messaging/_drivers/impl_rabbit.py", line 714, in _publish
2014-12-19 11:29:47.758 TRACE oslo.messaging._drivers.impl_rabbit publisher = cls(self.conf, channel, topic=topic, **kwargs)
2014-12-19 11:29:47.758 TRACE oslo.messaging._drivers.impl_rabbit File "/usr/local/lib/python2.7/dist-packages/oslo/messaging/_drivers/impl_rabbit.py", line 381, in __init__
2014-12-19 11:29:47.758 TRACE oslo.messaging._drivers.impl_rabbit **options)
2014-12-19 11:29:47.758 TRACE oslo.messaging._drivers.impl_rabbit File "/usr/local/lib/python2.7/dist-packages/oslo/messaging/_drivers/impl_rabbit.py", line 329, in __init__
2014-12-19 11:29:47.758 TRACE oslo.messaging._drivers.impl_rabbit self.reconnect(channel)
2014-12-19 11:29:47.758 TRACE oslo.messaging._drivers.impl_rabbit File "/usr/local/lib/python2.7/dist-packages/oslo/messaging/_drivers/impl_rabbit.py", line 337, in reconnect
2014-12-19 11:29:47.758 TRACE oslo.messaging._drivers.impl_rabbit routing_key=self.routing_key)
2014-12-19 11:29:47.758 TRACE oslo.messaging._drivers.impl_rabbit File "/usr/local/lib/python2.7/dist-packages/kombu/messaging.py", line 85, in __init__
2014-12-19 11:29:47.758 TRACE oslo.messaging._drivers.impl_rabbit self.revive(self._channel)
2014-12-19 11:29:47.758 TRACE oslo.messaging._drivers.impl_rabbit File "/usr/local/lib/python2.7/dist-packages/kombu/messaging.py", line 218, in revive
2014-12-19 11:29:47.758 TRACE oslo.messaging._drivers.impl_rabbit self.declare()
2014-12-19 11:29:47.758 TRACE oslo.messaging._drivers.impl_rabbit File "/usr/local/lib/python2.7/dist-packages/kombu/messaging.py", line 105, in declare
2014-12-19 11:29:47.758 TRACE oslo.messaging._drivers.impl_rabbit self.exchange.declare()
2014-12-19 11:29:47.7...

Changed in oslo.messaging:
status: Incomplete → Confirmed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to oslo.messaging (stable/juno)

Related fix proposed to branch: stable/juno
Review: https://review.openstack.org/143805

Revision history for this message
QingchuanHao (haoqingchuan-28) wrote :

Sorry for my disrupting the posting formation. The not-reusing-timeout , fixed in the commit reviewed above, indeed solved the hanging problem. But in case of heartbeat, the problem may occur. Because the heartbeat as a packet, unlock the timeout set by the socket.settimeout, and hang in _next_method located in amqp/method_frame.py. Publisher declare the queue can be beneficial in such case that the message is published before a queue declared, which is not ensure in so many ensurances.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to oslo.messaging (master)

Fix proposed to branch: master
Review: https://review.openstack.org/144498

Changed in oslo.messaging:
assignee: nobody → Vish Ishaya (vishvananda)
status: Confirmed → In Progress
Changed in oslo.messaging:
assignee: Vish Ishaya (vishvananda) → Mehdi Abaakouk (sileht)
Mehdi Abaakouk (sileht)
Changed in oslo.messaging:
importance: Undecided → High
milestone: none → next-kilo
Changed in oslo.messaging:
milestone: 1.6.0 → next-kilo
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to oslo.messaging (master)

Reviewed: https://review.openstack.org/109373
Committed: https://git.openstack.org/cgit/openstack/oslo.messaging/commit/?id=434b5c8781b36cd65b7b642d723c0502e7093795
Submitter: Jenkins
Branch: master

commit 434b5c8781b36cd65b7b642d723c0502e7093795
Author: Mehdi Abaakouk <email address hidden>
Date: Tue Jul 22 09:42:52 2014 -0700

    Declare DirectPublisher exchanges with passive=True

    If rabbit dies, the consumer can be disconnected before the publisher
    sends, and if the consumer hasn't declared the queue, the publisher's
    will send a message to an exchange that's not bound to a queue, and
    the message wll be lost. Setting passive=True will cause the
    publisher to fail and retry if the consumer hasn't declared the
    receiving queue yet.

    Co-Authored-By: Noel Burton-Krahn <email address hidden>
    Closes-Bug: #1338732
    Change-Id: I5ba4d311b97236d3a85a9f5badff61f12b08c12d

Changed in oslo.messaging:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on oslo.messaging (master)

Change abandoned by Mehdi Abaakouk (<email address hidden>) on branch: master
Review: https://review.openstack.org/144498
Reason: This bug have been closed by https://review.openstack.org/#/c/109373/

Mehdi Abaakouk (sileht)
Changed in oslo.messaging:
status: Fix Committed → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to oslo.messaging (stable/juno)

Related fix proposed to branch: stable/juno
Review: https://review.openstack.org/161119

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to oslo.messaging (stable/juno)

Reviewed: https://review.openstack.org/161119
Committed: https://git.openstack.org/cgit/openstack/oslo.messaging/commit/?id=858353394dd15101ebcb1c4b39e672bc4aa2bd86
Submitter: Jenkins
Branch: stable/juno

commit 858353394dd15101ebcb1c4b39e672bc4aa2bd86
Author: Jie Li <email address hidden>
Date: Wed Mar 4 15:42:13 2015 +0800

    Fix _poll_connection not timeout issue (1/2)

    _poll_connection could fall into a loop waiting for a reply message, if
    rabbit dies and up. This commit will set up a rpc_response_timeout timer
    for one connection polling; so the rpc will finally jump out with a
    timeout exception which is expected in such scenario.

    Related bug: #1338732

    The title of the commit in master was:
      rabbit: more precise iterconsume timeout

    but this was changed since it didn't describe the actual change.
    This commit resolved some conflicts due to cherry-pick.

    Change-Id: I157dab80cdb4afcf9a5f26fa900f96f0696db502
    (cherry picked from commit 023b7f44e2ccd77a7e9ee9ee78431a4646c88f13)

tags: added: in-stable-juno
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to oslo.messaging (stable/icehouse)

Related fix proposed to branch: stable/icehouse
Review: https://review.openstack.org/173717

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to oslo.messaging (stable/icehouse)

Fix proposed to branch: stable/icehouse
Review: https://review.openstack.org/173724

James Page (james-page)
Changed in oslo.messaging (Ubuntu Vivid):
status: New → Fix Released
Changed in oslo.messaging (Ubuntu Utopic):
status: New → Fix Released
Changed in oslo.messaging (Ubuntu Trusty):
status: New → Confirmed
importance: Undecided → High
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on oslo.messaging (stable/icehouse)

Change abandoned by Edward Hope-Morley (<email address hidden>) on branch: stable/icehouse
Review: https://review.openstack.org/173724
Reason: I am going to resubmit this as part of https://review.openstack.org/#/c/173717/ including the other patch mentioned by Mehdi

description: updated
Changed in oslo.messaging (Ubuntu Trusty):
status: Confirmed → In Progress
Mehdi Abaakouk (sileht)
tags: added: icehouse-backport-potential
Revision history for this message
Edward Hope-Morley (hopem) wrote :
Revision history for this message
David Medberry (med) wrote :

At the top I see "Affects" for T, U, and V. Where do the "Affects" for UCA show up? Ie, where's the patch for Juno on Trusty going to appear? Is it being considered. (Filing this in one place but of course this affects almost any T/U backport for I/J and now K OpenStack releases. What's the right way to flag a bug for UCA in general? @dosaboy @jamespage

Revision history for this message
James Page (james-page) wrote :

med

UCA will get auto-fed the result of the SRU process; so its implicit.

Raising a cloud-archive task so its trackable.

James Page (james-page)
Changed in oslo.messaging (Ubuntu Trusty):
assignee: nobody → James Page (james-page)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on oslo.messaging (stable/juno)

Change abandoned by Mehdi Abaakouk (sileht) (<email address hidden>) on branch: stable/juno
Review: https://review.openstack.org/143805
Reason: https://review.openstack.org/#/c/161119/

and

https://review.openstack.org/#/c/161120/

have replaced this changes.

Revision history for this message
Chris J Arges (arges) wrote : Please test proposed package

Hello Noel, or anyone else affected,

Accepted oslo.messaging into trusty-proposed. The package will build now and be available at http://launchpad.net/ubuntu/+source/oslo.messaging/1.3.0-0ubuntu1.1 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-needed to verification-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in oslo.messaging (Ubuntu Trusty):
status: In Progress → Fix Committed
tags: added: verification-needed
Revision history for this message
Ante Karamatić (ivoks) wrote :

We used 1.3.0-0ubuntu1.1 and we confirm it solves the problem.

tags: added: verification-done
removed: verification-needed
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package oslo.messaging - 1.3.0-0ubuntu1.1

---------------
oslo.messaging (1.3.0-0ubuntu1.1) trusty; urgency=medium

  * Backport fixes for reliable AMQP reconnect support, ensuring
    nova-compute instances re-connect and message correctly when
    RabbitMQ message brokers disappear is clustered configurations:
    - d/p/0001-rabbit-more-precise-iterconsume-timeout.patch:
      Improve precision of iterconsume timeouts (LP: #1400268).
    - d/p/0002-rabbit-fix-timeout-timer-when-duration-is-None.patch:
      Fix timeout timer when duration is set to None (LP: #1408370).
    - d/p/0003-Declare-DirectPublisher-exchanges-with-passive-True.patch:
      Ensure that message publishers fail and retry if the consumer has
      not yet declared a receiving queue (LP: #1338732).
 -- Edward Hope-Morley <email address hidden> Thu, 23 Apr 2015 15:56:08 +0100

Changed in oslo.messaging (Ubuntu Trusty):
status: Fix Committed → Fix Released
Revision history for this message
Brian Murray (brian-murray) wrote : Update Released

The verification of the Stable Release Update for oslo.messaging has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

James Page (james-page)
Changed in oslo.messaging (Ubuntu Utopic):
status: Fix Released → New
importance: Undecided → High
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to oslo.messaging (stable/icehouse)

Reviewed: https://review.openstack.org/173717
Committed: https://git.openstack.org/cgit/openstack/oslo.messaging/commit/?id=b58180210145e1c804ec496576d6bb2caabc68ef
Submitter: Jenkins
Branch: stable/icehouse

commit b58180210145e1c804ec496576d6bb2caabc68ef
Author: Mehdi Abaakouk <email address hidden>
Date: Mon Dec 8 10:56:52 2014 +0100

    rabbit: more precise iterconsume timeout

    The iterconsume always set the timeout of kombu to 1 second
    even the requested timeout more precise or < 1 second.

    This change fixes that.

    Related bug: #1400268
    Related bug: #1399257
    Related-bug: #1338732

    (cherry picked from commit 023b7f44e2ccd77a7e9ee9ee78431a4646c88f13)

    Conflicts:
     oslo/messaging/_drivers/amqpdriver.py
     oslo/messaging/_drivers/impl_rabbit.py

    Change-Id: I157dab80cdb4afcf9a5f26fa900f96f0696db502

tags: added: in-stable-icehouse
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to oslo.messaging (stable/icehouse)

Reviewed: https://review.openstack.org/173724
Committed: https://git.openstack.org/cgit/openstack/oslo.messaging/commit/?id=33f3b18539da88ddf984ada4b63e2767d64438f9
Submitter: Jenkins
Branch: stable/icehouse

commit 33f3b18539da88ddf984ada4b63e2767d64438f9
Author: Mehdi Abaakouk <email address hidden>
Date: Tue Jul 22 09:42:52 2014 -0700

    Declare DirectPublisher exchanges with passive=True

    If rabbit dies, the consumer can be disconnected before the publisher
    sends, and if the consumer hasn't declared the queue, the publisher's
    will send a message to an exchange that's not bound to a queue, and
    the message wll be lost. Setting passive=True will cause the
    publisher to fail and retry if the consumer hasn't declared the
    receiving queue yet.

    Co-Authored-By: Noel Burton-Krahn <email address hidden>
    Closes-Bug: #1338732
    (cherry picked from commit 434b5c8781b36cd65b7b642d723c0502e7093795)

    Conflicts:
     oslo_messaging/_drivers/common.py
     oslo_messaging/tests/test_utils.py

    Change-Id: I5ba4d311b97236d3a85a9f5badff61f12b08c12d

Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in oslo.messaging (Ubuntu Utopic):
status: New → Confirmed
Revision history for this message
Viktor Serhieiev (vsergeyev) wrote :

What are the actual steps to reproduce this bug? I want to try this fix - https://review.openstack.org/#/c/110058/

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to oslo.messaging (master)

Reviewed: https://review.openstack.org/184473
Committed: https://git.openstack.org/cgit/openstack/oslo.messaging/commit/?id=cc4ca1f9efc7f4a935ad31e869e8aae89e98b8fe
Submitter: Jenkins
Branch: master

commit cc4ca1f9efc7f4a935ad31e869e8aae89e98b8fe
Author: Victor Sergeyev <email address hidden>
Date: Wed May 20 15:49:53 2015 +0300

    Add one more functional test for MessagingTimeout

    Add a check, that MessagingTimeout raises on long-running queries,
    if client sends another queries at the same time.

    Added a long_running_task() to TestServerEndpoint and allowed to pass a
    message executor into the RpcServerFixture.

    Related bug: #1338732

    Co-Authored-By: Roman Podoliaka <email address hidden>
    Change-Id: Icafb6838e2d9fb76b6d1c202465c09c174a3bed9

Revision history for this message
Chris J Arges (arges) wrote : Please test proposed package

Hello Noel, or anyone else affected,

Accepted oslo.messaging into utopic-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/oslo.messaging/1.4.1-0ubuntu1.1 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-needed to verification-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in oslo.messaging (Ubuntu Utopic):
status: Confirmed → Fix Committed
tags: removed: verification-done
tags: added: verification-needed
Revision history for this message
JohnsonYi (yichengli) wrote :
Download full text (7.7 KiB)

Hello Chris J Arges, I tested your package https://launchpad.net/ubuntu/+source/oslo.messaging/1.4.1-0ubuntu1.1 on my fuel openstack environment, it didn't work very well.

Please check the below information for details,

[root@mvp-builder ~]# fuel --f
api: '1.0'
astute_sha: 16b252d93be6aaa73030b8100cf8c5ca6a970a91
auth_required: true
build_id: 2014-12-26_14-25-46
build_number: '58'
feature_groups:
- mirantis
- experimental
fuellib_sha: fde8ba5e11a1acaf819d402c645c731af450aff0
fuelmain_sha: 81d38d6f2903b5a8b4bee79ca45a54b76c1361b8
nailgun_sha: 5f91157daa6798ff522ca9f6d34e7e135f150a90
ostf_sha: a9afb68710d809570460c29d6c3293219d3624d4
production: docker
release: '6.0'
release_versions:
  2014.2-6.0:
    VERSION:
      api: '1.0'
      astute_sha: 16b252d93be6aaa73030b8100cf8c5ca6a970a91
      build_id: 2014-12-26_14-25-46
      build_number: '58'
      feature_groups:
      - mirantis
      fuellib_sha: fde8ba5e11a1acaf819d402c645c731af450aff0
      fuelmain_sha: 81d38d6f2903b5a8b4bee79ca45a54b76c1361b8
      nailgun_sha: 5f91157daa6798ff522ca9f6d34e7e135f150a90
      ostf_sha: a9afb68710d809570460c29d6c3293219d3624d4
      production: docker
      release: '6.0'

root@node-3:~# dpkg -l | grep oslo
ii python-oslo.config 1:1.4.0-fuel6.0~mira16 Common code for Openstack Projects (configuration API)
ii python-oslo.db 1.0.1-fuel6.0~mira19 oslo db library
ii python-oslo.i18n 1.0.0-fuel6.0~mira16 Oslo Internationalization Utilities - Python 2.x
ii python-oslo.messaging 1.4.1-fuel6.0~mira18 oslo messaging library
ii python-oslo.rootwrap 1.3.0.0a1-ubuntu2 Fine filtering of shell commands as 'root'
ii python-oslo.serialization 1.0.0-fuel6.0~mira15 utilities for serialization , especially JSON - Python 2.x
ii python-oslo.utils 1.0.0-fuel6.0~mira16 set of utility functions for OpenStack - Python 2.x
ii python-oslo.vmware 0.6.0-fuel6.0~mira16 VMware library for OpenStack projects - Python 2.x

I replaced the whole folder under oslo.messaging-1.4.1\oslo\messaging to /usr/share/pyshared/oslo/messaging, then restart all 3 controllers, the HA didn't work very well, nova service-list did not work with HTTP 503 error. Neutron components didn't work very well also.

Please check below error messages,
<166>Jul 16 14:22:12 node-3 neutron-server 2015-07-16 14:22:12.436 6138 INFO oslo.messaging._drivers.impl_rabbit [-] Connecting to AMQP server on 127.0.0.1:5673
<166>Jul 16 14:22:12 node-3 neutron-server 2015-07-16 14:22:12.456 6138 INFO oslo.messaging._drivers.impl_rabbit [-] Connected to AMQP server on 127.0.0.1:5673
<163>Jul 16 14:22:12 node-3 neutron-server 2015-07-16 14:22:12.544 6061 ERROR oslo.messaging._drivers.impl_rabbit [-] Failed to publish message to topic 'reply_c2d631eb2efa4fc1b88d2d03f90d6bca': [Errno 32] Broken pipe
2015-07-16 14:22:12.544 6061 TRACE oslo.messaging....

Read more...

Revision history for this message
Shintaro Mizuno (mizuno-shintaro) wrote :

Hi Chris J Arges,

https://launchpad.net/ubuntu/+source/oslo.messaging/1.4.1-0ubuntu1.1
doesn't seem to have https://review.openstack.org/109373 "Declare-DirectPublisher-exchanges-with-passive-True.patch" included in the patch list.
It is included in 1.3.0-0ubuntu1.1 package.

Revision history for this message
Jorge Niedbalski (niedbalski) wrote :
Changed in oslo.messaging (Ubuntu Utopic):
assignee: nobody → Jorge Niedbalski (niedbalski)
status: Fix Committed → In Progress
Revision history for this message
Jorge Niedbalski (niedbalski) wrote :

Attached it's the missing patch against utopic archive.

Revision history for this message
Jorge Niedbalski (niedbalski) wrote :

Also note that this should be merged into lp:~ubuntu-server-dev/oslo.messaging/juno (unexistant) as well.

Thanks.

Revision history for this message
George (lmihaiescu) wrote :

Do you know when the missing patch will be merged into python-oslo.messaging 1.4.1-0ubuntu1 (Juno) and an updated version released?

Thank you.

Revision history for this message
Rolf Leggewie (r0lf) wrote :

utopic has seen the end of its life and is no longer receiving any updates. Marking the utopic task for this ticket as "Won't Fix".

Changed in oslo.messaging (Ubuntu Utopic):
status: In Progress → Won't Fix
James Page (james-page)
Changed in cloud-archive:
status: New → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.