oslo.messaging does not redeclare exchange if it is missing

Bug #1609766 reported by Kirill Bespalov
18
This bug affects 3 people
Affects Status Importance Assigned to Milestone
oslo.messaging
Fix Released
Undecided
Kirill Bespalov

Bug Description

Steps to reproduce:
1. Create a load on OpenStack, trigger restart of one of RabbitMQ nodes (exact reason unknown).
2. Observe a lot of
operation basic.publish caused a channel exception not_found: "no exchange 'reply_d8786e66456a4660bebb362668a027e4' in vhost '/'"
   entries in RabbitMQ log for various reply queues.

Looking earlier in the RabbitMQ log one can found
2016-08-03T13:33:12.945437+00:00 notice: operation queue.declare caused a channel exception not_found: "failed to perform operation on queue 'reply_d8786e66456a4660bebb362668a027e4' in vhost '/' due to timeout"
2016-08-03T13:34:43.006600+00:00 notice: operation queue.bind caused a channel exception not_found: "no exchange 'reply_d8786e66456a4660bebb362668a027e4' in vhost '/'"

The following stacktrace corresponds to the later message in nova-compute.log: http://paste.openstack.org/show/548803/

It seems that during RabbitMQ failover we might end up with declared queue, which is not bound to an exchange (there an exception listed in paste above is thrown). Later oslo.messaging successfully starts consuming from that queue, since the queue exists, but it is useless because it is not bound to an exchange.

You may find the whole logs containing snippets from above attached - nova-compute.log and rabbitmq.log.3.gz

Versions:
oslo.messaging - stable/mitaka
RabbitMQ - 3.6.1
kombu 3.0.32
pyamqp 1.4.8

Revision history for this message
Kirill Bespalov (k-besplv) wrote :
Revision history for this message
Kirill Bespalov (k-besplv) wrote :
Changed in oslo.messaging:
assignee: nobody → Kirill Bespalov (k-besplv)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to oslo.messaging (master)

Fix proposed to branch: master
Review: https://review.openstack.org/351162

Changed in oslo.messaging:
status: New → In Progress
Revision history for this message
shen.zhixing (fooy5460) wrote :

I downloaded kombu-3.0.29 and kombu-4.0.0, there are "self.exchange.declare(nowait)" and "queue_bind" in function "queue.declare()";

So I wonder why this happen, do you know why the exchange is not created ?

2016-08-03 13:34:39.163 1217066 ERROR oslo_messaging._drivers.amqpdriver File "/usr/lib/python2.7/dist-packages/oslo_messaging/_drivers/impl_rabbit.py", line 296, in declare
2016-08-03 13:34:39.163 1217066 ERROR oslo_messaging._drivers.amqpdriver self.queue.declare()

kombu-3.0.29:
class Queue(MaybeChannelBound):
...
    def declare(self, nowait=False):
        """Declares the queue, the exchange and binds the queue to
        the exchange."""
        # - declare main binding.
        if self.exchange:
            self.exchange.declare(nowait)
        self.queue_declare(nowait, passive=False)

        if self.exchange and self.exchange.name:
            self.queue_bind(nowait)

        # - declare extra/multi-bindings.
        for B in self.bindings:
            B.declare(self.channel)
            B.bind(self, nowait=nowait)
        return self.name

kombu-4.0.0:
class Queue(MaybeChannelBound):
    def declare(self, nowait=False):
        """Declares the queue, the exchange and binds the queue to
        the exchange."""
        if not self.no_declare:
            # - declare main binding.
            self._create_exchange(nowait=nowait)
            self._create_queue(nowait=nowait)
            self._create_bindings(nowait=nowait)
        return self.name

Revision history for this message
shen.zhixing (fooy5460) wrote :

I have not find consumer is removed from self._consumers.items() , so I am afraid "consumer.is_declared" is not necessary while they work well before rabbitmq restarting.

if not consumer.is_declared(self):
   consumer.declare(self)

Revision history for this message
Kirill Bespalov (k-besplv) wrote :

Do not forget that exchange declared with flag 'auto-delete'.

The bug appears then one of nodes in cluster is failed and kombu driver in the ensure() method call on_error callback - on_reconnection(new_channel) (see impl_rabbit:750) where is we redeclare all consumers. But this step may failed, because exchange is auto-delete, take a look on consumer declaration process:

1) exchange.declare (already exists, just metadata shared between nodes in cluster)
2) queue.declare (not exist, because master node of the queue failed)
3) so, due to long time that the auto-delete exchange had no binded queues it's just deleted at this step.
4) queue.bind - failed because exchange does not exists.

Revision history for this message
Dmitry Mescheryakov (dmitrymex) wrote :

Note that we observed the issue on RabbitMQ 3.6.1. It might be that it happens only on certain versions. I have added versions of related libs/RabbitMQ to the bug description.

But we should not assume that this is just a bug of RabbitMQ. Since we are working with auto-delete exchanges, IMO scenario described by Kirill is valid. Another scenario leading to the same issue could be binding failure due to a different internal RabbitMQ issue.

description: updated
Revision history for this message
shen.zhixing (fooy5460) wrote :

Thank you for your patience!

I know your modification will take effect.

and I think "if not consumer.is_declared(self)" is TRUE only after "self._set_current_channel(channel)" has been called;

so I think declare consumers after "self._set_current_channel(channel)" below is a good idea.

        try:
            autoretry_method = self.connection.autoretry(
                execute_method, channel=self.channel,
                max_retries=retry,
                errback=on_error,
                interval_start=self.interval_start or 1,
                interval_step=self.interval_stepping,
                interval_max=self.interval_max,
                on_revive=on_reconnection)
            ret, channel = autoretry_method()
            self._set_current_channel(channel)
            return ret

Revision history for this message
Kirill Bespalov (k-besplv) wrote :

> I think "if not consumer.is_declared(self)" is TRUE only after
> "self._set_current_channel(channel)" has been called;
> so I think declare consumers after "self._set_current_channel(channel)" below is a good idea.

We already do consumer re-declaration in cases of a connection failing in the error callback:

def on_reconnection(new_channel):
      self.set_transport_socket_timeout()
      self._set_current_channel(new_channel)

      for consumer in self._consumers:
         consumer.declare(self)

But the problem that this error callback (on_reconnection) may fail due timeouts,
another connection losing, etc. It also may failed in _set_current_channel(channel) if we move the declaration code there, no difference.

So, in order to 100% make sure that we declared all queues/bindings/exchanges before staring consuming I kept the channel in consumer.declare:

def declare(self, conn):
       ...
            self.queue.declare()
       ...

       self._declared_on = conn.channel

And then in connection.consume() I check this point via is_declared()

Revision history for this message
shen.zhixing (fooy5460) wrote :

If "connection" and "channel" are available, but "exchange declare" time out, your modification seems the only way to solve it. It a small probability.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.openstack.org/365959

Changed in oslo.messaging:
assignee: Kirill Bespalov (k-besplv) → Mehdi Abaakouk (sileht)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on oslo.messaging (master)

Change abandoned by Kirill Bespalov (<email address hidden>) on branch: master
Review: https://review.openstack.org/351162

Changed in oslo.messaging:
assignee: Mehdi Abaakouk (sileht) → Kirill Bespalov (k-besplv)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to oslo.messaging (master)

Reviewed: https://review.openstack.org/365959
Committed: https://git.openstack.org/cgit/openstack/oslo.messaging/commit/?id=3f4ce9470b3f9e0502f61345a6943adad2dadac9
Submitter: Jenkins
Branch: master

commit 3f4ce9470b3f9e0502f61345a6943adad2dadac9
Author: kbespalov <email address hidden>
Date: Thu Aug 4 15:18:25 2016 +0300

    Fix consuming from unbound reply queue

    Consumer declaration consist of the next steps:
      1) declare an exchange
      2) declare a queue
      3) bind the queue to the exchange

    Due to reply exchanges are auto-delete, at the
    step 3 the exchange can be removed and consumer.declare()
    will raise `queue.bind 404 Exchange not found`.

    So, in this case the queue is exist and AMQPListener
    just call consumer.consume() on the queue and go to
    drain_events() despite on the fact that the queue is
    unbound.

    This change tries to redeclare queue/exchange proactively
    each times channel change and just before consuming messages.

    Co-Authored-By: Mehdi Abaakouk <email address hidden>
    Closes-Bug: #1609766
    Change-Id: Id8b48df3d26675d72955d417ce7622b1e8aa6195

Changed in oslo.messaging:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/oslo.messaging 5.11.0

This issue was fixed in the openstack/oslo.messaging 5.11.0 release.

Revision history for this message
David Moreau Simard (dmsimard) wrote :

This looks like a serious issue that was fixed and released in 5.11. Can we backport and release this to newton and mitaka ?

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to oslo.messaging (stable/newton)

Fix proposed to branch: stable/newton
Review: https://review.openstack.org/389014

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to oslo.messaging (stable/mitaka)

Fix proposed to branch: stable/mitaka
Review: https://review.openstack.org/389190

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to oslo.messaging (stable/newton)

Reviewed: https://review.openstack.org/389014
Committed: https://git.openstack.org/cgit/openstack/oslo.messaging/commit/?id=d8d564cd09ab5d4b07b6d9492b9014c024302433
Submitter: Jenkins
Branch: stable/newton

commit d8d564cd09ab5d4b07b6d9492b9014c024302433
Author: kbespalov <email address hidden>
Date: Thu Aug 4 15:18:25 2016 +0300

    Fix consuming from unbound reply queue

    Consumer declaration consist of the next steps:
      1) declare an exchange
      2) declare a queue
      3) bind the queue to the exchange

    Due to reply exchanges are auto-delete, at the
    step 3 the exchange can be removed and consumer.declare()
    will raise `queue.bind 404 Exchange not found`.

    So, in this case the queue is exist and AMQPListener
    just call consumer.consume() on the queue and go to
    drain_events() despite on the fact that the queue is
    unbound.

    This change tries to redeclare queue/exchange proactively
    each times channel change and just before consuming messages.

    Co-Authored-By: Mehdi Abaakouk <email address hidden>
    Closes-Bug: #1609766
    Change-Id: Id8b48df3d26675d72955d417ce7622b1e8aa6195
    (cherry picked from commit 3f4ce9470b3f9e0502f61345a6943adad2dadac9)

tags: added: in-stable-newton
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to oslo.messaging (stable/mitaka)

Reviewed: https://review.openstack.org/389190
Committed: https://git.openstack.org/cgit/openstack/oslo.messaging/commit/?id=b90580653af33678d9931f33308edee31cf1fe54
Submitter: Jenkins
Branch: stable/mitaka

commit b90580653af33678d9931f33308edee31cf1fe54
Author: kbespalov <email address hidden>
Date: Thu Aug 4 15:18:25 2016 +0300

    Fix consuming from unbound reply queue

    Consumer declaration consist of the next steps:
      1) declare an exchange
      2) declare a queue
      3) bind the queue to the exchange

    Due to reply exchanges are auto-delete, at the
    step 3 the exchange can be removed and consumer.declare()
    will raise `queue.bind 404 Exchange not found`.

    So, in this case the queue is exist and AMQPListener
    just call consumer.consume() on the queue and go to
    drain_events() despite on the fact that the queue is
    unbound.

    This change tries to redeclare queue/exchange proactively
    each times channel change and just before consuming messages.

    Co-Authored-By: Mehdi Abaakouk <email address hidden>
    Closes-Bug: #1609766
    Change-Id: Id8b48df3d26675d72955d417ce7622b1e8aa6195
    (cherry picked from commit 3f4ce9470b3f9e0502f61345a6943adad2dadac9)

tags: added: in-stable-mitaka
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/oslo.messaging 5.10.1

This issue was fixed in the openstack/oslo.messaging 5.10.1 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.