AMQP Channel errors in orchestrator logs

Bug #1319451 reported by Aleksey Kasatkin
22
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Committed
High
Dima Shulyak
5.0.x
Fix Committed
High
Dima Shulyak

Bug Description

ISO #203, VBOX

Thousands of such errors are in orchestrator node logs after master node is deployed:

2014-05-14 14:47:38 CRIT
[18709] Channel error #<AMQ::Protocol::Channel::Close:0x000000018e3a90 @reply_code=406, @reply_text="PRECONDITION_FAILED - cannot redeclare exchange 'naily_service' in vhost '/' with different type, durable, internal or autodelete value", @class_id=40, @method_id=10>
2014-05-14 14:47:38 CRIT
[18709] Channel error #<AMQ::Protocol::Channel::Close:0x00000001915bf8 @reply_code=406, @reply_text="PRECONDITION_FAILED - cannot redeclare exchange 'nailgun' in vhost '/' with different type, durable, internal or autodelete value", @class_id=40, @method_id=10>
2014-05-14 14:47:38 CRIT
[18704] Channel error #<AMQ::Protocol::Channel::Close:0x00000001914a50 @reply_code=406, @reply_text="PRECONDITION_FAILED - cannot redeclare exchange 'naily_service' in vhost '/' with different type, durable, internal or autodelete value", @class_id=40, @method_id=10>
2014-05-14 14:47:38 CRIT
[18704] Channel error #<AMQ::Protocol::Channel::Close:0x00000001916e40 @reply_code=406, @reply_text="PRECONDITION_FAILED - cannot redeclare exchange 'nailgun' in vhost '/' with different type, durable, internal or autodelete value", @class_id=40, @method_id=10>

Revision history for this message
Aleksey Kasatkin (alekseyk-ru) wrote :
Revision history for this message
Aleksey Kasatkin (alekseyk-ru) wrote :

Deployment was started w/o problems though.

Changed in fuel:
importance: Undecided → Medium
Mike Scherbakov (mihgen)
Changed in fuel:
assignee: nobody → Vladimir Sharshov (vsharshov)
Changed in fuel:
milestone: none → 5.0
Revision history for this message
Vladimir Sharshov (vsharshov) wrote :

This exchange required for cancel stop/provisioning functionality. Conflict with redeclaration is really problem.

Changed in fuel:
importance: Medium → High
status: New → Triaged
Revision history for this message
Mike Scherbakov (mihgen) wrote :

Folks if you ever reproduce this, please mark that it affects you and ideally provide logs. This issue is hard to reproduce and looks like it may occur really rarely as per conversation with V.Sharshov.

Revision history for this message
Vladimir Sharshov (vsharshov) wrote :

Igor reproduce it today. I discover logs and current RabbitMQ state and found that problem really in 'cannot redeclare exchange', specifically:

- naily_service: "auto_delete" should be true, but false.
- nailgun: "durable" should true, but false.

In fuel-web and in Astute this declaration is correct. And on first time only 'naily_service' was fail, but on second - 2 exchanges.

Changed in fuel:
status: Triaged → Confirmed
Revision history for this message
Vladimir Sharshov (vsharshov) wrote :

I suggest to create exchanges using curl then we deploy (Nailgun or Astute depends of which is first) as we do for MCollective.
Example: https://github.com/stackforge/fuel-library/blob/master/deployment/puppet/mcollective/manifests/rabbitmq.pp#L104

We need to declarate it as early as possible

It can give us place where this exchanges declared with wrong params.

Revision history for this message
Vladimir Kuklin (vkuklin) wrote :

It looks like we really have different parameters for exchange and queue creation in nailgun and naily code.

We need to recheck each queue and exchange.

our ruby amqp has durable set as default to 'false'
our python kombu library sets durable to 'true'

also we are creating naily_service queue in naily_service with ( durable= true, exclusive = false , auto_delete = false ) and in astute it is set to false/true/true

Revision history for this message
Vladimir Sharshov (vsharshov) wrote :

> we are creating naily_service queue in naily_service with ( durable= true, exclusive = false , auto_delete = false ) and in astute it is set to false/true/true

Yes, but it does not affect. We really do not use naily_service queue instead of naily_service exchange. It was created by Nailgun team when we test stop/reset functionality for tests and stay there. If we delete it nothing should be fail in production functionality.

Let me explain:

    @queue = @channel.queue(Astute.config.broker_queue, :durable => true).bind(@exchange)
    @service_queue = @service_channel.queue("", :exclusive => true, :auto_delete => true).bind(@service_exchange)

As you can see, i do not give any name for @service_queue instead of @queue (Astute.config.broker_queue vs ""). It will create temporary queue for every worker with random name and connect it to exchange. This is need for stop/reset, because message about canceling task should get every worker. It is Publish/Subscribe model (http://www.rabbitmq.com/tutorials/tutorial-three-ruby.html).

For @queue we use Work queues model: http://www.rabbitmq.com/tutorials/tutorial-two-ruby.html where only one of worker should get message from queue.

> our ruby amqp has durable set as default to 'false', our python kombu library sets durable to 'true'

This is good point, thanks.

The main interesting details in RabbitMQ: we always should declarate exchanges and queues. And if declaration is different, we always will get error about it. But in most case logs no errors about it.

naily_service exchange:

Astute:

          @service_exchange = @service_channel.fanout(Astute.config.broker_service_queue, :auto_delete => true)

Nailgun:

    naily_service_exchange = Exchange(
        'naily_service',
        'fanout',
        durable=False,
        auto_delete=True
    )

In last env for this exchange problem was for "auto_delete" params which declarate visibly, equally and should be 'true', but was 'false'.

naily exchange:

Astute:

    @exchange = @channel.topic(Astute.config.broker_exchange, :durable => true)

Nailgun:

   nailgun_exchange = Exchange(
       'nailgun',
       'topic',
       durable=True
   )

In last env for this exchange problem was for "durable" params which declarate visibly, equally and should be 'true', but was 'false'.

Revision history for this message
Vladimir Sharshov (vsharshov) wrote :

Decisions:

- add pre creation of exchanges (me);
- on debug logs for RabbitMQ: who, when and from create exchanges and queues (me);
- off masqueraded for RabbitMQ clients: Nailgun, Astute, Mcollective. (Matt)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to fuel-library (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/94635

Revision history for this message
Mike Scherbakov (mihgen) wrote :

As it is very hard to reproduce, and was reproduced only twice. So I have to lower the priority for 5.0, thus moving it to 5.1. Matt's patch may add more clarity - who actually created exchanges.

Changed in fuel:
milestone: 5.0 → 5.1
Revision history for this message
Mike Scherbakov (mihgen) wrote :

Looks like many folks hit this bug. We need to fix this ASAP.

Changed in fuel:
importance: High → Critical
Revision history for this message
Vladimir Sharshov (vsharshov) wrote :

I did not find way how to show in RabbitMQ logs when and from where exchanges and queues declared (and i finally will do it). Only way to debug it to find potencial problem ways is add pre creation of exchanges. But this is still not a fix and this one still very hard to reproduce.

We need more information how to reproduce it or at least increase reproduction rate. And logs from folks also will be very useful to understand what (nailgun or astute) fail in step of exchange declaration.

Revision history for this message
Miroslav Anashkin (manashkin) wrote :

Attached logs with this error.

Revision history for this message
Vladimir Sharshov (vsharshov) wrote :

Test various scenarios using release 5.0 ISO (fresh installation):

1) run nailgun and astute containers in different order;
2) with/without erasing queues and exchanges by hands;
3) using several containers with nailgun and astute at same time;
4) with/without restarting RabbitMQ.

Result: all work stable and no errors about exchanges/queues.

Revision history for this message
Nastya Urlapova (aurlapova) wrote :

Vladimir, take a look at fuel-5.0-52-2014-06-12_03-00-36.iso
we can't create diagnostic error:
InvalidURL: nonnumeric port: ' (406) PRECONDITION_FAILED - parameters for queue 'naily' in vhost ''

in astute.log:
[430] Exception during worker initialization: #<AMQP::TCPConnectionFailed: Could not estabilish TCP connection to 10.108.20.2:5672>, trace: ["/usr/lib64/ruby/gems/2.1.0/gems/amq-client-0.9.12/lib/amq/client/async/adapters/event_machine.rb:162:in `block in initialize'", "/usr/lib64/ruby/gems/2.1.0/gems/amq-client-0.9.12/lib/amq/client/async/adapter.rb:306:in `call'", "/usr/lib64/ruby/gems/2.1.0/gems/amq-client-0.9.12/lib/amq/client/async/adapter.rb:306:in `tcp_connection_failed'", "/usr/lib64/ruby/gems/2.1.0/gems/amq-client-0.9.12/lib/amq/client/async/adapters/event_machine.rb:290:in `unbind'", "/usr/lib64/ruby/gems/2.1.0/gems/eventmachine-1.0.3/lib/eventmachine.rb:1438:in `event_callback'", "/usr/lib64/ruby/gems/2.1.0/gems/eventmachine-1.0.3/lib/eventmachine.rb:187:in `run_machine'", "/usr/lib64/ruby/gems/2.1.0/gems/eventmachine-1.0.3/lib/eventmachine.rb:187:in `run'", "/usr/lib64/ruby/gems/2.1.0/gems/astute-0.0.2/lib/astute/server/worker.rb:38:in `run'", "/usr/lib64/ruby/gems/2.1.0/gems/raemon-0.3.0/lib/raemon/master.rb:343:in `worker_loop!'", "/usr/lib64/ruby/gems/2.1.0/gems/raemon-0.3.0/lib/raemon/master.rb:266:in `block (2 levels) in spawn_workers'", "/usr/lib64/ruby/gems/2.1.0/gems/raemon-0.3.0/lib/raemon/master.rb:266:in `fork'", "/usr/lib64/ruby/gems/2.1.0/gems/raemon-0.3.0/lib/raemon/master.rb:266:in `block in spawn_workers'", "/usr/lib64/ruby/gems/2.1.0/gems/raemon-0.3.0/lib/raemon/master.rb:261:in `each'", "/usr/lib64/ruby/gems/2.1.0/gems/raemon-0.3.0/lib/raemon/master.rb:261:in `spawn_workers'", "/usr/lib64/ruby/gems/2.1.0/gems/raemon-0.3.0/lib/raemon/master.rb:300:in `maintain_worker_count'", "/usr/lib64/ruby/gems/2.1.0/gems/raemon-0.3.0/lib/raemon/master.rb:120:in `master_loop!'", "/usr/lib64/ruby/gems/2.1.0/gems/raemon-0.3.0/lib/raemon/master.rb:53:in `start'", "/usr/lib64/ruby/gems/2.1.0/gems/raemon-0.3.0/lib/raemon/master.rb:22:in `start'", "/usr/lib64/ruby/gems/2.1.0/gems/astute-0.0.2/bin/astuted:84:in `<top (required)>'", "-e:1:in `load'", "-e:1:in `<main>'"]

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-web (master)

Fix proposed to branch: master
Review: https://review.openstack.org/103917

Changed in fuel:
assignee: Vladimir Sharshov (vsharshov) → Bogdan Dobrelya (bogdando)
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on fuel-web (master)

Change abandoned by Bogdan Dobrelya (<email address hidden>) on branch: master
Review: https://review.openstack.org/103917

Revision history for this message
Nastya Urlapova (aurlapova) wrote :

Looks like issue isn't reproducible.

Revision history for this message
Nastya Urlapova (aurlapova) wrote :

Env with issue are provided for Vladimir research.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-astute (master)

Fix proposed to branch: master
Review: https://review.openstack.org/104947

Changed in fuel:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-astute (master)

Reviewed: https://review.openstack.org/104947
Committed: https://git.openstack.org/cgit/stackforge/fuel-astute/commit/?id=46b5c2957fcbbbd3266708d3304dbada59ad132b
Submitter: Jenkins
Branch: master

commit 46b5c2957fcbbbd3266708d3304dbada59ad132b
Author: Vladimir Sharshov <email address hidden>
Date: Fri Jul 4 22:26:31 2014 +0400

    Auto recovering mechanism for RabbitMQ exchanges and queues

    If for some reason one or several exchanges have different
    type, durable, internal or autodelete values, we try
    to remove it and create again with correct params.

    Change-Id: I130aa29973692b6687f0b1623d438b29224b3553
    Closes-Bug: #1319451

Changed in fuel:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-astute (stable/5.0)

Fix proposed to branch: stable/5.0
Review: https://review.openstack.org/105249

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-astute (stable/5.0)

Reviewed: https://review.openstack.org/105249
Committed: https://git.openstack.org/cgit/stackforge/fuel-astute/commit/?id=ced046b6ed48d9fc7a7d58e1da5232e6440ad988
Submitter: Jenkins
Branch: stable/5.0

commit ced046b6ed48d9fc7a7d58e1da5232e6440ad988
Author: Vladimir Sharshov <email address hidden>
Date: Fri Jul 4 22:26:31 2014 +0400

    Auto recovering mechanism for RabbitMQ exchanges and queues

    If for some reason one or several exchanges have different
    type, durable, internal or autodelete values, we try
    to remove it and create again with correct params.

    Change-Id: I130aa29973692b6687f0b1623d438b29224b3553
    Closes-Bug: #1319451

Revision history for this message
Dmitry Pyzhov (dpyzhov) wrote :

Dmitry Borodaenko faced this issue in nailgun:

I've reproduced it with 5.0.1 ISO #110.

2014-07-09 01:54:10.220 DEBUG [7f042556d700] (logger) Request PUT /api/clusters/1/changes from 172.17.42.1:59202 {}
2014-07-09 01:54:12.001 ERROR [7f042556d700] (logger) Traceback (most recent call last):
  File "/usr/lib/python2.6/site-packages/web/application.py", line 239, in process
    return self.handle()
  File "/usr/lib/python2.6/site-packages/web/application.py", line 230, in handle
    return self._delegate(fn, self.fvars, args)
  File "/usr/lib/python2.6/site-packages/web/application.py", line 420, in _delegate
    return handle_class(cls)
  File "/usr/lib/python2.6/site-packages/web/application.py", line 396, in handle_class
    return tocall(*args)
  File "<string>", line 2, in PUT
  File "/usr/lib/python2.6/site-packages/nailgun/api/handlers/base.py", line 95, in content_json
    data = func(*args, **kwargs)
  File "/usr/lib/python2.6/site-packages/nailgun/api/handlers/base.py", line 367, in PUT
    task = task_manager.execute()
  File "/usr/lib/python2.6/site-packages/nailgun/task/manager.py", line 230, in execute
    rpc.cast('naily', task_messages)
  File "/usr/lib/python2.6/site-packages/nailgun/rpc/__init__.py", line 86, in cast
    declare=[use_queue])
  File "/usr/lib/python2.6/site-packages/kombu/messaging.py", line 168, in publish
    routing_key, mandatory, immediate, exchange, declare)
  File "/usr/lib/python2.6/site-packages/kombu/messaging.py", line 180, in _publish
    [maybe_declare(entity) for entity in declare]
  File "/usr/lib/python2.6/site-packages/kombu/messaging.py", line 111, in maybe_declare
    return maybe_declare(entity, self.channel, retry, **retry_policy)
  File "/usr/lib/python2.6/site-packages/kombu/common.py", line 99, in maybe_declare
    return _maybe_declare(entity)
  File "/usr/lib/python2.6/site-packages/kombu/common.py", line 110, in _maybe_declare
    entity.declare()
  File "/usr/lib/python2.6/site-packages/kombu/entity.py", line 505, in declare
    self.queue_declare(nowait, passive=False)
  File "/usr/lib/python2.6/site-packages/kombu/entity.py", line 531, in queue_declare
    nowait=nowait)
  File "/usr/lib/python2.6/site-packages/amqp/channel.py", line 1258, in queue_declare
    (50, 11), # Channel.queue_declare_ok
  File "/usr/lib/python2.6/site-packages/amqp/abstract_channel.py", line 69, in wait
    return self.dispatch_method(method_sig, args, content)
  File "/usr/lib/python2.6/site-packages/amqp/abstract_channel.py", line 87, in dispatch_method
    return amqp_method(self, args)
  File "/usr/lib/python2.6/site-packages/amqp/channel.py", line 241, in _close
    reply_code, reply_text, (class_id, method_id), ChannelError,
PreconditionFailed: Queue.declare: (406) PRECONDITION_FAILED - parameters for queue 'naily' in vhost '/' not equivalent

2014-07-09 01:54:12.005 ERROR [7f042556d700] (logger) Response code '500 Internal Server Error' for PUT /api/clusters/1/changes from 172.17.42.1:59202

Revision history for this message
Dmitry Pyzhov (dpyzhov) wrote :
Revision history for this message
Vladimir Sharshov (vsharshov) wrote :

Importan detail:

2014-07-09T00:57:39 info: [421] Worker initialization
2014-07-09T00:57:40 warning: [405] Trying to reconnect to message broker. Retry 5 sec later...
2014-07-09T00:57:40 warning: [413] Trying to reconnect to message broker. Retry 5 sec later...
2014-07-09T00:57:40 warning: [401] Trying to reconnect to message broker. Retry 5 sec later...
2014-07-09T00:57:40 warning: [409] Trying to reconnect to message broker. Retry 5 sec later...
2014-07-09T00:57:40 warning: [397] Trying to reconnect to message broker. Retry 5 sec later...
2014-07-09T00:57:40 warning: [421] Trying to reconnect to message broker. Retry 5 sec later...
2014-07-09T00:57:40 warning: [417] Trying to reconnect to message broker. Retry 5 sec later...
2014-07-09T00:57:45 warning: [405] Trying to reconnect to message broker. Retry 5 sec later...
2014-07-09T00:57:45 warning: [413] Trying to reconnect to message broker. Retry 5 sec later...
2014-07-09T00:57:45 warning: [401] Trying to reconnect to message broker. Retry 5 sec later...
2014-07-09T00:57:45 warning: [409] Trying to reconnect to message broker. Retry 5 sec later...
2014-07-09T00:57:45 warning: [397] Trying to reconnect to message broker. Retry 5 sec later...
2014-07-09T00:57:45 warning: [421] Trying to reconnect to message broker. Retry 5 sec later...
2014-07-09T00:57:45 warning: [417] Trying to reconnect to message broker. Retry 5 sec later...

No Astute recovery mechanism was called. This mean that error at now moment fully affected Nailgun part.

My suggestion: implement same recovery mechanism as in Astute. If we get such error, erase exchanges and queues and try to declarate it again.

Revision history for this message
Dima Shulyak (dshulyak) wrote :

But this way we can lose any unprocessed messages, and potentially task in nailgun will be forever in running state.

I saw someone proposed solution to declare necessery exchanges and queues in puppet and do not redeclare it in astute/nailgun.
Ofcourse we will be bound to use only exchanges with auto_delete=False but that is not big deal

Revision history for this message
Dima Shulyak (dshulyak) wrote :

Maybe remove declaration from astute and use declaration only in nailgun? Is there any issues with such approach?

Also on auto_delete usage - /usr/lib/python2.6/site-packages/amqp/channel.py:616: VDeprecationWarning: The auto_delete flag for exchanges has been deprecated and will be removed

Why do we need it at all ?

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-web (master)

Fix proposed to branch: master
Review: https://review.openstack.org/105996

Changed in fuel:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on fuel-library (master)

Change abandoned by Matthew Mosesohn (<email address hidden>) on branch: master
Review: https://review.openstack.org/94635
Reason: not needed

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-web (master)

Reviewed: https://review.openstack.org/105996
Committed: https://git.openstack.org/cgit/stackforge/fuel-web/commit/?id=597e33b2877c48f68b2b84c82ac7a0dfccde2c26
Submitter: Jenkins
Branch: master

commit 597e33b2877c48f68b2b84c82ac7a0dfccde2c26
Author: Dima Shulyak <email address hidden>
Date: Thu Jul 10 11:36:37 2014 +0300

    Redeclare existing exchanges in case 406 rabbitmq err

    - Modified rpc cast to ensure publish after exchange/queue is redeclared
    - Modified consumer to ensure recovery after exchange/queue is redeclared

    Change-Id: I9b368ca447ecabefea0d646d689e093238b77c41
    Closes-Bug: #1319451

Changed in fuel:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-web (stable/5.0)

Fix proposed to branch: stable/5.0
Review: https://review.openstack.org/106180

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-web (stable/5.0)

Reviewed: https://review.openstack.org/106180
Committed: https://git.openstack.org/cgit/stackforge/fuel-web/commit/?id=3089d4d67f9793522e47edbfe733c92103b7ff20
Submitter: Jenkins
Branch: stable/5.0

commit 3089d4d67f9793522e47edbfe733c92103b7ff20
Author: Dima Shulyak <email address hidden>
Date: Thu Jul 10 11:36:37 2014 +0300

    Redeclare existing exchanges in case 406 rabbitmq err

    - Modified rpc cast to ensure publish after exchange/queue is redeclared
    - Modified consumer to ensure recovery after exchange/queue is redeclared

    Change-Id: I9b368ca447ecabefea0d646d689e093238b77c41
    Closes-Bug: #1319451

Dmitry Pyzhov (dpyzhov)
no longer affects: fuel/5.1.x
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.