Bug #1251345 “Cluster setup fails on precise with cloud archives...” : Bugs : MAAS

Revision history for this message

Julian Edwards (julian-edwards) wrote on 2013-11-15: Re: [Bug 1251345] [NEW] Cluster setup fails on precise with cloud archives

#2

On Thursday 14 Nov 2013 17:22:43 you wrote:
> 9 Edit the /etc/maas/*cluster* files, and make them reflect the current
> setup (https://pastebin.canonical.com/100455/ - the URL should point to
> your MaaS on network A, and the UUID should be the same as that machine
> too)

Where did you read that the UUID should be the same? Each cluster's UUID
should be, well, unique. The packaging generates one for you so there's no
need to edit it.

The error you're getting is a problem of interaction between celery and
rabbit, and I've never seen this before.

Is the version of rabbitmq-server the same on both machines?

Revision history for this message

Julian Edwards (julian-edwards) wrote on 2013-11-15:

#1

For the benefit of those who cannot see the pastebin, the error from celery is:

[2013-11-14 16:55:04,805: ERROR/MainProcess] Unrecoverable error: AMQPChannelException(406, u"PRECONDITION_FAILED - cannot redeclare exchange 'celeryd.pidbox' in vhost '/maas_workers' with different type, durable, internal or autodelete value", (40, 10), 'Channel.exchange_declare')

Changed in maas:
status:	New → Incomplete

Revision history for this message

Chris Glass (tribaal) wrote on 2013-11-15:

#3

Ok, after further investigation, it seems like this was a user error (me).

This error happens when you put the same UUID in both CC and RC. I suppose I was confused with the UUID thing, I assumed it would work like a ceph ring identifier.

Marking this as won't fix.

Changed in maas:
status:	Incomplete → Invalid

Revision history for this message

Chris Glass (tribaal) wrote on 2013-11-15:

#4

I still encounter the error after resetting each cluster to have its own UUID.

The celery demon (run with sudo service maas-cluster-celery start) does not stay up (failing with the same error as pasted above). The UI on the RC did register the presence of a new cluster, but after acceptation it stays "stuck" with the warning that the new cluster has no boot images.

I will tear down my entire environment and start from scratch, maybe some side effect created this situation.

Changed in maas:
status:	Invalid → New

Revision history for this message

Chris Glass (tribaal) wrote on 2013-11-15:

#5

(changed the original pastebins to ubuntu.com instead of canonical.com - sorry, force of habit)

description:

updated

Revision history for this message

Gavin Panella (allenap) wrote on 2013-11-15:

#6

Chris, I think that's worth a bug report. Something is confusing there, and we might be able to address it via documentation or detecting and alerting about such a situation. Do you mind filing that?

Revision history for this message

Gavin Panella (allenap) wrote on 2013-11-15:

#7

Can you check the versions of python-celery and python-django-celery on the RC and CC? The following article suggests that different versions of Celery can't coexist peacefully :-/ https://github.com/celery/celery/issues/984

Revision history for this message

Chris Glass (tribaal) wrote on 2013-11-18:

#8

I tried to reproduce this error several times with a new environment, and can't seem to trigger this anymore.

For the record, if both UUIDs for the RC and the CC are the same, then "nothing happens" and the UI shows no CC waiting to be enrolled. The Celery log on the CC does not error.

Ok, I'm going to mark this as invalid.

Changed in maas:
status:	New → Invalid

Revision history for this message

Julian Edwards (julian-edwards) wrote on 2013-11-18:

#9

Ok thanks for investigating Chris. Please re-open if this happens again.

Revision history for this message

Darryl Weaver (dweaver) wrote on 2014-04-14:

#10

I've just hit this bug.
I have a set up that can be reproduced with the original poster's steps, except step 9.
I did not need to manually configure maas config files and the UUIDs are unique.
Both region and cluster controller were working to provision machines when first installed.
After running for a week or so, and upgrading released packages, I now see an error with the cluster controller.

I am now seeing the error:
AMQPChannelException: (406, u"PRECONDITION_FAILED - cannot redeclare exchange 'celeryd.pidbox' in vhost '/maas_workers' with different type, durable, internal or autodelete value", (40, 10), 'Channel.exchange_declare')

on the cluster controller node and corresponding error on the maas region controller.
If I try manually restarting the celerey service on the cluster controller with:
service maas-cluster-celery start
It just triggers the error again.

I am using precise with the cloud-tools archive enabled on both machines.
i.e.
maas version: 1.4+bzr1693+dfsg-0ubuntu2.3~ctools0
python-celery 2.4.6-1ubuntu0.1

rabbitmq-server is only installed on the region controller:
rabbitmq-server 2.7.1-0ubuntu4

python-django-celery is not installed on either machine.

Changed in maas:
status:	Invalid → Confirmed

Revision history for this message

Darryl Weaver (dweaver) wrote on 2014-04-14:

#11

I should also mention there is a firewall between network A and B and this has been changed recently and could be contributing to the issue, however, I can't see how as yet, and the error is recorded at both machines, so communication between them seems to be OK.

Revision history for this message

Darryl Weaver (dweaver) wrote on 2014-04-15:

#14

Download full text (3.8 KiB)

It would seem that the celery cluster controller process on the region controller and on the cluster controller cannot operate at the same time. It depends on a restart which one connects to rabbitmq to create the exchange first. Once the exchange is created by either the cluster or the region controller then the other one cannot create the exchange and fails.

You can re-create this by shutting down the maas-cluster-celery process on both region and cluster controller:
service maas-cluster-celery stop

Restarting rabbitmq on the region controller:
/etc/init.d/rabbitmq-server restart

Now start the cluster celery process on one of the controllers:
service maas-cluster-celery start
The first to connect will create the exchange correctly.

Then try starting the process on the other controller:
service maas-cluster-celery start

tail -50 /var/log/maas/celery.log:
[2014-04-15 13:05:47,856: ERROR/MainProcess] Unrecoverable error: AMQPChannelException(406, u"PRECONDITION_FAILED - cannot redeclare exchange 'celeryd.pidbox' in vhost '/maas_workers' with different type, durable, internal or autodelete value", (40, 10), 'Channel.exchange_declare')
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/celery/worker/__init__.py", line 268, in start
    component.start()
  File "/usr/lib/python2.7/dist-packages/celery/worker/consumer.py", line 309, in start
    self.reset_connection()
  File "/usr/lib/python2.7/dist-packages/celery/worker/consumer.py", line 592, in reset_connection
    self.reset_pidbox_node()
  File "/usr/lib/python2.7/dist-packages/celery/worker/consumer.py", line 531, in reset_pidbox_node
    callback=self.on_control)
  File "/usr/lib/python2.7/dist-packages/kombu/pidbox.py", line 62, in listen
    callbacks=[callback or self.handle_message])
  File "/usr/lib/python2.7/dist-packages/kombu/pidbox.py", line 53, in Consumer
    **options)
  File "/usr/lib/python2.7/dist-packages/kombu/messaging.py", line 257, in __init__
    self.declare()
  File "/usr/lib/python2.7/dist-packages/kombu/messaging.py", line 267, in declare
    queue.declare()
  File "/usr/lib/python2.7/dist-packages/kombu/entity.py", line 380, in declare
    self.exchange.declare(nowait)
  File "/usr/lib/python2.7/dist-packages/kombu/entity.py", line 154, in declare
    nowait=nowait)
  File "/usr/lib/python2.7/dist-packages/kombu/syn.py", line 23, in blocking
    return __sync_current(fun, *args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/kombu/syn.py", line 39, in __blocking__
    return fun(*args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/amqplib/client_0_8/channel.py", line 843, in exchange_declare
    (40, 11), # Channel.exchange_declare_ok
  File "/usr/lib/python2.7/dist-packages/amqplib/client_0_8/abstract_channel.py", line 97, in wait
    return self.dispatch_method(method_sig, args, content)
  File "/usr/lib/python2.7/dist-packages/amqplib/client_0_8/abstract_channel.py", line 115, in dispatch_method
    return amqp_method(self, args)
  File "/usr/lib/python2.7/dist-packages/amqplib/client_0_8/channel.py", line 273, in _close
    (class_id, method_id))
AMQPChannelException: (406, u"PRECONDITION_FAILED - cannot redeclare ...

It would seem that the celery cluster controller process on the region controller and on the cluster controller cannot operate at the same time.  It depends on a restart which one connects to rabbitmq to create the exchange first.  Once the exchange is created by either the cluster or the region controller then the other one cannot create the exchange and fails.

You can re-create this by shutting down the maas-cluster-celery process on both region and cluster controller:
service maas-cluster-celery stop

Restarting rabbitmq on the region controller:
/etc/init.d/rabbitmq-server restart

Now start the cluster celery process on one of the controllers:
service maas-cluster-celery start
The first to connect will create the exchange correctly.

Then try starting the process on the other controller:
service maas-cluster-celery start

tail -50 /var/log/maas/celery.log:
[2014-04-15 13:05:47,856: ERROR/MainProcess] Unrecoverable error: AMQPChannelException(406, u"PRECONDITION_FAILED - cannot redeclare exchange 'celeryd.pidbox' in vhost '/maas_workers' with different type, durable, internal or autodelete value", (40, 10), 'Channel.exchange_declare')
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/celery/worker/__init__.py", line 268, in start
    component.start()
  File "/usr/lib/python2.7/dist-packages/celery/worker/consumer.py", line 309, in start
    self.reset_connection()
  File "/usr/lib/python2.7/dist-packages/celery/worker/consumer.py", line 592, in reset_connection
    self.reset_pidbox_node()
  File "/usr/lib/python2.7/dist-packages/celery/worker/consumer.py", line 531, in reset_pidbox_node
    callback=self.on_control)
  File "/usr/lib/python2.7/dist-packages/kombu/pidbox.py", line 62, in listen
    callbacks=[callback or self.handle_message])
  File "/usr/lib/python2.7/dist-packages/kombu/pidbox.py", line 53, in Consumer
    **options)
  File "/usr/lib/python2.7/dist-packages/kombu/messaging.py", line 257, in __init__
    self.declare()
  File "/usr/lib/python2.7/dist-packages/kombu/messaging.py", line 267, in declare
    queue.declare()
  File "/usr/lib/python2.7/dist-packages/kombu/entity.py", line 380, in declare
    self.exchange.declare(nowait)
  File "/usr/lib/python2.7/dist-packages/kombu/entity.py", line 154, in declare
    nowait=nowait)
  File "/usr/lib/python2.7/dist-packages/kombu/syn.py", line 23, in blocking
    return __sync_current(fun, *args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/kombu/syn.py", line 39, in __blocking__
    return fun(*args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/amqplib/client_0_8/channel.py", line 843, in exchange_declare
    (40, 11),    # Channel.exchange_declare_ok
  File "/usr/lib/python2.7/dist-packages/amqplib/client_0_8/abstract_channel.py", line 97, in wait
    return self.dispatch_method(method_sig, args, content)
  File "/usr/lib/python2.7/dist-packages/amqplib/client_0_8/abstract_channel.py", line 115, in dispatch_method
    return amqp_method(self, args)
  File "/usr/lib/python2.7/dist-packages/amqplib/client_0_8/channel.py", line 273, in _close
    (class_id, method_id))
AMQPChannelException: (406, u"PRECONDITION_FAILED - cannot redeclare exchange 'celeryd.pidbox' in vhost '/maas_workers' with different type, durable, internal or autodelete value", (40, 10), 'Channel.exchange_declare')
[2014-04-15 13:05:47,858: INFO/MainProcess] Celerybeat: Shutting down...
[2014-04-15 13:05:50,245: INFO/PoolWorker-3] process shutting down
[2014-04-15 13:05:50,247: INFO/PoolWorker-3] process exiting with exitcode 0
[2014-04-15 13:05:50,251: INFO/MainProcess] process shutting down
[2014-04-15 13:05:50,253: INFO/MainProcess] sending shutdown message to manager

You can then recreate the process above to switch over to the other controller failing.
So, it is a problem with declaring the exchange a second time from the second controller, whichever controller I start first.

Gavin Panella (allenap) on 2014-04-15

Changed in maas:
status:	Confirmed → Triaged
importance:	Undecided → Critical

Revision history for this message

Darryl Weaver (dweaver) wrote on 2014-04-15:

#15

Download full text (3.4 KiB)

Correction: You also need to stop the service maas-region-celery before restarting rabbitmq-server.

Then there does seem to be a mismatch between the region controller parameters and the cluster controller parameters for the exchange.

On RC:
service maas-cluster-celery stop
service maas-region-celery stop

On CC:
service maas-cluster-celery stop

On RC:
/etc/init.d/rabbitmq-server restart
rabbitmqctl list_exchanges -p /maas_workers name type durable auto_delete internal arguments

Output:
Listing exchanges ...
master direct true false false []
amq.match headers true false false []
amq.headers headers true false false []
af008b67-37c0-40ee-a758-61e46cec716d direct true false false []
amq.rabbitmq.trace topic true false false []
131a5947-ba0e-4e91-b80d-777e5647c909 direct true false false []
amq.topic topic true false false []
amq.direct direct true false false []
amq.fanout fanout true false false []
celery direct true false false []
direct true false false []
...done.

Now on the RC:
service maas-region-celery start
rabbitmqctl list_exchanges -p /maas_workers name type durable auto_delete internal arguments

Output:
Listing exchanges ...
master direct true false false []
celeryd.pidbox fanout false false false []
amq.match headers true false false []
amq.headers headers true false false []
af008b67-37c0-40ee-a758-61e46cec716d direct true false false []
amq.rabbitmq.trace topic true false false []
131a5947-ba0e-4e91-b80d-777e5647c909 direct true false false []
amq.topic topic true false false []
amq.direct direct true false false []
amq.fanout fanout true false false []
celery direct true false false []
direct true false false []
...done.

Note the created exchange parameters for celeryd.pidbox:
celeryd.pidbox fanout false false false []

Now shut down the celery process and restart rabbitmq:
On RC:
service maas-cluster-celery stop
service maas-region-celery stop
/etc/init.d/rabbitmq-server restart
rabbitmqctl list_exchanges -p /maas_workers name type durable auto_delete internal arguments

Output:
Listing exchanges ...
master direct true false false []
amq.match headers true false false []
amq.headers headers true false false []
af008b67-37c0-40ee-a758-61e46cec716d direct true false false []
amq.rabbitmq.trace topic true false false []
131a5947-ba0e-4e91-b80d-777e5647c909 direct true false false []
amq.topic topic true false false []
amq.direct direct true false false []
amq.fanout fanout true false false []
celery direct true false false []
direct true false false []
...done.

Note: The exchange has been deleted by the rabbitmq restart.

Now start the CC celery instead:
on CC:
service maas-cluster-celery start

on RC:
rabbitmqctl list_exchanges -p /maas_workers name type durable auto_delete internal arguments

Output:
Listing exchanges ...
master direct true false false []
celeryd.pidbox fanout false true false []
amq.match headers true false false []
amq.headers headers true false false []
af008b67-37c0-40ee-a758-61e46cec716d direct true false false []
amq.rabbitmq.trace topic true false false []
131a5947-ba0e-4e91-b80d-777e5647c909 direct true false false []
amq.topic topic true false false []
amq.direct direct true false false []
am...

Correction:  You also need to stop the service maas-region-celery before restarting rabbitmq-server.

Then there does seem to be a mismatch between the region controller parameters and the cluster controller parameters for the exchange.

On RC:
service maas-cluster-celery stop
service maas-region-celery stop

On CC:
service maas-cluster-celery stop

On RC:
/etc/init.d/rabbitmq-server restart
rabbitmqctl list_exchanges -p /maas_workers name type durable auto_delete internal arguments

Output:
Listing exchanges ...
master	direct	true	false	false	[]
amq.match	headers	true	false	false	[]
amq.headers	headers	true	false	false	[]
af008b67-37c0-40ee-a758-61e46cec716d	direct	true	false	false	[]
amq.rabbitmq.trace	topic	true	false	false	[]
131a5947-ba0e-4e91-b80d-777e5647c909	direct	true	false	false	[]
amq.topic	topic	true	false	false	[]
amq.direct	direct	true	false	false	[]
amq.fanout	fanout	true	false	false	[]
celery	direct	true	false	false	[]
	direct	true	false	false	[]
...done.

Now on the RC:
service maas-region-celery start
rabbitmqctl list_exchanges -p /maas_workers name type durable auto_delete internal arguments

Output:
Listing exchanges ...
master	direct	true	false	false	[]
celeryd.pidbox	fanout	false	false	false	[]
amq.match	headers	true	false	false	[]
amq.headers	headers	true	false	false	[]
af008b67-37c0-40ee-a758-61e46cec716d	direct	true	false	false	[]
amq.rabbitmq.trace	topic	true	false	false	[]
131a5947-ba0e-4e91-b80d-777e5647c909	direct	true	false	false	[]
amq.topic	topic	true	false	false	[]
amq.direct	direct	true	false	false	[]
amq.fanout	fanout	true	false	false	[]
celery	direct	true	false	false	[]
	direct	true	false	false	[]
...done.

Note the created exchange parameters for celeryd.pidbox:
celeryd.pidbox	fanout	false	false	false	[]

Now shut down the celery process and restart rabbitmq:
On RC:
service maas-cluster-celery stop
service maas-region-celery stop
/etc/init.d/rabbitmq-server restart
rabbitmqctl list_exchanges -p /maas_workers name type durable auto_delete internal arguments

Output:
Listing exchanges ...
master	direct	true	false	false	[]
amq.match	headers	true	false	false	[]
amq.headers	headers	true	false	false	[]
af008b67-37c0-40ee-a758-61e46cec716d	direct	true	false	false	[]
amq.rabbitmq.trace	topic	true	false	false	[]
131a5947-ba0e-4e91-b80d-777e5647c909	direct	true	false	false	[]
amq.topic	topic	true	false	false	[]
amq.direct	direct	true	false	false	[]
amq.fanout	fanout	true	false	false	[]
celery	direct	true	false	false	[]
	direct	true	false	false	[]
...done.

Note: The exchange has been deleted by the rabbitmq restart.

Now start the CC celery instead:
on CC:
service maas-cluster-celery start

on RC:
rabbitmqctl list_exchanges -p /maas_workers name type durable auto_delete internal arguments

Output:
Listing exchanges ...
master	direct	true	false	false	[]
celeryd.pidbox	fanout	false	true	false	[]
amq.match	headers	true	false	false	[]
amq.headers	headers	true	false	false	[]
af008b67-37c0-40ee-a758-61e46cec716d	direct	true	false	false	[]
amq.rabbitmq.trace	topic	true	false	false	[]
131a5947-ba0e-4e91-b80d-777e5647c909	direct	true	false	false	[]
amq.topic	topic	true	false	false	[]
amq.direct	direct	true	false	false	[]
amq.fanout	fanout	true	false	false	[]
celery	direct	true	false	false	[]
	direct	true	false	false	[]
...done.

Note that the CC creates the exchange with the parameters:
celeryd.pidbox	fanout	false	*true*	false	[]

So, RC sets auto_delete=false and CC sets auto_delete=true, which causes the error.

Revision history for this message

Darryl Weaver (dweaver) wrote on 2014-04-15:

#16

Download full text (4.3 KiB)

Tried a brand new cluster controller using maas version 1.2+bzr1373+dfsg-0ubuntu1~12.04.5 from precise.
Same problem with all cluster controllers,
so problem seems to be on the region controller side.

Traced the origin of the issue to an apt-get dist-upgrade which upgraded the packages as follows:
Start-Date: 2014-03-29 00:43:39
Commandline: apt-get dist-upgrade
Install: python-jsonschema:amd64 (1.3.0-0ubuntu1~cloud0, automatic), librbd1:amd64 (0.67.4-0ubuntu2.2~cloud0, automatic), libnss3:amd64 (3.15.4-0ubuntu0.12.04.1, automatic), python-d2to1:amd64 (0.2.10-1ubuntu4~cloud0, automatic), libaugeas0:amd64 (0.10.0-0ubuntu4, automatic), python-mock:amd64 (1.0.1-2~cloud0, automatic), augeas-lenses:amd64 (0.10.0-0ubuntu4, automatic), python-oslo.config:amd64 (1.2.1-0ubuntu1~cloud0, automatic), librados2:amd64 (0.67.4-0ubuntu2.2~cloud0, automatic), libnetcf1:amd64 (0.1.9-2ubuntu3.2, automatic), python-bson-ext:amd64 (2.6-1~cloud1, automatic), python-swiftclient:amd64 (1.6.0-0ubuntu1~cloud0, automatic), python-amqp:amd64 (1.0.12-0ubuntu1~cloud0, automatic), libnl-route-3-200:amd64 (3.2.3-2ubuntu2, automatic), python-cinderclient:amd64 (1.0.6-0ubuntu1~cloud0, automatic), python-passlib:amd64 (1.5.3-0ubuntu1, automatic), libboost-thread1.46.1:amd64 (1.46.1-7ubuntu3, automatic), libnspr4:amd64 (4.9.5-0ubuntu0.12.04.2, automatic), python-urllib3:amd64 (1.6-2~ubuntu12.04.1~ppa1, automatic), python-keystoneclient:amd64 (0.3.2-0ubuntu1~cloud0, automatic), libxen-4.3:amd64 (4.3.0-1ubuntu1.1~cloud0, automatic), libaudit0:amd64 (1.7.18-1ubuntu1, automatic), python-requests:amd64 (1.2.3-1~ubuntu12.04.1~ppa1, automatic)
Upgrade: maas-dns:amd64 (1.4+bzr1693+dfsg-0ubuntu2.2~ctools0, 1.4+bzr1693+dfsg-0ubuntu2.3~ctools0), maas-common:amd64 (1.4+bzr1693+dfsg-0ubuntu2.2~ctools0, 1.4+bzr1693+dfsg-0ubuntu2.3~ctools0), openssh-server:amd64 (5.9p1-5ubuntu1.1, 5.9p1-5ubuntu1.2), glance-common:amd64 (2012.1.3+stable-20130423-74b067df-0ubuntu1, 2013.2.2-0ubuntu1~cloud0), python-maas-client:amd64 (1.4+bzr1693+dfsg-0ubuntu2.2~ctools0, 1.4+bzr1693+dfsg-0ubuntu2.3~ctools0), libxenstore3.0:amd64 (4.1.5-0ubuntu0.12.04.3, 4.3.0-1ubuntu1.1~cloud0), python-paramiko:amd64 (1.7.7.1-2ubuntu1, 1.10.1-1~cloud0), python-migrate:amd64 (0.7.2-1ubuntu1, 0.7.2-6~cloud0), libvirt0:amd64 (0.9.8-2ubuntu17.17, 1.1.1-0ubuntu8.5~cloud0), openssh-client:amd64 (5.9p1-5ubuntu1.1, 5.9p1-5ubuntu1.2), python-anyjson:amd64 (0.3.1-1build1, 0.3.3-1~cloud0), python-django-maas:amd64 (1.4+bzr1693+dfsg-0ubuntu2.2~ctools0, 1.4+bzr1693+dfsg-0ubuntu2.3~ctools0), python-greenlet:amd64 (0.3.1-1ubuntu5.1, 0.4.1-0ubuntu2~cloud1), python-glance:amd64 (2012.1.3+stable-20130423-74b067df-0ubuntu1, 2013.2.2-0ubuntu1~cloud0), python-sqlalchemy:amd64 (0.7.4-1ubuntu0.1, 0.8.2-1~cloud1), python-boto:amd64 (2.2.2-0ubuntu3, 2.9.6-1~cloud0), python-simplejson:amd64 (2.3.2-1, 3.3.0-2ubuntu2~cloud1), python-bson:amd64 (2.1-1ubuntu0.1, 2.6-1~cloud1), python-webob:amd64 (1.1.1-1ubuntu0, 1.2.3-2ubuntu1~cloud0), maas-region-controller:amd64 (1.4+bzr1693+dfsg-0ubuntu2.2~ctools0, 1.4+bzr1693+dfsg-0ubuntu2.3~ctools0), python-prettytable:amd64 (0.5-1ubuntu2, 0.6.1-1ubuntu1~cloud0), python-novaclient:amd64 (2012.1-0ubuntu1, 2.15.0-0ubun...

MAAS

Cluster setup fails on precise with cloud archives

Bug Description

Other bug subscribers

Remote bug watches