Cluster setup fails on precise with cloud archives

Bug #1251345 reported by Chris Glass
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
MAAS
Invalid
Critical
Unassigned

Bug Description

Trying to setup a "cluster" MaaS on Precise (using the cloud archives) fails.

Setup:
- 1 VM as a region controller (RC) + Cloud controller (CC) on network A
- 1 VM as a CC only on network B
- 1 VM as a Node on network A

Steps to reproduce:

1. Install a Precise MaaS CC+RC on the first VM using the CD installer option ("Create a new maas server on this machine") with ubuntu-12.04.3-server-amd64.iso
2. On first boot, add the cloud archives: sudo apt-get install python-software-properties; sudo add-apt-repository cloud-archive:tools; sudo apt-get update && sudo apt-get dist-upgrade
3. Setup the rest of your MaaS installation (start squid-deb-proxy, create a superuser, configure your cluster from the UI).
4. PXE boot a VM on the same network.
5. Assert the PXE process works, the VM is enlisted, comissionned, and shows as "ready" in the UI.
6. On another VM, install a normal Ubuntu server installation from the ubuntu-12.04.3-server-amd64.iso media.
7. on first boot, add the cloud archives as in step 2.
8. Install the maas cluster controller: sudo apt-get install maas-cluster-controller
9 Edit the /etc/maas/*cluster* files, and make them reflect the current setup (http://pastebin.ubuntu.com/6420582/ - the URL should point to your MaaS on network A, and the UUID should be the same as that machine too)

10. As suggested on IRC - restart the cluster and pserv services: sudo service maas-cluster-celery restart; sudo service maas-pserv restart
11. Notice that the celery service is not running: sudo service maas-cluster-celery status
12. The new cluster controller does not appear in the RC's UI at all.

The /var/log/maas/celery.log fil is http://pastebin.ubuntu.com/6420585/

Both machines have the same version of python-celery installed (http://pastebin.ubuntu.com/6420587/)

Tags: landscape
Revision history for this message
Julian Edwards (julian-edwards) wrote : Re: [Bug 1251345] [NEW] Cluster setup fails on precise with cloud archives

On Thursday 14 Nov 2013 17:22:43 you wrote:
> 9 Edit the /etc/maas/*cluster* files, and make them reflect the current
> setup (https://pastebin.canonical.com/100455/ - the URL should point to
> your MaaS on network A, and the UUID should be the same as that machine
> too)

Where did you read that the UUID should be the same? Each cluster's UUID
should be, well, unique. The packaging generates one for you so there's no
need to edit it.

The error you're getting is a problem of interaction between celery and
rabbit, and I've never seen this before.

Is the version of rabbitmq-server the same on both machines?

Revision history for this message
Julian Edwards (julian-edwards) wrote :

For the benefit of those who cannot see the pastebin, the error from celery is:

[2013-11-14 16:55:04,805: ERROR/MainProcess] Unrecoverable error: AMQPChannelException(406, u"PRECONDITION_FAILED - cannot redeclare exchange 'celeryd.pidbox' in vhost '/maas_workers' with different type, durable, internal or autodelete value", (40, 10), 'Channel.exchange_declare')

Changed in maas:
status: New → Incomplete
Revision history for this message
Chris Glass (tribaal) wrote :

Ok, after further investigation, it seems like this was a user error (me).

This error happens when you put the same UUID in both CC and RC. I suppose I was confused with the UUID thing, I assumed it would work like a ceph ring identifier.

Marking this as won't fix.

Changed in maas:
status: Incomplete → Invalid
Revision history for this message
Chris Glass (tribaal) wrote :

I still encounter the error after resetting each cluster to have its own UUID.

The celery demon (run with sudo service maas-cluster-celery start) does not stay up (failing with the same error as pasted above). The UI on the RC did register the presence of a new cluster, but after acceptation it stays "stuck" with the warning that the new cluster has no boot images.

I will tear down my entire environment and start from scratch, maybe some side effect created this situation.

Changed in maas:
status: Invalid → New
Revision history for this message
Chris Glass (tribaal) wrote :

(changed the original pastebins to ubuntu.com instead of canonical.com - sorry, force of habit)

description: updated
Revision history for this message
Gavin Panella (allenap) wrote :

Chris, I think that's worth a bug report. Something is confusing there, and we might be able to address it via documentation or detecting and alerting about such a situation. Do you mind filing that?

Revision history for this message
Gavin Panella (allenap) wrote :

Can you check the versions of python-celery and python-django-celery on the RC and CC? The following article suggests that different versions of Celery can't coexist peacefully :-/ https://github.com/celery/celery/issues/984

Revision history for this message
Chris Glass (tribaal) wrote :

I tried to reproduce this error several times with a new environment, and can't seem to trigger this anymore.

For the record, if both UUIDs for the RC and the CC are the same, then "nothing happens" and the UI shows no CC waiting to be enrolled. The Celery log on the CC does not error.

Ok, I'm going to mark this as invalid.

Changed in maas:
status: New → Invalid
Revision history for this message
Julian Edwards (julian-edwards) wrote :

Ok thanks for investigating Chris. Please re-open if this happens again.

Revision history for this message
Darryl Weaver (dweaver) wrote :

I've just hit this bug.
I have a set up that can be reproduced with the original poster's steps, except step 9.
I did not need to manually configure maas config files and the UUIDs are unique.
Both region and cluster controller were working to provision machines when first installed.
After running for a week or so, and upgrading released packages, I now see an error with the cluster controller.

I am now seeing the error:
AMQPChannelException: (406, u"PRECONDITION_FAILED - cannot redeclare exchange 'celeryd.pidbox' in vhost '/maas_workers' with different type, durable, internal or autodelete value", (40, 10), 'Channel.exchange_declare')

on the cluster controller node and corresponding error on the maas region controller.
If I try manually restarting the celerey service on the cluster controller with:
service maas-cluster-celery start
It just triggers the error again.

I am using precise with the cloud-tools archive enabled on both machines.
i.e.
maas version: 1.4+bzr1693+dfsg-0ubuntu2.3~ctools0
python-celery 2.4.6-1ubuntu0.1

rabbitmq-server is only installed on the region controller:
rabbitmq-server 2.7.1-0ubuntu4

python-django-celery is not installed on either machine.

Changed in maas:
status: Invalid → Confirmed
Revision history for this message
Darryl Weaver (dweaver) wrote :

I should also mention there is a firewall between network A and B and this has been changed recently and could be contributing to the issue, however, I can't see how as yet, and the error is recorded at both machines, so communication between them seems to be OK.

Revision history for this message
Darryl Weaver (dweaver) wrote :
Download full text (3.8 KiB)

It would seem that the celery cluster controller process on the region controller and on the cluster controller cannot operate at the same time. It depends on a restart which one connects to rabbitmq to create the exchange first. Once the exchange is created by either the cluster or the region controller then the other one cannot create the exchange and fails.

You can re-create this by shutting down the maas-cluster-celery process on both region and cluster controller:
service maas-cluster-celery stop

Restarting rabbitmq on the region controller:
/etc/init.d/rabbitmq-server restart

Now start the cluster celery process on one of the controllers:
service maas-cluster-celery start
The first to connect will create the exchange correctly.

Then try starting the process on the other controller:
service maas-cluster-celery start

tail -50 /var/log/maas/celery.log:
[2014-04-15 13:05:47,856: ERROR/MainProcess] Unrecoverable error: AMQPChannelException(406, u"PRECONDITION_FAILED - cannot redeclare exchange 'celeryd.pidbox' in vhost '/maas_workers' with different type, durable, internal or autodelete value", (40, 10), 'Channel.exchange_declare')
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/celery/worker/__init__.py", line 268, in start
    component.start()
  File "/usr/lib/python2.7/dist-packages/celery/worker/consumer.py", line 309, in start
    self.reset_connection()
  File "/usr/lib/python2.7/dist-packages/celery/worker/consumer.py", line 592, in reset_connection
    self.reset_pidbox_node()
  File "/usr/lib/python2.7/dist-packages/celery/worker/consumer.py", line 531, in reset_pidbox_node
    callback=self.on_control)
  File "/usr/lib/python2.7/dist-packages/kombu/pidbox.py", line 62, in listen
    callbacks=[callback or self.handle_message])
  File "/usr/lib/python2.7/dist-packages/kombu/pidbox.py", line 53, in Consumer
    **options)
  File "/usr/lib/python2.7/dist-packages/kombu/messaging.py", line 257, in __init__
    self.declare()
  File "/usr/lib/python2.7/dist-packages/kombu/messaging.py", line 267, in declare
    queue.declare()
  File "/usr/lib/python2.7/dist-packages/kombu/entity.py", line 380, in declare
    self.exchange.declare(nowait)
  File "/usr/lib/python2.7/dist-packages/kombu/entity.py", line 154, in declare
    nowait=nowait)
  File "/usr/lib/python2.7/dist-packages/kombu/syn.py", line 23, in blocking
    return __sync_current(fun, *args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/kombu/syn.py", line 39, in __blocking__
    return fun(*args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/amqplib/client_0_8/channel.py", line 843, in exchange_declare
    (40, 11), # Channel.exchange_declare_ok
  File "/usr/lib/python2.7/dist-packages/amqplib/client_0_8/abstract_channel.py", line 97, in wait
    return self.dispatch_method(method_sig, args, content)
  File "/usr/lib/python2.7/dist-packages/amqplib/client_0_8/abstract_channel.py", line 115, in dispatch_method
    return amqp_method(self, args)
  File "/usr/lib/python2.7/dist-packages/amqplib/client_0_8/channel.py", line 273, in _close
    (class_id, method_id))
AMQPChannelException: (406, u"PRECONDITION_FAILED - cannot redeclare ...

Read more...

Gavin Panella (allenap)
Changed in maas:
status: Confirmed → Triaged
importance: Undecided → Critical
Revision history for this message
Darryl Weaver (dweaver) wrote :
Download full text (3.4 KiB)

Correction: You also need to stop the service maas-region-celery before restarting rabbitmq-server.

Then there does seem to be a mismatch between the region controller parameters and the cluster controller parameters for the exchange.

On RC:
service maas-cluster-celery stop
service maas-region-celery stop

On CC:
service maas-cluster-celery stop

On RC:
/etc/init.d/rabbitmq-server restart
rabbitmqctl list_exchanges -p /maas_workers name type durable auto_delete internal arguments

Output:
Listing exchanges ...
master direct true false false []
amq.match headers true false false []
amq.headers headers true false false []
af008b67-37c0-40ee-a758-61e46cec716d direct true false false []
amq.rabbitmq.trace topic true false false []
131a5947-ba0e-4e91-b80d-777e5647c909 direct true false false []
amq.topic topic true false false []
amq.direct direct true false false []
amq.fanout fanout true false false []
celery direct true false false []
 direct true false false []
...done.

Now on the RC:
service maas-region-celery start
rabbitmqctl list_exchanges -p /maas_workers name type durable auto_delete internal arguments

Output:
Listing exchanges ...
master direct true false false []
celeryd.pidbox fanout false false false []
amq.match headers true false false []
amq.headers headers true false false []
af008b67-37c0-40ee-a758-61e46cec716d direct true false false []
amq.rabbitmq.trace topic true false false []
131a5947-ba0e-4e91-b80d-777e5647c909 direct true false false []
amq.topic topic true false false []
amq.direct direct true false false []
amq.fanout fanout true false false []
celery direct true false false []
 direct true false false []
...done.

Note the created exchange parameters for celeryd.pidbox:
celeryd.pidbox fanout false false false []

Now shut down the celery process and restart rabbitmq:
On RC:
service maas-cluster-celery stop
service maas-region-celery stop
/etc/init.d/rabbitmq-server restart
rabbitmqctl list_exchanges -p /maas_workers name type durable auto_delete internal arguments

Output:
Listing exchanges ...
master direct true false false []
amq.match headers true false false []
amq.headers headers true false false []
af008b67-37c0-40ee-a758-61e46cec716d direct true false false []
amq.rabbitmq.trace topic true false false []
131a5947-ba0e-4e91-b80d-777e5647c909 direct true false false []
amq.topic topic true false false []
amq.direct direct true false false []
amq.fanout fanout true false false []
celery direct true false false []
 direct true false false []
...done.

Note: The exchange has been deleted by the rabbitmq restart.

Now start the CC celery instead:
on CC:
service maas-cluster-celery start

on RC:
rabbitmqctl list_exchanges -p /maas_workers name type durable auto_delete internal arguments

Output:
Listing exchanges ...
master direct true false false []
celeryd.pidbox fanout false true false []
amq.match headers true false false []
amq.headers headers true false false []
af008b67-37c0-40ee-a758-61e46cec716d direct true false false []
amq.rabbitmq.trace topic true false false []
131a5947-ba0e-4e91-b80d-777e5647c909 direct true false false []
amq.topic topic true false false []
amq.direct direct true false false []
am...

Read more...

Revision history for this message
Darryl Weaver (dweaver) wrote :
Download full text (4.3 KiB)

Tried a brand new cluster controller using maas version 1.2+bzr1373+dfsg-0ubuntu1~12.04.5 from precise.
Same problem with all cluster controllers,
so problem seems to be on the region controller side.

Traced the origin of the issue to an apt-get dist-upgrade which upgraded the packages as follows:
Start-Date: 2014-03-29 00:43:39
Commandline: apt-get dist-upgrade
Install: python-jsonschema:amd64 (1.3.0-0ubuntu1~cloud0, automatic), librbd1:amd64 (0.67.4-0ubuntu2.2~cloud0, automatic), libnss3:amd64 (3.15.4-0ubuntu0.12.04.1, automatic), python-d2to1:amd64 (0.2.10-1ubuntu4~cloud0, automatic), libaugeas0:amd64 (0.10.0-0ubuntu4, automatic), python-mock:amd64 (1.0.1-2~cloud0, automatic), augeas-lenses:amd64 (0.10.0-0ubuntu4, automatic), python-oslo.config:amd64 (1.2.1-0ubuntu1~cloud0, automatic), librados2:amd64 (0.67.4-0ubuntu2.2~cloud0, automatic), libnetcf1:amd64 (0.1.9-2ubuntu3.2, automatic), python-bson-ext:amd64 (2.6-1~cloud1, automatic), python-swiftclient:amd64 (1.6.0-0ubuntu1~cloud0, automatic), python-amqp:amd64 (1.0.12-0ubuntu1~cloud0, automatic), libnl-route-3-200:amd64 (3.2.3-2ubuntu2, automatic), python-cinderclient:amd64 (1.0.6-0ubuntu1~cloud0, automatic), python-passlib:amd64 (1.5.3-0ubuntu1, automatic), libboost-thread1.46.1:amd64 (1.46.1-7ubuntu3, automatic), libnspr4:amd64 (4.9.5-0ubuntu0.12.04.2, automatic), python-urllib3:amd64 (1.6-2~ubuntu12.04.1~ppa1, automatic), python-keystoneclient:amd64 (0.3.2-0ubuntu1~cloud0, automatic), libxen-4.3:amd64 (4.3.0-1ubuntu1.1~cloud0, automatic), libaudit0:amd64 (1.7.18-1ubuntu1, automatic), python-requests:amd64 (1.2.3-1~ubuntu12.04.1~ppa1, automatic)
Upgrade: maas-dns:amd64 (1.4+bzr1693+dfsg-0ubuntu2.2~ctools0, 1.4+bzr1693+dfsg-0ubuntu2.3~ctools0), maas-common:amd64 (1.4+bzr1693+dfsg-0ubuntu2.2~ctools0, 1.4+bzr1693+dfsg-0ubuntu2.3~ctools0), openssh-server:amd64 (5.9p1-5ubuntu1.1, 5.9p1-5ubuntu1.2), glance-common:amd64 (2012.1.3+stable-20130423-74b067df-0ubuntu1, 2013.2.2-0ubuntu1~cloud0), python-maas-client:amd64 (1.4+bzr1693+dfsg-0ubuntu2.2~ctools0, 1.4+bzr1693+dfsg-0ubuntu2.3~ctools0), libxenstore3.0:amd64 (4.1.5-0ubuntu0.12.04.3, 4.3.0-1ubuntu1.1~cloud0), python-paramiko:amd64 (1.7.7.1-2ubuntu1, 1.10.1-1~cloud0), python-migrate:amd64 (0.7.2-1ubuntu1, 0.7.2-6~cloud0), libvirt0:amd64 (0.9.8-2ubuntu17.17, 1.1.1-0ubuntu8.5~cloud0), openssh-client:amd64 (5.9p1-5ubuntu1.1, 5.9p1-5ubuntu1.2), python-anyjson:amd64 (0.3.1-1build1, 0.3.3-1~cloud0), python-django-maas:amd64 (1.4+bzr1693+dfsg-0ubuntu2.2~ctools0, 1.4+bzr1693+dfsg-0ubuntu2.3~ctools0), python-greenlet:amd64 (0.3.1-1ubuntu5.1, 0.4.1-0ubuntu2~cloud1), python-glance:amd64 (2012.1.3+stable-20130423-74b067df-0ubuntu1, 2013.2.2-0ubuntu1~cloud0), python-sqlalchemy:amd64 (0.7.4-1ubuntu0.1, 0.8.2-1~cloud1), python-boto:amd64 (2.2.2-0ubuntu3, 2.9.6-1~cloud0), python-simplejson:amd64 (2.3.2-1, 3.3.0-2ubuntu2~cloud1), python-bson:amd64 (2.1-1ubuntu0.1, 2.6-1~cloud1), python-webob:amd64 (1.1.1-1ubuntu0, 1.2.3-2ubuntu1~cloud0), maas-region-controller:amd64 (1.4+bzr1693+dfsg-0ubuntu2.2~ctools0, 1.4+bzr1693+dfsg-0ubuntu2.3~ctools0), python-prettytable:amd64 (0.5-1ubuntu2, 0.6.1-1ubuntu1~cloud0), python-novaclient:amd64 (2012.1-0ubuntu1, 2.15.0-0ubun...

Read more...

Revision history for this message
Darryl Weaver (dweaver) wrote :

There is a package version mismatch on the region controller and cluster controller due to different repos being installed.
On the RC:
Havana cloud archive is enabled, i.e. /etc/apt/sources.list.d/cloudarchive-havana.list:
deb http://ubuntu-cloud.archive.canonical.com/ubuntu precise-updates/havana main
deb-src http://ubuntu-cloud.archive.canonical.com/ubuntu precise-updates/havana main

This has upgraded the package:
python-kombu:amd64 (1.4.3-1, 2.5.12-0ubuntu2~cloud1)

On the CC:
There is no havana cloud archive enabled only the cloud-tools archive which means package:
python-kombu:amd64 (1.4.3-1)
is installed.

So enabling the havana cloud archive on the CC has fixed the issue:
add-apt-repository cloud-archive:havana
apt-get update
apt-get dist-upgrade

service maas-cluster-celery start

Changed in maas:
status: Triaged → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.