Rabbit password is reset on every upgrade which forces lockstep cluster restarts

Bug #1300507 reported by Julian Edwards on 2014-03-31
82
This bug affects 13 people
Affects Status Importance Assigned to Milestone
maas (Ubuntu)
Critical
Greg Lutostanski

Bug Description

Every time the maas region controller package is updated it updates the rabbit password.

This means remote clusters are no longer able to connect to rabbit until they are restarted and re-receive credentials.

The packaging *must not* force lockstep upgrades like this, data centres will want to do rolling upgrades.

Related branches

Julian Edwards (julian-edwards) wrote :

I consider this a critical bug, but I cannot set the priority on here.

Robie Basak (racb) wrote :

Setting Critical for Julian.

Changed in maas (Ubuntu):
importance: Undecided → Critical
Andres Rodriguez (andreserl) wrote :

@Julian,

As we previously discussed, the region needs to be able to tell the cluster about the updated password so the cluster keeps making the requests to rabbitmq without having to manually restart the cluster. This needs to be fixed in MAAS core regardless of maas packaging changes the password on every upgrade.

Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in maas (Ubuntu):
status: New → Confirmed
Julian Edwards (julian-edwards) wrote :

Andres, I think we're going to have to agree to disagree on this one then.

Changing the password during installation is about the worst time it can be done as there is going to be some instability anyway. I do agree that changes need to be conveyed to the clusters, but if the change is made outside of MAAS's control I don't see how it can know and react unless MAAS has a region controller hook for the packaging to call. It will be quite a lot of work to implement this compared to the simple packaging fix that can be made in the short term.

Andreas Hasenack (ahasenack) wrote :

Why is the rabbit password changed anyway?

On Tuesday 22 Jul 2014 10:47:09 you wrote:
> Why is the rabbit password changed anyway?

Absolutely no idea, it seems pointless but Andres wrote that packaging code so
he may be able to explain.

Andres Rodriguez (andreserl) wrote :

The MAAS config file gets changed by the packaging because at the time MAAS did not support conf.d/ (and it does not currently support it either). The packaging updates the config file (which it actually shouldn't be doing, but it was the only way of solving the problem).

The problem was that if we were providing a new config quite constantly, which meant that a new config file needed to be installed replacing the older config, causing upgrades to fail because there was no way to obtain the old password. This is not a simple packaging fix, at least, at the time it wasn't and it required lots of hacky things (since we were doing things we werent supposed by policy anyway)

Now, as I have expressed before, the Region needs to be able to notify the Clusters about its password changes. It doesn't matter who changes the password here, whether it is the user directly or the packaging, the issue still remains and this should be fixed in MAAS and not just go for a quick fix in packaging. This is a bug in MAAS.

Changed in maas:
status: New → Confirmed
importance: Undecided → Critical
Gavin Panella (allenap) wrote :

RabbitMQ will be going away this cycle, so we should avoid investing a lot of time in an engineering fix for this.

Precisely why I was advocating a quick packaging fix. :)

David Britton (davidpbritton) wrote :

Saw this as well, Latest in trusty -> 1.6b6.

tags: added: landscape
tags: added: cloud-installer
David Britton (davidpbritton) wrote :

I just hit it again -- on upgrading from 1.6rc1

Andreas Hasenack (ahasenack) wrote :

Same here on another machine. Rabbit was full of these:
=ERROR REPORT==== 12-Aug-2014::19:18:59 ===
closing AMQP connection <0.4488.0> (10.96.0.10:36895 -> 10.96.0.10:5672):
{handshake_error,starting,0,
                 {amqp_error,access_refused,
                             "AMQPLAIN login refused: user 'maas_workers' - invalid credentials",
                             'connection.start_ok'}}

And celery.log was full of these:
[2014-08-12 19:06:55,739: ERROR/MainProcess] consumer: Cannot connect to amqp://maas_workers@10.96.0.10:5672//maas_workers: [Errno 104] Connection reset by peer.

Greg Lutostanski (lutostag) wrote :

Found where it happens:
debian/maas-region-controller.postinst:95

Should only happen when creating a new user -- will have to find out why the region-controller does not think it has a user already.

Will dive in further.

Changed in maas (Ubuntu):
assignee: nobody → Greg Lutostanski (lutostag)
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package maas - 1.6.1+bzr2550-0ubuntu1

---------------
maas (1.6.1+bzr2550-0ubuntu1) utopic; urgency=medium

  * New upstream bugfix release:
    - Auto-link node MACs to Networks (LP: #1341619)

  [ Julian Edwards ]
  * debian/maas-region-controller.postinst: Don't restart RabbitMQ on
    upgrades, just ensure it's running. Should prevent a race with the
    cluster celery restarting.
  * debian/rules: Pull upstream branch from the right place.

  [ Andres Rodriguez ]
  * debian/maas-region-controller.postinst: Ensure cluster celery is
    started if it also runs on the region.
 -- Julian Edwards <email address hidden> Thu, 21 Aug 2014 18:38:27 +1000

Changed in maas (Ubuntu):
status: Confirmed → Fix Released
no longer affects: maas
neel (neel-basu-z) wrote :

I am getting an Error with Ubuntu 14.04 Server installed from downloaded iso.

[2014-09-26 12:37:35,356: ERROR/Beat] beat: Connection error: timed out. Trying again in 32.0 seconds...
[2014-09-26 12:37:35,357: ERROR/MainProcess] consumer: Cannot connect to amqp://maas_workers@192.168.250.140:5672//maas_workers: timed out.
Trying again in 32.00 seconds...

[2014-09-26 12:38:11,403: ERROR/MainProcess] consumer: Cannot connect to amqp://maas_workers@192.168.250.140:5672//maas_workers: timed out.
Trying again in 32.00 seconds...

[2014-09-26 12:38:11,403: ERROR/Beat] beat: Connection error: timed out. Trying again in 32.0 seconds...

It all started once I started downloading pxe boot images with sudo -E maas-import-pxe-files

Is it related to this bug ? what is the workaround ?

Andreas Hasenack (ahasenack) wrote :

Hi neel,

unlikely. A timeout usually means a network connectivity problem. If it were a password problem, you would get an immediate error and it would say the credentials are incorrect.

Tuomas Heino (iheino+ub) wrote :

FYI neel, current version in 14.04.1 LTS (1.5.4+bzr2294-0ubuntu1.1) does not seem include a fix for this. A backport (or SRU?) would be nice to have for this. Or a ReleaseNotes entry at least.

Mark Shuttleworth (sabdfl) wrote :

We'll SRU 1.7 once it's super-solid!

Mark

tags: added: verification-done
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers