parallel juju deployments race on the same maas

Bug #1314409 reported by Ryan Harper on 2014-04-29
18
This bug affects 3 people
Affects Status Importance Assigned to Milestone
MAAS
Critical
Jeroen T. Vermeulen
1.5
Critical
Jeroen T. Vermeulen
juju-core
High
Unassigned
maas (Ubuntu)
Undecided
Unassigned
Trusty
Undecided
Unassigned

Bug Description

[Test Case]
1. Install MAAS
2. Add multiple nodes
3. deploy parallel nodes.
4. Juju in one of the parallel jobs will fail to obtain a node, even though it seems it has been allocated.

With the fix:
4. With the fix, Juju won't ever get the same node in the parallel environments.

Two jenkins-slaves running juju-deployer bundles attempt to create openstack deploys against a MAAS environment using the same pool of hardware. One environment (slave4) has a pending machine with no DNS name, log file says no instance information available. Examining maas, I can see it was allocated to a different user and env.

jenkins@juju-precise-machine-15:~$ juju --version
1.18.1-precise-amd64
jenkins@juju-precise-machine-15:~$ cat /etc/issue
Ubuntu 12.04.4 LTS \n \l

Here's the debug-log tail:

$ juju debug-log -e oil-slave-4
Warning: Permanently added 'suangi.oil' (ECDSA) to the list of known hosts.
machine-6: 2014-04-29 21:03:46 INFO juju runner.go:262 worker: start "machineenvironmentworker"
machine-6: 2014-04-29 21:03:46 INFO juju runner.go:262 worker: start "rsyslog"
machine-6: 2014-04-29 21:03:46 DEBUG juju.worker.logger logger.go:60 logger setup
machine-6: 2014-04-29 21:03:46 DEBUG juju.worker.machineenvironment machineenvironmentworker.go:70 write system files: true
machine-6: 2014-04-29 21:03:46 DEBUG juju.worker.rsyslog worker.go:76 starting rsyslog worker mode 1 for "machine-6" ""
machine-6: 2014-04-29 21:03:46 INFO juju runner.go:262 worker: start "authenticationworker"
machine-6: 2014-04-29 21:03:46 INFO juju.worker.machiner machiner.go:85 setting addresses for machine-6 to ["local-machine:127.0.0.1" "local-cloud:10.245.0.156" "local-machine:::1" "fe80::222:99ff:fee0:337"]
machine-6: 2014-04-29 21:03:46 DEBUG juju.worker.logger logger.go:45 reconfiguring logging from "<root>=DEBUG" to "<root>=WARNING;unit=DEBUG"
machine-0: 2014-04-29 21:18:02 WARNING juju.worker.instanceupdater updater.go:231 cannot get instance info for instance "/MAAS/api/1.0/nodes/node-9df8a42a-c4cd-11e3-824b-00163efc5068/": no instances found
machine-0: 2014-04-29 21:18:14 WARNING juju.worker.instanceupdater updater.go:231 cannot get instance info for instance "/MAAS/api/1.0/nodes/node-9f22e392-c4cd-11e3-824b-00163efc5068/": no instances found

I'll attach the juju status from the two environments.

Related branches

Ryan Harper (raharper) wrote :
Andres Rodriguez (andreserl) wrote :
Andres Rodriguez (andreserl) wrote :
Andres Rodriguez (andreserl) wrote :
Andres Rodriguez (andreserl) wrote :
Julian Edwards (julian-edwards) wrote :

To summarise the conversation I just had on IRC:

 * juju is using separate maas users for each maas environment
 * the oil2 machine "node-9df8a42a-c4cd-11e3-824b-00163efc5068" is also appearing in oil4's status output as pending

The question is, how can the same node seemingly be appearing on both juju environments? MAAS thinks it's allocated to oil2 (I think?)

Julian Edwards (julian-edwards) wrote :

Can someone please confirm the version of maas being used.

Curtis Hovey (sinzui) on 2014-04-30
Changed in juju-core:
status: New → Triaged
importance: Undecided → High
tags: added: maas-provider
Julian Edwards (julian-edwards) wrote :

I'm doing some debugging by looking in the maas access log.

It looks like an environment was destroying its environment around this time:

10.245.0.190 - - [29/Apr/2014:16:09:23 +0000] "POST /MAAS/api/1.0/nodes/node-9df8a42a-c4cd-11e3-824b-00163efc5068/?op=release HTTP/1.1" 200 717 "-" "Go 1.1 package http"

and there we can see the node in question getting released.

Shortly after in the log, there's a load of DELETEs being issued to remove the environment files. So far, so good.

The next time this node is mentioned is hours later when I see it in these two entries:

10.245.0.182 - - [29/Apr/2014:20:44:01 +0000] "POST /MAAS/api/1.0/nodes/?op=acquire HTTP/1.1" 200 723 "-" "Go 1.1 package http"
10.245.0.188 - - [29/Apr/2014:20:44:01 +0000] "POST /MAAS/api/1.0/nodes/?op=acquire HTTP/1.1" 200 723 "-" "Go 1.1 package http"
...
10.245.0.182 - - [29/Apr/2014:20:44:02 +0000] "POST /MAAS/api/1.0/nodes/node-9df8a42a-c4cd-11e3-824b-00163efc5068/?op=start HTTP/1.1" 200 723 "-" "Go 1.1 package http"
10.245.0.188 - - [29/Apr/2014:20:44:02 +0000] "POST /MAAS/api/1.0/nodes/node-9df8a42a-c4cd-11e3-824b-00163efc5068/?op=start HTTP/1.1" 200 723 "-" "Go 1.1 package http"

Two IP addresses are simultaneously trying to start the same node.

Sadly it does look like a bug in MAAS from this evidence.

Changed in maas:
status: New → Triaged
importance: Undecided → Critical
Changed in maas:
assignee: nobody → Jeroen T. Vermeulen (jtv)
status: Triaged → In Progress
Jeroen T. Vermeulen (jtv) wrote :

Django 1.6 supports higher isolation levels! Serialisation errors show up as OperationalError exceptions. These can happen while writing to the database; they're not necessarily limited to the commit.

\o/

Changed in maas:
status: In Progress → Fix Committed
description: updated
Changed in maas:
status: Fix Committed → Fix Released
Chris J Arges (arges) on 2014-05-09
Changed in maas (Ubuntu):
status: New → Fix Released

Hello Ryan, or anyone else affected,

Accepted maas into trusty-proposed. The package will build now and be available at http://launchpad.net/ubuntu/+source/maas/1.5.1+bzr2269-0ubuntu0.1 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-needed to verification-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in maas (Ubuntu Trusty):
status: New → Fix Committed
tags: added: verification-needed
tags: added: verification-done
removed: verification-needed
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package maas - 1.5.1+bzr2269-0ubuntu0.1

---------------
maas (1.5.1+bzr2269-0ubuntu0.1) trusty; urgency=medium

  * Stable Release Update (LP: #1317601):
    - Hardware Enablement for Cisco B-Series. (LP: #1300476)
    - Allow AMT power type to specify IP Address. (LP: #1308772)
    - Spurious failure when starting and creating lock files. (LP: 1308069)
    - Fix usage of hardware enablement kernels by fixing the preseeds
      (LP: #1310082, LP: #1310076, LP: #1310082)
    - Fix parallel juju deployments. (LP: #1314409)
    - Clear distro_series when stopping node from WebUI (LP: #1316396)
    - Fix click hijacking (LP: #1298784)
    - Fix blocking API client when deleting a resource (LP: #1313556)
    - Do not import Trusty RC images by default (LP: #1311151)
    - debian/control: Add missing dep on python-crochet for
      python-maas-provisioningserver (LP: #1311765)
 -- Andres Rodriguez <email address hidden> Fri, 09 May 2014 22:35:43 -0500

Changed in maas (Ubuntu Trusty):
status: Fix Committed → Fix Released

The verification of the Stable Release Update for maas has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regresssions.

Curtis Hovey (sinzui) on 2014-10-22
Changed in juju-core:
status: Triaged → Invalid
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers