juju add-unit performance degrades in large environments

Bug #1317909 reported by James Page on 2014-05-09
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
juju-core
Medium
Unassigned
juju-core (Ubuntu)
Medium
Unassigned

Bug Description

Adding units to a large, complex MAAS environment is extremely slow - for example:

  juju add-unit -n 63 nova-compute-b8

takes several 10's of minutes to complete.

Environment has 381 existing service units spread across a number of services with subordinates as well (see status.json).

jujud on machine 0 is spinning at about 200% cpu with load average: 2.88, 2.86, 2.71

Some errors in machine-0.log:

2014-05-09 12:29:39 WARNING juju.provider.maas environ.go:233 picked arbitrary tools &{"1.18.2-trusty-amd64" "https://streams.canonical.com/juju/tools/releases/juju-1.18.2-trusty-amd64.tgz" "1214b581d86b8795f5add552c9023a8ef83751c415da77c6021b79321af16c85" %!q(int64=7382418)}
2014-05-09 12:31:36 ERROR juju.state.unit unit.go:523 unit nova-compute-b8/16 cannot get assigned machine: unit "nova-compute-b8/16" is not assigned to a machine
2014-05-09 12:31:36 ERROR juju.state.unit unit.go:523 unit nova-compute-b8/16 cannot get assigned machine: unit "nova-compute-b8/16" is not assigned to a machine
2014-05-09 12:32:03 ERROR juju.state.unit unit.go:523 unit nova-compute-b8/17 cannot get assigned machine: unit "nova-compute-b8/17" is not assigned to a machine
2014-05-09 12:32:03 ERROR juju.state.unit unit.go:523 unit nova-compute-b8/17 cannot get assigned machine: unit "nova-compute-b8/17" is not assigned to a machine
2014-05-09 12:32:13 WARNING juju.provider.maas environ.go:233 picked arbitrary tools &{"1.18.2-trusty-amd64" "https://streams.canonical.com/juju/tools/releases/juju-1.18.2-trusty-amd64.tgz" "1214b581d86b8795f5add552c9023a8ef83751c415da77c6021b79321af16c85" %!q(int64=7382418)}
2014-05-09 12:32:36 ERROR juju.state.unit unit.go:523 unit nova-compute-b8/18 cannot get assigned machine: unit "nova-compute-b8/18" is not assigned to a machine
2014-05-09 12:32:36 ERROR juju.state.unit unit.go:523 unit nova-compute-b8/18 cannot get assigned machine: unit "nova-compute-b8/18" is not assigned to a machine
2014-05-09 12:33:32 ERROR juju.state.unit unit.go:523 unit nova-compute-b8/20 cannot get assigned machine: unit "nova-compute-b8/20" is not assigned to a machine
2014-05-09 12:33:32 ERROR juju.state.unit unit.go:523 unit nova-compute-b8/20 cannot get assigned machine: unit "nova-compute-b8/20" is not assigned to a machine
2014-05-09 12:34:31 ERROR juju.state.unit unit.go:523 unit nova-compute-b8/22 cannot get assigned machine: unit "nova-compute-b8/22" is not assigned to a machine
2014-05-09 12:34:31 ERROR juju.state.unit unit.go:523 unit nova-compute-b8/22 cannot get assigned machine: unit "nova-compute-b8/22" is not assigned to a machine
2014-05-09 12:34:47 WARNING juju.provider.maas environ.go:233 picked arbitrary tools &{"1.18.2-trusty-amd64" "https://streams.canonical.com/juju/tools/releases/juju-1.18.2-trusty-amd64.tgz" "1214b581d86b8795f5add552c9023a8ef83751c415da77c6021b79321af16c85" %!q(int64=7382418)}
2014-05-09 12:35:07 WARNING juju.provider.maas environ.go:233 picked arbitrary tools &{"1.18.2-trusty-amd64" "https://streams.canonical.com/juju/tools/releases/juju-1.18.2-trusty-amd64.tgz" "1214b581d86b8795f5add552c9023a8ef83751c415da77c6021b79321af16c85" %!q(int64=7382418)}
2014-05-09 12:35:21 WARNING juju.provider.maas environ.go:233 picked arbitrary tools &{"1.18.2-trusty-amd64" "https://streams.canonical.com/juju/tools/releases/juju-1.18.2-trusty-amd64.tgz" "1214b581d86b8795f5add552c9023a8ef83751c415da77c6021b79321af16c85" %!q(int64=7382418)}
2014-05-09 12:38:18 WARNING juju.provider.maas environ.go:233 picked arbitrary tools &{"1.18.2-trusty-amd64" "https://streams.canonical.com/juju/tools/releases/juju-1.18.2-trusty-amd64.tgz" "1214b581d86b8795f5add552c9023a8ef83751c415da77c6021b79321af16c85" %!q(int64=7382418)}
2014-05-09 12:38:24 ERROR juju.state.unit unit.go:523 unit nova-compute-b8/1 cannot get assigned machine: unit "nova-compute-b8/1" is not assigned to a machine
2014-05-09 12:38:45 WARNING juju.provider.maas environ.go:233 picked arbitrary tools &{"1.18.2-trusty-amd64" "https://streams.canonical.com/juju/tools/releases/juju-1.18.2-trusty-amd64.tgz" "1214b581d86b8795f5add552c9023a8ef83751c415da77c6021b79321af16c85" %!q(int64=7382418)}

James Page (james-page) wrote :
Download full text (6.6 KiB)

I thought this might be a dupe of:
  https://bugs.launchpad.net/juju-core/+bug/1245649

But I think that doesn't come into play until you have many 1000s of units
of a single service (like 4k in my testing).

I believe the internals of add-unit do each one-by-one, and from the above
logs it looks like it doesn't reuse any of the information lookups. (It
appears that it has to uniquely query the provider for all the information
for each node, as well as do a full tools lookup.)
I thought tools lookup was actually done in the Provisioner side, which
should be asynchronous from add-unit doing its work. (I have seen that
add-unit doesn't return as early as I would expect it to.) I don't know if
it is just accidental interlocking (Provisioner is busy rewriting the same
docs that AddUnit is trying to write as it tries to actually bring up new
instances).

All of those errors in the log appear to be on the Agent/Provisioner side,
not directly in AddUnit.

Those specific errors are actually just the InstancePoller trying to update
IP addresses for various units, but there is no machine started for the
unit yet, so it just returns an error for its IP address. Arguably it isn't
actually an *error* yet. I'm guessing something is running "juju status"
while the machines are coming up, and it is causing us to log messages
about not actually having an instance for given machines yet.

I filed https://bugs.launchpad.net/juju-core/+bug/1318148 that this
shouldn't actually be considered an ERROR (it is expected that there will
be a short period of time where a Unit doesn't have an IP address because
its associated machine hasn't actually been brought up yet.)

So there probably is still a performance bug that add-unit is doing a bit
too much work before returning, but it isn't related to the above error
messages (I believe).

John
=:->

On Fri, May 9, 2014 at 5:16 PM, James Page <email address hidden> wrote:

> Public bug reported:
>
> Adding units to a large, complex MAAS environment is extremely slow -
> for example:
>
> juju add-unit -n 63 nova-compute-b8
>
> takes several 10's of minutes to complete.
>
> Environment has 381 existing service units spread across a number of
> services with subordinates as well (see status.json).
>
> jujud on machine 0 is spinning at about 200% cpu with load average:
> 2.88, 2.86, 2.71
>
> Some errors in machine-0.log:
>
> 2014-05-09 12:29:39 WARNING juju.provider.maas environ.go:233 picked
> arbitrary tools &{"1.18.2-trusty-amd64" "
> https://streams.canonical.com/juju/tools/releases/juju-1.18.2-trusty-amd64.tgz"
> "1214b581d86b8795f5add552c9023a8ef83751c415da77c6021b79321af16c85"
> %!q(int64=7382418)}
> 2014-05-09 12:31:36 ERROR juju.state.unit unit.go:523 unit
> nova-compute-b8/16 cannot get assigned machine: unit "nova-compute-b8/16"
> is not assigned to a machine
> 2014-05-09 12:31:36 ERROR juju.state.unit unit.go:523 unit
> nova-compute-b8/16 cannot get assigned machine: unit "nova-compute-b8/16"
> is not assigned to a machine
> 2014-05-09 12:32:03 ERROR juju.state.unit unit.go:523 unit
> nova-compute-b8/17 cannot get assigned machine: unit "nova-compute-b8/17"
> is not assigned to a machine
> 2014-05-09 12:32:03 ...

Read more...

Changed in juju-core:
importance: Undecided → Medium
status: New → Triaged
tags: added: add-unit performance scalability
James Page (james-page) on 2014-05-11
tags: added: sm15k
Robie Basak (racb) on 2014-05-13
Changed in juju-core (Ubuntu):
status: New → Triaged
importance: Undecided → Medium
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Bug attachments