listing models is slow

Bug #1677434 reported by Ryan Beisner
24
This bug affects 5 people
Affects Status Importance Assigned to Milestone
Canonical Juju
Fix Released
High
Menno Finlay-Smits

Bug Description

I've found that my 2.1.2 controller has grown sluggish in querying juju models and juju status. The controller is arm64 MAAS, 55GB RAM free, and that machine is otherwise quite snappy at the moment.

ubuntu@node-loewy:~$ free -m
              total used free shared buff/cache available
Mem: 64392 4931 55911 57 3549 58889
Swap: 8191 0 8191

ubuntu@node-loewy:~$ df -h
Filesystem Size Used Avail Use% Mounted on
udev 32G 0 32G 0% /dev
tmpfs 6.3G 58M 6.3G 1% /run
/dev/sda2 220G 13G 197G 6% /
tmpfs 32G 0 32G 0% /dev/shm
tmpfs 5.0M 0 5.0M 0% /run/lock
tmpfs 32G 0 32G 0% /sys/fs/cgroup
/dev/sda1 511M 936K 511M 1% /boot/efi
tmpfs 6.3G 0 6.3G 0% /run/user/1000

ubuntu@node-loewy:~$ uname -a
Linux node-loewy 4.4.0-66-generic #87-Ubuntu SMP Fri Mar 3 15:31:00 UTC 2017 aarch64 aarch64 aarch64 GNU/Linux

ubuntu@node-loewy:~$ uptime
 02:36:19 up 7 days, 6:35, 1 user, load average: 0.81, 1.76, 2.63

---

jenkins@juju-osci1-machine-6:~$ time juju --version
2.1.2-xenial-amd64

real 0m0.032s
user 0m0.028s
sys 0m0.000s

jenkins@juju-osci1-machine-6:~$ time juju --debug controllers
02:33:09 INFO juju.cmd supercommand.go:63 running juju [2.1.2 gc go1.6.2]
02:33:09 DEBUG juju.cmd supercommand.go:64 args: []string{"juju", "--debug", "controllers"}
Use --refresh flag with this command to see the latest information.

Controller Model User Access Cloud/Region Models Machines HA Version
ruxton-maas-arm64* - admin superuser ruxton-maas 2 1 none 2.1.2

02:33:09 INFO cmd supercommand.go:465 command finished

real 0m0.049s
user 0m0.036s
sys 0m0.004s

jenkins@juju-osci1-machine-6:~$ time juju --debug models
02:33:13 INFO juju.cmd supercommand.go:63 running juju [2.1.2 gc go1.6.2]
02:33:13 DEBUG juju.cmd supercommand.go:64 args: []string{"juju", "--debug", "models"}
02:33:13 INFO juju.juju api.go:72 connecting to API addresses: [10.245.168.38:17070]
02:33:13 DEBUG juju.api apiclient.go:590 dialing "wss://10.245.168.38:17070/api"
02:33:14 DEBUG juju.api apiclient.go:593 successfully dialed "wss://10.245.168.38:17070/api"
02:33:14 INFO juju.api apiclient.go:521 connection established to "wss://10.245.168.38:17070/api"
02:33:17 DEBUG juju.juju api.go:263 API hostnames unchanged - not resolving
02:33:18 INFO juju.juju api.go:72 connecting to API addresses: [10.245.168.38:17070]
02:33:18 DEBUG juju.api apiclient.go:590 dialing "wss://10.245.168.38:17070/api"
02:33:18 DEBUG juju.api apiclient.go:593 successfully dialed "wss://10.245.168.38:17070/api"
02:33:18 INFO juju.api apiclient.go:521 connection established to "wss://10.245.168.38:17070/api"
02:33:21 DEBUG juju.juju api.go:263 API hostnames unchanged - not resolving
Controller: ruxton-maas-arm64

Model Cloud/Region Status Machines Cores Access Last connection
controller ruxton-maas available 1 48 admin 7 seconds ago
default ruxton-maas available 0 - admin 2017-03-22

02:33:27 INFO cmd supercommand.go:465 command finished

real 0m13.983s
user 0m0.068s
sys 0m0.008s

Ryan Beisner (1chb1n)
tags: added: usability
Revision history for this message
Ryan Beisner (1chb1n) wrote :

I rebooted the controller host and `juju models` is now back to an acceptable level of usability. It had only been up for 7 days, and had perhaps a dozen models come and go on it.

jenkins@juju-osci1-machine-6:~$ time juju models
Controller: ruxton-maas-arm64

Model Cloud/Region Status Machines Cores Access Last connection
controller ruxton-maas available 1 48 admin just now
default ruxton-maas available 0 - admin 2017-03-22

real 0m0.723s
user 0m0.068s
sys 0m0.008s

Ryan Beisner (1chb1n)
tags: added: amd64 arm64 ppc64el
Revision history for this message
Anastasia (anastasia-macmood) wrote :

@Ryan Beisner,

We are aware of some minor memory leaks that could escalate when 2.1x deployment run untouched for a while. These leaks can impact performance. I think that restarting jujud may have been sufficient for juju to behave acceptably for the next couple of days. However, the fact that rebooting controller host improved performance as well seems to point in that direction.

We have addressed all known ones on 2.2x. For example, bug # 1635311, bug # 1671258, bug # 1651291, bug # 1634328, bug # 1587644, bug # 1581069, bug # 1649719 to name a few :D

Is there any chance, you could re-try with Juju 2.2-beta1 or even daily snap?
We believe you will not get the same results \o/

Changed in juju:
status: New → Incomplete
Revision history for this message
John A Meinel (jameinel) wrote :

While working on fixing bug #1454661, I added some instrumentation.

It seems that doing "juju models" is creating state objects for every model, and loading all of the presence information.
While doing debug information, if I run "juju models" in one window, while logging is enabled in the other, I see:
machine-4: 19:06:12 DEBUG juju.state.presence looked up 440 unknown sequences for "c3525263-1fe3-4556-8b5c-2bbc3beee6f7" (6 unowned) in 13.677754ms (32.2seq/ms) from "presence.beings"

The "440 unknown sequences" is code for "I had to lookup the presence information of 440 values in the database that I haven't seen before".

As these are machines that are running "stable" they should have already cached all of the presence information for all of their models.

I can trigger this reliably with 'juju models', but I *cannot* trigger this with 'juju status'. So Status appears to be using StatePool and thus being able to give information from what it has already cached in memory, while 'juju models' is reloading that information from mongo.

Changed in juju:
importance: Undecided → High
status: Incomplete → Triaged
Ryan Beisner (1chb1n)
tags: added: adrastea
Xav Paice (xavpaice)
tags: added: canonical-bootstack
Revision history for this message
Xav Paice (xavpaice) wrote :

I'm currently working with a 290 node Openstack cloud (2.1.2, Maas 2.2, Xenial & Ocata), and a 'juju status' takes at least 40 minutes.

I've mgopurged offline using v1.5, and have a cron running every 2 hours to prune.

We've restarted all 3 state servers multiple times, and there is no difference to the performance.

The state servers generally have a loadaverage >50, with 16 cores in each VM (3 controllers VMs).

Running 'juju status -m controller' takes between 3 and 60 seconds to run, depending on when the last restart was, but 'juju status' for the main model (running OpenStack) takes 40+ minutes regardless.

We're trying adding cores, and refreshing some of the controller VMs, but we really need more advice on how we can help the juju team diagnose this - what logs are helpful, etc.

Revision history for this message
Tim Penhey (thumper) wrote :

Due to the size of the deployment, we *think* that what you are hitting are issues that we have resolved for 2.2 around the presence database.

What we are looking to do is to backport the fixes from 2.2 into the 2.1 branch and provide an agent that bootstack can check to see if it improves performance.

If it does improve it like we expect, we can look to provide a 2.1.4 release early next week.

Revision history for this message
John A Meinel (jameinel) wrote : Re: [Bug 1677434] Re: listing models is slow

For debugging, we could use the existing support for "cpuprofile" that is
in the 2.1 agents, and then issue a 'juju status' while that is running.

Other alternatives would be to look at "mongotop" which can tell us where
we are spending time in the database.

If they ran a full purge with mgopurge 1.5 than we would have already
dropped the presence database causing it to reinitialize from scratch,
which should be the primary issue for the bugs fixed in 2.2. So I wouldn't
start with doing a backport, because I would expect it not to be sufficient.

I've run test clouds that were much larger than 290 units and 'juju status'
did not take 40min to complete. The worst case was maybe around 2min even
when I had 300 units of 'peer-xplod' (which is constantly triggering charm
hooks as fast as it can), or 2000 units of 'ubuntu' (that doesn't trigger
lots of hooks, but does have far more agents.)

We really need to get profile information to understand why it would be
40min. 40s I could definitely see and have seen. 40min feels more like
we're doing something specifically stupid, like polling Openstack for
information about each machine, and getting 'rate limit exceeded' from OS
causing us to back off, etc. Note, we should *not* be contacting the
provider for any information during 'juju status', as it is intended to all
be from-db cached information.

John
=:->

On Fri, Jun 2, 2017 at 6:32 AM, Tim Penhey <email address hidden> wrote:

> Due to the size of the deployment, we *think* that what you are hitting
> are issues that we have resolved for 2.2 around the presence database.
>
> What we are looking to do is to backport the fixes from 2.2 into the 2.1
> branch and provide an agent that bootstack can check to see if it
> improves performance.
>
> If it does improve it like we expect, we can look to provide a 2.1.4
> release early next week.
>
> --
> You received this bug notification because you are subscribed to juju.
> Matching subscriptions: juju bugs
> https://bugs.launchpad.net/bugs/1677434
>
> Title:
> listing models is slow
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju/+bug/1677434/+subscriptions
>

Revision history for this message
Xav Paice (xavpaice) wrote :

John, how can we get you the information needed to progress with this? Specifically, you mention getting profile information - if there's anything we can do to get that information please let us know (xavpaice on irc).

Revision history for this message
Jill Rouleau (jillrouleau) wrote :

We're still seeing controller performance issues in a cloud deployed with 2.1 and recently upgraded to 2.2.1.1. Is there profiling or other data we can collect? We have a few different affected environments we can readily gather data from.

Revision history for this message
Tim Penhey (thumper) wrote :

@Jill, I don't think there is much else you can provide. If you have a sizeable 2.1.2 model, it is going to have a lot of unnessary load, things that we have fixed for 2.2.2. If at all possible, it would be good to get the system upgraded to this.

The debug output showed that the CLI was making two connections to the controller in order to be able to execute. About the first 8s were mostly TLS handshaking around the API connections. We should be able to cut that in half. Secondly, we need to work out what the remote side is doing to list the controllers. With 2.1.2, the database is going to be very busy doing unnecessary stuff that will slow everything down.

Revision history for this message
Tim Penhey (thumper) wrote :

This was fixed during the facade rework in 2.3.

Changed in juju:
assignee: nobody → Menno Finlay-Smits (menno.smits)
milestone: none → 2.3-alpha1
status: Triaged → Fix Committed
Changed in juju:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.