Juju status slow on large model

Bug #1865172 reported by Tim Penhey
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Canonical Juju
Fix Released
High
Tim Penhey

Bug Description

The code that is fetching the network interfaces is doing a database query to load the space name for every interface.

Line 620 in apiserver/facades/client/client/status.go

Revision history for this message
John A Meinel (jameinel) wrote :

The code already does model.AllSubnets() at the beginning, but as it iterates all of the machine interfaces, it then calls interface.subnet.SpaceName() which does a DB query to resolve the SpaceID on the subnet doc into the SpaceName in the space doc.
We could do the same caching of subnet CIDR to SpaceName. This is dangerous in the long term if we ever want to support multiple Networks inside a model. but it would solve the immediate problem. (There is, already, a State.AllSpaces() that would let us load it in a single pass.)

Looking a different way, Status also doesn't filter down to just the machines we care about (if you do 'juju status ubuntu/0' it will read all the machine interfaces on all machines to build the map, and then filter it down to just the instances we care about.)

So ideally we would figure out the machines we care about, then use those ids to load all the interfaces we care about, build the set of the subnet ids, use that to load just the subnets we care about, and use that to load just the spaces that we care about.

Revision history for this message
Joseph Phillips (manadart) wrote :

Looks like this blew out with the space ID changes.

Previously, subnet.SpaceName would access the local doc. Now that it is an ID, we go to Mongo to look up the name.

The Backend interface already implements SpaceLookup, so to prevent this, we just retrieve SpaceInfos before the loop, then look each one up with SpaceInfos.GetByID inside.

Revision history for this message
John A Meinel (jameinel) wrote :

https://github.com/juju/juju/pull/11260 possible fix (needs testing)

Revision history for this message
John A Meinel (jameinel) wrote : Re: [Bug 1865172] Re: Juju status slow on large model

This wasnt quite sufficient. We have the same problem during
Application.EndpointBindings because it loads all spaces for every
application.

John
=:->

On Fri, Feb 28, 2020, 17:25 John A Meinel <email address hidden> wrote:

> https://github.com/juju/juju/pull/11260 possible fix (needs testing)
>
> --
> You received this bug notification because you are subscribed to juju.
> Matching subscriptions: juju bugs
> https://bugs.launchpad.net/bugs/1865172
>
> Title:
> Juju status slow on large model
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju/+bug/1865172/+subscriptions
>

Revision history for this message
Richard Harding (rharding) wrote :

From John:

Here's the list of things that are obviously scaling incorrectly:

fetchAllApplicationsAndUnits reads all spaces for each application to map the binding space ID to the space name (cache the space names and pass in the lookup map)

fetchAllApplicationsAndUnits reads the charm for each application one-by-one. (read the charms in bulk, and then use that map lookup)

fetchNetworkInterface could share the spaceIDtoSpaceName map

fetchRelations iterates remoteApplications status.go:832

FullStatus -> modelStatus -> model.Config() reads settings{e}

FullStatus -> read statuses for the model statuses though we've read all statuses

makeMachineStatus -> instanceData and constraints one per Machine (and instanceData *multiple* times for the same Machine) InstanceNames, HardwareCharacteristics, CharmProfiles, which all read instanceData

processApplication calls Application.Charm which rereads Charm

processUnits
    Unit.publicAddress rereads the Machine object
    Unit.openedPorts rereads openedPorts (we've batch read openedPorts already)
    Unit.AllAddresses reads cloudContainer information per unit

processRelations
    reads relationStatus for each relation
    storage information is a separate API call that can't share the cache

Revision history for this message
Canonical Juju QA Bot (juju-qa-bot) wrote :

One PR has landed already that deals with the per-unit additional queries.

I have another to propose as soon as the 2.7.4 release is out that addresses the per-machine additional queries.

Changed in juju:
status: Triaged → In Progress
milestone: 2.7.4 → 2.7.5
Revision history for this message
Tim Penhey (thumper) wrote :

And this is how I found out that I was still logged in as the bot.

/me sighs.

Revision history for this message
Tim Penhey (thumper) wrote :
Revision history for this message
Tim Penhey (thumper) wrote :

While I've not fixed everything, addressing the per-unit and per-machine additional queries should speed this up significantly.

Changed in juju:
status: In Progress → Fix Committed
Changed in juju:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.