Controller not caching agent binaries across models

Bug #1900021 reported by Drew Freiberger on 2020-10-15
18
This bug affects 3 people
Affects Status Importance Assigned to Milestone
juju
Medium
Unassigned

Bug Description

While performing functional testing on many different charms (creating and destroying many models) on a lxd localhost controller, I experienced issues with slow agent binary downloads. This was due to an incident in the upstream simplestreams source for the juju agent binaries, however, it highlighted an agent binary cache issue within the controllers that would be wonderful to resolve to lower upstream hits for new series deployments or deployments of several different series' across multiple models.

My environment is a newly bootstrapped juju 2.8.5 controller using the juju snap and lxd is also snap installed. When bootstrapped (juju bootstrap localhost lxd), the default series for the controller is bionic.

Deployment of bionic machines in any new model results in machine-0.log on the controller presenting with "DEBUG juju.apiserver tools.go:140 request for agent binaries: 2.8.5-bionic-amd64" and the machine goes from pending to started almost immediately.

In contrast, if I deploy a new xenial or focal machine with "juju add-machine --series focal" (or xenial), the machine-0.log shows:
DEBUG juju.apiserver tools.go:140 request for agent binaries: 2.8.5-focal-amd64
INFO juju.apiserver tools.go:152 2.8.5-focal-amd64 agent binaries not found locally, fetching
DEBUG juju.apiserver tools.go:140 request for agent binaries: 2.8.5-xenial-amd64
INFO juju.apiserver tools.go:152 2.8.5-xenial-amd64 agent binaries not found locally, fetching

If I add an additional xenial or focal machine to the same model, it finds it locally, and serves it immediately from cache. However, if I create a new model "juju add-model drew2" and then deploy new xenial or focal units, it again results in "not found locally, fetching".

Speaking with @jameinel, he suggested a possible workaround to pre-cache the agent by adding new machines of the given series into the controller model. After performing this action, a newly created model "drew3" resulted in only "request for agent binaries" for the three series, but no longer needed to fetch remotely, relieving pressure from the upstream streams.canonical.com server, and resulting in faster deployment of new units to meet functional test timeouts.

It appears the code copies the agent cache from controller to models, but not from models to other models or back to the controller model. Also of interest, the upstream bits are the same for all amd64 agents, https://streams.canonical.com/juju/tools/agent/2.8.5/juju-2.8.5-ubuntu-amd64.tgz, so having to download this same file 3 times for three series is also an interesting data point to consider when addressing this.

Changed in juju:
status: New → Triaged
importance: Undecided → Medium
milestone: none → 2.8-next
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers