OOM and high load upgrading to 2.9.7

Bug #1936684 reported by John A Meinel
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
juju
High
Harry Pidcock

Bug Description

On one of the IS controllers, we upgraded to the controller to 2.9.7 on Xenial. (we originally had trouble reaching a apt repository, but once that was sorted, the controller moved over just fine.)

Afterward, the model was then upgraded to 2.9.7. After which the system seemed to struggle, hitting very high CPU/Memory usage until jujud was OOM killed. It then eventually recovered.

The machine log is available at: https://pastebin.canonical.com/p/ZqpWQYx9r2/

Digging through it, the one bit that seems very surprising is a *lot* of lines with:

2021-07-16 16:04:01 INFO juju.apiserver tools.go:214 2.9.7-ubuntu-amd64 agent binaries not found locally, fetching
...
2021-07-16 16:04:01 INFO juju.apiserver tools.go:258 fetching 2.9.7-ubuntu-amd64 agent binaries from https://streams.canonical.com/juju/tools/agent/2.9.7/juju-2.9.7-ubuntu-amd64.tgz

This occurs about 20 times, for a mix of amd64, arm64, s390x, and ppc64el.

The main concern is that the *same* .tgz shows up multiple times.

After the OOM kill, there isn't another download of amd64, but there are
arm64: 2
ppc64el: 1
s390x: 2

Now those are within a few seconds of each other. So it *might* just be a logging thing. Or it may be a case that if we get 5 clients asking for amd64 at the same time, we start downloading it 5 times simultaneously.

Revision history for this message
John A Meinel (jameinel) wrote :
Revision history for this message
John A Meinel (jameinel) wrote :

Note that the VM went unresponsive for a while (before the OOM), so there is a fair bit of data lost. (the machine would not even respond to regular SSH requests)

Revision history for this message
Tom Haddon (mthaddon) wrote :

subscribed ~field-high

Harry Pidcock (hpidcock)
Changed in juju:
status: Triaged → In Progress
assignee: nobody → Harry Pidcock (hpidcock)
Revision history for this message
Harry Pidcock (hpidcock) wrote :

First glance looks like readAndHash reads the entire tar into a slice, which is part of the problem. Second part is that there is no synchronization on the download, so on a heavily loaded system/really slow connection, multiple requests can cause many downloads.

I'm going to investigate further.

Revision history for this message
Ian Booth (wallyworld) wrote (last edit ):

There's a generic issue with clients (agents) requesting from the controller the delivery of kvm/lxd container images or agent tarballs or even charm reources. Such requests go via the controller which then acts as a cache for subsequent requests, but there's no mediation as stated in comment #4. eg see this related bug

https://bugs.launchpad.net/juju/+bug/1905703

I'm sure there's a similar one for lxd images or agent tarballs but can't locate it at the moment.

And then, the caching is model specific

https://bugs.launchpad.net/juju/+bug/1900021

Changed in juju:
milestone: 2.9-next → 2.9.10
Revision history for this message
Harry Pidcock (hpidcock) wrote :
Changed in juju:
status: In Progress → Fix Committed
Changed in juju:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers