Canonical Juju

OOM and high load upgrading to 2.9.7

Bug #1936684 reported by John A Meinel on 2021-07-16

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Canonical Juju	Fix Released	High	Harry Pidcock	Canonical Juju 2.9.10

Bug Description

On one of the IS controllers, we upgraded to the controller to 2.9.7 on Xenial. (we originally had trouble reaching a apt repository, but once that was sorted, the controller moved over just fine.)

Afterward, the model was then upgraded to 2.9.7. After which the system seemed to struggle, hitting very high CPU/Memory usage until jujud was OOM killed. It then eventually recovered.

The machine log is available at: https://pastebin.canonical.com/p/ZqpWQYx9r2/

Digging through it, the one bit that seems very surprising is a *lot* of lines with:

2021-07-16 16:04:01 INFO juju.apiserver tools.go:214 2.9.7-ubuntu-amd64 agent binaries not found locally, fetching
...
2021-07-16 16:04:01 INFO juju.apiserver tools.go:258 fetching 2.9.7-ubuntu-amd64 agent binaries from https://streams.canonical.com/juju/tools/agent/2.9.7/juju-2.9.7-ubuntu-amd64.tgz

This occurs about 20 times, for a mix of amd64, arm64, s390x, and ppc64el.

The main concern is that the *same* .tgz shows up multiple times.

After the OOM kill, there isn't another download of amd64, but there are
arm64: 2
ppc64el: 1
s390x: 2

Now those are within a few seconds of each other. So it *might* just be a logging thing. Or it may be a case that if we get 5 clients asking for amd64 at the same time, we start downloading it 5 times simultaneously.

Tags:

Revision history for this message

John A Meinel (jameinel) wrote on 2021-07-16:

Seems to correspond with this timeframe:
https://grafana.admin.canonical.com/d/sR1-JkYmz/juju2-controllers-thumpers?orgId=1&var-controller=scalingstack-bos01-unspecified-scalingstack-bos01&var-host=All&var-node=All&from=1626433329650&to=1626456529650

Revision history for this message

John A Meinel (jameinel) wrote on 2021-07-16:

Note that the VM went unresponsive for a while (before the OOM), so there is a fair bit of data lost. (the machine would not even respond to regular SSH requests)

Revision history for this message

Tom Haddon (mthaddon) wrote on 2021-07-23:

subscribed ~field-high

Harry Pidcock (hpidcock) on 2021-07-26

Changed in juju:
status:	Triaged → In Progress
assignee:	nobody → Harry Pidcock (hpidcock)

Revision history for this message

Harry Pidcock (hpidcock) wrote on 2021-07-26:

First glance looks like readAndHash reads the entire tar into a slice, which is part of the problem. Second part is that there is no synchronization on the download, so on a heavily loaded system/really slow connection, multiple requests can cause many downloads.

I'm going to investigate further.

Revision history for this message

Ian Booth (wallyworld) wrote on 2021-07-26 (last edit on 2021-07-26):

There's a generic issue with clients (agents) requesting from the controller the delivery of kvm/lxd container images or agent tarballs or even charm reources. Such requests go via the controller which then acts as a cache for subsequent requests, but there's no mediation as stated in comment #4. eg see this related bug

https://bugs.launchpad.net/juju/+bug/1905703

I'm sure there's a similar one for lxd images or agent tarballs but can't locate it at the moment.

And then, the caching is model specific

https://bugs.launchpad.net/juju/+bug/1900021

Changed in juju:
milestone:	2.9-next → 2.9.10

Revision history for this message

Harry Pidcock (hpidcock) wrote on 2021-07-30:

https://github.com/juju/juju/pull/13198

Changed in juju:
status:	In Progress → Fix Committed

Canonical Juju QA Bot (juju-qa-bot) on 2021-08-03

Changed in juju:
status:	Fix Committed → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.