OOM and high load upgrading to 2.9.7
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Canonical Juju |
Fix Released
|
High
|
Harry Pidcock |
Bug Description
On one of the IS controllers, we upgraded to the controller to 2.9.7 on Xenial. (we originally had trouble reaching a apt repository, but once that was sorted, the controller moved over just fine.)
Afterward, the model was then upgraded to 2.9.7. After which the system seemed to struggle, hitting very high CPU/Memory usage until jujud was OOM killed. It then eventually recovered.
The machine log is available at: https:/
Digging through it, the one bit that seems very surprising is a *lot* of lines with:
2021-07-16 16:04:01 INFO juju.apiserver tools.go:214 2.9.7-ubuntu-amd64 agent binaries not found locally, fetching
...
2021-07-16 16:04:01 INFO juju.apiserver tools.go:258 fetching 2.9.7-ubuntu-amd64 agent binaries from https:/
This occurs about 20 times, for a mix of amd64, arm64, s390x, and ppc64el.
The main concern is that the *same* .tgz shows up multiple times.
After the OOM kill, there isn't another download of amd64, but there are
arm64: 2
ppc64el: 1
s390x: 2
Now those are within a few seconds of each other. So it *might* just be a logging thing. Or it may be a case that if we get 5 clients asking for amd64 at the same time, we start downloading it 5 times simultaneously.
Changed in juju: | |
status: | Triaged → In Progress |
assignee: | nobody → Harry Pidcock (hpidcock) |
Changed in juju: | |
status: | Fix Committed → Fix Released |
Seems to correspond with this timeframe: /grafana. admin.canonical .com/d/ sR1-JkYmz/ juju2-controlle rs-thumpers? orgId=1& var-controller= scalingstack- bos01-unspecifi ed-scalingstack -bos01& var-host= All&var- node=All& from=1626433329 650&to= 1626456529650
https:/