Controller logs collections growing unbounded
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Canonical Juju |
Fix Released
|
High
|
Tim Penhey |
Bug Description
Over the 2018 year-end break, we had a couple of incidents with our large production Juju 2 controller (running 2.4.7) where the logs collections for a handful of models grew to an extremely large size. Times and sizes for the largest offender (timestamps manually adjusted from NZDT to UTC, hopefully correctly!):
2018-12-25T01:59:34 <pjdc> logs.logs.
2018-12-26T21:19:14 <pjdc> logs.logs.
After each incident I simply dropped the offending collection, so as you can see here it regrew from 0 documents to that size in a pretty short time.
(Values obtained with https:/
Ordinarily /var/lib/juju/db on this controller is about 26G or so; while this was happening, it would reach 39G, which is a lot of pressure on a 50G machine, and required a bunch of work to resync and compact the mongodb secondaries after dropping the collections on the primary. (This may not have strictly been necessary; during the second incident one of the secondaries released the space while I was resyncing the other, but perhaps this is a quirk of how its data happened to be laid out at the time and is not reliable. Anyway.)
Eventually I noticed that these models all had one or more units whose disks were completely full, each of which had one of more agents showing as "failed" in juju status. Once I recovered these units, their models' logs collections stopped growing so heinously.
I've reproduced a similar condition with a test controller, and it seems that when a unit's disk is full, it logs this sort of thing (first query to establish timeframe) every 3 seconds:
juju:PRIMARY> db.getCollectio
{
"_id" : ObjectId(
"t" : NumberLong(
"n" : "machine-0",
"r" : "2.4.7",
"m" : "juju.cmd",
"l" : "supercommand.
"v" : 3,
"x" : "running jujud [2.4.7 gc go1.11.2]"
}
juju:PRIMARY> db.getCollectio
{ "_id" : ObjectId(
{ "_id" : ObjectId(
{ "_id" : ObjectId(
{ "_id" : ObjectId(
{ "_id" : ObjectId(
{ "_id" : ObjectId(
juju:PRIMARY> db.getCollectio
46619
juju:PRIMARY>
I've checked the controller's logs and apparently the log-pruner was failing a lot during the window above, possibly the entire time, e.g. following the first incident I dropped the offending collections at approximately 02:01, but:
2018-12-25 01:51:40 ERROR juju.worker.
2018-12-25 01:57:56 ERROR juju.worker.
2018-12-25 02:14:06 ERROR juju.worker.
2018-12-25 02:21:42 ERROR juju.worker.
2018-12-25 02:52:37 ERROR juju.worker.
2018-12-25 03:07:48 ERROR juju.worker.
2018-12-25 03:23:32 ERROR juju.worker.
2018-12-25 03:33:15 ERROR juju.worker.
Changed in juju: | |
status: | New → Triaged |
importance: | Undecided → High |
milestone: | none → 2.5.1 |
Changed in juju: | |
milestone: | 2.5.1 → 2.5.2 |
Changed in juju: | |
milestone: | 2.5.2 → 2.5.3 |
Changed in juju: | |
milestone: | 2.5.3 → 2.5.4 |
Changed in juju: | |
milestone: | 2.5.4 → 2.5.5 |
summary: |
- failed units can cause models' logs collections to grow excessively + Controller logs collections growing unbounded |
Changed in juju: | |
milestone: | 2.5.6 → 2.5.8 |
Changed in juju: | |
milestone: | 2.5.8 → 2.5.9 |
Changed in juju: | |
milestone: | 2.6.7 → 2.6.8 |
Since reporting this bug we've run into this a few more times and have been forced to resize our Juju 2 controllers to 100G disk to avoid outages before we have time to react.