Controller logs collections growing unbounded

Bug #1811147 reported by Paul Collins
20
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Canonical Juju
Fix Released
High
Tim Penhey

Bug Description

Over the 2018 year-end break, we had a couple of incidents with our large production Juju 2 controller (running 2.4.7) where the logs collections for a handful of models grew to an extremely large size. Times and sizes for the largest offender (timestamps manually adjusted from NZDT to UTC, hopefully correctly!):

2018-12-25T01:59:34 <pjdc> logs.logs.ef6d8809-0b33-4461-845f-ddee6ca668dd: 14.1 GB (4.8 GB)
2018-12-26T21:19:14 <pjdc> logs.logs.ef6d8809-0b33-4461-845f-ddee6ca668dd: 25.2 GB (5.4 GB)

After each incident I simply dropped the offending collection, so as you can see here it regrew from 0 documents to that size in a pretty short time.

(Values obtained with https://gist.github.com/joeyAghion/6511184#gistcomment-1644191)

Ordinarily /var/lib/juju/db on this controller is about 26G or so; while this was happening, it would reach 39G, which is a lot of pressure on a 50G machine, and required a bunch of work to resync and compact the mongodb secondaries after dropping the collections on the primary. (This may not have strictly been necessary; during the second incident one of the secondaries released the space while I was resyncing the other, but perhaps this is a quirk of how its data happened to be laid out at the time and is not reliable. Anyway.)

Eventually I noticed that these models all had one or more units whose disks were completely full, each of which had one of more agents showing as "failed" in juju status. Once I recovered these units, their models' logs collections stopped growing so heinously.

I've reproduced a similar condition with a test controller, and it seems that when a unit's disk is full, it logs this sort of thing (first query to establish timeframe) every 3 seconds:

juju:PRIMARY> db.getCollection('logs.b878b48e-50be-476d-8659-0acc985f0261').find().limit(1).pretty()
{
        "_id" : ObjectId("5c33d0d9cc76f834a99676f0"),
        "t" : NumberLong("1546899670082887937"),
        "n" : "machine-0",
        "r" : "2.4.7",
        "m" : "juju.cmd",
        "l" : "supercommand.go:56",
        "v" : 3,
        "x" : "running jujud [2.4.7 gc go1.11.2]"
}
juju:PRIMARY> db.getCollection('logs.b878b48e-50be-476d-8659-0acc985f0261').find().skip(db.getCollection('logs.b878b48e-50be-476d-8659-0acc985f0261').count() - 6)
{ "_id" : ObjectId("5c365f0dcc76f834a998f384"), "t" : NumberLong("1547067147390773457"), "n" : "unit-ubuntu-0", "r" : "2.4.7", "m" : "juju.agent.tools", "l" : "symlinks.go:20", "v" : 3, "x" : "ensure jujuc symlinks in /var/lib/juju/tools/unit-ubuntu-0" }
{ "_id" : ObjectId("5c365f0dcc76f834a998f385"), "t" : NumberLong("1547067147391013866"), "n" : "unit-ubuntu-0", "r" : "2.4.7", "m" : "juju.agent.tools", "l" : "symlinks.go:40", "v" : 3, "x" : "was a symlink, now looking at /var/lib/juju/tools/2.4.7-bionic-amd64" }
{ "_id" : ObjectId("5c365f0dcc76f834a998f386"), "t" : NumberLong("1547067147414189930"), "n" : "unit-ubuntu-0", "r" : "2.4.7", "m" : "juju.worker.dependency", "l" : "engine.go:632", "v" : 5, "x" : "\"uniter\" manifold worker returned unexpected error: failed to initialize uniter for \"unit-ubuntu-0\": creating juju run listener: mkdir /var/lib/juju/agents/unit-ubuntu-0/299050935: no space left on device" }
{ "_id" : ObjectId("5c365f10cc76f834a998f387"), "t" : NumberLong("1547067150248296983"), "n" : "unit-ubuntu-0", "r" : "2.4.7", "m" : "juju.agent.tools", "l" : "symlinks.go:20", "v" : 3, "x" : "ensure jujuc symlinks in /var/lib/juju/tools/unit-ubuntu-0" }
{ "_id" : ObjectId("5c365f10cc76f834a998f388"), "t" : NumberLong("1547067150248494302"), "n" : "unit-ubuntu-0", "r" : "2.4.7", "m" : "juju.agent.tools", "l" : "symlinks.go:40", "v" : 3, "x" : "was a symlink, now looking at /var/lib/juju/tools/2.4.7-bionic-amd64" }
{ "_id" : ObjectId("5c365f10cc76f834a998f389"), "t" : NumberLong("1547067150272000356"), "n" : "unit-ubuntu-0", "r" : "2.4.7", "m" : "juju.worker.dependency", "l" : "engine.go:632", "v" : 5, "x" : "\"uniter\" manifold worker returned unexpected error: failed to initialize uniter for \"unit-ubuntu-0\": creating juju run listener: mkdir /var/lib/juju/agents/unit-ubuntu-0/025847210: no space left on device" }
juju:PRIMARY> db.getCollection('logs.b878b48e-50be-476d-8659-0acc985f0261').find({l:"engine.go:632"}).count()
46619
juju:PRIMARY>

I've checked the controller's logs and apparently the log-pruner was failing a lot during the window above, possibly the entire time, e.g. following the first incident I dropped the offending collections at approximately 02:01, but:

2018-12-25 01:51:40 ERROR juju.worker.dependency engine.go:632 "log-pruner" manifold worker returned unexpected error: failed to prune logs by time: read tcp 10.25.2.109:44628->10.25.2.111:37017: i/o timeout
2018-12-25 01:57:56 ERROR juju.worker.dependency engine.go:632 "log-pruner" manifold worker returned unexpected error: failed to prune logs by time: read tcp 10.25.2.109:52512->10.25.2.111:37017: i/o timeout
2018-12-25 02:14:06 ERROR juju.worker.dependency engine.go:632 "log-pruner" manifold worker returned unexpected error: log pruning failed: read tcp 10.25.2.109:44048->10.25.2.111:37017: i/o timeout
2018-12-25 02:21:42 ERROR juju.worker.dependency engine.go:632 "log-pruner" manifold worker returned unexpected error: failed to prune logs by time: read tcp 10.25.2.109:54450->10.25.2.111:37017: i/o timeout
2018-12-25 02:52:37 ERROR juju.worker.dependency engine.go:632 "log-pruner" manifold worker returned unexpected error: log pruning timestamp query failed: read tcp 10.25.2.109:54042->10.25.2.111:37017: i/o timeout
2018-12-25 03:07:48 ERROR juju.worker.dependency engine.go:632 "log-pruner" manifold worker returned unexpected error: log pruning timestamp query failed: read tcp 10.25.2.109:44590->10.25.2.111:37017: i/o timeout
2018-12-25 03:23:32 ERROR juju.worker.dependency engine.go:632 "log-pruner" manifold worker returned unexpected error: log pruning timestamp query failed: read tcp 10.25.2.109:55908->10.25.2.111:37017: i/o timeout
2018-12-25 03:33:15 ERROR juju.worker.dependency engine.go:632 "log-pruner" manifold worker returned unexpected error: log pruning timestamp query failed: read tcp 10.25.2.109:54778->10.25.2.111:37017: i/o timeout

Tim Penhey (thumper)
Changed in juju:
status: New → Triaged
importance: Undecided → High
milestone: none → 2.5.1
Ian Booth (wallyworld)
Changed in juju:
milestone: 2.5.1 → 2.5.2
Changed in juju:
milestone: 2.5.2 → 2.5.3
Changed in juju:
milestone: 2.5.3 → 2.5.4
Changed in juju:
milestone: 2.5.4 → 2.5.5
Revision history for this message
Paul Collins (pjdc) wrote :

Since reporting this bug we've run into this a few more times and have been forced to resize our Juju 2 controllers to 100G disk to avoid outages before we have time to react.

Revision history for this message
Jamon Camisso (jamon) wrote :

Here are two example messages from an outage this weekend where units ran out of disk, and their agent logs filled the controller's disks and broke the HA cluster:

juju:PRIMARY> db["logs.0db6f159-e42d-4a6e-8ece-206ecce03fca"].findOne()
{
        "_id" : ObjectId("5ccb5e345f5ce8159000f8d3"),
        "t" : NumberLong("1556831796258846439"),
        "n" : "unit-nrpe-0",
        "r" : "2.4.7",
        "m" : "juju.worker.dependency",
        "l" : "engine.go:632",
        "v" : 5,
        "x" : "\"metric-sender\" manifold worker returned unexpected error: mkdir /var/lib/juju/agents/unit-nrpe-0/782349613: no space left on device"
}

juju:PRIMARY> db['logs.466165d9-6c80-4833-835f-8fee6e1f32d2'].findOne()
{
        "_id" : ObjectId("5cca89ba5f5ce81590f2394c"),
        "t" : NumberLong("1556777402406471200"),
        "n" : "unit-charmstore-0",
        "r" : "2.4.7",
        "m" : "juju.worker.dependency",
        "l" : "engine.go:632",
        "v" : 5,
        "x" : "\"api-address-updater\" manifold worker returned unexpected error: error setting addresses: cannot write agent configuration: cannot write \"/var/lib/juju/agents/unit-charmstore-0/agent.conf\" contents: write /var/lib/juju/agents/unit-charmstore-0/agent.conf903797376: no space left on device"
}

It would be helpful if repeated 'no space left on device' messages increment a counter, or somehow don't repeatedly write the same/similar data until the state of the unit changes.

Revision history for this message
Tim Penhey (thumper) wrote : Re: [Bug 1811147] Re: failed units can cause models' logs collections to grow excessively

Related to this I think is backing off a failed worker rather than just
restarting every three seconds.

I feel like we should have some initial time that worker failures are
backed off. Perhaps something in the order of thirty seconds. Meaning if
the worker failed in the first 30 seconds of running, instead of
restarting in 3 seconds, we apply the exponential backoff that we have
for workers that fail to start at all.

This would help reduce the log messages, as there would be one every
five minutes rather than on every three seconds.

Could also look at compressing duplicate rows, meaning that if we see
the same error message as the last one, then have a counter. Emit the
count when we get the first one that is different.

This wouldn't help all situations, but would help many.

On 6/05/19 2:10 PM, Jamon Camisso wrote:
> Here are two example messages from an outage this weekend where units
> ran out of disk, and their agent logs filled the controller's disks and
> broke the HA cluster:
>
> juju:PRIMARY> db["logs.0db6f159-e42d-4a6e-8ece-206ecce03fca"].findOne()
> {
> "_id" : ObjectId("5ccb5e345f5ce8159000f8d3"),
> "t" : NumberLong("1556831796258846439"),
> "n" : "unit-nrpe-0",
> "r" : "2.4.7",
> "m" : "juju.worker.dependency",
> "l" : "engine.go:632",
> "v" : 5,
> "x" : "\"metric-sender\" manifold worker returned unexpected error: mkdir /var/lib/juju/agents/unit-nrpe-0/782349613: no space left on device"
> }
>
> juju:PRIMARY> db['logs.466165d9-6c80-4833-835f-8fee6e1f32d2'].findOne()
> {
> "_id" : ObjectId("5cca89ba5f5ce81590f2394c"),
> "t" : NumberLong("1556777402406471200"),
> "n" : "unit-charmstore-0",
> "r" : "2.4.7",
> "m" : "juju.worker.dependency",
> "l" : "engine.go:632",
> "v" : 5,
> "x" : "\"api-address-updater\" manifold worker returned unexpected error: error setting addresses: cannot write agent configuration: cannot write \"/var/lib/juju/agents/unit-charmstore-0/agent.conf\" contents: write /var/lib/juju/agents/unit-charmstore-0/agent.conf903797376: no space left on device"
> }
>
> It would be helpful if repeated 'no space left on device' messages
> increment a counter, or somehow don't repeatedly write the same/similar
> data until the state of the unit changes.
>

Joel Sing (jsing)
summary: - failed units can cause models' logs collections to grow excessively
+ Controller logs collections growing unbounded
Revision history for this message
Joel Sing (jsing) wrote :

There are multiple issues at hand here:

1. Any Juju agent can cause the controller logs collections to grow unbounded and run the controller(s) out of disk space.

2. A Juju agent spews logs when it encounters an ENOSPC situation.

3. Juju worker failures result in a crash loop with no backoff - this exacerbates (2), which results in a out of disk space instance causing (1).

I've retitled this bug to specifically target the unbounded growth of the logs collections on the controllers (which is presumably due to the log pruner failing or failing to keep up with the rate of growth). This one is also the most critical from an operational perspective.

I've retitled https://bugs.launchpad.net/juju/+bug/1827664 to target the fact that Juju agents spew logs on ENOSPC. A separate bug probably should be raised for (3).

Changed in juju:
milestone: 2.5.6 → 2.5.8
Changed in juju:
milestone: 2.5.8 → 2.5.9
Revision history for this message
Paul Collins (pjdc) wrote :

We've just run into this again on our 2.5.1 controller.

The "logs" database is ~23G, with the largest collections as follows:

logs.logs.0db6f159-e42d-4a6e-8ece-206ecce03fca: 31.1 GB (6.1 GB)
logs.logs.b7c04276-02fc-4a42-8da7-e6172bfcec90: 20.8 GB (4.2 GB)
logs.logs.e7b80f5e-0aa2-491d-8690-76ba0154e2e2: 20.4 GB (3.2 GB)
logs.logs.9231858f-69dd-47f1-8928-fdee501405f2: 19.7 GB (3.1 GB)
logs.logs.f8bcb962-cb4a-4e98-82d0-e0c883358190: 10.7 GB (1.6 GB)

Revision history for this message
Tim Penhey (thumper) wrote :

Release was actually in 2.6.5, but I don't want to open the milestone just to assign this bug.

Changed in juju:
milestone: 2.5.9 → 2.6.7
status: Triaged → Fix Released
assignee: nobody → Tim Penhey (thumper)
Revision history for this message
Tim Penhey (thumper) wrote :

To address @jsing's commentary above:

1. Any Juju agent can cause the controller logs collections to grow unbounded and run the controller(s) out of disk space.

 - This is addressed in 2.6.5 with the introduction of capped log collections per model.

2. A Juju agent spews logs when it encounters an ENOSPC situation.

 - The metric sender write an error when it isn't able to create a directory on startup. Not quite a spew.

3. Juju worker failures result in a crash loop with no backoff - this exacerbates (2), which results in a out of disk space instance causing (1).

 - However, the backoff code had a bug which would cause it to not backoff at all, which does exacerbate (2) above. This is also fixed in the 2.6 branch, and 2.5.9

Changed in juju:
milestone: 2.6.7 → 2.6.8
Revision history for this message
Anastasia (anastasia-macmood) wrote :

2.6.7 release never went out. This fix was instead released in 2.6.8 - re-targeting this report.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.