Controller logs collections growing unbounded

Bug #1811147 reported by Paul Collins on 2019-01-09
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
juju
High
Unassigned

Bug Description

Over the 2018 year-end break, we had a couple of incidents with our large production Juju 2 controller (running 2.4.7) where the logs collections for a handful of models grew to an extremely large size. Times and sizes for the largest offender (timestamps manually adjusted from NZDT to UTC, hopefully correctly!):

2018-12-25T01:59:34 <pjdc> logs.logs.ef6d8809-0b33-4461-845f-ddee6ca668dd: 14.1 GB (4.8 GB)
2018-12-26T21:19:14 <pjdc> logs.logs.ef6d8809-0b33-4461-845f-ddee6ca668dd: 25.2 GB (5.4 GB)

After each incident I simply dropped the offending collection, so as you can see here it regrew from 0 documents to that size in a pretty short time.

(Values obtained with https://gist.github.com/joeyAghion/6511184#gistcomment-1644191)

Ordinarily /var/lib/juju/db on this controller is about 26G or so; while this was happening, it would reach 39G, which is a lot of pressure on a 50G machine, and required a bunch of work to resync and compact the mongodb secondaries after dropping the collections on the primary. (This may not have strictly been necessary; during the second incident one of the secondaries released the space while I was resyncing the other, but perhaps this is a quirk of how its data happened to be laid out at the time and is not reliable. Anyway.)

Eventually I noticed that these models all had one or more units whose disks were completely full, each of which had one of more agents showing as "failed" in juju status. Once I recovered these units, their models' logs collections stopped growing so heinously.

I've reproduced a similar condition with a test controller, and it seems that when a unit's disk is full, it logs this sort of thing (first query to establish timeframe) every 3 seconds:

juju:PRIMARY> db.getCollection('logs.b878b48e-50be-476d-8659-0acc985f0261').find().limit(1).pretty()
{
        "_id" : ObjectId("5c33d0d9cc76f834a99676f0"),
        "t" : NumberLong("1546899670082887937"),
        "n" : "machine-0",
        "r" : "2.4.7",
        "m" : "juju.cmd",
        "l" : "supercommand.go:56",
        "v" : 3,
        "x" : "running jujud [2.4.7 gc go1.11.2]"
}
juju:PRIMARY> db.getCollection('logs.b878b48e-50be-476d-8659-0acc985f0261').find().skip(db.getCollection('logs.b878b48e-50be-476d-8659-0acc985f0261').count() - 6)
{ "_id" : ObjectId("5c365f0dcc76f834a998f384"), "t" : NumberLong("1547067147390773457"), "n" : "unit-ubuntu-0", "r" : "2.4.7", "m" : "juju.agent.tools", "l" : "symlinks.go:20", "v" : 3, "x" : "ensure jujuc symlinks in /var/lib/juju/tools/unit-ubuntu-0" }
{ "_id" : ObjectId("5c365f0dcc76f834a998f385"), "t" : NumberLong("1547067147391013866"), "n" : "unit-ubuntu-0", "r" : "2.4.7", "m" : "juju.agent.tools", "l" : "symlinks.go:40", "v" : 3, "x" : "was a symlink, now looking at /var/lib/juju/tools/2.4.7-bionic-amd64" }
{ "_id" : ObjectId("5c365f0dcc76f834a998f386"), "t" : NumberLong("1547067147414189930"), "n" : "unit-ubuntu-0", "r" : "2.4.7", "m" : "juju.worker.dependency", "l" : "engine.go:632", "v" : 5, "x" : "\"uniter\" manifold worker returned unexpected error: failed to initialize uniter for \"unit-ubuntu-0\": creating juju run listener: mkdir /var/lib/juju/agents/unit-ubuntu-0/299050935: no space left on device" }
{ "_id" : ObjectId("5c365f10cc76f834a998f387"), "t" : NumberLong("1547067150248296983"), "n" : "unit-ubuntu-0", "r" : "2.4.7", "m" : "juju.agent.tools", "l" : "symlinks.go:20", "v" : 3, "x" : "ensure jujuc symlinks in /var/lib/juju/tools/unit-ubuntu-0" }
{ "_id" : ObjectId("5c365f10cc76f834a998f388"), "t" : NumberLong("1547067150248494302"), "n" : "unit-ubuntu-0", "r" : "2.4.7", "m" : "juju.agent.tools", "l" : "symlinks.go:40", "v" : 3, "x" : "was a symlink, now looking at /var/lib/juju/tools/2.4.7-bionic-amd64" }
{ "_id" : ObjectId("5c365f10cc76f834a998f389"), "t" : NumberLong("1547067150272000356"), "n" : "unit-ubuntu-0", "r" : "2.4.7", "m" : "juju.worker.dependency", "l" : "engine.go:632", "v" : 5, "x" : "\"uniter\" manifold worker returned unexpected error: failed to initialize uniter for \"unit-ubuntu-0\": creating juju run listener: mkdir /var/lib/juju/agents/unit-ubuntu-0/025847210: no space left on device" }
juju:PRIMARY> db.getCollection('logs.b878b48e-50be-476d-8659-0acc985f0261').find({l:"engine.go:632"}).count()
46619
juju:PRIMARY>

I've checked the controller's logs and apparently the log-pruner was failing a lot during the window above, possibly the entire time, e.g. following the first incident I dropped the offending collections at approximately 02:01, but:

2018-12-25 01:51:40 ERROR juju.worker.dependency engine.go:632 "log-pruner" manifold worker returned unexpected error: failed to prune logs by time: read tcp 10.25.2.109:44628->10.25.2.111:37017: i/o timeout
2018-12-25 01:57:56 ERROR juju.worker.dependency engine.go:632 "log-pruner" manifold worker returned unexpected error: failed to prune logs by time: read tcp 10.25.2.109:52512->10.25.2.111:37017: i/o timeout
2018-12-25 02:14:06 ERROR juju.worker.dependency engine.go:632 "log-pruner" manifold worker returned unexpected error: log pruning failed: read tcp 10.25.2.109:44048->10.25.2.111:37017: i/o timeout
2018-12-25 02:21:42 ERROR juju.worker.dependency engine.go:632 "log-pruner" manifold worker returned unexpected error: failed to prune logs by time: read tcp 10.25.2.109:54450->10.25.2.111:37017: i/o timeout
2018-12-25 02:52:37 ERROR juju.worker.dependency engine.go:632 "log-pruner" manifold worker returned unexpected error: log pruning timestamp query failed: read tcp 10.25.2.109:54042->10.25.2.111:37017: i/o timeout
2018-12-25 03:07:48 ERROR juju.worker.dependency engine.go:632 "log-pruner" manifold worker returned unexpected error: log pruning timestamp query failed: read tcp 10.25.2.109:44590->10.25.2.111:37017: i/o timeout
2018-12-25 03:23:32 ERROR juju.worker.dependency engine.go:632 "log-pruner" manifold worker returned unexpected error: log pruning timestamp query failed: read tcp 10.25.2.109:55908->10.25.2.111:37017: i/o timeout
2018-12-25 03:33:15 ERROR juju.worker.dependency engine.go:632 "log-pruner" manifold worker returned unexpected error: log pruning timestamp query failed: read tcp 10.25.2.109:54778->10.25.2.111:37017: i/o timeout

Tim Penhey (thumper) on 2019-01-10
Changed in juju:
status: New → Triaged
importance: Undecided → High
milestone: none → 2.5.1
Ian Booth (wallyworld) on 2019-01-28
Changed in juju:
milestone: 2.5.1 → 2.5.2
Changed in juju:
milestone: 2.5.2 → 2.5.3
Changed in juju:
milestone: 2.5.3 → 2.5.4
Changed in juju:
milestone: 2.5.4 → 2.5.5
Paul Collins (pjdc) wrote :

Since reporting this bug we've run into this a few more times and have been forced to resize our Juju 2 controllers to 100G disk to avoid outages before we have time to react.

Jamon Camisso (jamon) wrote :

Here are two example messages from an outage this weekend where units ran out of disk, and their agent logs filled the controller's disks and broke the HA cluster:

juju:PRIMARY> db["logs.0db6f159-e42d-4a6e-8ece-206ecce03fca"].findOne()
{
        "_id" : ObjectId("5ccb5e345f5ce8159000f8d3"),
        "t" : NumberLong("1556831796258846439"),
        "n" : "unit-nrpe-0",
        "r" : "2.4.7",
        "m" : "juju.worker.dependency",
        "l" : "engine.go:632",
        "v" : 5,
        "x" : "\"metric-sender\" manifold worker returned unexpected error: mkdir /var/lib/juju/agents/unit-nrpe-0/782349613: no space left on device"
}

juju:PRIMARY> db['logs.466165d9-6c80-4833-835f-8fee6e1f32d2'].findOne()
{
        "_id" : ObjectId("5cca89ba5f5ce81590f2394c"),
        "t" : NumberLong("1556777402406471200"),
        "n" : "unit-charmstore-0",
        "r" : "2.4.7",
        "m" : "juju.worker.dependency",
        "l" : "engine.go:632",
        "v" : 5,
        "x" : "\"api-address-updater\" manifold worker returned unexpected error: error setting addresses: cannot write agent configuration: cannot write \"/var/lib/juju/agents/unit-charmstore-0/agent.conf\" contents: write /var/lib/juju/agents/unit-charmstore-0/agent.conf903797376: no space left on device"
}

It would be helpful if repeated 'no space left on device' messages increment a counter, or somehow don't repeatedly write the same/similar data until the state of the unit changes.

Related to this I think is backing off a failed worker rather than just
restarting every three seconds.

I feel like we should have some initial time that worker failures are
backed off. Perhaps something in the order of thirty seconds. Meaning if
the worker failed in the first 30 seconds of running, instead of
restarting in 3 seconds, we apply the exponential backoff that we have
for workers that fail to start at all.

This would help reduce the log messages, as there would be one every
five minutes rather than on every three seconds.

Could also look at compressing duplicate rows, meaning that if we see
the same error message as the last one, then have a counter. Emit the
count when we get the first one that is different.

This wouldn't help all situations, but would help many.

On 6/05/19 2:10 PM, Jamon Camisso wrote:
> Here are two example messages from an outage this weekend where units
> ran out of disk, and their agent logs filled the controller's disks and
> broke the HA cluster:
>
> juju:PRIMARY> db["logs.0db6f159-e42d-4a6e-8ece-206ecce03fca"].findOne()
> {
> "_id" : ObjectId("5ccb5e345f5ce8159000f8d3"),
> "t" : NumberLong("1556831796258846439"),
> "n" : "unit-nrpe-0",
> "r" : "2.4.7",
> "m" : "juju.worker.dependency",
> "l" : "engine.go:632",
> "v" : 5,
> "x" : "\"metric-sender\" manifold worker returned unexpected error: mkdir /var/lib/juju/agents/unit-nrpe-0/782349613: no space left on device"
> }
>
> juju:PRIMARY> db['logs.466165d9-6c80-4833-835f-8fee6e1f32d2'].findOne()
> {
> "_id" : ObjectId("5cca89ba5f5ce81590f2394c"),
> "t" : NumberLong("1556777402406471200"),
> "n" : "unit-charmstore-0",
> "r" : "2.4.7",
> "m" : "juju.worker.dependency",
> "l" : "engine.go:632",
> "v" : 5,
> "x" : "\"api-address-updater\" manifold worker returned unexpected error: error setting addresses: cannot write agent configuration: cannot write \"/var/lib/juju/agents/unit-charmstore-0/agent.conf\" contents: write /var/lib/juju/agents/unit-charmstore-0/agent.conf903797376: no space left on device"
> }
>
> It would be helpful if repeated 'no space left on device' messages
> increment a counter, or somehow don't repeatedly write the same/similar
> data until the state of the unit changes.
>

Joel Sing (jsing) on 2019-05-06
summary: - failed units can cause models' logs collections to grow excessively
+ Controller logs collections growing unbounded
Joel Sing (jsing) wrote :

There are multiple issues at hand here:

1. Any Juju agent can cause the controller logs collections to grow unbounded and run the controller(s) out of disk space.

2. A Juju agent spews logs when it encounters an ENOSPC situation.

3. Juju worker failures result in a crash loop with no backoff - this exacerbates (2), which results in a out of disk space instance causing (1).

I've retitled this bug to specifically target the unbounded growth of the logs collections on the controllers (which is presumably due to the log pruner failing or failing to keep up with the rate of growth). This one is also the most critical from an operational perspective.

I've retitled https://bugs.launchpad.net/juju/+bug/1827664 to target the fact that Juju agents spew logs on ENOSPC. A separate bug probably should be raised for (3).

Changed in juju:
milestone: 2.5.6 → 2.5.8
Changed in juju:
milestone: 2.5.8 → 2.5.9
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers