Canonical Juju

Bug #1811147
Comment #3

Comment 3 for bug 1811147

Revision history for this message

Tim Penhey (thumper) wrote on 2019-05-06: Re: [Bug 1811147] Re: failed units can cause models' logs collections to grow excessively

Related to this I think is backing off a failed worker rather than just
restarting every three seconds.

I feel like we should have some initial time that worker failures are
backed off. Perhaps something in the order of thirty seconds. Meaning if
the worker failed in the first 30 seconds of running, instead of
restarting in 3 seconds, we apply the exponential backoff that we have
for workers that fail to start at all.

This would help reduce the log messages, as there would be one every
five minutes rather than on every three seconds.

Could also look at compressing duplicate rows, meaning that if we see
the same error message as the last one, then have a counter. Emit the
count when we get the first one that is different.

This wouldn't help all situations, but would help many.

On 6/05/19 2:10 PM, Jamon Camisso wrote:
> Here are two example messages from an outage this weekend where units
> ran out of disk, and their agent logs filled the controller's disks and
> broke the HA cluster:
>
> juju:PRIMARY> db["logs.0db6f159-e42d-4a6e-8ece-206ecce03fca"].findOne()
> {
> "_id" : ObjectId("5ccb5e345f5ce8159000f8d3"),
> "t" : NumberLong("1556831796258846439"),
> "n" : "unit-nrpe-0",
> "r" : "2.4.7",
> "m" : "juju.worker.dependency",
> "l" : "engine.go:632",
> "v" : 5,
> "x" : "\"metric-sender\" manifold worker returned unexpected error: mkdir /var/lib/juju/agents/unit-nrpe-0/782349613: no space left on device"
> }
>
> juju:PRIMARY> db['logs.466165d9-6c80-4833-835f-8fee6e1f32d2'].findOne()
> {
> "_id" : ObjectId("5cca89ba5f5ce81590f2394c"),
> "t" : NumberLong("1556777402406471200"),
> "n" : "unit-charmstore-0",
> "r" : "2.4.7",
> "m" : "juju.worker.dependency",
> "l" : "engine.go:632",
> "v" : 5,
> "x" : "\"api-address-updater\" manifold worker returned unexpected error: error setting addresses: cannot write agent configuration: cannot write \"/var/lib/juju/agents/unit-charmstore-0/agent.conf\" contents: write /var/lib/juju/agents/unit-charmstore-0/agent.conf903797376: no space left on device"
> }
>
> It would be helpful if repeated 'no space left on device' messages
> increment a counter, or somehow don't repeatedly write the same/similar
> data until the state of the unit changes.
>

Related to this I think is backing off a failed worker rather than just 
restarting every three seconds.

I feel like we should have some initial time that worker failures are 
backed off. Perhaps something in the order of thirty seconds. Meaning if 
the worker failed in the first 30 seconds of running, instead of 
restarting in 3 seconds, we apply the exponential backoff that we have 
for workers that fail to start at all.

This would help reduce the log messages, as there would be one every 
five minutes rather than on every three seconds.

Could also look at compressing duplicate rows, meaning that if we see 
the same error message as the last one, then have a counter. Emit the 
count when we get the first one that is different.

This wouldn't help all situations, but would help many.

On 6/05/19 2:10 PM, Jamon Camisso wrote:
> Here are two example messages from an outage this weekend where units
> ran out of disk, and their agent logs filled the controller's disks and
> broke the HA cluster:
>
> juju:PRIMARY> db["logs.0db6f159-e42d-4a6e-8ece-206ecce03fca"].findOne()
> {
>          "_id" : ObjectId("5ccb5e345f5ce8159000f8d3"),
>          "t" : NumberLong("1556831796258846439"),
>          "n" : "unit-nrpe-0",
>          "r" : "2.4.7",
>          "m" : "juju.worker.dependency",
>          "l" : "engine.go:632",
>          "v" : 5,
>          "x" : "\"metric-sender\" manifold worker returned unexpected error: mkdir /var/lib/juju/agents/unit-nrpe-0/782349613: no space left on device"
> }
>
> juju:PRIMARY> db['logs.466165d9-6c80-4833-835f-8fee6e1f32d2'].findOne()
> {
>          "_id" : ObjectId("5cca89ba5f5ce81590f2394c"),
>          "t" : NumberLong("1556777402406471200"),
>          "n" : "unit-charmstore-0",
>          "r" : "2.4.7",
>          "m" : "juju.worker.dependency",
>          "l" : "engine.go:632",
>          "v" : 5,
>          "x" : "\"api-address-updater\" manifold worker returned unexpected error: error setting addresses: cannot write agent configuration: cannot write \"/var/lib/juju/agents/unit-charmstore-0/agent.conf\" contents: write /var/lib/juju/agents/unit-charmstore-0/agent.conf903797376: no space left on device"
> }
>
> It would be helpful if repeated 'no space left on device' messages
> increment a counter, or somehow don't repeatedly write the same/similar
> data until the state of the unit changes.
>