Bug #1827664 “Juju agents spew logs on ENOSPC” : Bugs : Canonical Juju

Revision history for this message

John A Meinel (jameinel) wrote on 2019-05-05: Re: [Bug 1827664] [NEW] Controller disk space running out due to units having no space left

#1

Controllers do rotate their log files and we prune the log database to keep
it around 4GB in size. It is possible that the rate of messages is higher
than the rate that we prune them, but there is some protection from it. We
also rate limit log messages, though this may not be aggressive enough.

On Fri, May 3, 2019 at 11:50 PM Laurent Sesques <
<email address hidden>> wrote:

> Public bug reported:
>
> When a unit has run out of space, it will spam its controllers with
> messages such as:
> 0db6f159-e42d-4a6e-8ece-206ecce03fca: unit-postgresql-0 2019-05-03
> 13:31:34 ERROR juju.worker.dependency engine.go:632 "metric-sender"
> manifold worker returned unexpected error: mkdir
> /var/lib/juju/agents/unit-postgresql-0/174896867: no space left on device
> Which leads to mongodb using high amounts of disk space, and eventually
> running out if nothing is done.
> I solved such a situation by freeing space on the 2 units which had a /
> full, but the controllers shouldn't be vulnerable to this.
>
> ** Affects: juju
> Importance: Undecided
> Status: New
>
> --
> You received this bug notification because you are subscribed to juju.
> Matching subscriptions: juju bugs
> https://bugs.launchpad.net/bugs/1827664
>
> Title:
> Controller disk space running out due to units having no space left
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju/+bug/1827664/+subscriptions
>

Revision history for this message

Joel Sing (jsing) wrote on 2019-05-06: Re: Controller disk space running out due to units having no space left

#2

If there is protection, it is not working - see https://bugs.launchpad.net/juju/+bug/1811147

Joel Sing (jsing) on 2019-05-06

summary:

- Controller disk space running out due to units having no space left
+ Juju agents spew logs on ENOSPC

Revision history for this message

Joel Sing (jsing) wrote on 2019-05-06:

#3

I've retitled this bug to specifically target the fact that Juju agents spew logs on ENOSPC. The unbounded growth of controller logs collections is targeted in https://bugs.launchpad.net/juju/+bug/1811147.

Revision history for this message

Laurent Sesquès (sajoupa) wrote on 2019-05-06:

#4

Forgot to mention: this happened on a controller running 2.5.1.

Revision history for this message

John A Meinel (jameinel) wrote on 2019-05-06: Re: [Bug 1827664] Re: Juju agents spew logs on ENOSPC

#5

@thumper
I know we did some work to have automatic exponential backoff when workers
are bouncing in their loop. Is the issue that this only triggers during the
Start of the worker, and not if it successfully started but then quickly
fails in the future?

On Mon, May 6, 2019 at 3:35 PM Laurent Sesques <
<email address hidden>> wrote:

> Forgot to mention: this happened on a controller running 2.5.1.
>
> --
> You received this bug notification because you are subscribed to juju.
> Matching subscriptions: juju bugs
> https://bugs.launchpad.net/bugs/1827664
>
> Title:
> Juju agents spew logs on ENOSPC
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju/+bug/1827664/+subscriptions
>

Revision history for this message

John A Meinel (jameinel) wrote on 2019-05-06:

#6

(or is it that our change wasn't to 2.5.1 and it is just that they weren't
running that code.)

On Mon, May 6, 2019 at 5:11 PM John Meinel <email address hidden> wrote:

> @thumper
> I know we did some work to have automatic exponential backoff when workers
> are bouncing in their loop. Is the issue that this only triggers during the
> Start of the worker, and not if it successfully started but then quickly
> fails in the future?
>
>
> On Mon, May 6, 2019 at 3:35 PM Laurent Sesques <
> <email address hidden>> wrote:
>
>> Forgot to mention: this happened on a controller running 2.5.1.
>>
>> --
>> You received this bug notification because you are subscribed to juju.
>> Matching subscriptions: juju bugs
>> https://bugs.launchpad.net/bugs/1827664
>>
>> Title:
>> Juju agents spew logs on ENOSPC
>>
>> To manage notifications about this bug go to:
>> https://bugs.launchpad.net/juju/+bug/1827664/+subscriptions
>>
>

Revision history for this message

Joel Sing (jsing) wrote on 2019-05-07:

#7

This is a snippet of logs spewing from an agent:

https://pastebin.canonical.com/p/YQCddh3v4Y/

A large portion are from the metric-sender.

Revision history for this message

Joel Sing (jsing) wrote on 2019-05-07:

#8

This is easy enough to reproduce - on an instance:

sudo dd if=/dev/zero of=/var/fillmeup bs=1M

Make a config change to ensure the agents were trying to do something, at which point the logsink.log on the controller has a stream of logs landing from the agents.

It is worth noting that on a 2.5.2 model, the rate of logs seems to be around five logs every three seconds - presumably this can be increased with more subordinates on a host.

The logs above are from a model that is still on 2.4.7, hence the restart backoff work may not be deployed there.

Revision history for this message

John A Meinel (jameinel) wrote on 2019-05-09:

#9

Download full text (6.1 KiB)

So I set up an lxd container with limited disk space with:

$ lxc storage create small zfs size=2GB
$ lxc launch juju/bionic/amd64 biotest --storage=small
$ lxc exec bash biotest
$$ su -l ubuntu
$$ ssh-import-id jameinel
$$ exit

$ lxc list
$ juju add-machine ssh:ubuntu@${IP}
$ cd $GOPATH/github.com/juju/juju/acceptance_tests/repository/trusty
$ juju deploy ./dummy-source --to 0

$ lxc exec bash biotest
$$ su -l ubuntu
$$ dd if=/dev/urandom of=$HOME/fillmeup bs=1M

Note that I had to use urandom instead of /dev/zero because I was using ZFS which seemed to happily allow me to create a 14GB file full of zeros on a 2GB partition.

Once the disk was full, I could see on the controller these failures:
45ba77ba-73bc-45c2-827a-a3fffde9d2b1: unit-dummy-source-0 2019-05-09 12:10:47 INFO juju.agent.tools symlinks.go:20 ensure jujuc symlinks in /var/lib/juju/tools/unit-dummy-source-0
45ba77ba-73bc-45c2-827a-a3fffde9d2b1: unit-dummy-source-0 2019-05-09 12:10:47 INFO juju.agent.tools symlinks.go:40 was a symlink, now looking at /var/lib/juju/tools/2.5.5.1-bionic-amd64
45ba77ba-73bc-45c2-827a-a3fffde9d2b1: unit-dummy-source-0 2019-05-09 12:10:47 ERROR juju.worker.dependency engine.go:636 "uniter" manifold worker returned unexpected error: failed to initialize uniter for "unit-dummy-source-0": creating juju run listener: mkdir /var/lib/juju/agents/unit-dummy-source-0/477599416: no space left on device
45ba77ba-73bc-45c2-827a-a3fffde9d2b1: unit-dummy-source-0 2019-05-09 12:10:50 INFO juju.agent.tools symlinks.go:20 ensure jujuc symlinks in /var/lib/juju/tools/unit-dummy-source-0
45ba77ba-73bc-45c2-827a-a3fffde9d2b1: unit-dummy-source-0 2019-05-09 12:10:50 INFO juju.agent.tools symlinks.go:40 was a symlink, now looking at /var/lib/juju/tools/2.5.5.1-bionic-amd64
45ba77ba-73bc-45c2-827a-a3fffde9d2b1: unit-dummy-source-0 2019-05-09 12:10:50 ERROR juju.worker.dependency engine.go:636 "uniter" manifold worker returned unexpected error: failed to initialize uniter for "unit-dummy-source-0": creating juju run listener: mkdir /var/lib/juju/agents/unit-dummy-source-0/879169719: no space left on device
45ba77ba-73bc-45c2-827a-a3fffde9d2b1: unit-dummy-source-0 2019-05-09 12:10:52 INFO juju.agent.tools symlinks.go:20 ensure jujuc symlinks in /var/lib/juju/tools/unit-dummy-source-0
45ba77ba-73bc-45c2-827a-a3fffde9d2b1: unit-dummy-source-0 2019-05-09 12:10:52 INFO juju.agent.tools symlinks.go:40 was a symlink, now looking at /var/lib/juju/tools/2.5.5.1-bionic-amd64
45ba77ba-73bc-45c2-827a-a3fffde9d2b1: unit-dummy-source-0 2019-05-09 12:10:52 ERROR juju.worker.dependency engine.go:636 "uniter" manifold worker returned unexpected error: failed to initialize uniter for "unit-dummy-source-0": creating juju run listener: mkdir /var/lib/juju/agents/unit-dummy-source-0/761994922: no space left on device
45ba77ba-73bc-45c2-827a-a3fffde9d2b1: unit-dummy-source-0 2019-05-09 12:10:56 INFO juju.agent.tools symlinks.go:20 ensure jujuc symlinks in /var/lib/juju/tools/unit-dummy-source-0
45ba77ba-73bc-45c2-827a-a3fffde9d2b1: unit-dummy-source-0 2019-05-09 12:10:56 INFO juju.agent.tools symlinks.go:40 was a symlink, now looking at /var/lib/juju/tools/2.5.5.1-bionic-amd64
45ba77ba-73bc-...

So I set up an lxd container with limited disk space with:

$ lxc storage create small zfs size=2GB
$ lxc launch juju/bionic/amd64 biotest --storage=small
$ lxc exec bash biotest 
$$ su -l ubuntu
$$ ssh-import-id jameinel
$$ exit

$ lxc list
$ juju add-machine ssh:ubuntu@${IP}
$ cd $GOPATH/github.com/juju/juju/acceptance_tests/repository/trusty
$ juju deploy ./dummy-source --to 0

$ lxc exec bash biotest 
$$ su -l ubuntu
$$ dd if=/dev/urandom of=$HOME/fillmeup bs=1M

Note that I had to use urandom instead of /dev/zero because I was using ZFS which seemed to happily allow me to create a 14GB file full of zeros on a 2GB partition.

Once the disk was full, I could see on the controller these failures:
45ba77ba-73bc-45c2-827a-a3fffde9d2b1: unit-dummy-source-0 2019-05-09 12:10:47 INFO juju.agent.tools symlinks.go:20 ensure jujuc symlinks in /var/lib/juju/tools/unit-dummy-source-0
45ba77ba-73bc-45c2-827a-a3fffde9d2b1: unit-dummy-source-0 2019-05-09 12:10:47 INFO juju.agent.tools symlinks.go:40 was a symlink, now looking at /var/lib/juju/tools/2.5.5.1-bionic-amd64
45ba77ba-73bc-45c2-827a-a3fffde9d2b1: unit-dummy-source-0 2019-05-09 12:10:47 ERROR juju.worker.dependency engine.go:636 "uniter" manifold worker returned unexpected error: failed to initialize uniter for "unit-dummy-source-0": creating juju run listener: mkdir /var/lib/juju/agents/unit-dummy-source-0/477599416: no space left on device
45ba77ba-73bc-45c2-827a-a3fffde9d2b1: unit-dummy-source-0 2019-05-09 12:10:50 INFO juju.agent.tools symlinks.go:20 ensure jujuc symlinks in /var/lib/juju/tools/unit-dummy-source-0
45ba77ba-73bc-45c2-827a-a3fffde9d2b1: unit-dummy-source-0 2019-05-09 12:10:50 INFO juju.agent.tools symlinks.go:40 was a symlink, now looking at /var/lib/juju/tools/2.5.5.1-bionic-amd64
45ba77ba-73bc-45c2-827a-a3fffde9d2b1: unit-dummy-source-0 2019-05-09 12:10:50 ERROR juju.worker.dependency engine.go:636 "uniter" manifold worker returned unexpected error: failed to initialize uniter for "unit-dummy-source-0": creating juju run listener: mkdir /var/lib/juju/agents/unit-dummy-source-0/879169719: no space left on device
45ba77ba-73bc-45c2-827a-a3fffde9d2b1: unit-dummy-source-0 2019-05-09 12:10:52 INFO juju.agent.tools symlinks.go:20 ensure jujuc symlinks in /var/lib/juju/tools/unit-dummy-source-0
45ba77ba-73bc-45c2-827a-a3fffde9d2b1: unit-dummy-source-0 2019-05-09 12:10:52 INFO juju.agent.tools symlinks.go:40 was a symlink, now looking at /var/lib/juju/tools/2.5.5.1-bionic-amd64
45ba77ba-73bc-45c2-827a-a3fffde9d2b1: unit-dummy-source-0 2019-05-09 12:10:52 ERROR juju.worker.dependency engine.go:636 "uniter" manifold worker returned unexpected error: failed to initialize uniter for "unit-dummy-source-0": creating juju run listener: mkdir /var/lib/juju/agents/unit-dummy-source-0/761994922: no space left on device
45ba77ba-73bc-45c2-827a-a3fffde9d2b1: unit-dummy-source-0 2019-05-09 12:10:56 INFO juju.agent.tools symlinks.go:20 ensure jujuc symlinks in /var/lib/juju/tools/unit-dummy-source-0
45ba77ba-73bc-45c2-827a-a3fffde9d2b1: unit-dummy-source-0 2019-05-09 12:10:56 INFO juju.agent.tools symlinks.go:40 was a symlink, now looking at /var/lib/juju/tools/2.5.5.1-bionic-amd64
45ba77ba-73bc-45c2-827a-a3fffde9d2b1: unit-dummy-source-0 2019-05-09 12:10:56 ERROR juju.worker.dependency engine.go:636 "uniter" manifold worker returned unexpected error: failed to initialize uniter for "unit-dummy-source-0": creating juju run listener: mkdir /var/lib/juju/agents/unit-dummy-source-0/142897153: no space left on device
45ba77ba-73bc-45c2-827a-a3fffde9d2b1: unit-dummy-source-0 2019-05-09 12:10:59 INFO juju.agent.tools symlinks.go:20 ensure jujuc symlinks in /var/lib/juju/tools/unit-dummy-source-0
45ba77ba-73bc-45c2-827a-a3fffde9d2b1: unit-dummy-source-0 2019-05-09 12:10:59 INFO juju.agent.tools symlinks.go:40 was a symlink, now looking at /var/lib/juju/tools/2.5.5.1-bionic-amd64
45ba77ba-73bc-45c2-827a-a3fffde9d2b1: unit-dummy-source-0 2019-05-09 12:10:59 ERROR juju.worker.dependency engine.go:636 "uniter" manifold worker returned unexpected error: failed to initialize uniter for "unit-dummy-source-0": creating juju run listener: mkdir /var/lib/juju/agents/unit-dummy-source-0/982682988: no space left on device
45ba77ba-73bc-45c2-827a-a3fffde9d2b1: unit-dummy-source-0 2019-05-09 12:11:02 INFO juju.agent.tools symlinks.go:20 ensure jujuc symlinks in /var/lib/juju/tools/unit-dummy-source-0
45ba77ba-73bc-45c2-827a-a3fffde9d2b1: unit-dummy-source-0 2019-05-09 12:11:02 INFO juju.agent.tools symlinks.go:40 was a symlink, now looking at /var/lib/juju/tools/2.5.5.1-bionic-amd64
45ba77ba-73bc-45c2-827a-a3fffde9d2b1: unit-dummy-source-0 2019-05-09 12:11:02 ERROR juju.worker.dependency engine.go:636 "uniter" manifold worker returned unexpected error: failed to initialize uniter for "unit-dummy-source-0": creating juju run listener: mkdir /var/lib/juju/agents/unit-dummy-source-0/774528987: no space left on device
45ba77ba-73bc-45c2-827a-a3fffde9d2b1: unit-dummy-source-0 2019-05-09 12:11:05 INFO juju.agent.tools symlinks.go:20 ensure jujuc symlinks in /var/lib/juju/tools/unit-dummy-source-0
45ba77ba-73bc-45c2-827a-a3fffde9d2b1: unit-dummy-source-0 2019-05-09 12:11:05 INFO juju.agent.tools symlinks.go:40 was a symlink, now looking at /var/lib/juju/tools/2.5.5.1-bionic-amd64
45ba77ba-73bc-45c2-827a-a3fffde9d2b1: unit-dummy-source-0 2019-05-09 12:11:05 ERROR juju.worker.dependency engine.go:636 "uniter" manifold worker returned unexpected error: failed to initialize uniter for "unit-dummy-source-0": creating juju run listener: mkdir /var/lib/juju/agents/unit-dummy-source-0/082283390: no space left on device

At this point it definitely is triggering every 3s which means it 
a) Is waiting between attempts
b) Isn't triggering the exponential backoff
c) Is clearly using 2.5.5 for everything.

I can try with juju-2.4.7, but according to the code, exponential backoff was added in 2.4.5, so it should have been there anyway.

It might be something special with metric-sender, as that may be using different code paths. (I don't know that it is, but I know metrics gathering doesn't use the hook execution lock so that a long-running hook doesn't prevent metrics from being updated.)

Revision history for this message

John A Meinel (jameinel) wrote on 2019-05-09:

#10

I think I worked out why it is triggering every 3s. Specifically the delay line is:
delay = time.Duration(float64(delay) * math.Pow(engine.config.BackoffFactor, float64(info.startAttempts-1)))

But note that it is tracking "startAttempts". However, startAttempts resets to 0 if gotStarted ends up getting called.
func (engine *Engine) gotStarted(name string, worker worker.Worker, resourceLog []resourceAccess) {
...
default:
  // It's fine to use this worker; update info and copy back.
  engine.config.Logger.Debugf("%q manifold worker started", name)
  info.worker = worker
  info.starting = false
  info.startCount++
  // Reset the start attempts after a successful start.
  info.startAttempts = 0
...

So we probably need to track failing workers slightly differently. Where we treat a failure shortly-after-started similarly to failing at start.

Revision history for this message

John A Meinel (jameinel) wrote on 2019-05-09:

#11

A potential fix in the Worker package:
https://github.com/juju/worker/pull/11

That turns any error into an exponential-backoff if less than BackoffResetTime.

Revision history for this message

John A Meinel (jameinel) wrote on 2019-05-28:

#12

https://github.com/juju/juju/pull/10240 is merging this into 2.5

Tim Penhey (thumper) on 2019-06-12

Changed in juju:
status:	New → In Progress
importance:	Undecided → Critical
assignee:	nobody → Tim Penhey (thumper)
milestone:	none → 2.6.4
milestone:	2.6.4 → 2.7-beta1

Revision history for this message

Tim Penhey (thumper) wrote on 2019-06-12:

#13

Upstream fix: https://github.com/juju/worker/pull/12
2.5 fix: https://github.com/juju/juju/pull/10315

Revision history for this message

Tim Penhey (thumper) wrote on 2019-06-12:

#14

2.6 fix: https://github.com/juju/juju/pull/10317

Revision history for this message

Anastasia (anastasia-macmood) wrote on 2019-06-14:

#15

Landed in develop too as part of the bigger merge.

Changed in juju:
status:	In Progress → Fix Committed

Tim Penhey (thumper) on 2019-12-05

Changed in juju:
status:	Fix Committed → Fix Released

Canonical Juju

Juju agents spew logs on ENOSPC

Bug Description

Other bug subscribers

Remote bug watches

	Status	Importance	Assigned to	Milestone
Canonical Juju	Fix Released	Critical	Tim Penhey	Canonical Juju 2.7-beta1
2.5	Fix Released	Critical	Tim Penhey	Canonical Juju 2.5.8
2.6	Fix Released	Critical	Tim Penhey	Canonical Juju 2.6.4