Juju agents spew logs on ENOSPC

Bug #1827664 reported by Laurent Sesquès
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Canonical Juju
Fix Released
Critical
Tim Penhey
2.5
Fix Released
Critical
Tim Penhey
2.6
Fix Released
Critical
Tim Penhey

Bug Description

When a unit has run out of space, it will spam its controllers with messages such as:
0db6f159-e42d-4a6e-8ece-206ecce03fca: unit-postgresql-0 2019-05-03 13:31:34 ERROR juju.worker.dependency engine.go:632 "metric-sender" manifold worker returned unexpected error: mkdir /var/lib/juju/agents/unit-postgresql-0/174896867: no space left on device
Which leads to mongodb using high amounts of disk space, and eventually running out if nothing is done.
I solved such a situation by freeing space on the 2 units which had a / full, but the controllers shouldn't be vulnerable to this.

Revision history for this message
John A Meinel (jameinel) wrote : Re: [Bug 1827664] [NEW] Controller disk space running out due to units having no space left

Controllers do rotate their log files and we prune the log database to keep
it around 4GB in size. It is possible that the rate of messages is higher
than the rate that we prune them, but there is some protection from it. We
also rate limit log messages, though this may not be aggressive enough.

On Fri, May 3, 2019 at 11:50 PM Laurent Sesques <
<email address hidden>> wrote:

> Public bug reported:
>
> When a unit has run out of space, it will spam its controllers with
> messages such as:
> 0db6f159-e42d-4a6e-8ece-206ecce03fca: unit-postgresql-0 2019-05-03
> 13:31:34 ERROR juju.worker.dependency engine.go:632 "metric-sender"
> manifold worker returned unexpected error: mkdir
> /var/lib/juju/agents/unit-postgresql-0/174896867: no space left on device
> Which leads to mongodb using high amounts of disk space, and eventually
> running out if nothing is done.
> I solved such a situation by freeing space on the 2 units which had a /
> full, but the controllers shouldn't be vulnerable to this.
>
> ** Affects: juju
> Importance: Undecided
> Status: New
>
> --
> You received this bug notification because you are subscribed to juju.
> Matching subscriptions: juju bugs
> https://bugs.launchpad.net/bugs/1827664
>
> Title:
> Controller disk space running out due to units having no space left
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju/+bug/1827664/+subscriptions
>

Revision history for this message
Joel Sing (jsing) wrote : Re: Controller disk space running out due to units having no space left

If there is protection, it is not working - see https://bugs.launchpad.net/juju/+bug/1811147

Joel Sing (jsing)
summary: - Controller disk space running out due to units having no space left
+ Juju agents spew logs on ENOSPC
Revision history for this message
Joel Sing (jsing) wrote :

I've retitled this bug to specifically target the fact that Juju agents spew logs on ENOSPC. The unbounded growth of controller logs collections is targeted in https://bugs.launchpad.net/juju/+bug/1811147.

Revision history for this message
Laurent Sesquès (sajoupa) wrote :

Forgot to mention: this happened on a controller running 2.5.1.

Revision history for this message
John A Meinel (jameinel) wrote : Re: [Bug 1827664] Re: Juju agents spew logs on ENOSPC

@thumper
I know we did some work to have automatic exponential backoff when workers
are bouncing in their loop. Is the issue that this only triggers during the
Start of the worker, and not if it successfully started but then quickly
fails in the future?

On Mon, May 6, 2019 at 3:35 PM Laurent Sesques <
<email address hidden>> wrote:

> Forgot to mention: this happened on a controller running 2.5.1.
>
> --
> You received this bug notification because you are subscribed to juju.
> Matching subscriptions: juju bugs
> https://bugs.launchpad.net/bugs/1827664
>
> Title:
> Juju agents spew logs on ENOSPC
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju/+bug/1827664/+subscriptions
>

Revision history for this message
John A Meinel (jameinel) wrote :

(or is it that our change wasn't to 2.5.1 and it is just that they weren't
running that code.)

On Mon, May 6, 2019 at 5:11 PM John Meinel <email address hidden> wrote:

> @thumper
> I know we did some work to have automatic exponential backoff when workers
> are bouncing in their loop. Is the issue that this only triggers during the
> Start of the worker, and not if it successfully started but then quickly
> fails in the future?
>
>
> On Mon, May 6, 2019 at 3:35 PM Laurent Sesques <
> <email address hidden>> wrote:
>
>> Forgot to mention: this happened on a controller running 2.5.1.
>>
>> --
>> You received this bug notification because you are subscribed to juju.
>> Matching subscriptions: juju bugs
>> https://bugs.launchpad.net/bugs/1827664
>>
>> Title:
>> Juju agents spew logs on ENOSPC
>>
>> To manage notifications about this bug go to:
>> https://bugs.launchpad.net/juju/+bug/1827664/+subscriptions
>>
>

Revision history for this message
Joel Sing (jsing) wrote :

This is a snippet of logs spewing from an agent:

https://pastebin.canonical.com/p/YQCddh3v4Y/

A large portion are from the metric-sender.

Revision history for this message
Joel Sing (jsing) wrote :

This is easy enough to reproduce - on an instance:

 sudo dd if=/dev/zero of=/var/fillmeup bs=1M

Make a config change to ensure the agents were trying to do something, at which point the logsink.log on the controller has a stream of logs landing from the agents.

It is worth noting that on a 2.5.2 model, the rate of logs seems to be around five logs every three seconds - presumably this can be increased with more subordinates on a host.

The logs above are from a model that is still on 2.4.7, hence the restart backoff work may not be deployed there.

Revision history for this message
John A Meinel (jameinel) wrote :
Download full text (6.1 KiB)

So I set up an lxd container with limited disk space with:

$ lxc storage create small zfs size=2GB
$ lxc launch juju/bionic/amd64 biotest --storage=small
$ lxc exec bash biotest
$$ su -l ubuntu
$$ ssh-import-id jameinel
$$ exit

$ lxc list
$ juju add-machine ssh:ubuntu@${IP}
$ cd $GOPATH/github.com/juju/juju/acceptance_tests/repository/trusty
$ juju deploy ./dummy-source --to 0

$ lxc exec bash biotest
$$ su -l ubuntu
$$ dd if=/dev/urandom of=$HOME/fillmeup bs=1M

Note that I had to use urandom instead of /dev/zero because I was using ZFS which seemed to happily allow me to create a 14GB file full of zeros on a 2GB partition.

Once the disk was full, I could see on the controller these failures:
45ba77ba-73bc-45c2-827a-a3fffde9d2b1: unit-dummy-source-0 2019-05-09 12:10:47 INFO juju.agent.tools symlinks.go:20 ensure jujuc symlinks in /var/lib/juju/tools/unit-dummy-source-0
45ba77ba-73bc-45c2-827a-a3fffde9d2b1: unit-dummy-source-0 2019-05-09 12:10:47 INFO juju.agent.tools symlinks.go:40 was a symlink, now looking at /var/lib/juju/tools/2.5.5.1-bionic-amd64
45ba77ba-73bc-45c2-827a-a3fffde9d2b1: unit-dummy-source-0 2019-05-09 12:10:47 ERROR juju.worker.dependency engine.go:636 "uniter" manifold worker returned unexpected error: failed to initialize uniter for "unit-dummy-source-0": creating juju run listener: mkdir /var/lib/juju/agents/unit-dummy-source-0/477599416: no space left on device
45ba77ba-73bc-45c2-827a-a3fffde9d2b1: unit-dummy-source-0 2019-05-09 12:10:50 INFO juju.agent.tools symlinks.go:20 ensure jujuc symlinks in /var/lib/juju/tools/unit-dummy-source-0
45ba77ba-73bc-45c2-827a-a3fffde9d2b1: unit-dummy-source-0 2019-05-09 12:10:50 INFO juju.agent.tools symlinks.go:40 was a symlink, now looking at /var/lib/juju/tools/2.5.5.1-bionic-amd64
45ba77ba-73bc-45c2-827a-a3fffde9d2b1: unit-dummy-source-0 2019-05-09 12:10:50 ERROR juju.worker.dependency engine.go:636 "uniter" manifold worker returned unexpected error: failed to initialize uniter for "unit-dummy-source-0": creating juju run listener: mkdir /var/lib/juju/agents/unit-dummy-source-0/879169719: no space left on device
45ba77ba-73bc-45c2-827a-a3fffde9d2b1: unit-dummy-source-0 2019-05-09 12:10:52 INFO juju.agent.tools symlinks.go:20 ensure jujuc symlinks in /var/lib/juju/tools/unit-dummy-source-0
45ba77ba-73bc-45c2-827a-a3fffde9d2b1: unit-dummy-source-0 2019-05-09 12:10:52 INFO juju.agent.tools symlinks.go:40 was a symlink, now looking at /var/lib/juju/tools/2.5.5.1-bionic-amd64
45ba77ba-73bc-45c2-827a-a3fffde9d2b1: unit-dummy-source-0 2019-05-09 12:10:52 ERROR juju.worker.dependency engine.go:636 "uniter" manifold worker returned unexpected error: failed to initialize uniter for "unit-dummy-source-0": creating juju run listener: mkdir /var/lib/juju/agents/unit-dummy-source-0/761994922: no space left on device
45ba77ba-73bc-45c2-827a-a3fffde9d2b1: unit-dummy-source-0 2019-05-09 12:10:56 INFO juju.agent.tools symlinks.go:20 ensure jujuc symlinks in /var/lib/juju/tools/unit-dummy-source-0
45ba77ba-73bc-45c2-827a-a3fffde9d2b1: unit-dummy-source-0 2019-05-09 12:10:56 INFO juju.agent.tools symlinks.go:40 was a symlink, now looking at /var/lib/juju/tools/2.5.5.1-bionic-amd64
45ba77ba-73bc-...

Read more...

Revision history for this message
John A Meinel (jameinel) wrote :

I think I worked out why it is triggering every 3s. Specifically the delay line is:
 delay = time.Duration(float64(delay) * math.Pow(engine.config.BackoffFactor, float64(info.startAttempts-1)))

But note that it is tracking "startAttempts". However, startAttempts resets to 0 if gotStarted ends up getting called.
func (engine *Engine) gotStarted(name string, worker worker.Worker, resourceLog []resourceAccess) {
...
 default:
  // It's fine to use this worker; update info and copy back.
  engine.config.Logger.Debugf("%q manifold worker started", name)
  info.worker = worker
  info.starting = false
  info.startCount++
  // Reset the start attempts after a successful start.
  info.startAttempts = 0
...

So we probably need to track failing workers slightly differently. Where we treat a failure shortly-after-started similarly to failing at start.

Revision history for this message
John A Meinel (jameinel) wrote :

A potential fix in the Worker package:
https://github.com/juju/worker/pull/11

That turns any error into an exponential-backoff if less than BackoffResetTime.

Revision history for this message
John A Meinel (jameinel) wrote :
Tim Penhey (thumper)
Changed in juju:
status: New → In Progress
importance: Undecided → Critical
assignee: nobody → Tim Penhey (thumper)
milestone: none → 2.6.4
milestone: 2.6.4 → 2.7-beta1
Revision history for this message
Tim Penhey (thumper) wrote :
Revision history for this message
Tim Penhey (thumper) wrote :
Revision history for this message
Anastasia (anastasia-macmood) wrote :

Landed in develop too as part of the bigger merge.

Changed in juju:
status: In Progress → Fix Committed
Tim Penhey (thumper)
Changed in juju:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.