Canonical Juju

memory leak in agent log rotation

Bug #1865901 reported by Zachary Zehring on 2020-03-03

This bug affects 4 people

Affects		Status	Importance	Assigned to	Milestone
	Canonical Juju	Fix Released	High	Tim Penhey	Canonical Juju 2.7.5

Bug Description

We have experienced this issue many times: we have a application unit deployed on a KVM with ~4G mem, along with several subordinate units. Over time, jujud service agents will increase their memory footprint, growing large enough to set off our alerting (~95% memory usage). When this occurs, we need to ssh in and restart these services. That ends up clearing the error, sometimes freeing up greater than 50% of the used memory.

Ideally, it would be a good quality of life feature if jujud agents wouldn't just keep growing without a certain cap or threshold. Also, I've heard there's been efforts to consolidate unit agents into just one jujud unit agent service. I would imagine this would help a lot with memory footprint as well and make it a bit easier to administrate.

Tags:

Revision history for this message

Zachary Zehring (zzehring) wrote on 2020-03-09:

Not sure what the appropriate logs would be helpful for looking into this issue. Are there any pertinent logs that should be uploaded?

Revision history for this message

Richard Harding (rharding) wrote on 2020-03-10:

Some things to collect:

juju version
deployed applications
goroutines for the jujud (https://discourse.jujucharms.com/t/agent-introspection-juju-goroutines/118)

and any logs that clear or indicate a failure of getting logs out of the agent to the controller.

In theory this is some sort of memory leak but I don't recall anything recently in the application end. If there is in fact a leak I'm not sure combining the unit agent would fix as it's not that the agents are duping the memory so much vs something is growing and causing you to need to restart to reset the leaky behavior.

Revision history for this message

Tim Penhey (thumper) wrote on 2020-03-10: Re: [Bug 1865901] Re: Over time, jujud agents consume too much memory

We do need to know which versions of Juju. We definitely have had leaks
that we have fixed recently.

On Tue, Mar 10, 2020 at 3:35 PM Richard Harding <email address hidden>
wrote:

> Some things to collect:
>
> juju version
> deployed applications
> goroutines for the jujud (
> https://discourse.jujucharms.com/t/agent-introspection-juju-goroutines/118
> )
>
> and any logs that clear or indicate a failure of getting logs out of the
> agent to the controller.
>
> In theory this is some sort of memory leak but I don't recall anything
> recently in the application end. If there is in fact a leak I'm not sure
> combining the unit agent would fix as it's not that the agents are
> duping the memory so much vs something is growing and causing you to
> need to restart to reset the leaky behavior.
>
> --
> You received this bug notification because you are subscribed to juju.
> Matching subscriptions: Juju bugs
> https://bugs.launchpad.net/bugs/1865901
>
> Title:
> Over time, jujud agents consume too much memory
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju/+bug/1865901/+subscriptions
>

Revision history for this message

Tim Penhey (thumper) wrote on 2020-03-24: Re: Over time, jujud agents consume too much memory

If these are unit agents, then there are definitely some useful logs you could get us. Particularly if you are seeing the same unit agents running out of memory.

All juju agents have built in access to get the current goroutines and heap profiles.

Memory leaks in Juju tend to be one of two things: goroutine leaks, lost references to blocks of memory. The best way to find them is to look at goroutines and heap profiles over time.

To gather these, ssh into the machine and run the following, using unit-telegraf-0 as an example:

$ juju_goroutines unit-telegraf-0 > goroutines.log
$ juju_heap_profile unit-telegraf-0 > heap-profile.log

Ideally include the date/time in the filename so we know which is earlier. Ideally at least an hour apart as this should make any leak more obvious.

Changed in juju:
status:	New → Incomplete

Revision history for this message

James Hebden (ec0) wrote on 2020-03-26:

The environment is running Juju 2.7.4 on Bionic. I've taken the goroutine dump of the unit agent (landscape-server, in this case) in a shared screen session with thumper, and it looks like there may be a potential leak surrounding lumberjack which sould be looked into -

ubuntu@landscape-2:~$ juju_goroutines unit-landscape-server-4 >goroutines.log^C
ubuntu@landscape-2:~$ head goroutines.log
goroutine profile: total 6796
6463 @ 0x43159b 0x431643 0x408cee 0x408a1b 0xae7d38 0x45fa61
# 0xae7d37 github.com/juju/juju/vendor/gopkg.in/natefinch/lumberjack%2ev2.(*Logger).millRun+0x57 /workspace/_build/src/github.com/juju/juju/vendor/gopkg.in/natefinch/lumberjack.v2/lumberjack.go:379

49 @ 0x43159b 0x441606 0xa69301 0x45fa61
# 0xa69300 github.com/juju/juju/vendor/gopkg.in/juju/worker.v1/catacomb.(*Catacomb).add.func2+0xe0 /workspace/_build/src/github.com/juju/juju/vendor/gopkg.in/juju/worker.v1/catacomb/catacomb.go:182

38 @ 0x43159b 0x431643 0x408cee 0x4089db 0xa6038a 0xa6cacd 0xa691c7 0x45fa61
# 0xa60389 github.com/juju/juju/vendor/gopkg.in/tomb%2ev2.(*Tomb).Wait+0x49 /workspace/_build/src/github.com/juju/juju/vendor/gopkg.in/tomb.v2/tomb.go:116
# 0xa6cacc github.com/juju/juju/api/watcher.(*commonWatcher).Wait+0x2c /workspace/_build/src/github.com/juju/juju/api/watcher/watcher.go:138

Revision history for this message

Tim Penhey (thumper) wrote on 2020-03-26:

This definitely looks like the leak:

6463 @ 0x43159b 0x431643 0x408cee 0x408a1b 0xae7d38 0x45fa61
# 0xae7d37 github.com/juju/juju/vendor/gopkg.in/natefinch/lumberjack%2ev2.(*Logger).millRun+0x57 /workspace/_build/src/github.com/juju/juju/vendor/gopkg.in/natefinch/lumberjack.v2/lumberjack.go:379

Changed in juju:
status:	Incomplete → Triaged
importance:	Undecided → High
milestone:	none → 2.8-beta1
tags:	added: logging memory-leak
summary:	- Over time, jujud agents consume too much memory + memory leak in agent log rotation