Canonical Juju

juju2 eating CPU, units in error

Bug #1635311 reported by Junien F on 2016-10-20

This bug affects 10 people

	Status	Importance	Assigned to	Milestone
Canonical Juju	Fix Released	High	Andrew Wilkins	Canonical Juju 2.2-alpha1
2.0	Fix Released	Critical	Tim Penhey	Canonical Juju 2.0.3
2.1	Fix Released	Critical	Andrew Wilkins	Canonical Juju 2.1-rc1

Bug Description

Hi,

As requested, opening a followup of LP#1587644

I installed an environment with juju2.0 last Friday, and today, jujud-machine-0 is going nuts and consuming between 150% and 200% CPU on a VM with 2 vcpus.

The principal charm on the server (yes, we deployed charms on the controller), and a subordinate are in error (with a message "resolver loop error", which could be related to #1604915)

I haven't restarted anything, so if you need more info, please, do ask.

Thanks !

Logs are at https://private-fileshare.canonical.com/~axino/lp1587644/2/

See original description

Tags:

Junien F (axino) on 2016-10-20

summary:

- juju2 eating CPU, units in error
+ [azure] juju2 eating CPU, units in error

Junien F (axino) on 2016-10-20

tags:

added: canonical-is

Revision history for this message

Junien F (axino) wrote on 2016-10-20: Re: [azure] juju2 eating CPU, units in error

I uploaded machine-0 log and some other stuff as requested out of band.

Note that the CPU usage is back to normal, after a final spike (which also ate up nearly all the memory). I put some graphs in the URL above.

Thank you

Alexis Bruemmer (alexis-bruemmer) on 2016-10-20

Changed in juju:
importance:	Undecided → High
status:	New → Triaged
assignee:	nobody → Alexis Bruemmer (alexis-bruemmer)

Andrew Wilkins (axwalk) on 2016-10-20

Changed in juju:
status:	Triaged → In Progress
assignee:	Alexis Bruemmer (alexis-bruemmer) → Andrew Wilkins (axwalk)

Paul Gear (paulgear) on 2016-10-21

description:

updated

Revision history for this message

Junien F (axino) wrote on 2016-10-27:

Removed the Azure tag, this is also happening on an AWS environment in us-east-2, which is not taking any traffic yet so all units are mostly idle. Can provide pprof traces if needed.

This is extremely painful for us, because it causes units to be marked as lost or failed in juju status, and we alert on that.

Thanks

summary:

- [azure] juju2 eating CPU, units in error
+ juju2 eating CPU, units in error

Revision history for this message

Uros Jovanovic (uros-jovanovic) wrote on 2016-10-27:

It seems that jujud gets crazy and starts putting lots of requests on mongo. Restarting mongo helped in this AWS case. This certainly needs more investigation as it makes long living controllers useless.

Note, I've got long living juju 2.0 controllers on GCE running just fine for weeks now.

Changed in juju:
importance:	High → Critical

Revision history for this message

Junien F (axino) wrote on 2016-10-27:

I captured an occurrence with logging level set to TRACE on the controller, I can share the logs if you need them.

Revision history for this message

Andrew Wilkins (axwalk) wrote on 2016-10-27:

@axino please do

Revision history for this message

Junien F (axino) wrote on 2016-10-28:

In https://private-fileshare.canonical.com/~axino/lp1587644/3/ you'll find a tar file with :

- pprof data I gathered when jujud went nuts on Oct 27 at around 09:30. TRACE mode wasn't enabled at that time.

- machine-0.log that I just retrieved, containing DEBUG information for the event above, but also TRACE information for later events. You can find the approximate time of these events by looking at the mongodb_requests file - as the incidents are followed (or preceded ?) by a bunch of mongodb requests.

Available on IRC if you need more info :)

Thank you !

Revision history for this message

Andrew Wilkins (axwalk) wrote on 2016-10-28:

Thanks, Junien. The log messages & mongodb_requests look like they might be pointing to the transaction pruning code. It's not clear to me whether the influx of queries is the problem, or just that they've triggered the pruner to run and the pruner is then causing the pain. I'll investigate further and let you know what I find.

Revision history for this message

Andrew Wilkins (axwalk) wrote on 2016-10-29:

Nothing definitive on the transaction pruning thing, but looks like it's a red herring so far.

The issue appears to have manifested in my environment, so I'm digging into it now. I won't have much time this weekend, but will leave it and inspect further on Monday.

So far, it looks like the CPU usage is directly caused by Mongo chewing up all of the memory. kswapd is taking a lot of CPU, as well as jujud. Jujud is mostly unresponsive as a result, but I did manage to take both CPU and heap profiles. The profiles point at mgo connection setup taking the majority of both CPU and heap in the process, which suggests that the process is spinning trying to get connections. Most likely Mongo is unresponsive because it's swapping.

The reason for the increase in memory is highly correlated with an increase in number of goroutines and number of Mongo connections. My guess is that we have a bug in Juju that is causing goroutines and Mongo connections to be leaked under some error condition. Up until a point in time, those numbers are flat. Then they just start growing until resource exhaustion.

My jujud is now so unresponsive that I cannot take a goroutine profile, so I've killed it with SIGQUIT to cause it to dump the state of all goroutines in the process. I don't have time to go through it in detail right now, but it looks a lot like there's a bug in the lease-manager restarting code in state. There are 9004 lines of "created by github.com/juju/juju/state/workers.(*RestartWorkers).maintain".

Revision history for this message

Andrew Wilkins (axwalk) wrote on 2016-10-30:

I spent some time yesterday testing the logic we have in transaction pruning, and at that time I dismissed it as a non-issue. The code looked dumb, but it seemed that Mongo was coping with it just fine.

Today I did the test again on the Azure instance where the issue was reproduced, instead of my local system where I did it first. On there, Mongo does not like the method used at all. In the logs from Comment #6, there's a log message stating that ~9000 transactions need pruning. Using the method from Juju, removing 10000 docs took about 2 minutes in one test. I wrote some more optimised code which does the same thing ~instantly.

I'm not convinced that this is the root cause of all the problems, but it's almost certainly not helping.

Revision history for this message

Anastasia (anastasia-macmood) wrote on 2016-10-31:

#10

1st in the series, PR to improve: https://github.com/juju/juju/pull/6515

Revision history for this message

Andrew Wilkins (axwalk) wrote on 2016-11-01:

#11

https://github.com/juju/juju/pull/6519

Anastasia (anastasia-macmood) on 2016-11-01

Changed in juju:
status:	In Progress → Fix Committed

Anastasia (anastasia-macmood) on 2016-11-01

Changed in juju:
status:	Fix Committed → In Progress

Revision history for this message

Thomas Cuthbert (tcuthbert) wrote on 2016-11-09:

#12

Popped up again on azure eu east.

here are the traces

https://private-fileshare.canonical.com/~tcuthbert/lp1635311-2016-11-09.tar.gz

Revision history for this message

Chris Stratford (chris-gondolin) wrote on 2016-11-10:

#13

This may be pure coincidence, but I'll mention it here just in case...

We've had another of our (now worryingly regular) controller-stops-talking episodes (aws us-east-2 this time). I did the usual fix (log into the controller, restart jujud-machine-0), only I got the wrong machine (logged into unit 0 on the juju env not the controller env), however after that restart everything started working again.

It may have been luck and the controller just recovered all by itself, but I can't help wondering if the controller just needed a wake-up call which the restart provided somehow.

Revision history for this message

Andrew Wilkins (axwalk) wrote on 2016-11-10:

#14

Chris, this is known, but thank you anyway. Restarting the controller will negate the effects of the bugs I have addressed. I've had a controller with the fixes running in Azure for ~10 days without any problems.

We're looking to release 2.0.2 ASAP (there's a couple of regressions holding things up), which will include the fixes.

Revision history for this message

Junien F (axino) wrote on 2016-11-15:

#15

For the record, all our models got upgraded to 2.0.2. We'll let you know how it goes.

Revision history for this message

Junien F (axino) wrote on 2016-11-16:

#16

No alert so far \o/

Revision history for this message

Andrew Wilkins (axwalk) wrote on 2016-11-28:

#17

@axino no news is good news?

Revision history for this message

Junien F (axino) wrote on 2016-11-29:

#18

@axwalk indeed, this behaviour simply doesn't happen with 2.0.2 (except when we hit LP#1643795 heh).

Cheers !

Revision history for this message

John A Meinel (jameinel) wrote on 2016-11-30:

#19

Can you link the actual fix to mgo/txn? The linked Juju github pull requests don't seem to have what actually fixed the CPU issues, just the 'integrate the fixes into juju'.

Revision history for this message

Anastasia (anastasia-macmood) wrote on 2016-11-30:

#20

@John

Based on https://bugs.launchpad.net/landscape/+bug/1644938/comments/7, this is the PR: https://github.com/go-mgo/mgo/pull/360

David Britton (dpb) on 2016-12-01

tags:

added: landscape

Revision history for this message

Michael Nelson (michael.nelson) wrote on 2017-01-31:

#21

I'm seeing this with juju 2.0.2, I don't have access to the controllers, but juju status stops responding. Chatting with webops, jujud is at times consuming 748% CPU (?):

09:09 < Spads> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
09:09 < Spads> 15201 root 20 0 3644884 2.659g 35908 S 747.8 34.1 9:49.31 jujud

while other times the controller is swapping like crazy:

09:10 < Spads> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
09:10 < Spads> 63 root 20 0 0 0 0 R 95.2 0.0 49:13.48 kswapd0
09:10 < Spads> 6322 root 20 0 4574996 3.393g 0 S 27.2 43.5 23:15.13 mongod
09:10 < Spads> 15201 root 20 0 4314648 3.253g 0 S 9.9 41.7 12:45.91 jujud

End result is that we can't interact at all with a deployment we're trying to setup. This issue was triggered twice today - once without any apparent cause, and another time when we restarted the controller agent as was recommended on irc as a work-around for https://bugs.launchpad.net/juju/+bug/1643430

Anastasia (anastasia-macmood) on 2017-01-31

Changed in juju:
milestone:	none → 2.1-rc1

Revision history for this message

Anastasia (anastasia-macmood) wrote on 2017-02-06:

#22

Marking as Won't Fix for 2.0.x series as we are not planning further 2.0.x releases.

Changed in juju:
milestone:	2.1-rc1 → 2.2.0-alpha1
importance:	Critical → High

Revision history for this message

Anastasia (anastasia-macmood) wrote on 2017-02-07:

#23

We think that this has been addressed in 2.1 as part of ongoing performance improvement.
2.1.0 is expected to be released in the second half of February.
I am marking it as Incomplete. Feel free to re-open when you know otherwise.

Changed in juju:
status:	In Progress → Incomplete

Revision history for this message

Jacek Nykis (jacekn) wrote on 2017-02-07:

#24

Sorry but I don't understand how this bug is "Incomplete"? If you are testing juju version that's meant to fix this problem I'd say it's "In progress". Updated

Jacek Nykis (jacekn) on 2017-02-07

Changed in juju:
status:	Incomplete → In Progress

Revision history for this message

Anastasia (anastasia-macmood) wrote on 2017-02-07:

#25

@Jacek,
It's Incomplete because we would like you to try out 2.1 which includes performance improvement.

2.1-beta5 has been released last week and as per announcement, please follow these instructions to get it:

## How do I get it?

If you are running Ubuntu, you can get Juju from the juju devel ppa:

sudo add-apt-repository ppa:juju/devel; sudo apt-get update

sudo apt-get install juju

Or install Juju from the snap store:

snap install juju --beta --devmode

Revision history for this message

Tom Haddon (mthaddon) wrote on 2017-02-07:

#26

Hi Anastasia,

Since this is a beta version, is it something that we should be installing on a production environment with a number of production models attached to it? We've typically been told to track the stable version up to this point by Juju development managers, so I'd like to confirm if that approach should change.

Revision history for this message

John A Meinel (jameinel) wrote on 2017-02-07:

#27

My thoughts-

1) It *would* be good feedback if you could try 2.1b5
2) However, for production, you're right that we generally only track stable releases
3) For reasons aside from this one, we're likely to need to guarantee upgrades from 2.1b5 to 2.1.0 final. However, there are likely to be rough edges around that sort of upgrade (switching to a devel channel, then back to a stable channel, etc.)
4) Its hard for us to unilaterally know if we've fixed all the issues as we don't have the same infrastructure running.
5) Ultimately it depends whether the fixes for 2.1 that have been put in outweigh the costs of going to a beta.

2.1b5 was last Friday, likely 2.1rc1 will be this Thurs/Friday. Which puts 2.1 final out in another week or 2.

We recently underwent staffing changes which makes it quite hard for us to do things like backporting just these changes to a 2.0 release. If it truly is critical and you can't evaluate 2.1b5 obviously its possible to task the team differently. But even if we did backport to 2.0.X it would still likely take at least a week for it to be tested and properly released.

Anastasia (anastasia-macmood) on 2017-02-08

Changed in juju:
status:	In Progress → Fix Committed

Revision history for this message

Menno Finlay-Smits (menno.smits) wrote on 2017-02-08:

#28

Thumper and I figured out why this was happening in 2.0.2 today as it caused an IS outage today.

The fix is already in 2.1: https://github.com/juju/juju/pull/6789

We are considering 2.0.3 release soon to get this fix into a 2.0 release.

Revision history for this message

Anastasia (anastasia-macmood) wrote on 2017-02-08:

#29

PR against 2.0: https://github.com/juju/juju/pull/6939

Curtis Hovey (sinzui) on 2017-02-23

Changed in juju:
status:	Fix Committed → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

Duplicates of this bug

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.