juju2 eating CPU, units in error

Bug #1635311 reported by Junien F
70
This bug affects 10 people
Affects Status Importance Assigned to Milestone
Canonical Juju
Fix Released
High
Andrew Wilkins
2.0
Fix Released
Critical
Tim Penhey
2.1
Fix Released
Critical
Andrew Wilkins

Bug Description

Hi,

As requested, opening a followup of LP#1587644

I installed an environment with juju2.0 last Friday, and today, jujud-machine-0 is going nuts and consuming between 150% and 200% CPU on a VM with 2 vcpus.

The principal charm on the server (yes, we deployed charms on the controller), and a subordinate are in error (with a message "resolver loop error", which could be related to #1604915)

I haven't restarted anything, so if you need more info, please, do ask.

Thanks !

Logs are at https://private-fileshare.canonical.com/~axino/lp1587644/2/

Junien F (axino)
summary: - juju2 eating CPU, units in error
+ [azure] juju2 eating CPU, units in error
Junien F (axino)
tags: added: canonical-is
Revision history for this message
Junien F (axino) wrote : Re: [azure] juju2 eating CPU, units in error

I uploaded machine-0 log and some other stuff as requested out of band.

Note that the CPU usage is back to normal, after a final spike (which also ate up nearly all the memory). I put some graphs in the URL above.

Thank you

Changed in juju:
importance: Undecided → High
status: New → Triaged
assignee: nobody → Alexis Bruemmer (alexis-bruemmer)
Andrew Wilkins (axwalk)
Changed in juju:
status: Triaged → In Progress
assignee: Alexis Bruemmer (alexis-bruemmer) → Andrew Wilkins (axwalk)
Paul Gear (paulgear)
description: updated
Revision history for this message
Junien F (axino) wrote :

Removed the Azure tag, this is also happening on an AWS environment in us-east-2, which is not taking any traffic yet so all units are mostly idle. Can provide pprof traces if needed.

This is extremely painful for us, because it causes units to be marked as lost or failed in juju status, and we alert on that.

Thanks

summary: - [azure] juju2 eating CPU, units in error
+ juju2 eating CPU, units in error
Revision history for this message
Uros Jovanovic (uros-jovanovic) wrote :

It seems that jujud gets crazy and starts putting lots of requests on mongo. Restarting mongo helped in this AWS case. This certainly needs more investigation as it makes long living controllers useless.

Note, I've got long living juju 2.0 controllers on GCE running just fine for weeks now.

Changed in juju:
importance: High → Critical
Revision history for this message
Junien F (axino) wrote :

I captured an occurrence with logging level set to TRACE on the controller, I can share the logs if you need them.

Revision history for this message
Andrew Wilkins (axwalk) wrote :

@axino please do

Revision history for this message
Junien F (axino) wrote :

In https://private-fileshare.canonical.com/~axino/lp1587644/3/ you'll find a tar file with :

  - pprof data I gathered when jujud went nuts on Oct 27 at around 09:30. TRACE mode wasn't enabled at that time.

  - machine-0.log that I just retrieved, containing DEBUG information for the event above, but also TRACE information for later events. You can find the approximate time of these events by looking at the mongodb_requests file - as the incidents are followed (or preceded ?) by a bunch of mongodb requests.

Available on IRC if you need more info :)

Thank you !

Revision history for this message
Andrew Wilkins (axwalk) wrote :

Thanks, Junien. The log messages & mongodb_requests look like they might be pointing to the transaction pruning code. It's not clear to me whether the influx of queries is the problem, or just that they've triggered the pruner to run and the pruner is then causing the pain. I'll investigate further and let you know what I find.

Revision history for this message
Andrew Wilkins (axwalk) wrote :

Nothing definitive on the transaction pruning thing, but looks like it's a red herring so far.

The issue appears to have manifested in my environment, so I'm digging into it now. I won't have much time this weekend, but will leave it and inspect further on Monday.

So far, it looks like the CPU usage is directly caused by Mongo chewing up all of the memory. kswapd is taking a lot of CPU, as well as jujud. Jujud is mostly unresponsive as a result, but I did manage to take both CPU and heap profiles. The profiles point at mgo connection setup taking the majority of both CPU and heap in the process, which suggests that the process is spinning trying to get connections. Most likely Mongo is unresponsive because it's swapping.

The reason for the increase in memory is highly correlated with an increase in number of goroutines and number of Mongo connections. My guess is that we have a bug in Juju that is causing goroutines and Mongo connections to be leaked under some error condition. Up until a point in time, those numbers are flat. Then they just start growing until resource exhaustion.

My jujud is now so unresponsive that I cannot take a goroutine profile, so I've killed it with SIGQUIT to cause it to dump the state of all goroutines in the process. I don't have time to go through it in detail right now, but it looks a lot like there's a bug in the lease-manager restarting code in state. There are 9004 lines of "created by github.com/juju/juju/state/workers.(*RestartWorkers).maintain".

Revision history for this message
Andrew Wilkins (axwalk) wrote :

I spent some time yesterday testing the logic we have in transaction pruning, and at that time I dismissed it as a non-issue. The code looked dumb, but it seemed that Mongo was coping with it just fine.

Today I did the test again on the Azure instance where the issue was reproduced, instead of my local system where I did it first. On there, Mongo does not like the method used at all. In the logs from Comment #6, there's a log message stating that ~9000 transactions need pruning. Using the method from Juju, removing 10000 docs took about 2 minutes in one test. I wrote some more optimised code which does the same thing ~instantly.

I'm not convinced that this is the root cause of all the problems, but it's almost certainly not helping.

Revision history for this message
Anastasia (anastasia-macmood) wrote :

1st in the series, PR to improve: https://github.com/juju/juju/pull/6515

Revision history for this message
Andrew Wilkins (axwalk) wrote :
Changed in juju:
status: In Progress → Fix Committed
Changed in juju:
status: Fix Committed → In Progress
Revision history for this message
Thomas Cuthbert (tcuthbert) wrote :

Popped up again on azure eu east.

here are the traces

https://private-fileshare.canonical.com/~tcuthbert/lp1635311-2016-11-09.tar.gz

Revision history for this message
Chris Stratford (chris-gondolin) wrote :

This may be pure coincidence, but I'll mention it here just in case...

We've had another of our (now worryingly regular) controller-stops-talking episodes (aws us-east-2 this time). I did the usual fix (log into the controller, restart jujud-machine-0), only I got the wrong machine (logged into unit 0 on the juju env not the controller env), however after that restart everything started working again.

It may have been luck and the controller just recovered all by itself, but I can't help wondering if the controller just needed a wake-up call which the restart provided somehow.

Revision history for this message
Andrew Wilkins (axwalk) wrote :

Chris, this is known, but thank you anyway. Restarting the controller will negate the effects of the bugs I have addressed. I've had a controller with the fixes running in Azure for ~10 days without any problems.

We're looking to release 2.0.2 ASAP (there's a couple of regressions holding things up), which will include the fixes.

Revision history for this message
Junien F (axino) wrote :

For the record, all our models got upgraded to 2.0.2. We'll let you know how it goes.

Revision history for this message
Junien F (axino) wrote :

No alert so far \o/

Revision history for this message
Andrew Wilkins (axwalk) wrote :

@axino no news is good news?

Revision history for this message
Junien F (axino) wrote :

@axwalk indeed, this behaviour simply doesn't happen with 2.0.2 (except when we hit LP#1643795 heh).

Cheers !

Revision history for this message
John A Meinel (jameinel) wrote :

Can you link the actual fix to mgo/txn? The linked Juju github pull requests don't seem to have what actually fixed the CPU issues, just the 'integrate the fixes into juju'.

Revision history for this message
Anastasia (anastasia-macmood) wrote :
David Britton (dpb)
tags: added: landscape
Revision history for this message
Michael Nelson (michael.nelson) wrote :

I'm seeing this with juju 2.0.2, I don't have access to the controllers, but juju status stops responding. Chatting with webops, jujud is at times consuming 748% CPU (?):

09:09 < Spads> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
09:09 < Spads> 15201 root 20 0 3644884 2.659g 35908 S 747.8 34.1 9:49.31 jujud

while other times the controller is swapping like crazy:

09:10 < Spads> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
09:10 < Spads> 63 root 20 0 0 0 0 R 95.2 0.0 49:13.48 kswapd0
09:10 < Spads> 6322 root 20 0 4574996 3.393g 0 S 27.2 43.5 23:15.13 mongod
09:10 < Spads> 15201 root 20 0 4314648 3.253g 0 S 9.9 41.7 12:45.91 jujud

End result is that we can't interact at all with a deployment we're trying to setup. This issue was triggered twice today - once without any apparent cause, and another time when we restarted the controller agent as was recommended on irc as a work-around for https://bugs.launchpad.net/juju/+bug/1643430

Changed in juju:
milestone: none → 2.1-rc1
Revision history for this message
Anastasia (anastasia-macmood) wrote :

Marking as Won't Fix for 2.0.x series as we are not planning further 2.0.x releases.

Changed in juju:
milestone: 2.1-rc1 → 2.2.0-alpha1
importance: Critical → High
Revision history for this message
Anastasia (anastasia-macmood) wrote :

We think that this has been addressed in 2.1 as part of ongoing performance improvement.
2.1.0 is expected to be released in the second half of February.
I am marking it as Incomplete. Feel free to re-open when you know otherwise.

Changed in juju:
status: In Progress → Incomplete
Revision history for this message
Jacek Nykis (jacekn) wrote :

Sorry but I don't understand how this bug is "Incomplete"? If you are testing juju version that's meant to fix this problem I'd say it's "In progress". Updated

Jacek Nykis (jacekn)
Changed in juju:
status: Incomplete → In Progress
Revision history for this message
Anastasia (anastasia-macmood) wrote :

@Jacek,
It's Incomplete because we would like you to try out 2.1 which includes performance improvement.

2.1-beta5 has been released last week and as per announcement, please follow these instructions to get it:

## How do I get it?

If you are running Ubuntu, you can get Juju from the juju devel ppa:

   sudo add-apt-repository ppa:juju/devel; sudo apt-get update

   sudo apt-get install juju

Or install Juju from the snap store:

   snap install juju --beta --devmode

Revision history for this message
Tom Haddon (mthaddon) wrote :

Hi Anastasia,

Since this is a beta version, is it something that we should be installing on a production environment with a number of production models attached to it? We've typically been told to track the stable version up to this point by Juju development managers, so I'd like to confirm if that approach should change.

Revision history for this message
John A Meinel (jameinel) wrote :

My thoughts-

1) It *would* be good feedback if you could try 2.1b5
2) However, for production, you're right that we generally only track stable releases
3) For reasons aside from this one, we're likely to need to guarantee upgrades from 2.1b5 to 2.1.0 final. However, there are likely to be rough edges around that sort of upgrade (switching to a devel channel, then back to a stable channel, etc.)
4) Its hard for us to unilaterally know if we've fixed all the issues as we don't have the same infrastructure running.
5) Ultimately it depends whether the fixes for 2.1 that have been put in outweigh the costs of going to a beta.

2.1b5 was last Friday, likely 2.1rc1 will be this Thurs/Friday. Which puts 2.1 final out in another week or 2.

We recently underwent staffing changes which makes it quite hard for us to do things like backporting just these changes to a 2.0 release. If it truly is critical and you can't evaluate 2.1b5 obviously its possible to task the team differently. But even if we did backport to 2.0.X it would still likely take at least a week for it to be tested and properly released.

Changed in juju:
status: In Progress → Fix Committed
Revision history for this message
Menno Finlay-Smits (menno.smits) wrote :

Thumper and I figured out why this was happening in 2.0.2 today as it caused an IS outage today.

The fix is already in 2.1: https://github.com/juju/juju/pull/6789

We are considering 2.0.3 release soon to get this fix into a 2.0 release.

Revision history for this message
Anastasia (anastasia-macmood) wrote :
Curtis Hovey (sinzui)
Changed in juju:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.