Canonical Juju

high load during startup, goes away when controllers get restarted

Bug #1727973 reported by Junien F on 2017-10-27

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Canonical Juju	Fix Released	High	Tim Penhey	Canonical Juju 2.3.2
	2.2	Fix Released	High	Tim Penhey	Canonical Juju 2.2.7

Bug Description

Hi,

Multiple times over the last few months, we've experienced issues with juju controllers startup.

The symptoms are that following a full restart of the controllers (HA with 3), there is a very high load on the mongodb primary (we're talking ~180 here). juju status is super slow, etc.

This load appears to remain high until we restart all the controllers and/or switch the mongodb primary.

We're going to track occurrences of this behaviour here until it is resolved.

We're currently running juju 2.2.4.

Thanks

Tags:

Junien F (axino) on 2017-10-27

tags:

added: canonical-is

Revision history for this message

Junien F (axino) wrote on 2017-10-27:

Occurrence of 2017-10-26 :

Startup (start of high load) : around 12:30 UTC
Restart (end of high load) : around 14:00 UTC

Following a full mgopurge to fix bug 1727679, the controllers are restarted. The load stays at ~180 for 1h, after what I try to kill -STOP / kill -CONT the jujuds, which just makes them restart, in turn making a mongodb primary failover. The load gets back to normal after this failover.

Observations :

* very high activity in txns.log (as expected)
* no Juju internal metric, sadly. We need to understand why (timeout when trying to fetch ? unavailable ? something to investigate if the problem re-appears)
* during the "high load" time :
- mongodb opened cursors stayed up super high, at 500. We need to understand why.
- mongodb repl_apply_batches was at 0. Once again something that we need to understand.
- disk throughput kind of low
- very high write lock acquire time (~1.5s) for about 10 min. Then another 5min "spike" at ~600 ms 10 min later.

Ian Booth (wallyworld) on 2017-11-02

Changed in juju:
milestone:	none → 2.3-beta3
importance:	Undecided → High
status:	New → Triaged

Tim Penhey (thumper) on 2017-11-03

Changed in juju:
assignee:	nobody → John A Meinel (jameinel)

Tim Penhey (thumper) on 2017-11-09

Changed in juju:
assignee:	John A Meinel (jameinel) → nobody
milestone:	2.3-beta3 → none

Revision history for this message

Tim Penhey (thumper) wrote on 2017-11-09:

This one is actually really hard to reproduce. Can I request that next time an upgrade is needed we can get a Juju dev to watch while it is happening?

Ideally we'd want to get mongotop running during the upgrade to see what is happening there.

Revision history for this message

John A Meinel (jameinel) wrote on 2017-11-09: Re: [Bug 1727973] Re: high load during startup, goes away when controllers get restarted

I do believe that this one showed unusually high activity in txns.log
during the load time. And that you could see a significant difference in
static load on the system once the controller was restarted one more time.
(Something like a static 500 active cursors dropping down to 10-20 that
only spiked up to around 100 or so.)

On Thu, Nov 9, 2017 at 8:01 AM, Tim Penhey <email address hidden> wrote:

> This one is actually really hard to reproduce. Can I request that next
> time an upgrade is needed we can get a Juju dev to watch while it is
> happening?
>
> Ideally we'd want to get mongotop running during the upgrade to see what
> is happening there.
>
> --
> You received this bug notification because you are subscribed to juju.
> Matching subscriptions: juju bugs
> https://bugs.launchpad.net/bugs/1727973
>
> Title:
> high load during startup, goes away when controllers get restarted
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju/+bug/1727973/+subscriptions
>

Revision history for this message

Tim Penhey (thumper) wrote on 2017-12-04:

I have a feeling that this is closely related to bug 1733708, in that we very inefficiently have watchers looking at the txns.log collection iterating transactions for every model in each HA controller API server.

Revision history for this message

Tim Penhey (thumper) wrote on 2017-12-13:

How many models are in the controller?

Revision history for this message

Paul Collins (pjdc) wrote on 2017-12-13:

176

Revision history for this message

Tim Penhey (thumper) wrote on 2017-12-13:

Yep, this will be the same problem as the bug referenced above.

Changed in juju:
status:	Triaged → In Progress
assignee:	nobody → Tim Penhey (thumper)
milestone:	none → 2.3.2

Revision history for this message

Paul Gear (paulgear) wrote on 2018-01-02:

I believe this to still be happening on 2.2.8 when controllers are restarted after an mgopurge. I'll continue to monitor on our main controller cluster.

Revision history for this message

Paul Gear (paulgear) wrote on 2018-01-02:

I've just performed a full stop, mgopurge, and start on our 3-node 2.2.8 controller and was unable to replicate the issue.

Tim Penhey (thumper) on 2018-01-07

Changed in juju:
status:	In Progress → Fix Committed

Canonical Juju QA Bot (juju-qa-bot) on 2018-01-16

Changed in juju:
status:	Fix Committed → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.