juju suddenly gets crazy with txns.log

Bug #1692008 reported by Junien F
18
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Canonical Juju
Fix Released
High
Unassigned

Bug Description

Hi,

So we have a situation with our productions controllers where every so often, the CPU load just skyrockets for no apparent reason. I _think_ (not entirely sure) that this is correlated with txns.log activity, as shown in https://paste.ubuntu.com/24604389/

juju just starts running a lot of { find: "txns.log", sort: { $natural: -1 }, skip: 0, batchSize: 10 }, which results in COLLSCANs, which I think could be a reason for the sudden load spike.

I had to restart the controllers to get the controllers back happy, which is why the numbers at the end are low.

Is there an explanation for this behaviour ?

This is with juju 2.1.2

Thanks

Tags: canonical-is
Junien F (axino)
tags: added: canonical-is
Revision history for this message
John A Meinel (jameinel) wrote : Re: [Bug 1692008] [NEW] juju suddenly gets crazy with txns.log

Lots of usage of txns.log probably indicates lots of State objects being
created. Do you have other logs that look like we have workers restarting
internally?

John
=:->

On Fri, May 19, 2017 at 3:49 PM, Junien Fridrick <<email address hidden>
> wrote:

> Public bug reported:
>
> Hi,
>
> So we have a situation with our productions controllers where every so
> often, the CPU load just skyrockets for no apparent reason. I _think_
> (not entirely sure) that this is correlated with txns.log activity, as
> shown in https://paste.ubuntu.com/24604389/
>
> juju just starts running a lot of { find: "txns.log", sort: { $natural:
> -1 }, skip: 0, batchSize: 10 }, which results in COLLSCANs, which I
> think could be a reason for the sudden load spike.
>
> I had to restart the controllers to get the controllers back happy,
> which is why the numbers at the end are low.
>
> Is there an explanation for this behaviour ?
>
> This is with juju 2.1.2
>
> Thanks
>
> ** Affects: juju
> Importance: Undecided
> Status: New
>
>
> ** Tags: canonical-is
>
> ** Tags added: canonical-is
>
> --
> You received this bug notification because you are subscribed to juju.
> Matching subscriptions: juju bugs
> https://bugs.launchpad.net/bugs/1692008
>
> Title:
> juju suddenly gets crazy with txns.log
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju/+bug/1692008/+subscriptions
>

Changed in juju:
status: New → Incomplete
Revision history for this message
Jamon Camisso (jamon) wrote :

This is happening again, here's another list of txns.log: https://pastebin.canonical.com/188958/

In this case I don't see any restarts of jujud or mongo.

Junien F (axino)
Changed in juju:
status: Incomplete → New
Revision history for this message
John A Meinel (jameinel) wrote : Re: [Bug 1692008] Re: juju suddenly gets crazy with txns.log

Seeing the activity of txns.log without any associated information about
what juju is doing is not particularly useful. I perfectly understand it is
happening but the "incomplete" is about needing something like debug-log
output to know what activity is going on inside juju at this time.

John
=:->

On May 23, 2017 18:51, "Junien Fridrick" <email address hidden> wrote:

> ** Changed in: juju
> Status: Incomplete => New
>
> --
> You received this bug notification because you are subscribed to juju.
> Matching subscriptions: juju bugs
> https://bugs.launchpad.net/bugs/1692008
>
> Title:
> juju suddenly gets crazy with txns.log
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju/+bug/1692008/+subscriptions
>

Changed in juju:
importance: Undecided → High
status: New → Incomplete
Revision history for this message
Jay Kuri (jk0ne) wrote :

This occurred today, same M.O. as before. A segment of the machine-log is here:

https://paste.ubuntu.com/24636116/

and

https://paste.ubuntu.com/24636208/ (UUIDs have been changed to protect the innocent)

Revision history for this message
Jay Kuri (jk0ne) wrote :

The above messages are repeated for many thousands of lines while this is going on.

Junien F (axino)
Changed in juju:
status: Incomplete → New
Revision history for this message
Junien F (axino) wrote :

We needed to restart the controllers today (because we needed a DB surgery after a model got half-removed), and the load on the mongoDB primary just skyrocketted for ~45 min with no sign of going down (it generally is high after a controller restart like that one, but only for a few minutes).

mongodb was spending most of its time in txns.log.

Revision history for this message
Nick Moffitt (nick-moffitt) wrote :

A subsequent restart of the jujud-machine-1 service on the affected controller unit sorted this out rapidly.

Revision history for this message
John A Meinel (jameinel) wrote :

do you have any controller logs that might indicate what it was doing
during that time. it's possible you would need to increase logging
verbosity first.
"restarting after db surgery" doesn't really give us a way to reproduce or
tell us what was going on inside.

John
=:->

On May 31, 2017 15:11, "Nick Moffitt" <email address hidden> wrote:

> A subsequent restart of the jujud-machine-1 service on the affected
> controller unit sorted this out rapidly.
>
> --
> You received this bug notification because you are subscribed to juju.
> Matching subscriptions: juju bugs
> https://bugs.launchpad.net/bugs/1692008
>
> Title:
> juju suddenly gets crazy with txns.log
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju/+bug/1692008/+subscriptions
>

Revision history for this message
Nick Moffitt (nick-moffitt) wrote :
Revision history for this message
John A Meinel (jameinel) wrote :
Download full text (5.6 KiB)

That particular bug is full of the firewaller freaking out:
2017-05-31 12:14:00 ERROR juju.worker.firewaller firewaller.go:514 failed
to lookup "unit-telegraf-4", skipping port change
2017-05-31 12:14:00 ERROR juju.worker.firewaller firewaller.go:514 failed
to lookup "unit-telegraf-4", skipping port change
2017-05-31 12:14:00 ERROR juju.worker.firewaller firewaller.go:514 failed
to lookup "unit-telegraf-4", skipping port change
2017-05-31 12:14:00 ERROR juju.worker.firewaller firewaller.go:514 failed
to lookup "unit-telegraf-4", skipping port change

and things like:
2017-05-31 12:13:57 ERROR juju.worker.dependency engine.go:547 "firewaller"
manifold worker returned unexpected error: security groups matching
"juju-.*-a3397c39-c75d-4922-82a4-69a689b10996-1" not found

I also see these:
2017-05-29 19:59:43 ERROR juju.worker.dependency engine.go:547 "firewaller"
manifold worker returned unexpected error: panic resulted in: runtime
error: invalid memory address or nil pointer dereference
2017-05-29 19:59:43 ERROR juju.worker.dependency engine.go:547
"storage-provisioner" manifold worker returned unexpected error: panic
resulted in: runtime error: invalid memory address or nil pointer
dereference

2017-05-29 19:59:43 ERROR juju.worker.dependency engine.go:547
"compute-provisioner" manifold worker returned unexpected error: failed to
process updated machines: failed to ensure machine dead id:19, details:19:
machine 19 has attachments [volume-2]

2017-05-29 19:59:46 ERROR juju.worker.dependency engine.go:547
"storage-provisioner" manifold worker returned unexpected error: panic
resulted in: runtime error: invalid memory address or nil pointer
dereference

2017-05-29 19:59:54 ERROR juju.worker.dependency engine.go:547
"compute-provisioner" manifold worker returned unexpected error: failed to
process updated machines: failed to ensure machine dead id:14, details:14:
machine 14 has attachments [volume-0]

If I 'sort shared-machine-1.log | uniq -c | sort -n' I see:
  18866 ERROR juju.worker.dependency engine.go:547 "firewaller" manifold
worker returned unexpected error: panic resulted in: runtime error: invalid
memory address or nil pointer dereference
  26713 ERROR juju.api.watcher watcher.go:87 error trying to stop watcher:
Closed explicitly
  26733 ERROR juju.worker.firewaller firewaller.go:514 failed to lookup
"unit-telegraf-2", skipping port change
  28068 ERROR juju.worker.firewaller firewaller.go:514 failed to lookup
"unit-telegraf-13", skipping port change
  28074 ERROR juju.worker.firewaller firewaller.go:514 failed to lookup
"unit-telegraf-11", skipping port change
  32746 ERROR juju.worker.firewaller firewaller.go:514 failed to lookup
"unit-telegraf-8", skipping port change
  32753 ERROR juju.worker.firewaller firewaller.go:514 failed to lookup
"unit-telegraf-15", skipping port change
  36516 ERROR juju.worker.firewaller firewaller.go:514 failed to lookup
"unit-telegraf-6", skipping port change
  36522 ERROR juju.worker.firewaller firewaller.go:514 failed to lookup
"unit-telegraf-3", skipping port change
  36522 ERROR juju.worker.firewaller firewaller.go:514 failed to lookup
"unit-telegraf-4", skipping port change
  41308 ERROR juju.worker.fi...

Read more...

Revision history for this message
Junien F (axino) wrote :

The thing is, the controllers are always logging lines like this, even when the load is low.

Changed in juju:
status: New → Triaged
Revision history for this message
Junien F (axino) wrote :

This happened again, mongo just spending all its time in txns.log.

I realized that the code doing this is actually in juju :
https://github.com/juju/juju/blob/2.1/state/watcher/watcher.go#L360

What could cause this code to be called 300 times in one minute ? Doing a single of these takes ~15s, so I'll let you imagine what happens when you run 300 in parallel.

Revision history for this message
John A Meinel (jameinel) wrote :

Each call takes 15s? That is iterating the log in reverse order, batching
them 10 at a time.
Theoretically we're only calling it to get items that we haven't seen
recently. It is possible that we are creating new State objects that
haven't synced from the database at all (which would mean reading the
entire thing). We're supposed to be using a StatePool that holds on to the
state objects.

I think we also already have a bug about "we should be sharing the core
Watcher object between State objects". We currently have 1 state object per
Model, which means we read the table N times for N models.

On Thu, Jun 15, 2017 at 5:47 PM, Junien Fridrick <<email address hidden>
> wrote:

> This happened again, mongo just spending all its time in txns.log.
>
> I realized that the code doing this is actually in juju :
> https://github.com/juju/juju/blob/2.1/state/watcher/watcher.go#L360
>
> What could cause this code to be called 300 times in one minute ? Doing
> a single of these takes ~15s, so I'll let you imagine what happens when
> you run 300 in parallel.
>
> --
> You received this bug notification because you are subscribed to juju.
> Matching subscriptions: juju bugs
> https://bugs.launchpad.net/bugs/1692008
>
> Title:
> juju suddenly gets crazy with txns.log
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju/+bug/1692008/+subscriptions
>

Revision history for this message
John A Meinel (jameinel) wrote :

BTW, our experience in other big deployments is that if you can set up
Telegraf and Grafana to monitor the controller it can be much easier to get
visibility as to what is actually going on during something like this.

I don't think we've yet published docs on the website for how to set it up
(some of it is new for 2.2), but it could be quite useful here.

On Thu, Jun 15, 2017 at 6:28 PM, John Meinel <email address hidden> wrote:

> Each call takes 15s? That is iterating the log in reverse order, batching
> them 10 at a time.
> Theoretically we're only calling it to get items that we haven't seen
> recently. It is possible that we are creating new State objects that
> haven't synced from the database at all (which would mean reading the
> entire thing). We're supposed to be using a StatePool that holds on to the
> state objects.
>
> I think we also already have a bug about "we should be sharing the core
> Watcher object between State objects". We currently have 1 state object per
> Model, which means we read the table N times for N models.
>
>
> On Thu, Jun 15, 2017 at 5:47 PM, Junien Fridrick <
> <email address hidden>> wrote:
>
>> This happened again, mongo just spending all its time in txns.log.
>>
>> I realized that the code doing this is actually in juju :
>> https://github.com/juju/juju/blob/2.1/state/watcher/watcher.go#L360
>>
>> What could cause this code to be called 300 times in one minute ? Doing
>> a single of these takes ~15s, so I'll let you imagine what happens when
>> you run 300 in parallel.
>>
>> --
>> You received this bug notification because you are subscribed to juju.
>> Matching subscriptions: juju bugs
>> https://bugs.launchpad.net/bugs/1692008
>>
>> Title:
>> juju suddenly gets crazy with txns.log
>>
>> To manage notifications about this bug go to:
>> https://bugs.launchpad.net/juju/+bug/1692008/+subscriptions
>>
>
>

Revision history for this message
Menno Finlay-Smits (menno.smits) wrote :

@axino:

1) Can you please tell us how many documents are in txns.log?

2) Can we see the profiling data where you can see the ~15s query times?

3) Are the controller and models "stable" when this happens? (i.e. nothing being deployed or removed)

4) How many models does this controller have?

5) It would be really useful to see the controller machine logs (all nodes) around this time. Preferably at DEBUG. Is this possible?

Revision history for this message
Menno Finlay-Smits (menno.smits) wrote :

I've created a number of tickets related to unnecessary State creation. One or more of these may be related to this issue.

Bug 1698701
Bug 1698702
Bug 1698703

Revision history for this message
Junien F (axino) wrote :

So we're going to upgrade to 2.2 and I'll update this bug with all that if the problem persists.

Changed in juju:
status: Triaged → Incomplete
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for juju because there has been no activity for 60 days.]

Changed in juju:
status: Incomplete → Expired
Revision history for this message
Junien F (axino) wrote :

For the record, we upgraded to 2.2.2 and it would appear that this specific issue is gone.

Changed in juju:
status: Expired → Fix Released
Revision history for this message
Tim Penhey (thumper) wrote :

Awesome.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.