Bug #1692008 “juju suddenly gets crazy with txns.log” : Bugs : Canonical Juju

Junien F (axino) on 2017-05-19

tags:

added: canonical-is

Revision history for this message

John A Meinel (jameinel) wrote on 2017-05-20: Re: [Bug 1692008] [NEW] juju suddenly gets crazy with txns.log

#1

Lots of usage of txns.log probably indicates lots of State objects being
created. Do you have other logs that look like we have workers restarting
internally?

John
=:->

On Fri, May 19, 2017 at 3:49 PM, Junien Fridrick <<email address hidden>
> wrote:

> Public bug reported:
>
> Hi,
>
> So we have a situation with our productions controllers where every so
> often, the CPU load just skyrockets for no apparent reason. I _think_
> (not entirely sure) that this is correlated with txns.log activity, as
> shown in https://paste.ubuntu.com/24604389/
>
> juju just starts running a lot of { find: "txns.log", sort: { $natural:
> -1 }, skip: 0, batchSize: 10 }, which results in COLLSCANs, which I
> think could be a reason for the sudden load spike.
>
> I had to restart the controllers to get the controllers back happy,
> which is why the numbers at the end are low.
>
> Is there an explanation for this behaviour ?
>
> This is with juju 2.1.2
>
> Thanks
>
> ** Affects: juju
> Importance: Undecided
> Status: New
>
>
> ** Tags: canonical-is
>
> ** Tags added: canonical-is
>
> --
> You received this bug notification because you are subscribed to juju.
> Matching subscriptions: juju bugs
> https://bugs.launchpad.net/bugs/1692008
>
> Title:
> juju suddenly gets crazy with txns.log
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju/+bug/1692008/+subscriptions
>

Anastasia (anastasia-macmood) on 2017-05-21

Changed in juju:
status:	New → Incomplete

Revision history for this message

Jamon Camisso (jamon) wrote on 2017-05-23:

#2

This is happening again, here's another list of txns.log: https://pastebin.canonical.com/188958/

In this case I don't see any restarts of jujud or mongo.

Junien F (axino) on 2017-05-23

Changed in juju:
status:	Incomplete → New

Revision history for this message

John A Meinel (jameinel) wrote on 2017-05-23: Re: [Bug 1692008] Re: juju suddenly gets crazy with txns.log

#3

Seeing the activity of txns.log without any associated information about
what juju is doing is not particularly useful. I perfectly understand it is
happening but the "incomplete" is about needing something like debug-log
output to know what activity is going on inside juju at this time.

John
=:->

On May 23, 2017 18:51, "Junien Fridrick" <email address hidden> wrote:

> ** Changed in: juju
> Status: Incomplete => New
>
> --
> You received this bug notification because you are subscribed to juju.
> Matching subscriptions: juju bugs
> https://bugs.launchpad.net/bugs/1692008
>
> Title:
> juju suddenly gets crazy with txns.log
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju/+bug/1692008/+subscriptions
>

Changed in juju:
importance:	Undecided → High
status:	New → Incomplete

Revision history for this message

Jay Kuri (jk0ne) wrote on 2017-05-23:

#4

This occurred today, same M.O. as before. A segment of the machine-log is here:

https://paste.ubuntu.com/24636116/

and

https://paste.ubuntu.com/24636208/ (UUIDs have been changed to protect the innocent)

Revision history for this message

Jay Kuri (jk0ne) wrote on 2017-05-23:

#5

The above messages are repeated for many thousands of lines while this is going on.

Junien F (axino) on 2017-05-24

Changed in juju:
status:	Incomplete → New

Revision history for this message

Junien F (axino) wrote on 2017-05-31:

#6

We needed to restart the controllers today (because we needed a DB surgery after a model got half-removed), and the load on the mongoDB primary just skyrocketted for ~45 min with no sign of going down (it generally is high after a controller restart like that one, but only for a few minutes).

mongodb was spending most of its time in txns.log.

Revision history for this message

Nick Moffitt (nick-moffitt) wrote on 2017-05-31:

#7

A subsequent restart of the jujud-machine-1 service on the affected controller unit sorted this out rapidly.

Revision history for this message

John A Meinel (jameinel) wrote on 2017-05-31:

#8

do you have any controller logs that might indicate what it was doing
during that time. it's possible you would need to increase logging
verbosity first.
"restarting after db surgery" doesn't really give us a way to reproduce or
tell us what was going on inside.

John
=:->

On May 31, 2017 15:11, "Nick Moffitt" <email address hidden> wrote:

> A subsequent restart of the jujud-machine-1 service on the affected
> controller unit sorted this out rapidly.
>
> --
> You received this bug notification because you are subscribed to juju.
> Matching subscriptions: juju bugs
> https://bugs.launchpad.net/bugs/1692008
>
> Title:
> juju suddenly gets crazy with txns.log
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju/+bug/1692008/+subscriptions
>

Revision history for this message

Nick Moffitt (nick-moffitt) wrote on 2017-05-31:

#9

https://private-fileshare.canonical.com/~nick/shared-controller-machine-1.log.gz

Revision history for this message

John A Meinel (jameinel) wrote on 2017-05-31:

#10

Download full text (5.6 KiB)

That particular bug is full of the firewaller freaking out:
2017-05-31 12:14:00 ERROR juju.worker.firewaller firewaller.go:514 failed
to lookup "unit-telegraf-4", skipping port change
2017-05-31 12:14:00 ERROR juju.worker.firewaller firewaller.go:514 failed
to lookup "unit-telegraf-4", skipping port change
2017-05-31 12:14:00 ERROR juju.worker.firewaller firewaller.go:514 failed
to lookup "unit-telegraf-4", skipping port change
2017-05-31 12:14:00 ERROR juju.worker.firewaller firewaller.go:514 failed
to lookup "unit-telegraf-4", skipping port change

and things like:
2017-05-31 12:13:57 ERROR juju.worker.dependency engine.go:547 "firewaller"
manifold worker returned unexpected error: security groups matching
"juju-.*-a3397c39-c75d-4922-82a4-69a689b10996-1" not found

I also see these:
2017-05-29 19:59:43 ERROR juju.worker.dependency engine.go:547 "firewaller"
manifold worker returned unexpected error: panic resulted in: runtime
error: invalid memory address or nil pointer dereference
2017-05-29 19:59:43 ERROR juju.worker.dependency engine.go:547
"storage-provisioner" manifold worker returned unexpected error: panic
resulted in: runtime error: invalid memory address or nil pointer
dereference

2017-05-29 19:59:43 ERROR juju.worker.dependency engine.go:547
"compute-provisioner" manifold worker returned unexpected error: failed to
process updated machines: failed to ensure machine dead id:19, details:19:
machine 19 has attachments [volume-2]

2017-05-29 19:59:46 ERROR juju.worker.dependency engine.go:547
"storage-provisioner" manifold worker returned unexpected error: panic
resulted in: runtime error: invalid memory address or nil pointer
dereference

2017-05-29 19:59:54 ERROR juju.worker.dependency engine.go:547
"compute-provisioner" manifold worker returned unexpected error: failed to
process updated machines: failed to ensure machine dead id:14, details:14:
machine 14 has attachments [volume-0]

If I 'sort shared-machine-1.log | uniq -c | sort -n' I see:
  18866 ERROR juju.worker.dependency engine.go:547 "firewaller" manifold
worker returned unexpected error: panic resulted in: runtime error: invalid
memory address or nil pointer dereference
  26713 ERROR juju.api.watcher watcher.go:87 error trying to stop watcher:
Closed explicitly
  26733 ERROR juju.worker.firewaller firewaller.go:514 failed to lookup
"unit-telegraf-2", skipping port change
  28068 ERROR juju.worker.firewaller firewaller.go:514 failed to lookup
"unit-telegraf-13", skipping port change
  28074 ERROR juju.worker.firewaller firewaller.go:514 failed to lookup
"unit-telegraf-11", skipping port change
  32746 ERROR juju.worker.firewaller firewaller.go:514 failed to lookup
"unit-telegraf-8", skipping port change
  32753 ERROR juju.worker.firewaller firewaller.go:514 failed to lookup
"unit-telegraf-15", skipping port change
  36516 ERROR juju.worker.firewaller firewaller.go:514 failed to lookup
"unit-telegraf-6", skipping port change
  36522 ERROR juju.worker.firewaller firewaller.go:514 failed to lookup
"unit-telegraf-3", skipping port change
  36522 ERROR juju.worker.firewaller firewaller.go:514 failed to lookup
"unit-telegraf-4", skipping port change
  41308 ERROR juju.worker.fi...

That particular bug is full of the firewaller freaking out:
2017-05-31 12:14:00 ERROR juju.worker.firewaller firewaller.go:514 failed
to lookup "unit-telegraf-4", skipping port change
2017-05-31 12:14:00 ERROR juju.worker.firewaller firewaller.go:514 failed
to lookup "unit-telegraf-4", skipping port change
2017-05-31 12:14:00 ERROR juju.worker.firewaller firewaller.go:514 failed
to lookup "unit-telegraf-4", skipping port change
2017-05-31 12:14:00 ERROR juju.worker.firewaller firewaller.go:514 failed
to lookup "unit-telegraf-4", skipping port change

and things like:
2017-05-31 12:13:57 ERROR juju.worker.dependency engine.go:547 "firewaller"
manifold worker returned unexpected error: security groups matching
"juju-.*-a3397c39-c75d-4922-82a4-69a689b10996-1" not found

I also see these:
2017-05-29 19:59:43 ERROR juju.worker.dependency engine.go:547 "firewaller"
manifold worker returned unexpected error: panic resulted in: runtime
error: invalid memory address or nil pointer dereference
2017-05-29 19:59:43 ERROR juju.worker.dependency engine.go:547
"storage-provisioner" manifold worker returned unexpected error: panic
resulted in: runtime error: invalid memory address or nil pointer
dereference

2017-05-29 19:59:43 ERROR juju.worker.dependency engine.go:547
"compute-provisioner" manifold worker returned unexpected error: failed to
process updated machines: failed to ensure machine dead id:19, details:19:
machine 19 has attachments [volume-2]

2017-05-29 19:59:46 ERROR juju.worker.dependency engine.go:547
"storage-provisioner" manifold worker returned unexpected error: panic
resulted in: runtime error: invalid memory address or nil pointer
dereference

2017-05-29 19:59:54 ERROR juju.worker.dependency engine.go:547
"compute-provisioner" manifold worker returned unexpected error: failed to
process updated machines: failed to ensure machine dead id:14, details:14:
machine 14 has attachments [volume-0]

If I 'sort shared-machine-1.log | uniq -c | sort -n' I see:
  18866 ERROR juju.worker.dependency engine.go:547 "firewaller" manifold
worker returned unexpected error: panic resulted in: runtime error: invalid
memory address or nil pointer dereference
  26713 ERROR juju.api.watcher watcher.go:87 error trying to stop watcher:
Closed explicitly
  26733 ERROR juju.worker.firewaller firewaller.go:514 failed to lookup
"unit-telegraf-2", skipping port change
  28068 ERROR juju.worker.firewaller firewaller.go:514 failed to lookup
"unit-telegraf-13", skipping port change
  28074 ERROR juju.worker.firewaller firewaller.go:514 failed to lookup
"unit-telegraf-11", skipping port change
  32746 ERROR juju.worker.firewaller firewaller.go:514 failed to lookup
"unit-telegraf-8", skipping port change
  32753 ERROR juju.worker.firewaller firewaller.go:514 failed to lookup
"unit-telegraf-15", skipping port change
  36516 ERROR juju.worker.firewaller firewaller.go:514 failed to lookup
"unit-telegraf-6", skipping port change
  36522 ERROR juju.worker.firewaller firewaller.go:514 failed to lookup
"unit-telegraf-3", skipping port change
  36522 ERROR juju.worker.firewaller firewaller.go:514 failed to lookup
"unit-telegraf-4", skipping port change
  41308 ERROR juju.worker.firewaller firewaller.go:514 failed to lookup
"unit-telegraf-1", skipping port change
  50205 ERROR juju.worker.firewaller firewaller.go:514 failed to lookup
"unit-telegraf-0", skipping port change
  50558 ERROR juju.worker.firewaller firewaller.go:514 failed to lookup
"unit-telegraf-5", skipping port change
  63921 ERROR juju.worker.firewaller firewaller.go:514 failed to lookup
"unit-telegraf-7", skipping port change

So something about the firewaller is going nuts. I'm guessing some of it is:
   7682 ERROR juju.worker.dependency engine.go:547 "firewaller" manifold
worker returned unexpected error: machine 19 not provisioned
   8420 ERROR juju.worker.dependency engine.go:547 "firewaller" manifold
worker returned unexpected error: machine 15 not provisioned
  10358 ERROR juju.worker.dependency engine.go:547 "firewaller" manifold
worker returned unexpected error: machine 14 not provisioned
and
   4450 ERROR juju.worker.dependency engine.go:547 "firewaller" manifold
worker returned unexpected error: security groups matching
"juju-.*-a3397c39-c75d-4922-82a4-69a689b10996-1" not found
   4677 ERROR juju.worker.dependency engine.go:547 "firewaller" manifold
worker returned unexpected error: security groups matching
"juju-.*-4c3d68ed-15d3-4002-825a-f2361ebe9048-1" not found
   5238 ERROR juju.worker.dependency engine.go:547 "firewaller" manifold
worker returned unexpected error: security groups matching
"juju-.*-569d4ba6-1b1d-4837-8190-4a902ba34d61-4" not found
   5918 ERROR juju.worker.dependency engine.go:547 "compute-provisioner"
manifold worker returned unexpected error: failed to process updated
machines: failed to ensure machine dead id:14, details:14: machine 14 has
attachments [volume-0]
   7105 ERROR juju.worker.dependency engine.go:547 "firewaller" manifold
worker returned unexpected error: security groups matching
"juju-.*-569d4ba6-1b1d-4837-8190-4a902ba34d61-3" not found

You have 3 machines that are in a bad state, and security groups that can't
be found, and its *not* happy about it. I don't know why restarting would
fix anything.

John
=:->

On Wed, May 31, 2017 at 4:29 PM, Nick Moffitt <nick.moffitt@canonical.com>
wrote:

> https://private-fileshare.canonical.com/~nick/shared-controller-
> machine-1.log.gz
>
> --
> You received this bug notification because you are subscribed to juju.
> Matching subscriptions: juju bugs
> https://bugs.launchpad.net/bugs/1692008
>
> Title:
>   juju suddenly gets crazy with txns.log
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju/+bug/1692008/+subscriptions
>

Revision history for this message

Junien F (axino) wrote on 2017-05-31:

#11

The thing is, the controllers are always logging lines like this, even when the load is low.

Menno Finlay-Smits (menno.smits) on 2017-06-15

Changed in juju:
status:	New → Triaged

Revision history for this message

Junien F (axino) wrote on 2017-06-15:

#12

This happened again, mongo just spending all its time in txns.log.

I realized that the code doing this is actually in juju :
https://github.com/juju/juju/blob/2.1/state/watcher/watcher.go#L360

What could cause this code to be called 300 times in one minute ? Doing a single of these takes ~15s, so I'll let you imagine what happens when you run 300 in parallel.

Revision history for this message

John A Meinel (jameinel) wrote on 2017-06-15:

#13

Each call takes 15s? That is iterating the log in reverse order, batching
them 10 at a time.
Theoretically we're only calling it to get items that we haven't seen
recently. It is possible that we are creating new State objects that
haven't synced from the database at all (which would mean reading the
entire thing). We're supposed to be using a StatePool that holds on to the
state objects.

I think we also already have a bug about "we should be sharing the core
Watcher object between State objects". We currently have 1 state object per
Model, which means we read the table N times for N models.

On Thu, Jun 15, 2017 at 5:47 PM, Junien Fridrick <<email address hidden>
> wrote:

> This happened again, mongo just spending all its time in txns.log.
>
> I realized that the code doing this is actually in juju :
> https://github.com/juju/juju/blob/2.1/state/watcher/watcher.go#L360
>
> What could cause this code to be called 300 times in one minute ? Doing
> a single of these takes ~15s, so I'll let you imagine what happens when
> you run 300 in parallel.
>
> --
> You received this bug notification because you are subscribed to juju.
> Matching subscriptions: juju bugs
> https://bugs.launchpad.net/bugs/1692008
>
> Title:
> juju suddenly gets crazy with txns.log
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju/+bug/1692008/+subscriptions
>

Revision history for this message

John A Meinel (jameinel) wrote on 2017-06-15:

#14

BTW, our experience in other big deployments is that if you can set up
Telegraf and Grafana to monitor the controller it can be much easier to get
visibility as to what is actually going on during something like this.

I don't think we've yet published docs on the website for how to set it up
(some of it is new for 2.2), but it could be quite useful here.

On Thu, Jun 15, 2017 at 6:28 PM, John Meinel <email address hidden> wrote:

> Each call takes 15s? That is iterating the log in reverse order, batching
> them 10 at a time.
> Theoretically we're only calling it to get items that we haven't seen
> recently. It is possible that we are creating new State objects that
> haven't synced from the database at all (which would mean reading the
> entire thing). We're supposed to be using a StatePool that holds on to the
> state objects.
>
> I think we also already have a bug about "we should be sharing the core
> Watcher object between State objects". We currently have 1 state object per
> Model, which means we read the table N times for N models.
>
>
> On Thu, Jun 15, 2017 at 5:47 PM, Junien Fridrick <
> <email address hidden>> wrote:
>
>> This happened again, mongo just spending all its time in txns.log.
>>
>> I realized that the code doing this is actually in juju :
>> https://github.com/juju/juju/blob/2.1/state/watcher/watcher.go#L360
>>
>> What could cause this code to be called 300 times in one minute ? Doing
>> a single of these takes ~15s, so I'll let you imagine what happens when
>> you run 300 in parallel.
>>
>> --
>> You received this bug notification because you are subscribed to juju.
>> Matching subscriptions: juju bugs
>> https://bugs.launchpad.net/bugs/1692008
>>
>> Title:
>> juju suddenly gets crazy with txns.log
>>
>> To manage notifications about this bug go to:
>> https://bugs.launchpad.net/juju/+bug/1692008/+subscriptions
>>
>
>

Revision history for this message

Menno Finlay-Smits (menno.smits) wrote on 2017-06-15:

#15

@axino:

1) Can you please tell us how many documents are in txns.log?

2) Can we see the profiling data where you can see the ~15s query times?

3) Are the controller and models "stable" when this happens? (i.e. nothing being deployed or removed)

4) How many models does this controller have?

5) It would be really useful to see the controller machine logs (all nodes) around this time. Preferably at DEBUG. Is this possible?

Revision history for this message

Menno Finlay-Smits (menno.smits) wrote on 2017-06-18:

#16

I've created a number of tickets related to unnecessary State creation. One or more of these may be related to this issue.

Bug 1698701
Bug 1698702
Bug 1698703

Revision history for this message

Junien F (axino) wrote on 2017-06-20:

#17

So we're going to upgrade to 2.2 and I'll update this bug with all that if the problem persists.

Changed in juju:
status:	Triaged → Incomplete

Revision history for this message

Launchpad Janitor (janitor) wrote on 2017-08-20:

#18

[Expired for juju because there has been no activity for 60 days.]

Changed in juju:
status:	Incomplete → Expired

Revision history for this message

Junien F (axino) wrote on 2017-08-21:

#19

For the record, we upgraded to 2.2.2 and it would appear that this specific issue is gone.

Changed in juju:
status:	Expired → Fix Released

Revision history for this message

Tim Penhey (thumper) wrote on 2017-08-22:

#20

Awesome.

Canonical Juju

juju suddenly gets crazy with txns.log

Bug Description

Other bug subscribers

Remote bug watches