Bug #1729712 “Juju controller needs to be restarted” : Bugs : Canonical Juju

Revision history for this message

Thomi Richards (thomir-deactivatedaccount) wrote on 2017-11-02:

#1

juju controller log. Edit (20.8 KiB, text/plain)

description:

updated

Revision history for this message

Thomi Richards (thomir-deactivatedaccount) wrote on 2017-11-02:

#2

juju controller syslog Edit (4.8 KiB, text/plain)

Let me know what other information I can provide. Attached is syslog for the same period.

Revision history for this message

Thomi Richards (thomir-deactivatedaccount) wrote on 2017-11-02:

#3

Forgot to mention: The controller was running juju version 2.2.5

Revision history for this message

Ian Booth (wallyworld) wrote on 2017-11-02:

#4

Might be related to bug 1727973 ??

Revision history for this message

Christopher Lee (veebers) wrote on 2017-11-02:

#5

I understand that it's common to need to restart this controller after an upgrade? Other than after an upgrade do you find you need to restart it at other times?

Tim Penhey (thumper) on 2017-11-03

Changed in juju:
status:	New → Triaged
importance:	Undecided → High
tags:	added: controller-load restart

Revision history for this message

John A Meinel (jameinel) wrote on 2017-11-03: Re: [Bug 1729712] Re: Juju controller needs to be restarted

#6

A few things that are notable, nothing that seems like a smoking gun:
2017-11-01 20:39:11 WARNING juju.apiserver log.go:168 http: TLS handshake
error from 91.189.90.53:43927: EOF

Is 91.189.90.53 one of the controller IP addresses? Is it a remote
monitoring?
Having EOF in the middle of doing a TLS handshake certainly sounds like
something that could cause a fair bit of load on us. (We have to do the
work to determine TLS settings, but then throw that all away.) That might
show as high CPU load. However, it happens about 20 times over the 2hrs or
so from the log, so it doesn't seem specifically relevant.

A lot of these messages:
2017-11-02 21:50:36 ERROR juju.state unit.go:339 cannot delete history for
unit "u#snapdevicegw-r8eeaa63/4#charm": workload: read tcp 127.0.0.1:33376->
127.0.0.1:37017: i/o timeout

Presumably a model or at least a large number of applications was removed,
causing us to try to delete all of the information for those units. But
while cleaning up their information, Mongo load got high enough that it
just rejected any new request.

2017-11-02 22:14:26 ERROR juju.state allwatcher.go:399 getting a public
address for unit "snapdevicegw-r9e7a9d9/0" failed: "unit
snapdevicegw-r9e7a9d9/0 cannot get assigned machine: unit
\"snapdevicegw-r9e7a9d9/0\" is not assigned to a machine"

I don't know that this is high load, but having units not assigned to
machines certainly means there was some sort of deployment failure
somewhere.

2017-11-02 22:43:12 INFO juju.cmd supercommand.go:63 running jujud [2.2.5
gc go1.8]
juju was restarted
2017-11-02 22:43:15 WARNING juju.mongo open.go:159 mongodb connection
failed, will retry: dial tcp 127.0.0.1:37017: getsockopt: connection refused
mongo was refusing connections. Was Mongo restarted at the same time as
Juju?

2017-11-02 23:02:11 WARNING juju.apiserver log.go:168 http: TLS handshake
error from 91.189.90.53:42749: EOF
And yet we're still seeing TLS handshake failures.

Do we have any more syslog data? The syslog uploaded here only goes back to
Nov 2 23:22:04, but the point where things were failing is more around
21:50:36 which is much more likely to have interesting failures.

John
=:->

On Fri, Nov 3, 2017 at 4:38 AM, Tim Penhey <email address hidden> wrote:

> ** Changed in: juju
> Status: New => Triaged
>
> ** Changed in: juju
> Importance: Undecided => High
>
> ** Tags added: controller-load restart
>
> --
> You received this bug notification because you are subscribed to juju.
> Matching subscriptions: juju bugs
> https://bugs.launchpad.net/bugs/1729712
>
> Title:
> Juju controller needs to be restarted
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju/+bug/1729712/+subscriptions
>