Juju controller needs to be restarted

Bug #1729712 reported by Thomi Richards
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Canonical Juju
Triaged
Low
Unassigned

Bug Description

Our production juju controller needed to be restarted today in order to fix an issue where `juju status` was taking in excess of one minute to respond, and almost all other juju commands were similarly slow.

What we found when logging into the controller was that jujud and mongod were each consuming about 50% of the available CPU time.

Restarting the controller fixed the issue, but this clearly is not a good solution long-term.

Attached are the machine-0 logs from the controller for that day. Note that the incident occured around 2017-11-02 23:35, and there's nothing particularly interesting in the controller log.

Revision history for this message
Thomi Richards (thomir-deactivatedaccount) wrote :
description: updated
Revision history for this message
Thomi Richards (thomir-deactivatedaccount) wrote :

Let me know what other information I can provide. Attached is syslog for the same period.

Revision history for this message
Thomi Richards (thomir-deactivatedaccount) wrote :

Forgot to mention: The controller was running juju version 2.2.5

Revision history for this message
Ian Booth (wallyworld) wrote :

Might be related to bug 1727973 ??

Revision history for this message
Christopher Lee (veebers) wrote :

I understand that it's common to need to restart this controller after an upgrade? Other than after an upgrade do you find you need to restart it at other times?

Tim Penhey (thumper)
Changed in juju:
status: New → Triaged
importance: Undecided → High
tags: added: controller-load restart
Revision history for this message
John A Meinel (jameinel) wrote : Re: [Bug 1729712] Re: Juju controller needs to be restarted

A few things that are notable, nothing that seems like a smoking gun:
2017-11-01 20:39:11 WARNING juju.apiserver log.go:168 http: TLS handshake
error from 91.189.90.53:43927: EOF

Is 91.189.90.53 one of the controller IP addresses? Is it a remote
monitoring?
Having EOF in the middle of doing a TLS handshake certainly sounds like
something that could cause a fair bit of load on us. (We have to do the
work to determine TLS settings, but then throw that all away.) That might
show as high CPU load. However, it happens about 20 times over the 2hrs or
so from the log, so it doesn't seem specifically relevant.

A lot of these messages:
2017-11-02 21:50:36 ERROR juju.state unit.go:339 cannot delete history for
unit "u#snapdevicegw-r8eeaa63/4#charm": workload: read tcp 127.0.0.1:33376->
127.0.0.1:37017: i/o timeout

Presumably a model or at least a large number of applications was removed,
causing us to try to delete all of the information for those units. But
while cleaning up their information, Mongo load got high enough that it
just rejected any new request.

2017-11-02 22:14:26 ERROR juju.state allwatcher.go:399 getting a public
address for unit "snapdevicegw-r9e7a9d9/0" failed: "unit
snapdevicegw-r9e7a9d9/0 cannot get assigned machine: unit
\"snapdevicegw-r9e7a9d9/0\" is not assigned to a machine"

I don't know that this is high load, but having units not assigned to
machines certainly means there was some sort of deployment failure
somewhere.

2017-11-02 22:43:12 INFO juju.cmd supercommand.go:63 running jujud [2.2.5
gc go1.8]
juju was restarted
2017-11-02 22:43:15 WARNING juju.mongo open.go:159 mongodb connection
failed, will retry: dial tcp 127.0.0.1:37017: getsockopt: connection refused
mongo was refusing connections. Was Mongo restarted at the same time as
Juju?

2017-11-02 23:02:11 WARNING juju.apiserver log.go:168 http: TLS handshake
error from 91.189.90.53:42749: EOF
And yet we're still seeing TLS handshake failures.

Do we have any more syslog data? The syslog uploaded here only goes back to
Nov 2 23:22:04, but the point where things were failing is more around
21:50:36 which is much more likely to have interesting failures.

John
=:->

On Fri, Nov 3, 2017 at 4:38 AM, Tim Penhey <email address hidden> wrote:

> ** Changed in: juju
> Status: New => Triaged
>
> ** Changed in: juju
> Importance: Undecided => High
>
> ** Tags added: controller-load restart
>
> --
> You received this bug notification because you are subscribed to juju.
> Matching subscriptions: juju bugs
> https://bugs.launchpad.net/bugs/1729712
>
> Title:
> Juju controller needs to be restarted
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju/+bug/1729712/+subscriptions
>

Revision history for this message
Thomi Richards (thomir-deactivatedaccount) wrote :
Download full text (3.3 KiB)

Hi John,

Some answers inline, below:

On Fri, Nov 3, 2017 at 6:27 PM, John A Meinel <email address hidden>
wrote:

> A few things that are notable, nothing that seems like a smoking gun:
> 2017-11-01 20:39:11 WARNING juju.apiserver log.go:168 http: TLS handshake
> error from 91.189.90.53:43927: EOF
>
> Is 91.189.90.53 one of the controller IP addresses? Is it a remote
> monitoring?
>

91.189.90.53 is wendigo, where the juju client is being run. My assumption
(based on zero knowledge) is that this is an error in the connection
between the juju client and the juju controller, and port 43942 is the
ephemeral TCP port for the communication at the time.

> Having EOF in the middle of doing a TLS handshake certainly sounds like
> something that could cause a fair bit of load on us. (We have to do the
> work to determine TLS settings, but then throw that all away.) That might
> show as high CPU load. However, it happens about 20 times over the 2hrs or
> so from the log, so it doesn't seem specifically relevant.
>
> A lot of these messages:
> 2017-11-02 21:50:36 ERROR juju.state unit.go:339 cannot delete history for
> unit "u#snapdevicegw-r8eeaa63/4#charm": workload: read tcp 127.0.0.1:33376
> ->
> 127.0.0.1:37017: i/o timeout
>
> Presumably a model or at least a large number of applications was removed,
> causing us to try to delete all of the information for those units. But
> while cleaning up their information, Mongo load got high enough that it
> just rejected any new request.
>

Ahh, that could well be the cause. Before this started I ran 'juju
remove-application' on two or three (I can't remember which, but it was
certainly no more than three) applications ('snaprevicegw-r<hash>'). Each
application has four units, and each unit has four subordinates
(canonical-livepatch, landscape, logstash-forwarder and telegraf), so all
up I'd probably asked juju to remove ~60 units in a short space of time. It
seems to me that while high, this should be well within juju's capabilities?

>
> 2017-11-02 22:14:26 ERROR juju.state allwatcher.go:399 getting a public
> address for unit "snapdevicegw-r9e7a9d9/0" failed: "unit
> snapdevicegw-r9e7a9d9/0 cannot get assigned machine: unit
> \"snapdevicegw-r9e7a9d9/0\" is not assigned to a machine"
>
> I don't know that this is high load, but having units not assigned to
> machines certainly means there was some sort of deployment failure
> somewhere.
>

...except I suspect this is one of the units just deleted.

> 2017-11-02 22:43:12 INFO juju.cmd supercommand.go:63 running jujud [2.2.5
> gc go1.8]
> juju was restarted
> 2017-11-02 22:43:15 WARNING juju.mongo open.go:159 mongodb connection
> failed, will retry: dial tcp 127.0.0.1:37017: getsockopt: connection
> refused
> mongo was refusing connections. Was Mongo restarted at the same time as
> Juju?
>

Yes, Mongodb was restarted as well.

>
> 2017-11-02 23:02:11 WARNING juju.apiserver log.go:168 http: TLS handshake
> error from 91.189.90.53:42749: EOF
> And yet we're still seeing TLS handshake failures.
>
>
> Do we have any more syslog data? The syslog uploaded here only goes back to
> Nov 2 23:22:04, but the point where things were failing is more around
> 21...

Read more...

Revision history for this message
Canonical Juju QA Bot (juju-qa-bot) wrote :

This bug has not been updated in 2 years, so we're marking it Low importance. If you believe this is incorrect, please update the importance.

Changed in juju:
importance: High → Low
tags: added: expirebugs-bot
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.