Internal controller service restarts on 2.9

Bug #1969345 reported by Barry Price
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Canonical Juju
Expired
High
Unassigned

Bug Description

We may be seeing this more often with 2.9.27, but I've seen it on previous 2.9 releases too, e.g. 2.9.25.

While the jujud process itself doesn't restart, something within it is stopping and starting the network listener on port 17070, for long enough to trip our monitoring systems.

An example from today, in an Azure-hosted HA controller environment:

$ juju run --application ubuntu "grep 'running jujud\|httpserver worker' /var/log/juju/machine-?.log | pp
= ubuntu/1 (rc = 0) =

== stdout ==
2022-04-18 04:31:59 INFO juju.worker.httpserver worker.go:315 listening on "[::]:17070"

= ubuntu/2 (rc = 0) =

== stdout ==
2022-04-18 04:25:31 INFO juju.worker.httpserver worker.go:315 listening on "[::]:17070"
2022-04-18 04:26:35 INFO juju.worker.httpserver worker.go:315 listening on "[::]:17070"

= ubuntu/0 (rc = 0) =

== stdout ==
2022-04-18 04:26:28 INFO juju.worker.httpserver worker.go:315 listening on "[::]:17070"
2022-04-18 04:26:33 INFO juju.worker.httpserver worker.go:315 listening on "[::]:17070"

$

Looking into the logs on machine 0 here, we see a lot of this in the seconds before the restart:

ERROR juju.apiserver apiserver.go:1017 error serving RPCs: codec.ReadHeader error: error receiving message: read tcp 192.168.16.4:17070->192.168.0.7:42288: read: connection reset by peer

Followed by this:

2022-04-18 04:25:25 ERROR juju.worker.raft apply.go:147 Raft future error: timed out enqueuing operation
2022-04-18 04:25:25 ERROR juju.worker.dependency engine.go:693 "is-responsible-flag" manifold worker returned unexpected error: lease manager stopped
2022-04-18 04:25:25 ERROR juju.worker.dependency engine.go:693 "is-responsible-flag" manifold worker returned unexpected error: lease manager stopped
2022-04-18 04:25:25 ERROR juju.worker.dependency engine.go:693 "is-responsible-flag" manifold worker returned unexpected error: lease manager stopped
2022-04-18 04:25:25 ERROR juju.worker.dependency engine.go:693 "is-primary-controller-flag" manifold worker returned unexpected error: lease manager stopped
2022-04-18 04:25:25 ERROR juju.worker.dependency engine.go:693 "lease-manager" manifold worker returned unexpected error: timed out enqueuing operation

And then a bunch of these from all the connected agents:

INFO juju.apiserver.connection request_notifier.go:124 agent disconnected

Then:

2022-04-18 04:25:25 INFO juju.cmd.jujud.runner runner.go:386 runner is dying
2022-04-18 04:25:25 INFO juju.cmd.jujud.runner runner.go:587 stopped "0-container-watcher", err: watcher has been stopped (stopped)
2022-04-18 04:25:25 ERROR juju.cmd.jujud.runner runner.go:459 fatal "0-container-watcher": watcher has been stopped (stopped)
2022-04-18 04:25:25 INFO juju.worker.logger logger.go:136 logger worker stopped

Then more connection errors along the lines of:

ERROR juju.worker.dependency engine.go:693 "api-caller" manifold worker returned unexpected error: [005266] "machine-0" cannot open api: unable to connect to API: apiserver shutdown in progress (Service Unavailable)

And finally the listener restarts:

2022-04-18 04:26:28 INFO juju.worker.httpserver worker.go:315 listening on "[::]:17070"

Can obviously supply the full logs and any further data as needed but I think that's the gist of it.

Tags: canonical-is
Changed in juju:
status: New → Triaged
importance: Undecided → High
assignee: nobody → Joseph Phillips (manadart)
Revision history for this message
Haw Loeung (hloeung) wrote :

Seeing this for jaas-azure-westus-002, twice, same time yesterday and today.

I've collected machine agent logs and available via the usual place, juju-controller-reports...

Revision history for this message
John A Meinel (jameinel) wrote : Re: [Bug 1969345] Re: Internal controller service restarts on 2.9

I think all of the things you are seeing about TCP disconnects, and workers
stopping is because of an earlier error that is triggering the shutdown of
the API. It would be good to get logs from earlier to see what the
triggering event was.

On Tue, May 3, 2022 at 2:05 AM Haw Loeung <email address hidden>
wrote:

> Seeing this for jaas-azure-westus-002, twice, same time yesterday and
> today.
>
> I've collected machine agent logs and available via the usual place,
> juju-controller-reports...
>
> --
> You received this bug notification because you are subscribed to juju.
> Matching subscriptions: juju bugs
> https://bugs.launchpad.net/bugs/1969345
>
> Title:
> Internal controller service restarts on 2.9
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju/+bug/1969345/+subscriptions
>
>

Revision history for this message
Haw Loeung (hloeung) wrote :

https://juju-controller-reports.admin.canonical.com/jaas-azure-westus-002/ as mentioned earlier should have the requested logs. Otherwise I can make the previous rotated log files available too.

Revision history for this message
Haw Loeung (hloeung) wrote :

Same issue, different JAAS controller today, jaas-azure-centralus-001. I'm in the process of collecting logs, should be available https://juju-controller-reports.admin.canonical.com/jaas-azure-centralus-001/ shortly.

Ian Booth (wallyworld)
Changed in juju:
milestone: none → 2.9-next
John A Meinel (jameinel)
tags: added: canonical-is
Revision history for this message
Ian Booth (wallyworld) wrote :

This is likely fixed now. Please re-open if still an issue.

Changed in juju:
status: Triaged → Incomplete
assignee: Joseph Phillips (manadart) → nobody
milestone: 2.9-next → none
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for Canonical Juju because there has been no activity for 60 days.]

Changed in juju:
status: Incomplete → Expired
Revision history for this message
Gheza (gluka) wrote :

hello,

We also spotted this bug in 2.9.45.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.