2.5-rc1 controller unresponsive, fails to restart

Bug #1810313 reported by Paul Gear
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Canonical Juju
Fix Released
High
Tim Penhey
2.6
Fix Released
High
Tim Penhey

Bug Description

One of my 2.5-rc1 controllers, after successfully upgrading, has become unresponsive. It appears to be stuck in a restart loop, and is unresponsive to "juju models", "juju status", etc.

Revision history for this message
Paul Gear (paulgear) wrote :

Logs & introspection reports uploaded to https://private-fileshare.canonical.com/~paulgear/lp1810313/

Revision history for this message
Ian Booth (wallyworld) wrote :

An initial look at the logs indicates there's a whole world of hurt occurring with regards to network connectivity in general. After various restarts, the juju controller agent reports errors connecting to mongo, connecting to itself via the api etc. The engine report shows that most/all workers are reporting their connection is lost. The raft worker is likewise affected - it is reporting timeouts trying to process lease claims; these are not fatal (and it retries the request) and are more symptoms of an underlying problem than the root cause.

Given this is running on canonistack, with known issues with being oversubscribed, I wonder if there are any similarly sized 2.4 controllers having similar issues.

Revision history for this message
John A Meinel (jameinel) wrote :
Download full text (7.2 KiB)

So Ian and I spent some time digging through the logs again this morning. I think we've worked out that the issue is likely because of a charm requesting a reboot, and things like raft errors are a red herring.

Specifically, these log lines:
2018-12-21 06:07:48 INFO juju.cmd supercommand.go:57 running jujud [2.5-rc1 gc go1.11.3]
2018-12-21 06:07:48 DEBUG juju.cmd supercommand.go:58 args: []string{"/var/lib/juju/tools/machine-0/jujud", "machine", "--data-dir", "/var/lib/juju", "--machine-id", "0", "--debug"}
...
2018-12-21 06:07:50 INFO juju.worker.httpserver worker.go:258 listening on "[::]:17070"
2018-12-21 06:07:51 INFO juju.apiserver.connection request_notifier.go:96 agent login: machine-14 for bbb742e7-2832-457c-8dac-a713e42780ee
2018-12-21 06:07:51 INFO juju.apiserver.common password.go:105 setting password for "machine-14u
2018-12-21 06:07:51 INFO juju.apiserver.connection request_notifier.go:96 agent login: unit-unattended-0 for bbb742e7-2832-457c-8dac-a713e42780ee
2018-12-21 06:07:51 INFO juju.provider.openstack provider.go:162 opening model "ntp-lcy01"
2018-12-21 06:07:51 INFO juju.apiserver.common password.go:105 setting password for "unit-unattended-0"
...
2018-12-21 06:07:54 INFO juju.worker.apicaller connect.go:158 [bbb742] "machine-0" successfully connected to "localhost:17070"
2018-12-21 06:07:54 INFO juju.provider.openstack provider.go:162 opening model "ntp-lcy01"
2018-12-21 06:07:54 INFO juju.provider.openstack provider.go:162 opening model "canonistack1"
2018-12-21 06:07:54 WARNING juju.environs.config config.go:1570 unknown config field "rsyslog-ca-cert"
2018-12-21 06:07:54 WARNING juju.environs.config config.go:1570 unknown config field "rsyslog-ca-key"
2018-12-21 06:07:54 INFO juju.worker.stateconfigwatcher manifold.go:119 tomb dying
2018-12-21 06:07:54 INFO juju.apiserver.connection request_notifier.go:125 agent disconnected: unit-unattended-0 for bbb742e7-2832-457c-8dac-a713e42780ee
2018-12-21 06:07:54 INFO juju.worker.apicaller connect.go:158 [1ec998] "machine-0" successfully connected to "localhost:17070"
...
2018-12-21 06:07:54 INFO juju.apiserver.connection request_notifier.go:125 agent disconnected: unit-unattended-0 for bbb742e7-2832-457c-8dac-a713e42780ee
2018-12-21 06:07:54 INFO juju.worker.apicaller connect.go:158 [1ec998] "machine-0" successfully connected to "localhost:17070"
2018-12-21 06:07:54 INFO juju.provider.openstack provider.go:162 opening model "controller"
2018-12-21 06:07:54 INFO juju.worker.authenticationworker worker.go:103 "machine-0" key updater worker started
2018-12-21 06:07:54 INFO juju.worker.upgradeseries worker.go:164 no series upgrade lock present
2018-12-21 06:07:54 ERROR juju.worker.logger logger.go:63 connection is shut down
...
2018-12-21 06:07:54 INFO juju.apiserver.connection request_notifier.go:96 agent login: unit-ubuntu-lite-0 for 1fc7e89f-9afa-4559-8d15-a03cc9835901
2018-12-21 06:07:54 INFO juju.apiserver.connection request_notifier.go:96 agent login: unit-landscape-client-5 for 1ec99855-c41e-44c7-8e9c-b51e4cdcb3d3
2018-12-21 06:07:54 INFO juju.apiserver.connection request_notifier.go:125 agent disconnected: unit-ubuntu-0 for bbb742e7-2832-457c-8dac-a713e42780ee
2018-12-21 ...

Read more...

Revision history for this message
Ian Booth (wallyworld) wrote :

Paul confirmed that manually deleting the doc from the "reboot" collection for the controller 0 machine cleared the reboot flag and stopped the reboot loop.

So the cause is a charm deployed to a controller machine requesting a reboot via juju-reboot. The changes to apiserver shutdown in 2.5 mean that the reboot worker never gets to act properly to clear the reboot flag because the api server is already shutdown; the result is an endless loop of apiserver restarts, followed by shutdown due to the reboot flag,

Changed in juju:
milestone: none → 2.5.1
status: New → Triaged
importance: Undecided → Critical
Tim Penhey (thumper)
Changed in juju:
assignee: nobody → Tim Penhey (thumper)
Revision history for this message
Tim Penhey (thumper) wrote :
Revision history for this message
Richard Harding (rharding) wrote :

Moving this down from critical as a workaround has been validated. Thanks for the PR thumper.

Changed in juju:
importance: Critical → High
Tim Penhey (thumper)
Changed in juju:
status: Triaged → In Progress
Tim Penhey (thumper)
Changed in juju:
status: In Progress → Fix Committed
Changed in juju:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.