HA controllers still connect to each other during startup

Bug #1749383 reported by John A Meinel on 2018-02-14
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
juju
High
Witold Krecicki
2.3
High
Witold Krecicki

Bug Description

So there was a patch from Witold in bug #1697956

However, that just makes us *prefer* localhost, but doesn't *require* localhost.

I'm trying to trace a different bug, but I think I found the ordering issue.

Namely, the api-caller portion of dependency engine is independent of the api-server portion.
Which means that we can start up the api-caller before we have our own api server working. And so even though we would prefer to connect to localhost, that isn't available yet, but instead of blocking/looping waiting for it to become available, we instead just connect to the other controller.

I was tracing an upgrade event, and I think these are the relevant lines:

2018-02-14 07:19:43 INFO juju.worker.upgrader error.go:33 upgraded from 2.2.10.1-xenial-amd64 to 2.3.3.1-xenial-amd64 ("https://10.16.17.50:17070/model/05
ERROR must restart: an agent upgrade is available
2018-02-14 07:19:43 DEBUG cmd supercommand.go:459 error stack:
must restart: an agent upgrade is available
2018-02-14 07:19:43 DEBUG juju.cmd.jujud main.go:187 jujud complete, code 0, err <nil>
2018-02-14 07:19:44 INFO juju.cmd supercommand.go:56 running jujud [2.3.3.1 gc go1.9.4]

We have just finally downloaded the new agent binaries and have restarted.
...
2018-02-14 07:19:44 DEBUG juju.worker runner.go:319 start "engine"
2018-02-14 07:19:44 INFO juju.worker runner.go:477 start "engine"
2018-02-14 07:19:44 DEBUG juju.worker.dependency engine.go:504 "migration-inactive-flag" manifold worker stopped: "api-caller" not running: dependency not
2018-02-14 07:19:44 DEBUG juju.worker.dependency engine.go:504 "api-caller" manifold worker stopped: "agent" not running: dependency not available

we start the dependency engine, but we're not yet ready to dial a controller.
...
2018-02-14 07:19:44 DEBUG juju.worker runner.go:328 "engine" started
2018-02-14 07:19:44 DEBUG juju.worker.dependency engine.go:504 "machine-action-runner" manifold worker stopped: "migration-inactive-flag" not running: dep
2018-02-14 07:19:44 DEBUG juju.worker.dependency engine.go:486 "agent" manifold worker started

engine is ready, agent is ready.
...
2018-02-14 07:19:44 DEBUG juju.worker.dependency engine.go:504 "upgrade-steps-runner" manifold worker stopped: <nil>

not sure why
...
2018-02-14 07:19:44 DEBUG juju.worker.dependency engine.go:504 "upgrader" manifold worker stopped: "api-caller" not running: dependency not available
2018-02-14 07:19:44 DEBUG juju.worker.dependency engine.go:486 "central-hub" manifold worker started
2018-02-14 07:19:44 DEBUG juju.worker.apicaller connect.go:112 connecting with current password
2018-02-14 07:19:44 DEBUG juju.api apiclient.go:695 looked up localhost -> [127.0.0.1]
2018-02-14 07:19:44 DEBUG juju.api apiclient.go:843 successfully dialed "wss://10.16.17.202:17070/model/05780955-0629-4ae8-8bd7-912132a44ae4/api"
2018-02-14 07:19:44 INFO juju.api apiclient.go:597 connection established to "wss://10.16.17.202:17070/model/05780955-0629-4ae8-8bd7-912132a44ae4/api"

we still haven't started the upgrader, because api-caller isn't running yet, but we're starting api-caller now. Note that we successfully look for 127.0.0.1 but end up connecting to someone else.
It isn't until *much* later that we get:
...
2018-02-14 07:19:45 DEBUG juju.worker runner.go:328 "apiserver" started
2018-02-14 07:19:45 INFO juju.apiserver apiserver.go:440 listening on "[::]:17070"

(time wise, its just 1s, but in the log this is about 100 lines later).

In the mean time, we've even gotten:
2018-02-14 07:19:45 ERROR juju.worker.dependency engine.go:551 "is-primary-controller-flag" manifold worker returned unexpected error: permission denied
2018-02-14 07:19:45 DEBUG juju.worker.dependency engine.go:553 stack trace:
permission denied
github.com/juju/juju/api/singular/api.go:76:
github.com/juju/juju/worker/singular/flag.go:141:
github.com/juju/juju/worker/singular/flag.go:63:
github.com/juju/juju/worker/singular/shim.go:31:
github.com/juju/juju/worker/singular/manifold.go:53:

(I don't understand why we get "permission denied" instead of something like "leadership lease already taken")

Witold Krecicki (wpk) on 2018-02-14
Changed in juju:
assignee: nobody → Witold Krecicki (wpk)
Witold Krecicki (wpk) wrote :

The trivial solution would be to only connect to localhost when being controller (technically - when StateServingInfo is present). This, however, ends up in a race condition:
1. Machine-0 is a single controller
2. We do enable-ha -n 2, machine-1 and machine-2 are started.
3. jujuds on machine-1 and machine-2 both connect to jujud on machine-0 (as they don't know yet they're controllers), they get the info that they're controllers.
4. machiner on machine-1 sets the state of the machine to 'Starting'
5. That causes controller on machine-0 to add mongo on machine-1 to replicaset
6. And also that causes update on APIHostPorts on all machines
7. Which causes apicaller on machine-2 to reload. Since now it knows it's a controller it'll try to connect to itself, but it can't since we're not yet listening as state is not running as local mongo is not yet in replicaset. We'll never start machiner to report that we're in Starting state for this to resolve.

John A Meinel (jameinel) wrote :
Changed in juju:
status: Triaged → Fix Committed
Changed in juju:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers