HA controllers still connect to each other during startup
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Canonical Juju |
Fix Released
|
High
|
Witold Krecicki | ||
2.3 |
Fix Released
|
High
|
Witold Krecicki |
Bug Description
So there was a patch from Witold in bug #1697956
However, that just makes us *prefer* localhost, but doesn't *require* localhost.
I'm trying to trace a different bug, but I think I found the ordering issue.
Namely, the api-caller portion of dependency engine is independent of the api-server portion.
Which means that we can start up the api-caller before we have our own api server working. And so even though we would prefer to connect to localhost, that isn't available yet, but instead of blocking/looping waiting for it to become available, we instead just connect to the other controller.
I was tracing an upgrade event, and I think these are the relevant lines:
2018-02-14 07:19:43 INFO juju.worker.
ERROR must restart: an agent upgrade is available
2018-02-14 07:19:43 DEBUG cmd supercommand.go:459 error stack:
must restart: an agent upgrade is available
2018-02-14 07:19:43 DEBUG juju.cmd.jujud main.go:187 jujud complete, code 0, err <nil>
2018-02-14 07:19:44 INFO juju.cmd supercommand.go:56 running jujud [2.3.3.1 gc go1.9.4]
We have just finally downloaded the new agent binaries and have restarted.
...
2018-02-14 07:19:44 DEBUG juju.worker runner.go:319 start "engine"
2018-02-14 07:19:44 INFO juju.worker runner.go:477 start "engine"
2018-02-14 07:19:44 DEBUG juju.worker.
2018-02-14 07:19:44 DEBUG juju.worker.
we start the dependency engine, but we're not yet ready to dial a controller.
...
2018-02-14 07:19:44 DEBUG juju.worker runner.go:328 "engine" started
2018-02-14 07:19:44 DEBUG juju.worker.
2018-02-14 07:19:44 DEBUG juju.worker.
engine is ready, agent is ready.
...
2018-02-14 07:19:44 DEBUG juju.worker.
not sure why
...
2018-02-14 07:19:44 DEBUG juju.worker.
2018-02-14 07:19:44 DEBUG juju.worker.
2018-02-14 07:19:44 DEBUG juju.worker.
2018-02-14 07:19:44 DEBUG juju.api apiclient.go:695 looked up localhost -> [127.0.0.1]
2018-02-14 07:19:44 DEBUG juju.api apiclient.go:843 successfully dialed "wss://
2018-02-14 07:19:44 INFO juju.api apiclient.go:597 connection established to "wss://
we still haven't started the upgrader, because api-caller isn't running yet, but we're starting api-caller now. Note that we successfully look for 127.0.0.1 but end up connecting to someone else.
It isn't until *much* later that we get:
...
2018-02-14 07:19:45 DEBUG juju.worker runner.go:328 "apiserver" started
2018-02-14 07:19:45 INFO juju.apiserver apiserver.go:440 listening on "[::]:17070"
(time wise, its just 1s, but in the log this is about 100 lines later).
In the mean time, we've even gotten:
2018-02-14 07:19:45 ERROR juju.worker.
2018-02-14 07:19:45 DEBUG juju.worker.
permission denied
github.
github.
github.
github.
github.
(I don't understand why we get "permission denied" instead of something like "leadership lease already taken")
Changed in juju: | |
assignee: | nobody → Witold Krecicki (wpk) |
Changed in juju: | |
status: | Fix Committed → Fix Released |
The trivial solution would be to only connect to localhost when being controller (technically - when StateServingInfo is present). This, however, ends up in a race condition:
1. Machine-0 is a single controller
2. We do enable-ha -n 2, machine-1 and machine-2 are started.
3. jujuds on machine-1 and machine-2 both connect to jujud on machine-0 (as they don't know yet they're controllers), they get the info that they're controllers.
4. machiner on machine-1 sets the state of the machine to 'Starting'
5. That causes controller on machine-0 to add mongo on machine-1 to replicaset
6. And also that causes update on APIHostPorts on all machines
7. Which causes apicaller on machine-2 to reload. Since now it knows it's a controller it'll try to connect to itself, but it can't since we're not yet listening as state is not running as local mongo is not yet in replicaset. We'll never start machiner to report that we're in Starting state for this to resolve.