2.9rc2: juju enable-ha times out with new nodes failing to connect to juju api
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Canonical Juju |
Fix Released
|
High
|
Ian Booth | ||
2.8 |
Fix Released
|
High
|
Ian Booth |
Bug Description
on 2.9rc2, enable-ha failed with the two secondary nodes failing to become ready. They installed quickly and booted up, but in machine log we have this repeated:
/home/jenkins/
2020-11-09 15:59:18 INFO juju.state.
2020-11-09 15:59:18 DEBUG juju.mongo open.go:160 mongodb connection failed, will retry: dial tcp 10.244.
2020-11-09 15:59:19 ERROR juju.worker.
2020-11-09 15:59:33 DEBUG juju.mongo open.go:174 dialled mongodb server at "253.14.0.1:37017"
2020-11-09 15:59:48 DEBUG juju.mongo open.go:160 mongodb connection failed, will retry: dial tcp 10.244.
2020-11-09 15:59:49 DEBUG juju.mongo open.go:160 mongodb connection failed, will retry: dial tcp 10.244.
2020-11-09 16:00:18 INFO juju.state.
2020-11-09 16:00:18 DEBUG juju.mongo open.go:174 dialled mongodb server at "127.0.0.1:37017"
2020-11-09 16:00:19 ERROR juju.worker.
2020-11-09 16:00:33 DEBUG juju.mongo open.go:174 dialled mongodb server at "253.14.0.1:37017"
2020-11-09 16:00:48 DEBUG juju.mongo open.go:160 mongodb connection failed, will retry: dial tcp 10.244.
2020-11-09 16:01:03 DEBUG juju.mongo open.go:160 mongodb connection failed, will retry: dial tcp 10.244.
2020-11-09 16:01:08 DEBUG juju.worker.
2020-11-09 16:01:08 DEBUG juju.api apiclient.go:760 looked up localhost -> [127.0.0.1]
2020-11-09 16:01:08 DEBUG juju.worker.
2020-11-09 16:01:08 DEBUG juju.worker.
2020-11-09 16:01:08 ERROR juju.worker.
2020-11-09 16:01:08 DEBUG juju.worker.
dial tcp 127.0.0.1:17070: connect: connection refused
/home/jenkins/
/home/jenkins/
/home/jenkins/
/home/jenkins/
/home/jenkins/
/home/jenkins/
/home/jenkins/
/home/jenkins/
/home/jenkins/
2020-11-09 16:01:18 INFO juju.state.
2020-11-09 16:01:19 DEBUG juju.mongo open.go:160 mongodb connection failed, will retry: dial tcp 10.244.
2020-11-09 16:01:19 ERROR juju.worker.
2020-11-09 16:01:33 DEBUG juju.mongo open.go:174 dialled mongodb server at "253.14.0.1:37017"
2020-11-09 16:01:48 DEBUG juju.mongo open.go:160 mongodb connection failed, will retry: dial tcp 10.244.
2020-11-09 16:01:49 DEBUG juju.mongo open.go:160 mongodb connection failed, will retry: dial tcp 10.244.
Example run: https:/
description: | updated |
Changed in juju: | |
milestone: | none → 2.9-rc3 |
status: | New → Triaged |
importance: | Undecided → High |
Changed in juju: | |
milestone: | 2.8.7 → 2.9-rc3 |
Changed in juju: | |
status: | Fix Committed → Fix Released |
Currently, when a controller gathers potential IP addresses to try to use to dial mongo, it gets the list of all known addresses available on the host, including localhost, cloud local, fan, public etc. It passes these to the upstream mgo client library, which will try them all and use whatever one can establish a TCP connection. In many cases, this works, but I think it's flawed.
When the controller agent on the first new HA controller starts after installing mongo, it is successful in dialling on 2 addresses
2020-11-09 15:44:17 DEBUG juju.cmd.jujud machine.go:1187 mongodb service is installed
2020-11-09 15:44:17 DEBUG juju.mongo open.go:160 mongodb connection failed, will retry: dial tcp 127.0.0.1:37017: connect: connection refused
2020-11-09 15:44:17 DEBUG juju.mongo open.go:174 dialled mongodb server at "253.14.0.1:37017"
2020-11-09 15:44:17 DEBUG juju.mongo open.go:174 dialled mongodb server at "172.16.0.14:37017"
2020-11-09 15:44:17 INFO juju.state open.go:144 using client-side transactions
The correct one that should be used is the cloud local one, 172.16.0.14, not the fan one 253.14.0.1.
Subsequently, the logs seem to indicate that mongo was dialled on on 127.0.0.1 and 253.14.0.1. The correct cloud local address was never used again. And even though dialling on 127.0.0.1 etc was successful, trying to use that session failed. And there's also attempts to dial on 10.244.32.182 which is also incorrect.
The crashdump artefacts don't have this info, but I think that the mongo replicaset will have been configured with the various 172 addresses, ie 172.16.0.14, 172.16.0.175 etc. These should be what's used to dial mongo on each controller, rather than throwing a bunch of addresses at the wall to see what sticks.
On the primary controller which maintains knowledge of the mongo replicaset configuration, the overall set of addresses is filtered to only include cloud local ones when setting up the replicaset. I would argue we should use this selection filter universally when a given controller wants to dial its local mongo instance.