2.9rc2: juju enable-ha times out with new nodes failing to connect to juju api

Bug #1903557 reported by Jason Hobbs on 2020-11-09
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
juju
High
Ian Booth
2.8
High
Ian Booth

Bug Description

on 2.9rc2, enable-ha failed with the two secondary nodes failing to become ready. They installed quickly and booted up, but in machine log we have this repeated:

/home/jenkins/workspace/BuildJuju-amd64/_build/src/github.com/juju/juju/worker/apicaller/manifold.go:97: [49da11] "machine-2" cannot open api
2020-11-09 15:59:18 INFO juju.state.allwatcher allwatcher.go:1791 allwatcher loaded for model "49da1147-a806-4aa2-837f-8ca499125bfc" in 1m0.004401573s
2020-11-09 15:59:18 DEBUG juju.mongo open.go:160 mongodb connection failed, will retry: dial tcp 10.244.32.182:37017: i/o timeout
2020-11-09 15:59:19 ERROR juju.worker.modelcache worker.go:366 watcher error: error loading entities for model 49da1147-a806-4aa2-837f-8ca499125bfc: loading subsidiary collections: cache instance data: cannot read all instance data: read tcp 253.253.0.1:60772->253.14.0.1:37017: i/o timeout, getting new watcher
2020-11-09 15:59:33 DEBUG juju.mongo open.go:174 dialled mongodb server at "253.14.0.1:37017"
2020-11-09 15:59:48 DEBUG juju.mongo open.go:160 mongodb connection failed, will retry: dial tcp 10.244.32.182:37017: i/o timeout
2020-11-09 15:59:49 DEBUG juju.mongo open.go:160 mongodb connection failed, will retry: dial tcp 10.244.32.182:37017: i/o timeout
2020-11-09 16:00:18 INFO juju.state.allwatcher allwatcher.go:1791 allwatcher loaded for model "49da1147-a806-4aa2-837f-8ca499125bfc" in 1m0.005996788s
2020-11-09 16:00:18 DEBUG juju.mongo open.go:174 dialled mongodb server at "127.0.0.1:37017"
2020-11-09 16:00:19 ERROR juju.worker.modelcache worker.go:366 watcher error: error loading entities for model 49da1147-a806-4aa2-837f-8ca499125bfc: loading subsidiary collections: cache instance data: cannot read all instance data: read tcp 253.253.0.1:60768->253.14.0.1:37017: i/o timeout, getting new watcher
2020-11-09 16:00:33 DEBUG juju.mongo open.go:174 dialled mongodb server at "253.14.0.1:37017"
2020-11-09 16:00:48 DEBUG juju.mongo open.go:160 mongodb connection failed, will retry: dial tcp 10.244.32.182:37017: i/o timeout
2020-11-09 16:01:03 DEBUG juju.mongo open.go:160 mongodb connection failed, will retry: dial tcp 10.244.32.182:37017: i/o timeout
2020-11-09 16:01:08 DEBUG juju.worker.apicaller connect.go:115 connecting with current password
2020-11-09 16:01:08 DEBUG juju.api apiclient.go:760 looked up localhost -> [127.0.0.1]
2020-11-09 16:01:08 DEBUG juju.worker.apicaller connect.go:155 [49da11] failed to connect
2020-11-09 16:01:08 DEBUG juju.worker.dependency engine.go:598 "api-caller" manifold worker stopped: [49da11] "machine-2" cannot open api: unable to connect to API: dial tcp 127.0.0.1:17070: connect: connection refused
2020-11-09 16:01:08 ERROR juju.worker.dependency engine.go:671 "api-caller" manifold worker returned unexpected error: [49da11] "machine-2" cannot open api: unable to connect to API: dial tcp 127.0.0.1:17070: connect: connection refused
2020-11-09 16:01:08 DEBUG juju.worker.dependency engine.go:673 stack trace:
dial tcp 127.0.0.1:17070: connect: connection refused
/home/jenkins/workspace/BuildJuju-amd64/_build/src/github.com/juju/juju/api/apiclient.go:691:
/home/jenkins/workspace/BuildJuju-amd64/_build/src/github.com/juju/juju/api/apiclient.go:1111:
/home/jenkins/workspace/BuildJuju-amd64/_build/src/github.com/juju/juju/api/apiclient.go:1086: unable to connect to API
/home/jenkins/workspace/BuildJuju-amd64/_build/src/github.com/juju/juju/api/apiclient.go:982:
/home/jenkins/workspace/BuildJuju-amd64/_build/src/github.com/juju/juju/api/apiclient.go:637:
/home/jenkins/workspace/BuildJuju-amd64/_build/src/github.com/juju/juju/api/apiclient.go:215:
/home/jenkins/workspace/BuildJuju-amd64/_build/src/github.com/juju/juju/worker/apicaller/connect.go:156:
/home/jenkins/workspace/BuildJuju-amd64/_build/src/github.com/juju/juju/worker/apicaller/connect.go:211:
/home/jenkins/workspace/BuildJuju-amd64/_build/src/github.com/juju/juju/worker/apicaller/manifold.go:97: [49da11] "machine-2" cannot open api
2020-11-09 16:01:18 INFO juju.state.allwatcher allwatcher.go:1791 allwatcher loaded for model "49da1147-a806-4aa2-837f-8ca499125bfc" in 1m0.006181644s
2020-11-09 16:01:19 DEBUG juju.mongo open.go:160 mongodb connection failed, will retry: dial tcp 10.244.32.182:37017: i/o timeout
2020-11-09 16:01:19 ERROR juju.worker.modelcache worker.go:366 watcher error: error loading entities for model 49da1147-a806-4aa2-837f-8ca499125bfc: loading subsidiary collections: cache instance data: cannot read all instance data: read tcp 253.253.0.1:60786->253.14.0.1:37017: i/o timeout, getting new watcher
2020-11-09 16:01:33 DEBUG juju.mongo open.go:174 dialled mongodb server at "253.14.0.1:37017"
2020-11-09 16:01:48 DEBUG juju.mongo open.go:160 mongodb connection failed, will retry: dial tcp 10.244.32.182:37017: i/o timeout
2020-11-09 16:01:49 DEBUG juju.mongo open.go:160 mongodb connection failed, will retry: dial tcp 10.244.32.182:37017: i/o timeout

Example run: https://solutions.qa.canonical.com/testruns/testRun/8c61d24a-f199-46f8-ae05-954372e6158e

description: updated
Changed in juju:
milestone: none → 2.9-rc3
status: New → Triaged
importance: Undecided → High
Ian Booth (wallyworld) wrote :

Currently, when a controller gathers potential IP addresses to try to use to dial mongo, it gets the list of all known addresses available on the host, including localhost, cloud local, fan, public etc. It passes these to the upstream mgo client library, which will try them all and use whatever one can establish a TCP connection. In many cases, this works, but I think it's flawed.

When the controller agent on the first new HA controller starts after installing mongo, it is successful in dialling on 2 addresses

2020-11-09 15:44:17 DEBUG juju.cmd.jujud machine.go:1187 mongodb service is installed
2020-11-09 15:44:17 DEBUG juju.mongo open.go:160 mongodb connection failed, will retry: dial tcp 127.0.0.1:37017: connect: connection refused
2020-11-09 15:44:17 DEBUG juju.mongo open.go:174 dialled mongodb server at "253.14.0.1:37017"
2020-11-09 15:44:17 DEBUG juju.mongo open.go:174 dialled mongodb server at "172.16.0.14:37017"
2020-11-09 15:44:17 INFO juju.state open.go:144 using client-side transactions

The correct one that should be used is the cloud local one, 172.16.0.14, not the fan one 253.14.0.1.

Subsequently, the logs seem to indicate that mongo was dialled on on 127.0.0.1 and 253.14.0.1. The correct cloud local address was never used again. And even though dialling on 127.0.0.1 etc was successful, trying to use that session failed. And there's also attempts to dial on 10.244.32.182 which is also incorrect.

The crashdump artefacts don't have this info, but I think that the mongo replicaset will have been configured with the various 172 addresses, ie 172.16.0.14, 172.16.0.175 etc. These should be what's used to dial mongo on each controller, rather than throwing a bunch of addresses at the wall to see what sticks.

On the primary controller which maintains knowledge of the mongo replicaset configuration, the overall set of addresses is filtered to only include cloud local ones when setting up the replicaset. I would argue we should use this selection filter universally when a given controller wants to dial its local mongo instance.

Changed in juju:
status: Triaged → In Progress
assignee: nobody → Ian Booth (wallyworld)
Ian Booth (wallyworld) wrote :

So far I can't reproduce this on AWS, but I think we can still clean up how mongo addresses are selected.

Ian Booth (wallyworld) wrote :
Changed in juju:
status: In Progress → Fix Committed
milestone: 2.9-rc3 → 2.8.7
John A Meinel (jameinel) on 2020-11-12
Changed in juju:
milestone: 2.8.7 → 2.9-rc3
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers