bootstrap on slow node fails: "ERROR juju.cmd supercommand.go:304 can't dial mongo to initiate replicaset: no reachable servers"

Bug #1320966 reported by dann frazier
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
juju-core
Fix Released
High
Ian Booth

Bug Description

I'm trying to use juju using a simulated system. The simulated environment is much slower than a real VM, and this appears to cause a timeout to expire while trying to connect to the state server:

2014-05-19 17:23:44 DEBUG juju.worker.peergrouper initiate.go:34 Initiating mongo replicaset; dialInfo &mgo.DialInfo{Addrs:[]string{"127.0.0.1:37017"}, Direct:false, Timeout:30000000000, FailFast:false, Database:"", Source:"", Service:"", Mechanism:"", Username:"", Password:"", DialServer:(func(*mgo.ServerAddr) (net.Conn, error))(nil), Dial:(func(net.Addr) (net.Conn, error))(0x6d15f8)}; memberHostport "ip-10-102-153-94.ec2.internal:37017"; user ""; password ""
2014-05-19 17:23:44 DEBUG juju.state open.go:128 connection failed, will retry: dial tcp 127.0.0.1:37017: connection refused
2014-05-19 17:23:47 DEBUG juju.state open.go:128 connection failed, will retry: dial tcp 127.0.0.1:37017: connection refused
[...]
2014-05-19 17:24:14 DEBUG juju.state open.go:128 connection failed, will retry: dial tcp 127.0.0.1:37017: connection refused
2014-05-19 17:24:14 DEBUG juju.state open.go:128 connection failed, will retry: dial tcp 127.0.0.1:37017: connection refused
2014-05-19 17:24:15 INFO juju.worker.peergrouper .:0 finished MaybeInitiateMongoServer
2014-05-19 17:24:15 ERROR juju.cmd supercommand.go:304 can't dial mongo to initiate replicaset: no reachable servers
2014-05-19 17:24:16 ERROR juju.provider.common bootstrap.go:118 bootstrap failed: subprocess encountered error code 1
Stopping instance...
2014-05-19 17:24:17 INFO juju.cmd cmd.go:113 Bootstrap failed, destroying environment
2014-05-19 17:24:17 INFO juju.provider.common destroy.go:14 destroying environment "amazon"
2014-05-19 17:24:19 ERROR juju.cmd supercommand.go:304 subprocess encountered error code 1

I've tried bumping up the mongoSocketTimeout and defaultDialTimeout constants in src/launchpad.net/juju-core/state/open.go to 1000 * time.Second, but this did not resolve the issue.

Tags: hs-arm64
Revision history for this message
Curtis Hovey (sinzui) wrote :

Slow provisioners need to extend the timeouts. has this been tried. Given that my test of arm64 images were 10x slower than amd64 on ec2, so long timeouts are probably needed.

Environments that need more time to provision an instance can configure 3 options the environments.yaml. MAAS environments often need to set bootstrap- timeout to 1800.

bootstrap-timeout (default: 600s)
bootstrap-retry-delay (default: 5s)
bootstrap-addresses-delay (default: 10s)

Changed in juju-core:
status: New → Incomplete
Revision history for this message
dann frazier (dannf) wrote :

I'm using a bootstrap-timeout of 9999999. I haven't modified the -delay settings.

Revision history for this message
dann frazier (dannf) wrote :

I tried up'ing bootstrap-addresses-delay to 9999999, but this appeared to just delay how long it took to notice the IP of the instance. This hasn't been an issue for me - ec2 returns the IP as quickly as it would for any normal instance.

I also tried up'ing bootstrap-retry-delay. This seems to just increase the amount of time it takes to retry the initial ssh after the first ssh fails (because the instance isn't up). This also hasn't been a problem for me - it does take a lot of retries before sshd in the simulated system is available, but it does eventually connect and begin the process.

Returning the bug to NEW state since it looks like the cause is still unknown.

Changed in juju-core:
status: Incomplete → New
Curtis Hovey (sinzui)
Changed in juju-core:
status: New → Triaged
importance: Undecided → High
milestone: none → 1.19.3
Ian Booth (wallyworld)
Changed in juju-core:
assignee: nobody → Ian Booth (wallyworld)
status: Triaged → In Progress
Ian Booth (wallyworld)
Changed in juju-core:
status: In Progress → Fix Committed
Curtis Hovey (sinzui)
Changed in juju-core:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.