juju-core

bootstrap on slow node fails: "ERROR juju.cmd supercommand.go:304 can't dial mongo to initiate replicaset: no reachable servers"

Bug #1320966 reported by dann frazier on 2014-05-19

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	juju-core	Fix Released	High	Ian Booth	juju-core 1.19.3

Bug Description

I'm trying to use juju using a simulated system. The simulated environment is much slower than a real VM, and this appears to cause a timeout to expire while trying to connect to the state server:

2014-05-19 17:23:44 DEBUG juju.worker.peergrouper initiate.go:34 Initiating mongo replicaset; dialInfo &mgo.DialInfo{Addrs:[]string{"127.0.0.1:37017"}, Direct:false, Timeout:30000000000, FailFast:false, Database:"", Source:"", Service:"", Mechanism:"", Username:"", Password:"", DialServer:(func(*mgo.ServerAddr) (net.Conn, error))(nil), Dial:(func(net.Addr) (net.Conn, error))(0x6d15f8)}; memberHostport "ip-10-102-153-94.ec2.internal:37017"; user ""; password ""
2014-05-19 17:23:44 DEBUG juju.state open.go:128 connection failed, will retry: dial tcp 127.0.0.1:37017: connection refused
2014-05-19 17:23:47 DEBUG juju.state open.go:128 connection failed, will retry: dial tcp 127.0.0.1:37017: connection refused
[...]
2014-05-19 17:24:14 DEBUG juju.state open.go:128 connection failed, will retry: dial tcp 127.0.0.1:37017: connection refused
2014-05-19 17:24:14 DEBUG juju.state open.go:128 connection failed, will retry: dial tcp 127.0.0.1:37017: connection refused
2014-05-19 17:24:15 INFO juju.worker.peergrouper .:0 finished MaybeInitiateMongoServer
2014-05-19 17:24:15 ERROR juju.cmd supercommand.go:304 can't dial mongo to initiate replicaset: no reachable servers
2014-05-19 17:24:16 ERROR juju.provider.common bootstrap.go:118 bootstrap failed: subprocess encountered error code 1
Stopping instance...
2014-05-19 17:24:17 INFO juju.cmd cmd.go:113 Bootstrap failed, destroying environment
2014-05-19 17:24:17 INFO juju.provider.common destroy.go:14 destroying environment "amazon"
2014-05-19 17:24:19 ERROR juju.cmd supercommand.go:304 subprocess encountered error code 1

I've tried bumping up the mongoSocketTimeout and defaultDialTimeout constants in src/launchpad.net/juju-core/state/open.go to 1000 * time.Second, but this did not resolve the issue.

Tags:

Revision history for this message

Curtis Hovey (sinzui) wrote on 2014-05-19:

Slow provisioners need to extend the timeouts. has this been tried. Given that my test of arm64 images were 10x slower than amd64 on ec2, so long timeouts are probably needed.

Environments that need more time to provision an instance can configure 3 options the environments.yaml. MAAS environments often need to set bootstrap- timeout to 1800.

bootstrap-timeout (default: 600s)
bootstrap-retry-delay (default: 5s)
bootstrap-addresses-delay (default: 10s)

Changed in juju-core:
status:	New → Incomplete

Revision history for this message

dann frazier (dannf) wrote on 2014-05-19:

I'm using a bootstrap-timeout of 9999999. I haven't modified the -delay settings.

Revision history for this message

dann frazier (dannf) wrote on 2014-05-20:

I tried up'ing bootstrap-addresses-delay to 9999999, but this appeared to just delay how long it took to notice the IP of the instance. This hasn't been an issue for me - ec2 returns the IP as quickly as it would for any normal instance.

I also tried up'ing bootstrap-retry-delay. This seems to just increase the amount of time it takes to retry the initial ssh after the first ssh fails (because the instance isn't up). This also hasn't been a problem for me - it does take a lot of retries before sshd in the simulated system is available, but it does eventually connect and begin the process.

Returning the bug to NEW state since it looks like the cause is still unknown.

Changed in juju-core:
status:	Incomplete → New

Curtis Hovey (sinzui) on 2014-05-20

Changed in juju-core:
status:	New → Triaged
importance:	Undecided → High
milestone:	none → 1.19.3

Ian Booth (wallyworld) on 2014-05-23

Changed in juju-core:
assignee:	nobody → Ian Booth (wallyworld)
status:	Triaged → In Progress

Ian Booth (wallyworld) on 2014-05-25

Changed in juju-core:
status:	In Progress → Fix Committed

Curtis Hovey (sinzui) on 2014-05-30

Changed in juju-core:
status:	Fix Committed → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.