Bootstrap agent initialization timeout too small

Bug #1605335 reported by Nicholas Skaggs
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Canonical Juju
Won't Fix
High
Unassigned

Bug Description

We are unable to set the timeout for how long a bootstrap will wait for agent initialization. Currently it will retry 60 times, giving an effective timeout of about a minute. I would like to increase this timeout but am unable to do so. Can we consider this an option? That said a sane default may be better.

At the least, it seems a minute may be a little too short. The default timeout for bootstraps is 10 minutes in total, it would be nice if this timeout also applied to this agent-initialization (though bootstrap I believe is consider complete by then; perhaps it shouldn't be). Manual provisioning on slower hardware shows the agent taking more than a minute to come up, and sometimes juju fails before it responds.

It's worth trying to figure out why it's so slow, but that answer lies outside of juju likely. Juju should try and support it anyway if possible.

In short:

Change the timeout to a saner default than 60 retries
Make bootstrap-timeout include the agent-initialization piece and/or Add a new config for the agent-initialization piece

-------

Specifically I'm referring this wait:

Bootstrapping Juju machine agent
Starting Juju machine agent (jujud-machine-0)
2016-07-21 15:06:07 INFO cmd cmd.go:129 Bootstrap agent installed
2016-07-21 15:06:07 DEBUG juju.juju api.go:246 API hostnames [10.0.2.15:17070] - resolving hostnames
2016-07-21 15:06:07 INFO juju.juju api.go:268 new API addresses to cache [10.0.2.15:17070]
2016-07-21 15:06:07 INFO juju.juju api.go:77 connecting to API addresses: [10.0.2.15:17070]
2016-07-21 15:06:07 INFO juju.api apiclient.go:520 dialing "wss://10.0.2.15:17070/model/14694040-d75a-4a52-895f-9b5864353ee7/api"
2016-07-21 15:06:07 DEBUG juju.api apiclient.go:526 error dialing "wss://10.0.2.15:17070/model/14694040-d75a-4a52-895f-9b5864353ee7/api", will retry: websocket.Dial wss://10.0.2.15:17070/model/14694040-d75a-4a52-895f-9b5864353ee7/api: dial tcp 10.0.2.15:17070: getsockopt: connection refused
....
ERROR unable to contact api server after 61 attempts: upgrade in progress (upgrade in progress)

Revision history for this message
Nate Finch (natefinch) wrote :

Note, this is the attempt strategy defined in WaitForAgentInitialisation in cmd/juju/common/controller.go

It seems like this should use the timeout from the whole bootstrap command, rather than hard-coding something like it is now. It seems like having a timeout here is silly if we have one higher up the stack.... this code should just take a cancellation channel and retry until told to stop by the top level timeout.

Revision history for this message
Nicholas Skaggs (nskaggs) wrote :

If folks are on board, I think removing this separate timeout and incorporating this action as part of the overall bootstrap makes sense.

Changed in juju-core:
status: New → Triaged
importance: Undecided → High
milestone: none → 2.0.0
affects: juju-core → juju
Changed in juju:
milestone: 2.0.0 → none
milestone: none → 2.0.0
Changed in juju:
assignee: nobody → Richard Harding (rharding)
Revision history for this message
Nicholas Skaggs (nskaggs) wrote :

Alexis, this was occurring when I was trying to deploy a large bundle on a slow machine. I wasn't trying to troubleshoot why it was so slow to initialize, but rather trying to see if it would eventually come up or not. I've not tried this recently, but I think the changes / cleanup make sense regardless as the timeout should apply to the bootstrap as a whole.

Changed in juju:
milestone: 2.0.0 → 2.1.0
Changed in juju:
assignee: Richard Harding (rharding) → nobody
Revision history for this message
Anastasia (anastasia-macmood) wrote :

Removing 2.1 milestone as we will not be addressing this issue in 2.1.

Changed in juju:
milestone: 2.1.0 → none
Revision history for this message
Heather Lanigan (hmlanigan) wrote :

The overall bootstrap timeout is now 20min by default and can be configured at bootstrap for a different value if needed.

Changed in juju:
status: Triaged → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.