Bootstrap agent initialization timeout too small

Bug #1605335 reported by Nicholas Skaggs on 2016-07-21
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
juju
High
Unassigned

Bug Description

We are unable to set the timeout for how long a bootstrap will wait for agent initialization. Currently it will retry 60 times, giving an effective timeout of about a minute. I would like to increase this timeout but am unable to do so. Can we consider this an option? That said a sane default may be better.

At the least, it seems a minute may be a little too short. The default timeout for bootstraps is 10 minutes in total, it would be nice if this timeout also applied to this agent-initialization (though bootstrap I believe is consider complete by then; perhaps it shouldn't be). Manual provisioning on slower hardware shows the agent taking more than a minute to come up, and sometimes juju fails before it responds.

It's worth trying to figure out why it's so slow, but that answer lies outside of juju likely. Juju should try and support it anyway if possible.

In short:

Change the timeout to a saner default than 60 retries
Make bootstrap-timeout include the agent-initialization piece and/or Add a new config for the agent-initialization piece

-------

Specifically I'm referring this wait:

Bootstrapping Juju machine agent
Starting Juju machine agent (jujud-machine-0)
2016-07-21 15:06:07 INFO cmd cmd.go:129 Bootstrap agent installed
2016-07-21 15:06:07 DEBUG juju.juju api.go:246 API hostnames [10.0.2.15:17070] - resolving hostnames
2016-07-21 15:06:07 INFO juju.juju api.go:268 new API addresses to cache [10.0.2.15:17070]
2016-07-21 15:06:07 INFO juju.juju api.go:77 connecting to API addresses: [10.0.2.15:17070]
2016-07-21 15:06:07 INFO juju.api apiclient.go:520 dialing "wss://10.0.2.15:17070/model/14694040-d75a-4a52-895f-9b5864353ee7/api"
2016-07-21 15:06:07 DEBUG juju.api apiclient.go:526 error dialing "wss://10.0.2.15:17070/model/14694040-d75a-4a52-895f-9b5864353ee7/api", will retry: websocket.Dial wss://10.0.2.15:17070/model/14694040-d75a-4a52-895f-9b5864353ee7/api: dial tcp 10.0.2.15:17070: getsockopt: connection refused
....
ERROR unable to contact api server after 61 attempts: upgrade in progress (upgrade in progress)

Nate Finch (natefinch) wrote :

Note, this is the attempt strategy defined in WaitForAgentInitialisation in cmd/juju/common/controller.go

It seems like this should use the timeout from the whole bootstrap command, rather than hard-coding something like it is now. It seems like having a timeout here is silly if we have one higher up the stack.... this code should just take a cancellation channel and retry until told to stop by the top level timeout.

Nicholas Skaggs (nskaggs) wrote :

If folks are on board, I think removing this separate timeout and incorporating this action as part of the overall bootstrap makes sense.

Changed in juju-core:
status: New → Triaged
importance: Undecided → High
milestone: none → 2.0.0
affects: juju-core → juju
Changed in juju:
milestone: 2.0.0 → none
milestone: none → 2.0.0
Changed in juju:
assignee: nobody → Richard Harding (rharding)
Nicholas Skaggs (nskaggs) wrote :

Alexis, this was occurring when I was trying to deploy a large bundle on a slow machine. I wasn't trying to troubleshoot why it was so slow to initialize, but rather trying to see if it would eventually come up or not. I've not tried this recently, but I think the changes / cleanup make sense regardless as the timeout should apply to the bootstrap as a whole.

Changed in juju:
milestone: 2.0.0 → 2.1.0
Changed in juju:
assignee: Richard Harding (rharding) → nobody
Anastasia (anastasia-macmood) wrote :

Removing 2.1 milestone as we will not be addressing this issue in 2.1.

Changed in juju:
milestone: 2.1.0 → none
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers