Canonical Juju

agent install dies randomly on Azure

Bug #1533275 reported by Marco Ceppi on 2016-01-12

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Canonical Juju	Fix Released	Critical	Andrew Wilkins	Canonical Juju 2.0-alpha2

Bug Description

When deploying a workload, Juju will occasionally fail to install a machine agent despite the instance coming online.

http://paste.ubuntu.com/14478704/ - cloud-init output

root@machine-9:~# curl -sSfw "tools from %{url_effective} downloaded: HTTP %{http_code}; time %{time_total}s; size %{size_download} bytes; speed %{speed_download} bytes/s" --noproxy * --insecure -o /var/lib/juju/tools/1.26-alpha3-trusty-amd64/tools.tar.gz https://10.0.0.4:17070/tools/1.26-alpha3-trusty-amd64
tools from https://10.0.0.4:17070/tools/1.26-alpha3-trusty-amd64 downloaded: HTTP 200; time 1.484s; size 20625085 bytes; speed 13900019.000 bytes/s

After some time, while investigating, I was able to download the agent. This happens about once every 4 instances I launch with 1.26-alpha3

Tags:

Revision history for this message

Cheryl Jennings (cherylj) wrote on 2016-01-12:

Looks like the instance was not able to get to the state server to get the tools:

+ printf Attempt 5 to download tools from %s...\n https://10.0.0.4:17070/tools/1.26-alpha3-trusty-amd64
Attempt 5 to download tools from https://10.0.0.4:17070/tools/1.26-alpha3-trusty-amd64...
+ curl -sSfw tools from %{url_effective} downloaded: HTTP %{http_code}; time %{time_total}s; size %{size_download} bytes; speed %{speed_download} bytes/s --noproxy * --insecure -o /var/lib/juju/tools/1.26-alpha3-trusty-amd64/tools.tar.gz https://10.0.0.4:17070/tools/1.26-alpha3-trusty-amd64
curl: (7) Failed to connect to 10.0.0.4 port 17070: No route to host

Changed in juju-core:
status:	New → Triaged
importance:	Undecided → High

Revision history for this message

Marco Ceppi (marcoceppi) wrote on 2016-01-12:

Yes, but my test later showed it could, which means there must be some kind of race condition.

Curtis Hovey (sinzui) on 2016-01-19

tags:

added: azure-provider

Cheryl Jennings (cherylj) on 2016-01-19

Changed in juju-core:
importance:	High → Critical

Revision history for this message

Curtis Hovey (sinzui) wrote on 2016-01-19:

CI sees this with the newer azure arm provider but not with the older provider. We can see in the azure portal that a network is not listed in the resource group.

Revision history for this message

Cheryl Jennings (cherylj) wrote on 2016-01-19:

We are now hitting this in CI, and it is preventing us from getting a blessed run:
http://reports.vapour.ws/releases/3524/job/azure-arm-deploy/attempt/109

Changed in juju-core:
milestone:	none → 2.0-alpha1
tags:	added: ci

Revision history for this message

Curtis Hovey (sinzui) wrote on 2016-01-19:

Juju CI saw several failures recently. There might be a timing issue regarding the creation of the network rules for each machine. The azure-arm-deploy has a 30 minute timeout. The portal showed that machine 0 always got a network, machine 1 sometimes got one, and machine 2 did not Maybe machine 2 would have gotten a network if the deployment waited another 15 minutes. The issue went away after a few hours.

Revision history for this message

Andrew Wilkins (axwalk) wrote on 2016-01-20:

> We can see in the azure portal that a network is not listed in the resource group.

I find this difficult to believe. I'm pretty sure VMs in Azure *have* to have a NIC. If you didn't, you wouldn't have a public IP and wouldn't have been able to bootstrap.

Are you looking in the "Summary" for the resource group? If so, that will now display all of the resources if there are more than fit on one page. There's a little expander.

Anyway, I don't doubt there's a network issue, just not sure that it's due to complete lack of a network. I'll see if I can reproduce the issue locally.

Revision history for this message

Andrew Wilkins (axwalk) wrote on 2016-01-20:

I haven't yet reproduced the issue, but I have noticed that sometimes when I add machines, the CLI can't connect to the controller for a period of time. Seems likely to be related.

Revision history for this message

Andrew Wilkins (axwalk) wrote on 2016-01-20:

Quick update: I have managed to reproduce the issue several times, but still unsure of what's going on.

Revision history for this message

Andrew Wilkins (axwalk) wrote on 2016-01-20:

On each occasion that I've reproduced the issue, it's been the case that waiting a little while longer leads to the address being routable. I guess that Azure is still setting up the routing tables underneath us.

Increasing the number of retries should resolve this. At the moment we retry tools downloads up to 5 times, with 15 seconds between attempts. There's really no good reason to limit the number of retries, because the machine will just sit there idle otherwise. I'll just make it unbounded.

Changed in juju-core:
assignee:	nobody → Andrew Wilkins (axwalk)
status:	Triaged → In Progress

Curtis Hovey (sinzui) on 2016-01-20

Changed in juju-core:
milestone:	2.0-alpha1 → 2.0-alpha2

Andrew Wilkins (axwalk) on 2016-01-21

Changed in juju-core:
status:	In Progress → Fix Committed

Curtis Hovey (sinzui) on 2016-02-11

Changed in juju-core:
status:	Fix Committed → Fix Released

Cheryl Jennings (cherylj) on 2016-03-29

tags:

added: 2.0-count

Canonical Juju QA Bot (juju-qa-bot) on 2016-08-23

affects:	juju-core → juju
Changed in juju:
milestone:	2.0-alpha2 → none
milestone:	none → 2.0-alpha2

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.