agent install dies randomly on Azure

Bug #1533275 reported by Marco Ceppi
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Canonical Juju
Fix Released
Critical
Andrew Wilkins

Bug Description

When deploying a workload, Juju will occasionally fail to install a machine agent despite the instance coming online.

http://paste.ubuntu.com/14478704/ - cloud-init output

root@machine-9:~# curl -sSfw "tools from %{url_effective} downloaded: HTTP %{http_code}; time %{time_total}s; size %{size_download} bytes; speed %{speed_download} bytes/s" --noproxy * --insecure -o /var/lib/juju/tools/1.26-alpha3-trusty-amd64/tools.tar.gz https://10.0.0.4:17070/tools/1.26-alpha3-trusty-amd64
tools from https://10.0.0.4:17070/tools/1.26-alpha3-trusty-amd64 downloaded: HTTP 200; time 1.484s; size 20625085 bytes; speed 13900019.000 bytes/s

After some time, while investigating, I was able to download the agent. This happens about once every 4 instances I launch with 1.26-alpha3

Revision history for this message
Cheryl Jennings (cherylj) wrote :

Looks like the instance was not able to get to the state server to get the tools:

+ printf Attempt 5 to download tools from %s...\n https://10.0.0.4:17070/tools/1.26-alpha3-trusty-amd64
Attempt 5 to download tools from https://10.0.0.4:17070/tools/1.26-alpha3-trusty-amd64...
+ curl -sSfw tools from %{url_effective} downloaded: HTTP %{http_code}; time %{time_total}s; size %{size_download} bytes; speed %{speed_download} bytes/s --noproxy * --insecure -o /var/lib/juju/tools/1.26-alpha3-trusty-amd64/tools.tar.gz https://10.0.0.4:17070/tools/1.26-alpha3-trusty-amd64
curl: (7) Failed to connect to 10.0.0.4 port 17070: No route to host

Changed in juju-core:
status: New → Triaged
importance: Undecided → High
Revision history for this message
Marco Ceppi (marcoceppi) wrote :

Yes, but my test later showed it could, which means there must be some kind of race condition.

Curtis Hovey (sinzui)
tags: added: azure-provider
Changed in juju-core:
importance: High → Critical
Revision history for this message
Curtis Hovey (sinzui) wrote :

CI sees this with the newer azure arm provider but not with the older provider. We can see in the azure portal that a network is not listed in the resource group.

Revision history for this message
Cheryl Jennings (cherylj) wrote :

We are now hitting this in CI, and it is preventing us from getting a blessed run:
http://reports.vapour.ws/releases/3524/job/azure-arm-deploy/attempt/109

Changed in juju-core:
milestone: none → 2.0-alpha1
tags: added: ci
Revision history for this message
Curtis Hovey (sinzui) wrote :

Juju CI saw several failures recently. There might be a timing issue regarding the creation of the network rules for each machine. The azure-arm-deploy has a 30 minute timeout. The portal showed that machine 0 always got a network, machine 1 sometimes got one, and machine 2 did not Maybe machine 2 would have gotten a network if the deployment waited another 15 minutes. The issue went away after a few hours.

Revision history for this message
Andrew Wilkins (axwalk) wrote :

> We can see in the azure portal that a network is not listed in the resource group.

I find this difficult to believe. I'm pretty sure VMs in Azure *have* to have a NIC. If you didn't, you wouldn't have a public IP and wouldn't have been able to bootstrap.

Are you looking in the "Summary" for the resource group? If so, that will now display all of the resources if there are more than fit on one page. There's a little expander.

Anyway, I don't doubt there's a network issue, just not sure that it's due to complete lack of a network. I'll see if I can reproduce the issue locally.

Revision history for this message
Andrew Wilkins (axwalk) wrote :

I haven't yet reproduced the issue, but I have noticed that sometimes when I add machines, the CLI can't connect to the controller for a period of time. Seems likely to be related.

Revision history for this message
Andrew Wilkins (axwalk) wrote :

Quick update: I have managed to reproduce the issue several times, but still unsure of what's going on.

Revision history for this message
Andrew Wilkins (axwalk) wrote :

On each occasion that I've reproduced the issue, it's been the case that waiting a little while longer leads to the address being routable. I guess that Azure is still setting up the routing tables underneath us.

Increasing the number of retries should resolve this. At the moment we retry tools downloads up to 5 times, with 15 seconds between attempts. There's really no good reason to limit the number of retries, because the machine will just sit there idle otherwise. I'll just make it unbounded.

Changed in juju-core:
assignee: nobody → Andrew Wilkins (axwalk)
status: Triaged → In Progress
Curtis Hovey (sinzui)
Changed in juju-core:
milestone: 2.0-alpha1 → 2.0-alpha2
Andrew Wilkins (axwalk)
Changed in juju-core:
status: In Progress → Fix Committed
Curtis Hovey (sinzui)
Changed in juju-core:
status: Fix Committed → Fix Released
tags: added: 2.0-count
affects: juju-core → juju
Changed in juju:
milestone: 2.0-alpha2 → none
milestone: none → 2.0-alpha2
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.