cloud-init cannot always use private ip address to fetch tools (ec2 provider)

Bug #1566431 reported by Roger Peppe
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Canonical Juju
Fix Released
Critical
Eric Snow

Bug Description

In EC2, instances started using different accounts can be network-isolated
from one another. When starting instances in a model that uses different
credentials than the controller, the instances seem to use the private
address to fetch the tools which fails in this case.

This means that a central use case driving multiple models is currently
broken.

This doesn't seem to apply to all regions - us-east-1 isn't affected
but eu-central-1 is.

Example from /var/log/cloud-init-output.log:

 + mkdir -p /var/lib/juju/tools/1.25.4.1-trusty-amd64
 + echo Fetching tools: curl -sSfw 'tools from %{url_effective} downloaded: HTTP %{http_code}; time %{time_total}s; size %{size_download} bytes; speed %{speed_download} bytes/s ' --noproxy "*" --insecure -o $bin/tools.tar.gz <[https://172.31.14.160:17070/tools/1.25.4.1-trusty-amd64]>
 Fetching tools: curl -sSfw 'tools from %{url_effective} downloaded: HTTP %{http_code}; time %{time_total}s; size %{size_download} bytes; speed %{speed_download} bytes/s ' --noproxy "*" --insecure -o $bin/tools.tar.gz <[https://172.31.14.160:17070/tools/1.25.4.1-trusty-amd64]>
 + seq 5
 + printf Attempt 1 to download tools from %s...\n https://172.31.14.160:17070/tools/1.25.4.1-trusty-amd64
 Attempt 1 to download tools from https://172.31.14.160:17070/tools/1.25.4.1-trusty-amd64...
 + curl -sSfw tools from %{url_effective} downloaded: HTTP %{http_code}; time %{time_total}s; size %{size_download} bytes; speed %{speed_download} bytes/s --noproxy * --insecure -o /var/lib/juju/tools/1.25.4.1-trusty-amd64/tools.tar.gz https://172.31.14.160:17070/tools/1.25.4.1-trusty-amd64
 curl: (7) Failed to connect to 172.31.14.160 port 17070: No route to host
 tools from https://172.31.14.160:17070/tools/1.25.4.1-trusty-amd64 downloaded: HTTP 000; time 2.202s; size 0 bytes; speed 0.000 bytes/s + [ 1 -lt 5 ]
 + echo Download failed..... wait 15s
 Download failed..... wait 15s
 + sleep 15
 + printf Attempt 2 to download tools from %s...\n https://172.31.14.160:17070/tools/1.25.4.1-trusty-amd64
 Attempt 2 to download tools from https://172.31.14.160:17070/tools/1.25.4.1-trusty-amd64...
 + curl -sSfw tools from %{url_effective} downloaded: HTTP %{http_code}; time %{time_total}s; size %{size_download} bytes; speed %{speed_download} bytes/s --noproxy * --insecure -o /var/lib/juju/tools/1.25.4.1-trusty-amd64/tools.tar.gz https://172.31.14.160:17070/tools/1.25.4.1-trusty-amd64
 curl: (7) Failed to connect to 172.31.14.160 port 17070: No route to host

Revision history for this message
Cheryl Jennings (cherylj) wrote :

There was a problem in the joyent provider where hosted models weren't inheriting the region endpoint properly, and were therefore spinning up instances in different regions. The instances couldn't communicate on the private IPs because of this. The fix in that scenario was to force hosted models to use the same region. Bug #1561611

If we plan to allow users to create models in different regions then yeah, we should find a way to use of the public IP to talk to the controller in those cases.

Revision history for this message
Roger Peppe (rogpeppe) wrote :

This is a problem even on the same region when different credentials are used in the model vs the controller. I think that the solution is to return both public and private addresses from the APIInfo call, so the tools download logic will try both before failing.

Changed in juju-core:
status: New → Triaged
importance: Undecided → Critical
milestone: none → 2.0-beta4
Changed in juju-core:
assignee: nobody → Eric Snow (ericsnowcurrently)
status: Triaged → In Progress
Revision history for this message
Eric Snow (ericsnowcurrently) wrote :

I have not been able to reproduce this. I tried bootstrapping with one set of credentials and then switching to a different set, but the original set is still used. I also tried the same thing using a different Juju user, but the outcome was the same.

Changed in juju-core:
status: In Progress → Incomplete
assignee: Eric Snow (ericsnowcurrently) → Roger Peppe (rogpeppe)
Changed in juju-core:
assignee: Roger Peppe (rogpeppe) → Eric Snow (ericsnowcurrently)
status: Incomplete → In Progress
Revision history for this message
Eric Snow (ericsnowcurrently) wrote :

Repro:

juju bootstrap ec2-eu aws/eu-central-1 --upload-tools --credential aws:<email address hidden>
juju create-model testing --credential aws:<email address hidden>
juju add-machine
juju ssh 0 "tail -f /var/log/cloud-init-output.log"

Revision history for this message
Eric Snow (ericsnowcurrently) wrote :

As Roger pointed out, we should be trying both the private and the public IP addresses for the controller. However, we are only trying the private one, which apparently doesn't work by default across AWS accounts. The code that generates the cloud-init commands for the attempts relies on the list provided to it:

https://github.com/juju/juju/blob/master/cloudconfig/userdatacfg_unix.go#L248

Revision history for this message
Eric Snow (ericsnowcurrently) wrote :
Revision history for this message
Eric Snow (ericsnowcurrently) wrote :

In that code path the following happens:

https://github.com/juju/juju/blob/master/apiserver/common/addresses.go#L69
https://github.com/juju/juju/blob/master/network/address.go#L417
https://github.com/juju/juju/blob/master/network/address.go#L471
https://github.com/juju/juju/blob/master/network/address.go#L478

This means that a single address is selected as the "best" one for each controller API host. The algorithm we use is selecting the private address over the public one, assuming that both were stored in state when each controller was provisioned.

Revision history for this message
Eric Snow (ericsnowcurrently) wrote :

So, as Roger suggested, an obvious solution is to use network.SelectInternalHostPorts() instead of SelectInternalHostPort():

https://github.com/juju/juju/blob/master/apiserver/common/addresses.go#L69

I'll write a patch to do so and verify that it works.

Revision history for this message
Eric Snow (ericsnowcurrently) wrote :

Ah, bestAddressIndexes() limits the returned addresses to the best scope. We need something that prioritizes by scope instead.

Revision history for this message
Eric Snow (ericsnowcurrently) wrote :
Revision history for this message
Eric Snow (ericsnowcurrently) wrote :

Hmm. Looks like a similar problem is happening in the upgrade worker.

Revision history for this message
Eric Snow (ericsnowcurrently) wrote :
Revision history for this message
Eric Snow (ericsnowcurrently) wrote :

Gotta run. My next step was to update apiserver/common/tools.go to use a list of tools URLs rather than a single one. This is similar to what we do in the provisioner. It might also make sense to directly use the tools download API rather than the URL provided for each Tools.

Changed in juju-core:
assignee: Eric Snow (ericsnowcurrently) → nobody
status: In Progress → Triaged
Revision history for this message
Eric Snow (ericsnowcurrently) wrote :

FYI, repeated log entries from the machine log:

2016-04-11 23:22:44 INFO juju.worker.upgrader upgrader.go:178 desired tool version: 2.0-beta4.1
2016-04-11 23:22:44 INFO juju.worker.upgrader upgrader.go:199 upgrade requested from 2.0-beta4.3 to 2.0-beta4.1
2016-04-11 23:22:44 INFO juju.worker.upgrader upgrader.go:251 fetching tools from "https://172.31.13.178:17070/model/6babf452-5dee-412f-8e69-3f8a8c316f26/tools/2.0-beta4.1-trusty-amd64"
2016-04-11 23:22:47 ERROR juju.worker.upgrader upgrader.go:222 failed to fetch tools from "https://172.31.13.178:17070/model/6babf452-5dee-412f-8e69-3f8a8c316f26/tools/2.0-beta4.1-trusty-amd64": Get https://172.31.13.178:17070/model/6babf452-5dee-412f-8e69-3f8a8c316f26/tools/2.0-beta4.1-trusty-amd64: dial tcp 172.31.13.178:17070: getsockopt: no route to host

Changed in juju-core:
assignee: nobody → Eric Snow (ericsnowcurrently)
status: Triaged → In Progress
Changed in juju-core:
milestone: 2.0-beta4 → 2.0-rc1
Revision history for this message
Eric Snow (ericsnowcurrently) wrote :

Getting close:

2016-04-13 19:14:12 INFO juju.worker.upgrader upgrader.go:178 desired tool version: 2.0-beta4.1
2016-04-13 19:14:12 INFO juju.worker.upgrader upgrader.go:199 upgrade requested from 2.0-beta4.2 to 2.0-beta4.1
2016-04-13 19:14:12 INFO juju.worker.upgrader upgrader.go:253 fetching tools from "https://172.31.7.120:17070/model/3190f4ee-98c1-4992-8630-7d31a40d9b8e/tools/2.0-beta4.1-trusty-amd64" │
2016-04-13 19:14:15 ERROR juju.worker.upgrader upgrader.go:223 failed to fetch tools from "https://172.31.7.120:17070/model/3190f4ee-98c1-4992-8630-7d31a40d9b8e/tools/2.0-beta4.1-trusty-amd64": Get https://172.31.7.120:17070/model/3190f4ee-98c1-4992-8630-7d31a40d9b8e/tools/2.0-beta4.1-trusty-amd64: dial tcp 172.31.7.120:17070: getsockopt: no route to host
2016-04-13 19:14:15 INFO juju.worker.upgrader upgrader.go:253 fetching tools from "https://52.28.11.215:17070/model/3190f4ee-98c1-4992-8630-7d31a40d9b8e/tools/2.0-beta4.1-trusty-amd64" │
2016-04-13 19:14:18 ERROR juju.worker.upgrader upgrader.go:223 failed to fetch tools from "https://52.28.11.215:17070/model/3190f4ee-98c1-4992-8630-7d31a40d9b8e/tools/2.0-beta4.1-trusty-amd64": bad HTTP response: 400 Bad Request

This implies to me that the endpoint isn't exposed on the public address, which means tweaking it to make it work. However, That may not be the correct approach. It may be more appropriate to sort out the networking under the hood so that the instances under two different AWS users are on the same cloud-local network or at least routable.

Revision history for this message
Eric Snow (ericsnowcurrently) wrote :

So it's not an access issue:

machine-0: 2016-04-13 19:22:14 ERROR juju.apiserver tools.go:50 GET(/model/3190f4ee-98c1-4992-8630-7d31a40d9b8e/tools/2.0-beta4.1-trusty-amd64?%3Amodeluuid=3190f4ee-98c1-4992-8630-7d31a40d9b8e&%3Aversion=2.0-beta4.1-trusty-amd64&) failed: error fetching tools: no matching tools available

At this point the latest uploaded tools was 2.0-beta4.3-trusty-amd64. I'm not immediately clear on why it's trying to get the original (beta4.1 vs. beta4.3).

Revision history for this message
Eric Snow (ericsnowcurrently) wrote :

Ah, I just needed to run "juju upgrade-juju" for the model after having done so for the controller. At this point I have the fix ready:

http://reviews.vapour.ws/r/4576/

Changed in juju-core:
status: In Progress → Fix Committed
Curtis Hovey (sinzui)
Changed in juju-core:
status: Fix Committed → Fix Released
affects: juju-core → juju
Changed in juju:
milestone: 2.0-beta5 → none
milestone: none → 2.0-beta5
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.