CloudStack provider cannot determine correct metadata IP with multiple network interfaces

Bug #1839854 reported by Joshua Hügli
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
cloud-init
Fix Released
Undecided
Unassigned

Bug Description

[Problem]
When mutliple network interfaces are present in a CloudStack VM, cloud-init randomly chooses the gateway address to fetch the metadata from. This is not a problem when all network interfaces offer metadata. However, if a shared network interface is attached to the VM the gateway on that interface doesn't have the metadata. Cloud-init will timeout waiting for response from the gateway and will not apply metadata to the host.

[How to reproduce]
- Create VM with 1x Isolated and 1x Shared Network
- Ensure cloud-init is installed in the VM and CloudStack is configured as a metadata provider
- Boot VM

[Expected result]
- VM should boot and apply metadata from cloudstack

[Observed result]
- cloud-init sometimes chooses wrong metadata server IP
- cloud-init delays startup waiting for response
- metadata isn't applied
- cloud-init service fails

[Notes]
I noticed that in "cloudinit/sources/DataSourceCloudStack.py" get_vr_address() the dhcp lease option is preferred over the default gateway. Wouldn't it be smarter to just always use "get_default_gateway()"?
We used till recently cloud-init 0.7.5 but after the introduction of NetworkManager lease support we started running into this problem. (https://github.com/cloud-init/cloud-init/commit/33816e96d8981918f734dab3ee1a967bce85451a#diff-5bc9de2bb7889d66205845400c7cf99bR182)
Up to this point cloud-init has always used the default_gateway method.
CentOS 7 has only recently updated cloud-init in it's repos, so we were stuck on this old version for a long time.

Maybe it would be nice to have a configuration option to choose between the methods manually?
Also it would be cool if on a fault cloud-init would choose the next possible dhcp lease.

[Attachment]
We added some files for debugging as a tar.gz.

Related branches

Revision history for this message
Joshua Hügli (joschi36) wrote :
Revision history for this message
Dan Watkins (oddbloke) wrote :

Hi Joshua,

Thanks for using cloud-init, and for the detailed bug report!

> When mutliple network interfaces are present in a CloudStack VM, cloud-init
> randomly chooses the gateway address to fetch the metadata from.

The "random"ness here isn't contained within cloud-init per se. It will
deterministically select the most recent DHCP lease and use that to access
metadata. The problem is that the order in which the DHCP leases completes is
not stable, so cloud-init's assumption here is mistaken.

> I noticed that in "cloudinit/sources/DataSourceCloudStack.py"
> get_vr_address() the dhcp lease option is preferred over the default gateway.
> Wouldn't it be smarter to just always use "get_default_gateway()"?

Using the DHCP leases is the documented way of finding the metadata server in
the CloudStack docs[0]. I don't know for sure, but I believe the
get_default_gateway() path is there for cases where DHCP has failed (or,
perhaps, for static network configuration?). My intuition is that the first
interface to successfully DHCP would get the default route; do you send some
DHCP route configuration to make the default gateway stable?

> Maybe it would be nice to have a configuration option to choose between the
> methods manually?

This sounds feasible, as a datasource configuration option that you could bake
into your images/templates. I'd prefer for us to work out if there's a way we
can get this right without requiring configuration, though, because otherwise
new CloudStack operators (or image/template builders) have to discover this
option themselves when they're running in an environment that requires it. Do
you know if there's a way that we can tell which DHCP lease is the right one to
use?

(Marking this Incomplete for now, please move it back to New when you respond!)

Thanks!

Dan

[0] http://docs.cloudstack.apache.org/projects/cloudstack-administration/en/4.8/virtual_machines/user-data.html

Changed in cloud-init:
status: New → Incomplete
Revision history for this message
Joshua Hügli (joschi36) wrote :

Hello Dan

We found another better-looking way to determine the data-server address.
This method uses a DNS entry called data-server.

I've create a merge requests which adds this as preferred method to determine the address.

Changed in cloud-init:
status: Incomplete → New
Ryan Harper (raharper)
Changed in cloud-init:
status: New → In Progress
Ryan Harper (raharper)
Changed in cloud-init:
status: In Progress → Fix Committed
Brett Holman (holmanb)
Changed in cloud-init:
status: Fix Committed → Fix Released
Revision history for this message
James Falcon (falcojr) wrote :
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.