Cloud-init should not setup ephemeral ipv4 if apply_network_config is False for OpenStack

Bug #1821102 reported by Andy Botting on 2019-03-20
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
cloud-init
High
Unassigned

Bug Description

As fixed in bug #1749717, cloud-init will attempt to configure an ephemeral ipv4 address on the first interface to fetch OpenStack (and probably others) networking config via a metadata URL.

There's a couple of issues with this implementation that affect our OpenStack cloud.

Access to our metadata server on 169.254.169.254 is delivered by an additional route delivered by DHCP, which is not configured via cloud-init's dhcp.py (that is probably another bug).

Also, we needed to bump up the timeouts for accessing our metadata, as we're a largeish cloud and the defaults were way too low. We actually copied the timeout/retry values from the Ec2 Datasource.

So the result is that users are left waiting for cloud-init-local stage to timeout, as the additional route to the metadata server isn't configured, which was 2 mins in our config.

I believe a simple fix for this situation would be to skip the ephemeral ipv4 setup if the datastore config has apply_network_config: False

Related branches

Ryan Harper (raharper) wrote :

Thank you for filing a bug.

Would you be able to provide the output from 'cloud-init collect-logs' and attach the tarball?

Changed in cloud-init:
importance: Undecided → High
status: New → Incomplete
Andy Botting (andybotting) wrote :

Logs attached.

Thanks

Ryan Harper (raharper) on 2019-03-25
Changed in cloud-init:
status: Incomplete → New
Ryan Harper (raharper) wrote :

Thanks for the logs.

The dhcp response currently results in the following setup:

Received dhcp lease on eth0 for 130.56.248.107/255.255.240.0
Attempting setup of ephemeral network on eth0 with 130.56.248.107/20 brd 130.56.255.255
Running command ['ip', '-family', 'inet', 'addr', 'add', '130.56.248.107/20', 'broadcast', '130.56.255.255', 'dev', 'eth0']
Running command ['ip', '-family', 'inet', 'link', 'set', 'dev', 'eth0', 'up']
Running command ['ip', 'route', 'show', '0.0.0.0/0']
Running command ['ip', '-4', 'route', 'add', '130.56.240.1', 'dev', 'eth0', 'src', '130.56.248.107']
Running command ['ip', '-4', 'route', 'add', 'default', 'via', '130.56.240.1', 'dev', 'eth0']

Note, cloud-init is running dhclient; what additional route is not being applied?

If the additional route is not provided, why would the second datasource crawl in init-net stage succeed?

Wouldn't increasing the timeouts for the initial crawl (or fixing the missing static route) in init-local suffice ?

Changed in cloud-init:
status: New → Incomplete

It never works in the first stage because the static route to
169.254.169.254 isn't set up. This is delivered by dhcp, which isn't
specifically handled by cloud-init. I verified this by looking at the
output of dhclient.

It works later because the interface is brought up properly by the system
between the first and second stages. We use systemd networkd here with a
configuration to simply setup ipv4 by dhcp on all eth* interfaces, which
does correctly apply the route.

I can provide more debugging around the interfaces/routes if you like.

On Fri., 5 Apr. 2019, 2:10 am Ryan Harper, <email address hidden>
wrote:

> Thanks for the logs.
>
> The dhcp response currently results in the following setup:
>
> Received dhcp lease on eth0 for 130.56.248.107/255.255.240.0
>
> Attempting setup of ephemeral network on eth0 with 130.56.248.107/20 brd
> 130.56.255.255
> Running command ['ip', '-family', 'inet', 'addr', 'add', '
> 130.56.248.107/20', 'broadcast', '130.56.255.255', 'dev', 'eth0']
> Running command ['ip', '-family', 'inet', 'link', 'set', 'dev', 'eth0',
> 'up']
> Running command ['ip', 'route', 'show', '0.0.0.0/0']
>
> Running command ['ip', '-4', 'route', 'add', '130.56.240.1', 'dev',
> 'eth0', 'src', '130.56.248.107']
> Running command ['ip', '-4', 'route', 'add', 'default', 'via',
> '130.56.240.1', 'dev', 'eth0']
>
>
> Note, cloud-init is running dhclient; what additional route is not being
> applied?
>
> If the additional route is not provided, why would the second datasource
> crawl in init-net stage succeed?
>
> Wouldn't increasing the timeouts for the initial crawl (or fixing the
> missing static route) in init-local suffice ?
>
>
>
> ** Changed in: cloud-init
> Status: New => Incomplete
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1821102
>
> Title:
> Cloud-init should not setup ephemeral ipv4 if apply_network_config is
> False for OpenStack
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/cloud-init/+bug/1821102/+subscriptions
>

Ryan Harper (raharper) wrote :
Download full text (3.1 KiB)

On Thu, Apr 4, 2019 at 3:36 PM Andy Botting <email address hidden> wrote:

> It never works in the first stage because the static route to
> 169.254.169.254 isn't set up. This is delivered by dhcp, which isn't
> specifically handled by cloud-init. I verified this by looking at the
> output of dhclient.
>

Can you provide the full DHCP response/lease file? We do run
dhclient with -sf /bin/true to avoid modifying the root filesystem
before we've parsed the network configuration.

However, if there's an optional route present in the response, then
maybe cloud-init should extract that an apply the route (which
is what /sbin/dhclient-script is doing).

It works later because the interface is brought up properly by the system
> between the first and second stages. We use systemd networkd here with a
> configuration to simply setup ipv4 by dhcp on all eth* interfaces, which
> does correctly apply the route.
>

Is this baked in networkd configuration a workaround or do you have a
custom image?

>
> I can provide more debugging around the interfaces/routes if you like.
>

Thanks; additional info on the contents of the DHCP server response,
and `ip route show` output would be helpful here.

>
>
>
> On Fri., 5 Apr. 2019, 2:10 am Ryan Harper, <email address hidden>
> wrote:
>
> > Thanks for the logs.
> >
> > The dhcp response currently results in the following setup:
> >
> > Received dhcp lease on eth0 for 130.56.248.107/255.255.240.0
> >
> > Attempting setup of ephemeral network on eth0 with 130.56.248.107/20 brd
> > 130.56.255.255
> > Running command ['ip', '-family', 'inet', 'addr', 'add', '
> > 130.56.248.107/20', 'broadcast', '130.56.255.255', 'dev', 'eth0']
> > Running command ['ip', '-family', 'inet', 'link', 'set', 'dev', 'eth0',
> > 'up']
> > Running command ['ip', 'route', 'show', '0.0.0.0/0']
> >
> > Running command ['ip', '-4', 'route', 'add', '130.56.240.1', 'dev',
> > 'eth0', 'src', '130.56.248.107']
> > Running command ['ip', '-4', 'route', 'add', 'default', 'via',
> > '130.56.240.1', 'dev', 'eth0']
> >
> >
> > Note, cloud-init is running dhclient; what additional route is not being
> > applied?
> >
> > If the additional route is not provided, why would the second datasource
> > crawl in init-net stage succeed?
> >
> > Wouldn't increasing the timeouts for the initial crawl (or fixing the
> > missing static route) in init-local suffice ?
> >
> >
> >
> > ** Changed in: cloud-init
> > Status: New => Incomplete
> >
> > --
> > You received this bug notification because you are subscribed to the bug
> > report.
> > https://bugs.launchpad.net/bugs/1821102
> >
> > Title:
> > Cloud-init should not setup ephemeral ipv4 if apply_network_config is
> > False for OpenStack
> >
> > To manage notifications about this bug go to:
> > https://bugs.launchpad.net/cloud-init/+bug/1821102/+subscriptions
> >
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1821102
>
> Title:
> Cloud-init should not setup ephemeral ipv4 if apply_network_config is
> False for OpenStack
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/cloud-init/+bug/18...

Read more...

Andy Botting (andybotting) wrote :

Hi Ryan,

> Can you provide the full DHCP response/lease file? We do run
> dhclient with -sf /bin/true to avoid modifying the root filesystem
> before we've parsed the network configuration.

Absolutely, here it is.

From systemd:
# cat /run/systemd/netif/leases/2
# This is private data. Do not parse.
ADDRESS=130.56.248.145
NETMASK=255.255.240.0
ROUTER=130.56.240.1
SERVER_ADDRESS=130.56.248.255
NEXT_SERVER=130.56.248.255
BROADCAST=130.56.255.255
MTU=9000
T1=907200
T2=1587600
LIFETIME=1814400
DNS=150.203.1.10 8.8.8.8
DOMAINNAME=openstacklocal
HOSTNAME=host-130-56-248-145
ROUTES=169.254.169.254/32,130.56.248.255 0.0.0.0/0,130.56.240.1
CLIENTID=ffb55e67ff00020000ab11982e99e5ba937fb2

Simulating cloud-init with ./dhclient -1 -v -lf dhcp.leases -pf dhclient.pid eth0 -sf /bin/true:
lease {
  interface "eth0";
  fixed-address 130.56.248.145;
  option subnet-mask 255.255.240.0;
  option routers 130.56.240.1;
  option dhcp-lease-time 1814400;
  option dhcp-message-type 5;
  option domain-name-servers 150.203.1.10,8.8.8.8;
  option dhcp-server-identifier 130.56.248.255;
  option interface-mtu 9000;
  option dhcp-renewal-time 907200;
  option rfc3442-classless-static-routes 32,169,254,169,254,130,56,248,255,0,130,56,240,1;
  option broadcast-address 130.56.255.255;
  option dhcp-rebinding-time 1587600;
  option host-name "host-130-56-248-145";
  option domain-name "openstacklocal";
  renew 6 2019/04/13 01:05:11;
  rebind 2 2019/04/23 06:38:57;
  expire 4 2019/04/25 21:38:57;
}

> However, if there's an optional route present in the response, then
> maybe cloud-init should extract that an apply the route (which
> is what /sbin/dhclient-script is doing).

Yeah, I thought about that too. For us, we don't need that functionality, but I guess others might. I did look at potentially doing that, but I hadn't worked out the format of the rfc3442-classless-static-routes yet.

> Is this baked in networkd configuration a workaround or do you have a
> custom image?

We do build our own images, with optimisations for running on our cloud. We try and make it as light-touch as possible, but in terms of networking, we have been setting the newer distros to use systemd-networkd instead. It allows us do dhcp on all eth* interfaces, so if users attach multiple networks, they get configured right away by the OS.

Launchpad Janitor (janitor) wrote :

[Expired for cloud-init because there has been no activity for 60 days.]

Changed in cloud-init:
status: Incomplete → Expired
Changed in cloud-init:
status: Expired → Confirmed
Andy Botting (andybotting) wrote :

I've reopened this, and marked it as confirmed. Please let me know if there's any more information I can provide.

I'm happy to look into fixing this if I know what direction we should take on this, whether we should be implementing the routes, or allow a way of ignoring metadata in the first stage?

Thanks

Ryan Harper (raharper) wrote :

Thanks for re-opening; after your response to my request for information, you can move the bug state back to New. Sorry for the trouble and thanks for following up.

I do think that our EphemeralDHCP handler will need to detect and handle the static routes options; this should ensure that we can crawl metadata service early and get the full config.

Next, your request to run dhcp on all interfaces can be done via network-config from your metadata service, and cloud-init can read that openstack network-config and render that config. In Ubuntu Bionic and newer, cloud-init will render netplan config which then handles configuring systemd-networkd, and on Xenial, cloud-init renders /etc/network/interfaces.

What does your network-config from metadata service look like (standard openstack network_data.json) ?

Changed in cloud-init:
status: Confirmed → Incomplete
Andy Botting (andybotting) wrote :

Here's an example of network_data.json from our flat networking setup. This provides a public routable IP address to the instance.

Andy Botting (andybotting) wrote :

This is an example of network_data.json for a private network.

Ryan Harper (raharper) on 2019-06-04
Changed in cloud-init:
status: Incomplete → Triaged
Ryan Harper (raharper) wrote :

Thanks.

The "flat" network looks to be just "dhcp" on the interface with specified MAC. Note, that since this specifies DHCP, it's not clear whether the DHCP response would include DNS settings that match what's also defined in the json;

And the "private" is also straightforward.

How do either configs interact with your "systemd-networkd dhcp on everything" changes?

We do have some in-progress work on handling hotplug interfaces in OpenStack[1] and updating the system config via updated network_data.json. However, in the DHCP on everything case, this does present multiple default routes which by default can take out networking on the instance without some extra work on setting up routing policies. Is that something you've baked into the image with networkd?

I'm interested in seeing those changes if you've got that working as we'd like to have cloud-init emit configs that work with dhcp on multiple interfaces.

And lastly; I think if we fix the ephemeral dhcp to add the static to the metadata server, then I don't think you'll need to use apply_network_config ds_config, IIUC. Could you confirm?

Changed in cloud-init:
status: Triaged → Incomplete
Andy Botting (andybotting) wrote :

Hi Ryan.

So in both cases, DHCP delivers all the information the instances need, so we technically don't require any of that information from network_data.json. Then systemd-networkd just works.

I realise I forget to mention that in our cloud-init config, to make this work we do set:

network:
  config: disabled

You are correct that two interfaces attached to an instance can break the routing. The usual case for us is that some instances will be attached to a 'data' network which has no default route, so this is fine.

In cases where a user would attach two interfaces from either the flat or a NAT'ted private network, then yes the routing doesn't work. We call this 'Advanced Networking' on our cloud, and it's mostly users who understand what they're doing so it's not been a problem.

I have considered possibly running eth0 with a lower metric than the other interfaces so they can at least get in via that interface if they get in trouble, but it's not been a problem for us.

If cloud-init did handle adding our static metadata route in the ephemeral dhcp, that would certainly fix our issue.

Thanks!

Ryan Harper (raharper) on 2019-06-05
Changed in cloud-init:
status: Incomplete → Triaged
Ben Raymond (benray12) wrote :

Andy, thanks for filing this. I believe I am running into it as well.

Have you been able to identify any workarounds at boot time?

thanks!
Ben

Andy Botting (andybotting) wrote :

Hi Ben,

Unfortunately not. We have been just waiting out the timeout.

Ryan,

After the reading the docs again, I see this:

*Local stage*
none: network configuration can be disabled entirely with config like the following in /etc/cloud/cloud.cfg: ‘network: {config: disabled}’

Do you think the correct behavior in this stage would be to disable the ephemeral IPv4?

Ryan Harper (raharper) wrote :

Andy,

I discussed this particular bug with the team on Wednesday. The 'network: {config: disabled}' is designed to tell cloud-init to not create network-configuration; which is typically fallback (dhcp on an interface) or read the metadata from a platform for a richer config. It is not, "don't read cloud metadata over the network".

The EphemeralDHCP setup allows us to fetch more than just network-config, in fact it reads all of the Datasource metadata, including instance-id and one critical part related to setting hostname; which needs to be set (in some cases) prior to bringing networking up (even if cloud-init isn't generating the config).

The 'apply_network_config: false' config was also meant to only disable the rendering of the network-config to remain backwards compatible with how cloud-init (On OpenStack) behaved in Xenial.

So while either of these configs seem to imply that cloud-init could skip running the EphemeralDHCP setup during local time it actually means to not render the configuration; not avoid reading metadata altogether.

Our plan is to have ephemeralDHCP apply static routes in the response correctly; this will prevent the timeout (UUIC), read all of the metadata from the service and then your local changes (network: config: disabled and apply_network_config: False) will ensure that cloud-init won't generate network config as requested.

Parsing the static routes isn't a huge lift so I hope we'll have a branch up quickly.

Andy Botting (andybotting) wrote :

Thanks Ryan, great explanation. I really appreciate you looking into this for me.

Let me know when you have something up and I can give it a test in our environment.

Ryan Harper (raharper) wrote :

Hi Andy,

I've put up a branch to handle classless static routes in DHCP responses. I've also published a test package here:

https://launchpad.net/~raharper/+archive/ubuntu/cloud-init-dev/+packages

I have it for bionic, but I can add Xenial or other Ubuntu releases if you can give that a test to see if it works to resolve the timeout.

If you could capture:

% ip a
% ip addr show
% ip route show

I'd like to confirm I'm adding the static routes correctly.

Typically, I'd take a running instance and install the newer cloud-init, then:

cloud-init clean --logs --reboot

which wipes instance data to make the filesystem look like it's booting as a new instance.

Then once it boots, cloud-init collect-logs and attach the tarball.

Andy Botting (andybotting) wrote :
Download full text (4.1 KiB)

Hi Ryan,

Apologies for just getting around to this now - completely forgot! Testing looks great - with no more timeout.

Pre-fix
--------------
Cloud-init v. 18.5-45-g3554ffe8-0ubuntu1~18.04.1 running 'init-local' at Thu, 13 Jun 2019 01:47:13 +0000. Up 3.00 seconds.
2019-06-13 01:48:44,083 - util.py[WARNING]: No active metadata service found
Cloud-init v. 18.5-45-g3554ffe8-0ubuntu1~18.04.1 running 'init' at Thu, 13 Jun 2019 01:48:46 +0000. Up 95.42 seconds.

Post-fix
--------------
Cloud-init v. 19.1-9-gd8ea5dca-1~bddeb~18.04.1 running 'init-local' at Thu, 13 Jun 2019 01:55:24 +0000. Up 2.27 seconds.
Cloud-init v. 19.1-9-gd8ea5dca-1~bddeb~18.04.1 running 'init' at Thu, 13 Jun 2019 01:55:29 +0000. Up 7.74 seconds.

The relevant log file suggests it works!

2019-06-13 01:55:24,096 - dhcp.py[DEBUG]: Performing a dhcp discovery on eth0
2019-06-13 01:55:24,096 - util.py[DEBUG]: Copying /sbin/dhclient to /var/tmp/cloud-init/cloud-init-dhcp-ff8b_366/dhclient
2019-06-13 01:55:24,099 - util.py[DEBUG]: Running command ['ip', 'link', 'set', 'dev', 'eth0', 'up'] with allowed return codes [0] (shell=False, capture=True)
2019-06-13 01:55:24,107 - util.py[DEBUG]: Running command ['/var/tmp/cloud-init/cloud-init-dhcp-ff8b_366/dhclient', '-1', '-v', '-lf', '/var/tmp/cloud-init/cloud-init-dhcp-ff8b_366/dhcp.leases', '-pf', '/var/tmp/cloud-init/cloud-init-dhcp-ff8b_366/dhclient.pid', 'eth0', '-sf', '/bin/true'] with allowed return codes [0] (shell=False, capture=True)
2019-06-13 01:55:24,180 - util.py[DEBUG]: All files appeared after 0 seconds: ['/var/tmp/cloud-init/cloud-init-dhcp-ff8b_366/dhclient.pid', '/var/tmp/cloud-init/cloud-init-dhcp-ff8b_366/dhcp.leases']
2019-06-13 01:55:24,180 - util.py[DEBUG]: Reading from /var/tmp/cloud-init/cloud-init-dhcp-ff8b_366/dhclient.pid (quiet=False)
2019-06-13 01:55:24,180 - util.py[DEBUG]: Read 4 bytes from /var/tmp/cloud-init/cloud-init-dhcp-ff8b_366/dhclient.pid
2019-06-13 01:55:24,180 - util.py[DEBUG]: Reading from /proc/448/stat (quiet=True)
2019-06-13 01:55:24,180 - util.py[DEBUG]: Read 297 bytes from /proc/448/stat
2019-06-13 01:55:24,180 - dhcp.py[DEBUG]: killing dhclient with pid=448
2019-06-13 01:55:24,181 - util.py[DEBUG]: Reading from /var/tmp/cloud-init/cloud-init-dhcp-ff8b_366/dhcp.leases (quiet=False)
2019-06-13 01:55:24,181 - util.py[DEBUG]: Read 704 bytes from /var/tmp/cloud-init/cloud-init-dhcp-ff8b_366/dhcp.leases
2019-06-13 01:55:24,182 - dhcp.py[DEBUG]: Received dhcp lease on eth0 for 130.56.249.206/255.255.240.0
2019-06-13 01:55:24,182 - __init__.py[DEBUG]: Attempting setup of ephemeral network on eth0 with 130.56.249.206/20 brd 130.56.255.255
2019-06-13 01:55:24,182 - util.py[DEBUG]: Running command ['ip', '-family', 'inet', 'addr', 'add', '130.56.249.206/20', 'broadcast', '130.56.255.255', 'dev', 'eth0'] with allowed return codes [0] (shell=False, capture=True)
2019-06-13 01:55:24,185 - util.py[DEBUG]: Running command ['ip', '-family', 'inet', 'link', 'set', 'dev', 'eth0', 'up'] with allowed return codes [0] (shell=False, capture=True)
2019-06-13 01:55:24,187 - util.py[DEBUG]: Running command ['ip', '-4', 'route', 'add', '169.254.169.254', 'via', '130.56.248.255', 'dev', 'eth0'] with allowed retu...

Read more...

Ryan Harper (raharper) wrote :

\o/

I would like to see the ip output for additional confirmation if you don't mind.

And I think we still need to bump the timeouts; you suggested that the default url timeouts from DatasourceEC2 are more realistic, correct?

I may file a separate bug for that since this bug covered the intial networking setup. Do you have logs for those timeouts that we could attach to the new bug?

Ryan Harper (raharper) wrote :

I've updated the branch with some unittests and fixes (for other routes besides a /32).

Same ppa:raharper/cloud-init-dev

cloud-init_19.1-12-gb5a47081-1~bddeb~18.04.1

Thanks for filing bug and testing!

Ryan Harper (raharper) wrote :

@Andy any change you could give the updated cloud-init package a test?

Andy Botting (andybotting) wrote :

Thanks for the reminder! Oops.

I've just tested your new build cloud-init_19.1-12-gb5a47081-1~bddeb~18.04.1 which seemed to work perfectly.

Also, here's the networking details you asked for:

ubuntu@test-cloudinit:~$ ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc fq_codel state UP group default qlen 1000
    link/ether fa:16:3e:6f:f3:84 brd ff:ff:ff:ff:ff:ff
    inet 130.56.249.206/20 brd 130.56.255.255 scope global dynamic eth0
       valid_lft 1814279sec preferred_lft 1814279sec
    inet6 fe80::f816:3eff:fe6f:f384/64 scope link
       valid_lft forever preferred_lft forever

ubuntu@test-cloudinit:~$ ip addr show
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc fq_codel state UP group default qlen 1000
    link/ether fa:16:3e:6f:f3:84 brd ff:ff:ff:ff:ff:ff
    inet 130.56.249.206/20 brd 130.56.255.255 scope global dynamic eth0
       valid_lft 1814274sec preferred_lft 1814274sec
    inet6 fe80::f816:3eff:fe6f:f384/64 scope link
       valid_lft forever preferred_lft forever

ubuntu@test-cloudinit:~$ ip route show
default via 130.56.240.1 dev eth0 proto dhcp metric 1024
130.56.240.0/20 dev eth0 proto kernel scope link src 130.56.249.206
169.254.169.254 via 130.56.248.255 dev eth0 proto dhcp metric 1024

Attaching debug logs too.

Thanks!

Andy Botting (andybotting) wrote :
Andy Botting (andybotting) wrote :

> And I think we still need to bump the timeouts; you suggested that the default url timeouts from DatasourceEC2 are more realistic, correct?

So in our images we currently have these values set:

datasource_list:
  - ConfigDrive
  - OpenStack

datasource:
 OpenStack:
  max_wait: 90
  timeout: 30
  retries: 3

This initially came about because some users were reporting that their SSH host keys were changing after a reboot.

What happened was their instance would initially boot and pick up the OpenStack data source (with default cloud-init config) and get provisioned OK. Some time later they'd reboot and the metadata server wouldn't respond as quickly and so cloud-init would fall back to the EC2 data source.

This would result in their 'instance id' switching from an OpenStack UUID to and EC2 i-xxxxxxx format one and cloud-init would think it's a different instance and reprovision.

The timeouts aren't normally a problem, but they can stretch out when we're having message queue issues. Our metrics show over the last 30 days our max was 13 secs, so we should probably revisit the values we have set and drop them.

Ryan Harper (raharper) wrote :

Thanks for verifying!

For the timeout, I'll start a new bug and we can discuss changes there.

W.r.t the datasource change; that really shouldn't happen; at least on images using cloud-init's ds-identify.

However, since you're hard-coding the datasource_list, this is going to disable the detection.

Openstack identifies itself via platform metadata (DMI tables); so cloud-init will detect this value and set the datasource to OpenStack (or config-drive via attached devices filesystem labels).

EC2 also identifies itself and you'd never see an OpenStack cloud instance be confused for Ec2 resulting in different boots.

It may be worth revisiting your images to see if you can rely on cloud-init's ds-identify (called through the systemd-generator we provide.

Andy Botting (andybotting) wrote :

Hi Ryan,

Just built a Debian 10 image and am seeing this issue again. Just checking in to see if tgere's a release yet with the fix?

Ryan Harper (raharper) wrote :

On Wed, Jul 10, 2019 at 6:20 PM Andy Botting <email address hidden> wrote:

> Hi Ryan,
>
> Just built a Debian 10 image and am seeing this issue again. Just
> checking in to see if tgere's a release yet with the fix?
>

The branch has not yet landed. Once it does then the next
cloud-init SRU will release this back through Xenial. We'll
also cut a 19.2 cloud-init release which would include the fix.

>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1821102
>
> Title:
> Cloud-init should not setup ephemeral ipv4 if apply_network_config is
> False for OpenStack
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/cloud-init/+bug/1821102/+subscriptions
>

Andy Botting (andybotting) wrote :

> The branch has not yet landed. Once it does then the next
> cloud-init SRU will release this back through Xenial. We'll
> also cut a 19.2 cloud-init release which would include the fix.
>

Thanks Ryan.

This bug is fixed with commit 07b17236 to cloud-init on branch master.
To view that commit see the following URL:
https://git.launchpad.net/cloud-init/commit/?id=07b17236

Changed in cloud-init:
status: Triaged → Fix Committed

This bug is believed to be fixed in cloud-init in version 19.2. If this is still a problem for you, please make a comment and set the state back to New

Thank you.

Changed in cloud-init:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers