Crashes in convert_ec2_metadata_network_config on ScalingStack bos01 (ppc64el)

Bug #1715128 reported by Iain Lane on 2017-09-05
24
This bug affects 2 people
Affects Status Importance Assigned to Milestone
cloud-init (Ubuntu)
Critical
Unassigned

Bug Description

This is all on 0.7.9-259-g7e76c57b-0ubuntu1 in artful.

cloud-init is currently crashing on our ScalingStack ppc64el images as used for autopkgtest. It's bad for us, as we set up proxies in /etc/environment using userdata and this isn't written because of the crash.

Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/cloudinit/cmd/main.py", line 638, in status_wrapper
    ret = functor(name, args)
  File "/usr/lib/python3/dist-packages/cloudinit/cmd/main.py", line 357, in main_init
    init.apply_network_config(bring_up=bool(mode != sources.DSMODE_LOCAL))
  File "/usr/lib/python3/dist-packages/cloudinit/stages.py", line 635, in apply_network_config
    netcfg, src = self._find_networking_config()
  File "/usr/lib/python3/dist-packages/cloudinit/stages.py", line 622, in _find_networking_config
    if self.datasource and hasattr(self.datasource, 'network_config'):
  File "/usr/lib/python3/dist-packages/cloudinit/sources/DataSourceEc2.py", line 297, in network_config
    self.metadata)
  File "/usr/lib/python3/dist-packages/cloudinit/sources/DataSourceEc2.py", line 461, in convert_ec2_metadata_network_config
    macs_metadata = metadata['network']['interfaces']['macs']
KeyError: 'network'

'metadata' is:

{'block-device-mapping': {'ami': 'vda', 'root': '/dev/vda'}, 'placement': {'availability-zone': 'nova'}, 'public-keys': {'testbed-juju-prod-ues-proposed-migration-machine-3': ['ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDBJ2WCstti4DGBSSY+QleVTmqlGNfMPjzSoaCvbC4fk6sb4KvrCjNO+Hn8F2Qaw0j2uUVgwDwlcIgVh2pawyenR01hDqBEuHtH/l6VZsVYsiO83WL5HE8Iwy0i+S2W3saaZq38F19mKNrZ5obTJeRK21t84PN4nE107WTnXSj+o6DXrw4dlGPwRlejKmuhiKIn7WEQSve6fyzDXJT/F70SSUW+NqzpffGmLs2DandlJLxKKcAhLxVSZuEaLF1dY6rDPgwejd+vF64X8YVcRBaLXTQs+iWq5BKxpCfL9NsxNfYOD0vxNmENfvHgX4PXxRWefhavIMQhRs2sY967kmLR ubuntu@juju-prod-ues-proposed-migration-machine-3']}, 'ami-id': 'ami-00002bbc', 'ami-launch-index': '0', 'ami-manifest-path': 'FIXME', 'hostname': 'laney.novalocal', 'instance-action': 'none', 'instance-id': 'i-00225797', 'instance-type': 'autopkgtest', 'kernel-id': 'None', 'local-hostname': 'laney.novalocal', 'local-ipv4': '10.43.46.163', 'public-hostname': 'laney.novalocal', 'public-ipv4': '', 'ramdisk-id': 'None', 'reservation-id': 'r-lrjn99ln', 'security-groups': ''}

Note that this does *not* happen on lcy01 or lgw01. In cloud-init-output.log for bos01 (but not lcy01 or lgw01) I have:

2017-09-05 03:00:46,477 - warnings.py[WARNING]: **************************************************************************
# A new feature in cloud-init identified possible datasources for #
# this system as: #
# [] #
# However, the datasource used was: Ec2 #
# #
# In the future, cloud-init will only attempt to use datasources that #
# are identified or specifically configured. #
# For more information see #
# https://bugs.launchpad.net/bugs/1669675 #
# #
# If you are seeing this message, please file a bug against #
# cloud-init at #
# https://bugs.launchpad.net/cloud-init/+filebug?field.tags=dsid #
# Make sure to include the cloud provider your instance is #
# running on. #
# #
# After you have filed a bug, you can disable this warning by launching #
# your instance with the cloud-config below, or putting that content #
# into /etc/cloud/cloud.cfg.d/99-warnings.cfg #
# #
# #cloud-config #
# warnings: #
# dsid_missing_source: off #
**************************************************************************

I don't know if that's relevant.

Related branches

Iain Lane (laney) wrote :

Ok, I found the commit that introduced this code.

https://git.launchpad.net/cloud-init/commit/?id=3c45330af2a301f2bf219da556844d01cef6778e

Chad, please can you help us out here?

Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in cloud-init (Ubuntu):
status: New → Confirmed

I came here tracking a bug in vagrant-mutate dep8 tests that block qemu migrating.
This effectively blocks every artful package migration that needs access through the proxies, therefore setting prio critical - since Artful is not released yet I'm not setting update-regression.

Changed in cloud-init (Ubuntu):
importance: Undecided → Critical
Iain Lane (laney) wrote :

cpaelzer asked for these logs - bos01 broken, lgw01 working.

Copy from IRC:
[10:03] <Laney> cpaelzer: it's probably https://bugs.launchpad.net/ubuntu/+source/cloud-init/+bug/1715128
[10:04] <ubottu> Launchpad bug 1715128 in cloud-init (Ubuntu) "Crashes in convert_ec2_metadata_network_config on ScalingStack bos01 (ppc64el)" [Undecided,New]
[10:20] <cpaelzer> I opened bug 1715555 for what I saw, but the bug you listed makes sense to all I've found
[10:20] <ubottu> bug 1715555 in Auto Package Testing "ppc64el vagrant-mutate tests fail to access atlas.hashicorp.com" [Undecided,New] https://launchpad.net/bugs/1715555
[10:20] <cpaelzer> my issue boils down to proxies missing and the bug you referred is reporting just that as an issue
[10:21] <cpaelzer> I'm dupping mine - thanks for the info
[10:27] <cpaelzer> Laney: do we fake the EC2 datasource in our scalingstack?
[10:27] <cpaelzer> to provide the config data to cloud-init
[10:28] <Laney> I don't know how that stuff works
[10:28] <Laney> it's more or less the normal cloud images if that helps
[10:28] <cpaelzer> that is only the consumer
[10:28] <cpaelzer> how the config is passed is the interesting part in this case
[10:28] <cpaelzer> Laney: it would be great if you could add the full boot log containing all cloud-init output to the bug
[10:29] <wgrant> bos01 is a pretty standard kilo AFAIK
[10:29] <wgrant> We didn't do anything specially EC2ish while deploying it
[10:29] <cpaelzer> so it should be detected as openstack datasources then, which it is not
[10:29] <wgrant> How does cloud-init actually detect the datasource, though?
[10:30] <cpaelzer> Laney: if you could attach the output from the failing ppc as well as the working lcy01 or lgw01 that would be great
[10:30] <wgrant> If it uses DMI or something that might just not work on ppc64el with our relatively old setup
[10:31] <cpaelzer> wgrant: Laney: https://git.launchpad.net/cloud-init/tree/tools/ds-identify
[10:31] <wgrant> I have a religious objection to 1300 line shell scripts.
[10:32] <wgrant> Ah yes, lots of DMI bits in there.
[10:34] <Laney> https://bugs.launchpad.net/cloud-init/+bug/1715241?
[10:34] <ubottu> Launchpad bug 1715241 in cloud-init "ds-identify openstack returns not found on non-intel" [Medium,Confirmed]
[10:34] <cpaelzer> Laney: if you could provide those logs I'm sure smoser and rharper will find that most interesting later on and then has all the data to start finding what is wrong in the detection
[10:34] <wgrant> Aha
[10:34] <cpaelzer> It shoudl still find an openstack config drive no matter of DMI in your case
[10:34] <cpaelzer> (assumption)
[10:35] <wgrant> I have never seen an OpenStack that used configdrive.
[10:35] <wgrant> Oh you can do it when creating the VM, I see.
[10:35] <cpaelzer> which brings me back to my former question what you currently use to pass cloud init config info?
[10:35] <wgrant> Still never seen it used :)
[10:36] <wgrant> It uses the normal Nova metadata HTTP service
[10:36] <cpaelzer> thanks
[10:36] <cpaelzer> I'll copy this chat to the bug, together with the logs Laney can provide this should make some progress when US wakes up later

Further discussion:

[11:11] <apw> cpaelzer, if this is going to take days to fix, can we not revert the whole of the cloud-init update somehow
[11:11] <apw> cpaelzer, given it is cratering testing
[11:11] <cpaelzer> apw: absolutely
[11:12] <cpaelzer> apw: I planned to jump into the cloud-init standup today to dsicuss options
[11:12] <cpaelzer> apw: a CURVER+reallyOLDVER might be a temporary fix in this case
[11:12] <cpaelzer> but since this is around data source identification in scaling stack we have to understand what really is going on)
[11:13] <wgrant> Well we know now
[11:13] <cpaelzer> my current theory is that scaling stack tries to mimic ec2, but doesn't do perfectly (a common case btw)
[11:13] <wgrant> https://bugs.launchpad.net/cloud-init/+bug/1715241
[11:13] <ubot5`> Ubuntu bug 1715241 in cloud-init "ds-identify openstack returns not found on non-intel" [Medium,Confirmed]
[11:13] <wgrant> ds-identify simply doesn't work outside x86.
[11:13] <wgrant> Since it relies on DMI
[11:14] <cpaelzer> IM-veryH-o I'm not 100% convinced that is the issue affecting the current case
[11:14] <apw> they use dmi ... openstack "for the win"
[11:14] <cpaelzer> hmm, yeah you might be right ont hat bug actually
[11:15] <cpaelzer> yet I wonder why this should now be different than before
[11:15] <cpaelzer> anyway smoser and rharper will ahve all the current context later today
[11:16] <wgrant> Becaise cloud-init used to poll all datasources
[11:16] <cpaelzer> Maybe it comes together with fail-detect EC2 + new ipv6 code that can fault if no EC2 data is actually provided
[11:16] <wgrant> there was work this cycle to speed up cloud-init by using ds-identify to skip irrelevant sourcea
[11:17] <cpaelzer> The polling -> detect change happened a while ago, that is why I wonder the recent update breaking it in this case
[11:17] <cpaelzer> I'm rather up to date on the ds-identify move
[11:17] <cpaelzer> just thought that would have been in artful for a while now, which is why I expect it to be partially bad for a while but e.g. the ipv6 code added now making this a hard fail
[11:18] <wgrant> it is either that it has just switched to EC2, or EC2 used to work but doesn't any more
[11:19] <wgrant> but it *should* be using OpenStack on that cloud, so fixing ds-identify would probably work
[11:19] <cpaelzer> ack
[11:19] <cpaelzer> I think copying that discussion context to the current bug is worth too

Per former discussions please evaluate the log provided by Laney and decide if this is an instance of bug 1715241.
If so depending on your planned release timing would a hot-fix back to the last working version make any sense?

[13:38] <xnox> apw, awwww... /o\ what cloud-init regression?
[13:38] <xnox> is there a bug number? do you add bug number when pushing hints?
[13:39] <xnox> cpaelzer, we are currently undergoing investigations in scaling stack failing to talk to the metadata sources. and cloud-init was mistakenly identifying and (using/abusing) ec2 data-source on scalingstack on non-intel architectures since introduction of ds-identify it seems, and we are inflight fixing that.
[13:40] <xnox> and separately, I'm currently seeing regression of not being able to talk to the openstack native datasource, on intel, in scalingstack =/
[13:42] <wgrant> xnox: Hm, really? Do you have an example instance?
[13:42] <wgrant> Which network is it on?
[13:42] <wgrant> And which cloud?
[13:42] <xnox> wgrant, yes i have all that. One sec.
[13:43] <xnox> &> some other channel

Ryan Harper (raharper) wrote :

It appears that Ec2Local Datasource gets searched before OpenStack (network mode) does and is selected.

EC2Local should not enable without the positive id via UUID; that is (EC2-Strict-mode IIRC), if we're in a MAYBE mode (where we are searching all datasources) we can't select a DS if we've not finished searching.

Additionally; during the metadata crawl; EC2 will have the 'network' metadata space defined, where OpenStack does not.

A workaround is to use ConfigDrive when launching OpenStack instances

Scott Moser (smoser) wrote :

Just commenting to myself as it confused me.
The warning above shows:

  # A new feature in cloud-init identified possible datasources for #
  # this system as: #
  # [] #
  # However, the datasource used was: Ec2 #

I was confused that ds-identify identified nothing but cloud-init
still ran. This is because the default policy on non-intel is
  search,found=all,maybe=all,notfound=${DI_ENABLED}

which means "let cloud-init search if you found nothing."

So it is actually behaving as designed. On intel, though it would
just disable cloud-init entirely (but on intel Nova is identifed).

Launchpad Janitor (janitor) wrote :

This bug was fixed in the package cloud-init - 0.7.9-267-g922c3c5c-0ubuntu1

---------------
cloud-init (0.7.9-267-g922c3c5c-0ubuntu1) artful; urgency=medium

  * New upstream snapshot.
    - Ec2: only attempt to operate at local mode on known platforms.
      (LP: #1715128)
    - Use /run/cloud-init for tempfile operations. (LP: #1707222)
    - ds-identify: Make OpenStack return maybe on arch other than intel.
      (LP: #1715241)
    - tests: mock missed openstack metadata uri network_data.json
      [Chad Smith] (LP: #1714376)
    - relocate tests/unittests/helpers.py to cloudinit/tests
      [Lars Kellogg-Stedman]
    - tox: add nose timer output [Joshua Powers]
    - upstart: do not package upstart jobs, drop ubuntu-init-switch module.
    - tests: Stop leaking calls through unmocked metadata addresses
      [Chad Smith] (LP: #1714117)

 -- Scott Moser <email address hidden> Thu, 07 Sep 2017 16:59:04 -0400

Changed in cloud-init (Ubuntu):
status: Confirmed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers