cloud-init regenerating ssh-keys

Bug #1885527 reported by Hadmut Danisch on 2020-06-29
14
This bug affects 3 people
Affects Status Importance Assigned to Milestone
cloud-init
Medium
Markus Schade
cloud-init (Ubuntu)
Undecided
Markus Schade

Bug Description

Hi,

I made some experiments with virtual machines with Ubuntu-20.04 at a german cloud provider (Hetzner), who uses cloud-init to initialize machines with a basic setup such as ip and ssh access.

During my installation tests I had to reboot the virtual machines several times after installing or removing packages.

Occassionally (not always) I noticed that the ssh host keys have changed, ssh complained. After accepting the new host keys (insecure!) I found, that all key files in /etc/ssh had fresh mod times, i.e. were freshly regenerated.

This reminds me to a bug I had reported about cloud-init some time ago, where I could not change the host name permanently, because cloud-init reset it to it's initial configuration at every boot time (highly dangerous, because it seemed to reset passwords to their original state as well.

Although cloud-init is intended to do an initial configuration for the first boot only, it seems to remain on the system and – even worse: occasionally – change configurations.

I've never understood what's the purpose of cloud-init remaining active once after the machine is up and running.

Revision history for this message
Hadmut Danisch (hadmut) wrote :

BTW,

docs at https://cloudinit.readthedocs.io completely fail to tell what cloud-init actually is or is supposed to do.

It is not explaining that or why cloud-init survives the first boot and remains active for future boots, and what this is good for.

There is no warning, no hint, no information that cloud-init keeps continuously twiddeling with the system.

Revision history for this message
Scott Moser (smoser) wrote :

Hi, please attach the output of 'cloud-init collect-logs' when run on a system that demonstrates the problem.

cloud-init uses the "instance-id" from the metadata service to indicate a new instance. Some things run once per instance, some things run once per boot.

Changed in cloud-init (Ubuntu):
status: New → Incomplete
Revision history for this message
Scott Moser (smoser) wrote :

after replying with collected logs, please set the status back to 'new'.
thanks for taking the time to file a bug.

Revision history for this message
Valery Tschopp (valery-tschopp) wrote :

We have a similar issue:

1. At first boot cloud-init generated the host ssh keys
2. The metadata service crashed and went down :(
3. At reboot, cloud-init can NOT reach the metadata service and regenerates the host ssh keys

Revision history for this message
Hadmut Danisch (hadmut) wrote :

I currently cannot give logs, since these were only temporary testing machines in a cloud, that existed only for tens of minutes to test installation procedures. I will supply logs as soon as a proceed with testing and the problem occurs again.

However, I do not understand and did not find any documentation about why cloud-init even remains active after first boot.

Descriptions like https://help.ubuntu.com/community/CloudInit or https://cloudinit.readthedocs.io/en/latest/ are just misleading as they suggest, that this is just about the initialization of the machine. They don't tell that cloud-init remains active and keeps manipulating the system.

I found this to be a severy security issue (which I reported in an earlier bug report for 18.04) when I could not permanently change the hostname of a machine, since cloud-init was resetting it with every reboot, and the file, where this was stored, was hidden deeply somewhere in /var. I'm afraid I cannot even change a password, since cloud-init might reset it to it's initial state.

I do consider it as a serious flaw and security problem just that cloud-init is behaving very differently from what's described in the documentation.

AGAIN: Why is cloud-init still manipulating the machine *after* initialization and first boot?

Changed in cloud-init (Ubuntu):
status: Incomplete → New
Revision history for this message
Scott Moser (smoser) wrote :

"AGAIN: Why is cloud-init still manipulating the machine *after* initialization and first boot?"

Because cloud-init thinks it is a "first boot". A supported use case for cloud-init is:
 * boot instance on cloud
 * ssh in
 * install some packages, prep this instance
 * stop instance
 * snapshot disk
 * register new image from disk
 * start new instances from this image

cloud-init will recognize that these instances are new instances, and initialize them. It recognizes this by comparing the cached value of 'instance-id' versus the current value of 'instance-id'. If they have changed, then you have a new instance.

The other reason for cloud-init to "remain active" is that it offers "per-boot" things.

Revision history for this message
Valery Tschopp (valery-tschopp) wrote :

We have no problem about cloud-init still being active on the machine after the first boot init.

My issue is:

On first instance boot, the metadata service is successfully contacted, and the initialisation succeed (host ssh key generated, hostname set, ...)

But on any reboot, if the metadata service is DOWN or not reachable for any reason, then cloud-init regenerates the host ssh keys.

My understanding is that determining if it is a first boot, or not, is only based on the cached instance-id, compared to the data received from the metadata service. So if the metadata service is DOWN or not reachable, cloud-init will always think it is a first boot, right?

Isn't it possible to make this test more robust?

Revision history for this message
Scott Moser (smoser) wrote :

@Valery,

Some cloud platforms provide the instance id via some non-network channel (dmi data is common). In those cases, cloud-init will check cached value versus the locally-available instance-id before looking for a network available datasource.

So, if Hetzner provides that information in some way, cloud-init can use it.

If not, the only options are to for the user to disable cloud-init (touch /etc/cloud/cloud-init.disabled) or set manual_cache_clean (https://bugs.launchpad.net/cloud-init/+bug/1712680/comments/11).

I'm not really sold on "what if the metadata service is DOWN" argument. Your cloud should not have its important services just fail. If it does, things are going to break. You could make a similar argument "What if DNS server is down?". I'm not discounting "Design for failure", and cloud-init could definitely do better here, but we need some support from the platform (locally available instance-id) to do better without sacrificing design goals.

Revision history for this message
Dan Watkins (oddbloke) wrote :

If Hetzner has (or starts to provide) a way of determining instance ID without using the network, we'd be more than happy to accept patches to use that in cloud-init. However, as it sounds like the issue here is Hetzner's internal services being unreliable, rather than a cloud-init issue, I'm going to mark this Incomplete. If you think this is unreasonable, please comment and change the status back to New.

Thanks!

Changed in cloud-init (Ubuntu):
status: New → Incomplete
Hadmut Danisch (hadmut) on 2020-07-07
Changed in cloud-init (Ubuntu):
status: Incomplete → New
Revision history for this message
Hadmut Danisch (hadmut) wrote :

Yes, I do think this is unreasonable.

It is definitely not Hetzner's task to fix Ubuntu.

Especially since that process of re-initiialization of that instance ID is neither obvious nor documented.

Looking at

https://cloudinit.readthedocs.io/

I did not yet find an explanation of what is going on, and at

https://cloudinit.readthedocs.io/en/latest/topics/instancedata.html

it just says

v1.instance_id

Unique instance_id allocated by the cloud.

Examples output:

    i-<hash>

but does not give a hint that this is to be constantly provided by some internal service.

On the contrary, it says

"Cloud-init is the industry standard multi-distribution method for cross-platform cloud instance initialization. It is supported across all major public cloud providers, provisioning systems for private cloud infrastructure, and bare-metal installations."

It says „instance initalization”. It does not say that is keeps modifying the living instance.

So this is undocumented behaviour, and I am more and more thinking about the question, whether this is a backdoor.

Revision history for this message
Dan Watkins (oddbloke) wrote :
Download full text (3.4 KiB)

> It is definitely not Hetzner's task to fix Ubuntu.

To be clear, cloud-init is not used only on Ubuntu; I believe that Hetzner's outage would have this effect across the majority of Linux distributions.

And, that aside, I don't think this characterisation is fair: we're suggesting that if Hetzner are going to allow their internal services to go down, then they should provide a more reliable way for instances to determine their identity. (This is generally done via DMI in other clouds that do it. The hypervisor stores the instance ID and provides it as a DMI value, and obviously instances can only boot if the hypervisor is up; therefore, the instance ID is always available.) To state this more glibly (and therefore less helpfully): it is not cloud-init's task to fix Hetzner.

That said, perhaps there is something that the Hetzner data source could do to handle this Hetzner-specific case. We already perform 60 retries with a 2 second wait between them, and a 2 second timeout. So we allow at least 2 minutes for the services to respond with something before we give up; we could bump that but I don't think it addresses the underlying issue. Any thoughts would be appreciated.

Alternatively (or perhaps additionally), this may need a change in the instance ID model that cloud-init uses to handle an explicit "we are not currently able to determine instance ID, so assume it hasn't changed". I think, however, that this would lead to a converse problem: instances launched from instance-capture images which boot for the first time during an "instance ID outage" would not detect that they were new instances, and so would not perform their first boot customisation. This would result in potentially-inaccessible instances (if any credentials remaining in the image are not available to the user launching instances) with SSH host keys not rotated (meaning that they would all have the same host keys as the image; a security issue). Of course, if users are also relying on their cloud-init user-data to perform any actions, that also won't occur; depending on their threat model, some users might also consider this a security issue.

The ultimate problem is that cloud-init cannot determine when it runs within an instance whether or not this is a "first boot": the cloud needs to indicate to us one way or the other, which is done via instance ID. If the cloud cannot do that, then there is no way to determine the correct behaviour.

If you are certain that you will never be capturing instances as images (i.e. you can categorically say that the root filesystem in this instance will _never_ first boot again) and you aren't using any of cloud-init's functionality after first boot (e.g. per-boot scripts), then you can disable cloud-init in the ways described by Scott earlier in this bug.

One convenience we could potentially provide: if cloud-init had a way for image creators to express "when next launched, cloud-init should treat that instance ID as immutable and permanent" (in a way that could be undone on subsequent boots, if a user wants to "unfreeze" an instance for image capture) then we might be able to avoid some of this pain, but that idea would need...

Read more...

Paride Legovini (paride) on 2020-07-24
Changed in cloud-init (Ubuntu):
status: New → Incomplete
Revision history for this message
Scott Moser (smoser) wrote :

@Daniel,

> One convenience we could potentially provide: if cloud-init had a way
> for image creators to express "when next launched, cloud-init should
> treat that instance ID as immutable and permanent" (in a way that could
> be undone on subsequent boots, if a user wants to "unfreeze" an instance
> for image capture) then we might be able to avoid some of this pain, but
> that idea would need more fleshing out before it's clear if it even
> makes sense.

I think you're basically describing "manual_cache_clean".

The intent (testing is needed) for manual_cache_clean is that

a.) user-data and system config (/etc/cloud/*.cfg) can set
manual_cache_clean to true or false. As always user-data overrides system
config. vendor-data should also be able to provide the setting.

b.) cloud-init renders /var/lib/cloud/instance/manual-clean
(path_helper.get_ipath_cur("manual_clean_marker")) if

c.) on boot, both ds-identify and cloud-init will check
and respect existance of /var/lib/cloud/instance/manual-clean

So... "unfreeze", if manual_cache_clean was set is just:
 rm -Rf /var/lib/cloud/instance /var/lib/cloud/instance/

I think it would be good to both test that my intent/understanding are
correct, and document it.

Revision history for this message
Scott Moser (smoser) wrote :

I opened bug 1888858 to request better documentation on this feature.

Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for cloud-init (Ubuntu) because there has been no activity for 60 days.]

Changed in cloud-init (Ubuntu):
status: Incomplete → Expired
Hadmut Danisch (hadmut) on 2020-09-23
Changed in cloud-init (Ubuntu):
status: Expired → New
Revision history for this message
Hadmut Danisch (hadmut) wrote :
Download full text (3.6 KiB)

I've found the problem.

The cloud provider (in this case german Hetzner) provides a machine with a virtual ethernet interface

# lspci | fgrep Ethernet
00:03.0 Ethernet controller: Red Hat, Inc. Virtio network device

which is, following the naming standards, named ens3

But then, the provider gives an cloud-init-file, which cloud-init fetches by http (I have not yet fully understood the after-boot-behaviour of cloud-init), which gives the network-configuration as

network-config:
  config:
  - mac_address: 96:00:00:70:f1:8f
    name: eth0
    subnets:
    - dns_nameservers:
      - 213.133.99.99
      - 213.133.100.100
      - 213.133.98.98
      ipv4: true
      type: dhcp
    - address: 2a01:4f9:c010:d0eb::1/64
      gateway: fe80::1
      ipv6: true
      type: static
    type: physical
  version: 1

thus renaming the device from ens3 to eth0

Therefore kern.log contains entries like

Sep 23 14:13:44 worker-00 kernel: [ 1.372316] virtio_net virtio0 ens3: renamed from eth0
Sep 23 14:13:44 worker-00 kernel: [ 4.834687] virtio_net virtio0 eth0: renamed from ens3

proving that the network interface changes between ens3 and eth0.

This then *sometimes* (logs taken from another machine where that problem occured) results in the cloud-init.log

2020-09-22 15:00:15,611 - __init__.py[DEBUG]: Attempting setup of ephemeral network on ens3 with 169.254.0.1/16 brd 169.254.255.255
2020-09-22 15:00:15,611 - subp.py[DEBUG]: Running command ['ip', '-family', 'inet', 'addr', 'add', '169.254.0.1/16', 'broadcast', '169.254.255.255', 'dev', 'ens3'] with allowed return codes
 [0] (shell=False, capture=True)
2020-09-22 15:00:15,637 - handlers.py[DEBUG]: finish: init-local/search-Hetzner: FAIL: no local data found from DataSourceHetzner
2020-09-22 15:00:15,637 - util.py[WARNING]: Getting data from <class 'cloudinit.sources.DataSourceHetzner.DataSourceHetzner'> failed
2020-09-22 15:00:15,639 - util.py[DEBUG]: Getting data from <class 'cloudinit.sources.DataSourceHetzner.DataSourceHetzner'> failed
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/cloudinit/sources/__init__.py", line 770, in find_source
    if s.update_metadata([EventType.BOOT_NEW_INSTANCE]):
  File "/usr/lib/python3/dist-packages/cloudinit/sources/__init__.py", line 659, in update_metadata
    result = self.get_data()
  File "/usr/lib/python3/dist-packages/cloudinit/sources/DataSourceHetzner.py", line 53, in get_data
    with cloudnet.EphemeralIPv4Network(nic, "169.254.0.1", 16,
  File "/usr/lib/python3/dist-packages/cloudinit/net/__init__.py", line 990, in __enter__
    self._bringup_device()
  File "/usr/lib/python3/dist-packages/cloudinit/net/__init__.py", line 1027, in _bringup_device
    subp.subp(
  File "/usr/lib/python3/dist-packages/cloudinit/subp.py", line 291, in subp
    raise ProcessExecutionError(stdout=out, stderr=err,
cloudinit.subp.ProcessExecutionError: Unexpected error while running command.
Command: ['ip', '-family', 'inet', 'addr', 'add', '169.254.0.1/16', 'broadcast', '169.254.255.255', 'dev', 'ens3']
Exit code: 1
Reason: -
Stdout:
Stderr: Cannot find device "ens3"
2020-09-22 15:00:15,646 - main.py[DEBUG]: No local datasource found

...

Read more...

Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in cloud-init (Ubuntu):
status: New → Confirmed
Revision history for this message
Markus Schade (lp-markusschade) wrote :

Hetzner also provides instance ID via DMI information (system-serial-number). So this could be used as a fallback should the metadata service not respond

Revision history for this message
Scott Moser (smoser) wrote :

@Markus,
Can you please provide a link to documentation showing that?

Revision history for this message
Markus Schade (lp-markusschade) wrote :

I co-wrote the datasource. ;-)

I will prepare a MR to add the following to our DS:

    def check_instance_id(self, sys_cfg):
        return sources.instance_id_matches_system_uuid(
            self.get_instance_id(), 'system-serial-number')

That should take care of those cases and prevent an already configured instance from being reconfigured in case the metadata server does not respond properly.

Revision history for this message
Markus Schade (lp-markusschade) wrote :
Dan Watkins (oddbloke) on 2020-10-27
Changed in cloud-init (Ubuntu):
status: Confirmed → In Progress
assignee: nobody → Markus Schade (lp-markusschade)
Scott Moser (smoser) on 2020-10-29
Changed in cloud-init:
status: New → Fix Committed
importance: Undecided → Medium
assignee: nobody → Markus Schade (lp-markusschade)
Revision history for this message
Markus Schade (lp-markusschade) wrote :

New images with the patched datasource have been rolled out on the platform.

Revision history for this message
Chad Smith (chad.smith) wrote : Fixed in cloud-init version 20.4.

This bug is believed to be fixed in cloud-init in version 20.4. If this is still a problem for you, please make a comment and set the state back to New

Thank you.

Changed in cloud-init:
status: Fix Committed → Fix Released
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (7.4 KiB)

This bug was fixed in the package cloud-init - 20.4-0ubuntu1

---------------
cloud-init (20.4-0ubuntu1) hirsute; urgency=medium

  * d/control: add gnupg to Recommends as cc_apt_configure requires it to be
    installed for some operations.
  * New upstream release.
    - Release 20.4 (#686) [James Falcon] (LP: #1905440)
    - tox: avoid tox testenv subsvars for xenial support (#684)
    - Ensure proper root permissions in integration tests (#664) [James Falcon]
    - LXD VM support in integration tests (#678) [James Falcon]
    - Integration test for fallocate falling back to dd (#681) [James Falcon]
    - .travis.yml: correctly integration test the built .deb (#683)
    - Ability to hot-attach NICs to preprovisioned VMs before reprovisioning
      (#613) [aswinrajamannar]
    - Support configuring SSH host certificates. (#660) [Jonathan Lung]
    - add integration test for LP: #1900837 (#679)
    - cc_resizefs on FreeBSD: Fix _can_skip_ufs_resize (#655)
      [Mina Galić] (LP: #1901958, #1901958)
    - DataSourceAzure: push dmesg log to KVP (#670) [Anh Vo]
    - Make mount in place for tests work (#667) [James Falcon]
    - integration_tests: restore emission of settings to log (#657)
    - DataSourceAzure: update password for defuser if exists (#671) [Anh Vo]
    - tox.ini: only select "ci" marked tests for CI runs (#677)
    - Azure helper: Increase Azure Endpoint HTTP retries (#619) [Johnson Shi]
    - DataSourceAzure: send failure signal on Azure datasource failure (#594)
      [Johnson Shi]
    - test_persistence: simplify VersionIsPoppedFromState (#674)
    - only run a subset of integration tests in CI (#672)
    - cli: add --system param to allow validating system user-data on a
      machine (#575)
    - test_persistence: add VersionIsPoppedFromState test (#673)
    - introduce an upgrade framework and related testing (#659)
    - add --no-tty option to gpg (#669) [Till Riedel] (LP: #1813396)
    - Pin pycloudlib to a working commit (#666) [James Falcon]
    - DataSourceOpenNebula: exclude SRANDOM from context output (#665)
    - cloud_tests: add hirsute release definition (#662)
    - split integration and cloud_tests requirements (#652)
    - faq.rst: add warning to answer that suggests running `clean` (#661)
    - Fix stacktrace in DataSourceRbxCloud if no metadata disk is found (#632)
      [Scott Moser]
    - Make wakeonlan Network Config v2 setting actually work (#626)
      [dermotbradley]
    - HACKING.md: unify network-refactoring namespace (#658) [Mina Galić]
    - replace usage of dmidecode with kenv on FreeBSD (#621) [Mina Galić]
    - Prevent timeout on travis integration tests. (#651) [James Falcon]
    - azure: enable pushing the log to KVP from the last pushed byte (#614)
      [Moustafa Moustafa]
    - Fix launch_kwargs bug in integration tests (#654) [James Falcon]
    - split read_fs_info into linux & freebsd parts (#625) [Mina Galić]
    - PULL_REQUEST_TEMPLATE.md: expand commit message section (#642)
    - Make some language improvements in growpart documentation (#649)
      [Shane Frasier]
    - Revert ".travis.yml: use a known-working version of lxd (#643)" (#650)
    - Fix not sourcing default 50-cloud-in...

Read more...

Changed in cloud-init (Ubuntu):
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers