> It is definitely not Hetzner's task to fix Ubuntu.

To be clear, cloud-init is not used only on Ubuntu; I believe that Hetzner's outage would have this effect across the majority of Linux distributions.

And, that aside, I don't think this characterisation is fair: we're suggesting that if Hetzner are going to allow their internal services to go down, then they should provide a more reliable way for instances to determine their identity.  (This is generally done via DMI in other clouds that do it.  The hypervisor stores the instance ID and provides it as a DMI value, and obviously instances can only boot if the hypervisor is up; therefore, the instance ID is always available.)  To state this more glibly (and therefore less helpfully): it is not cloud-init's task to fix Hetzner.

That said, perhaps there is something that the Hetzner data source could do to handle this Hetzner-specific case.  We already perform 60 retries with a 2 second wait between them, and a 2 second timeout.  So we allow at least 2 minutes for the services to respond with something before we give up; we could bump that but I don't think it addresses the underlying issue.  Any thoughts would be appreciated.

Alternatively (or perhaps additionally), this may need a change in the instance ID model that cloud-init uses to handle an explicit "we are not currently able to determine instance ID, so assume it hasn't changed".  I think, however, that this would lead to a converse problem: instances launched from instance-capture images which boot for the first time during an "instance ID outage" would not detect that they were new instances, and so would not perform their first boot customisation.  This would result in potentially-inaccessible instances (if any credentials remaining in the image are not available to the user launching instances) with SSH host keys not rotated (meaning that they would all have the same host keys as the image; a security issue).  Of course, if users are also relying on their cloud-init user-data to perform any actions, that also won't occur; depending on their threat model, some users might also consider this a security issue.

The ultimate problem is that cloud-init cannot determine when it runs within an instance whether or not this is a "first boot": the cloud needs to indicate to us one way or the other, which is done via instance ID.  If the cloud cannot do that, then there is no way to determine the correct behaviour.

If you are certain that you will never be capturing instances as images (i.e. you can categorically say that the root filesystem in this instance will _never_ first boot again) and you aren't using any of cloud-init's functionality after first boot (e.g. per-boot scripts), then you can disable cloud-init in the ways described by Scott earlier in this bug.

One convenience we could potentially provide: if cloud-init had a way for image creators to express "when next launched, cloud-init should treat that instance ID as immutable and permanent" (in a way that could be undone on subsequent boots, if a user wants to "unfreeze" an instance for image capture) then we might be able to avoid some of this pain, but that idea would need more fleshing out before it's clear if it even makes sense.

> Especially since that process of re-initiialization of that instance ID is neither obvious nor documented.

Agreed on both counts; cloud-init's documentation is lacking in many respects, including this one.