ds-identify runs too early

Bug #1810859 reported by Robert Schweikert on 2019-01-07
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
cloud-init
Medium
Unassigned

Bug Description

ds-identify is executed from a systemd generator [1]. Based on my understanding of the intention of both this creates a non resolvable timing conflict.

Generators run very early in the boot process.

The cloud-init generator runs ds-identify which in turn runs "blkid" to find filesystems with specific labels, "cidata" for the nocloud data source. However, it is possible to construct an environment where the filesystem with the "cidata" label is on an attached device and the generator runs prior to the attached device being known to the kernel and thus the return of blkid cannot reflect the proper status, meaning the "cidata" label cannot be found and thus the "nocloud" data source is not properly identified. This implies that the cloud-init.target unit will be disabled.

Observed in a test environment with qemu and the data source on a separate virtual device.

According to [1] we shouldn't add any sync points such as "udevadm settle", thus I am not certain how this could be resolved. Also given that we cannot control the timing of the execution of the generator it appears that this is going to be difficult to get under control.

Would it make sense to give ds-identify the option to simply exit and leave things alone?

In the present setup the generator target runs ds-identify which in turn will disable cloud-init.target if no data source can be identified. However, the Python code usually runs late enough that things that are no available in early boot are found and data sources are identified properly. If users that know they run in a specific environment could set a "ds=no-check" flag on the kernel command line then the timing issue could be prevented.

I realize for the nocloud case a user can set "ds=nocloud" on the kernel command line to work around the timing issue described herein. Also a "ds=no-check" would circumvent the basic intention of the generator to allow cloud-init to be installed anywhere and simply detect quickly an environment where cloud-init Python code should not be executed and thus safe boot time.

My point is that, IMHO, timing issues in general cannot be avoided by ds-identify due to the nature of when systemd executes the generators. Thus giving users the general ability to disable ds-identify maybe useful.

I am happy if I can be proven incorrect and the timing issue can be resolved.

[1] https://www.freedesktop.org/wiki/Software/systemd/Generators/

Hi, I have similar concern here.

We want to use cloud-init with NoCloud DataSource, as we run our instances on vmWare vSphere which have no metadata server.

So we put our meta-data and user-data files in /var/lib/cloud/seed/nocloud/ directory. But our /var is on a logical volume which is not mounted when cloudinit generator run ds-identify script. ds-identify then exit 1 and cloudinit is not enabled.

Would it be possible to have an option as Robert S. suggested or would it be possible to change Nocloud seed directory to another directory which is not likely to be on a separated filesystem ?

Regards,
Guillaume C.

Ran cloud-init version 18.2-27-g6ef92c98-0ubuntu1~18.04.1

Ryan Harper (raharper) wrote :

The goal of ds-identify is ensure that cloud-init does not run unless a datasource is present. We don't yet have the idea of "present" being much later than when the generators run. Note that generators may run multiple times, including in the initramfs and as soon as root device is found.

Cloud-init generally expresses it's dependencies to run a proper check, including presence of require paths (/var/lib/cloud, for example), but is unaware of more complicated storage configurations which may contain datasources (both the LVM and "delayed/slow" block device which is not yet present by the time that rootfs is present).

In the short term, I would suggest that for these scenarios, you will want a different ds-identify configuration than the default which is (in kernel cmd line format):

ci.di.policy=search,found=all,maybe=none,notfound=disabled

Alternatively, I think the two scenarios here could use this policy which
will put cloud-init in a mode where _if_ it finds a definitive datasource,
then it will use that specific datasource, however if none are found, then
cloud-init will remain enabled and then will search through all known datasources
when the services run.

ci.di.polilcy=search,found=all,maybe=all,notfound=enabled

Alternatively, if you have an image which you know uses a specific datasource and you
don't plan to export the image to a different platform (with different datasource)
you could specify the datasource on the kernel command line which would ensure that
cloud-init enables itself.

ci.datasource=nocloud-net

Note, that it's still possible for the race to occur in any of these scenarios...

how long should cloud-init wait during boot for it's datasource to arrive? It depends
on the use-case.

Looking at a longer term approach; I suspect that cloud-init, as a daemon, could remain
idle/inert and watch for specific system events on systemd-bus or dbus, or netlink, etc
and re-run it's identification code, and then trigger cloud-init services.

It remains to be discussed what it means for a "late" cloud-init, or a "at runtime" cloud-init
where an machine may be up and running without cloud-init having completed any tasks only to
then trigger those things at some time later.

Ryan Harper (raharper) wrote :

> Would it be possible to have an option as Robert S. suggested
> or would it be possible to change Nocloud seed directory to
> another directory which is not likely to be on a separated filesystem ?

You can control this for NoCloud via command line params:

ds=nocloud-net;seedfrom=/path/to/my/seed/dir

Changed in cloud-init:
importance: Undecided → Medium
status: New → Triaged
Scott Moser (smoser) wrote :

@Robert,

I've known that this was a potential problem since cloud-init's first creation, but I'd never seen it in practice. ds-identify makes it more likely but removing it doesn't "fix" the problem entirely either.

You describe the recreate as "Observed in a test environment with qemu and the data source on a separate virtual device.". Thats not very helpful, and not necessarily an issue. I assume that it means that ds-identify disabled cloud-init. The intent of ds-identify is to in fact identify known scenarios when cloud-init should run, and only enable cloud-init when that is the case.

I'm not particularly bothered if someone writes "cloud-init did not identify my
in-house developed EC2 clone as EC2". The response to that is either:
a.) fix your in-house clone to be a perfect clone (bug 1793590).
b.) adjust cloud-init to know about your clone, ideally as a separate
datasource (ie, Aliyun)

So ... I suggest that you have those two options also. If the platform clearly
identifies itself (possibly through dmi infornation or any other "local" path)
then we can identify it and suitably wait for the block device that we will
then *know* will appear. If this is NoCloud, then I think ryan gave you some
options, and we can further improve nocloud too.

@Ryan,
I don't think that "cloud-init as a daemon" actually solves the problem at all. cloud-init aims to provide structured points at boot when a user (or cloud-init) can operate. If running later later (after such a device shows up) then cloud-init is not providing that point in boot and thus failing in a different way.

On Tue, Apr 30, 2019 at 10:11 AM Scott Moser <email address hidden>
wrote:

>
> @Ryan,
> I don't think that "cloud-init as a daemon" actually solves the problem at
> all. cloud-init aims to provide structured points at boot when a user (or
> cloud-init) can operate. If running later later (after such a device shows
> up) then cloud-init is not providing that point in boot and thus failing in
> a different way.
>

You're right that it does not solve boot-time configuration if user-data is
present problem. I do think that transitioning to having the daemon that
reacts to events means that for some scenarios which don't need to run at
boot time only, will be able to run later.

> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1810859
>
> Title:
> ds-identify runs too early
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/cloud-init/+bug/1810859/+subscriptions
>

Robert Schweikert (rjschwei) wrote :

@Scott,

You are correct I was not very specific in the original description. I was more concerned about describing the general problem and the "race condition" to allow others that may experience similar problems to associate the issue with whatever they are seeing.

In our particular case, yes it is NoCloud, and in addition to just setting "ds=NoCloud" on the command line as a work around I added Ryan's suggestions to the SUSE bug.

The basic problem remains. When ds-identify runs and the source is not there, for whatever reason, then cloud-init by default gets disabled. In our specific case it just takes longer to attach the device that holds the config.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers