cloud-init status --wait returns before cloud-final has finished executing

Bug #1890528 reported by Jeff Jo
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
cloud-init
Triaged
High
Chad Smith

Bug Description

I'm instrumenting our EC2 instances to report whether cloud-init succeeded or failed during the boot process. I've implemented this as a systemd unit that is triggered early in the boot process that runs a script that resembles the following:

cloud_init_status=0
cloud-init status --wait || cloud_init_status=$?

if [ "$cloud_init_status" = "0" ]; then
  report_launch_success
else
  report_launch_failure
fi

I was expecting `cloud-init status --wait` to only return after cloud-final has completed, but I discovered today that it can return early if there is an error encountered during the cloud-config stage.

I reported this in IRC and @blackboxsw thought the issue might be somewhere in this code ( https://github.com/canonical/cloud-init/blob/a13febd286d21f1754e32f4a05e722039eb452b8/cloudinit/cmd/status.py#L133-L144) and suggested I file a bug here.

I am using cloud-init 20.2-45 on Xenial.

Revision history for this message
Dan Watkins (oddbloke) wrote :

Thanks for filing this bug, Jeff! I agree with Chad, this is a bug; the help text for the --wait options reads:

  -w, --wait Block waiting on cloud-init to complete

I can't think of cases where someone might be running `cloud-init status --wait` and relying on the current behaviour of exit-early-in-case-of-error, so I think the correct thing to do here is to modify the behaviour of --wait to match its help text.

Changed in cloud-init:
status: New → Triaged
importance: Undecided → High
Revision history for this message
Scott Moser (smoser) wrote :

@Dan,

I think the problem is that later stages (cloud-init-final) do not necessarily run if earlier stages error (cloud-init-local).

I could be wrong, but I think that if cloud-init-local failed while 'cloud-init status --wait' was waiting, and you ignored the error, that status would end up waiting forever.

Revision history for this message
Jeff Jo (jeffjo) wrote :

@smoser, I was trying to wrap my head around this:

> I think the problem is that later stages (cloud-init-final) do not necessarily run if earlier stages error (cloud-init-local).

I know the documentation[0] states that cloud-init-local.service and cloud-init.service both block "as much of the boot as possible", but I couldn't figure out where this was being enforced. As far as I can tell, there isn't anything in the systemd configuration that requires cloud-init-local to succeed for targets like config-config.target to be reached (for example). Is the invariant being enforced within the code? (ie cloud-init detects that these earlier stages have failed and doesn't continue)

[0] https://cloudinit.readthedocs.io/en/latest/topics/boot.html#local

Revision history for this message
Scott Moser (smoser) wrote :

@Jeff,

You may be right. I don't see anything obvious either, but in the current systemd implementation, I think you're right. All jobs should run. There was that sometimes resulted in /var/lib/cloud/instance being a directory (rather than symlink), and that could cause subsequent jobs to fail before updating status (i think). I thought that bug had been fixed by odd_bloke, but I don't see it in upstream changelog.

https://bugs.launchpad.net/cloud-init/+bug/1531880 maybe what i was remembering.

Anyway... I don't have hard evidence to prove I'm right.

James Falcon (falcojr)
Changed in cloud-init:
assignee: nobody → Chad Smith (chad.smith)
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers