cloud-init status --wait returns before cloud-final has finished executing

Bug #1890528 reported by Jeff Jo
16
This bug affects 3 people
Affects Status Importance Assigned to Milestone
cloud-init
Expired
High
Chad Smith

Bug Description

I'm instrumenting our EC2 instances to report whether cloud-init succeeded or failed during the boot process. I've implemented this as a systemd unit that is triggered early in the boot process that runs a script that resembles the following:

cloud_init_status=0
cloud-init status --wait || cloud_init_status=$?

if [ "$cloud_init_status" = "0" ]; then
  report_launch_success
else
  report_launch_failure
fi

I was expecting `cloud-init status --wait` to only return after cloud-final has completed, but I discovered today that it can return early if there is an error encountered during the cloud-config stage.

I reported this in IRC and @blackboxsw thought the issue might be somewhere in this code ( https://github.com/canonical/cloud-init/blob/a13febd286d21f1754e32f4a05e722039eb452b8/cloudinit/cmd/status.py#L133-L144) and suggested I file a bug here.

I am using cloud-init 20.2-45 on Xenial.

Revision history for this message
Dan Watkins (oddbloke) wrote :

Thanks for filing this bug, Jeff! I agree with Chad, this is a bug; the help text for the --wait options reads:

  -w, --wait Block waiting on cloud-init to complete

I can't think of cases where someone might be running `cloud-init status --wait` and relying on the current behaviour of exit-early-in-case-of-error, so I think the correct thing to do here is to modify the behaviour of --wait to match its help text.

Changed in cloud-init:
status: New → Triaged
importance: Undecided → High
Revision history for this message
Scott Moser (smoser) wrote :

@Dan,

I think the problem is that later stages (cloud-init-final) do not necessarily run if earlier stages error (cloud-init-local).

I could be wrong, but I think that if cloud-init-local failed while 'cloud-init status --wait' was waiting, and you ignored the error, that status would end up waiting forever.

Revision history for this message
Jeff Jo (jeffjo) wrote :

@smoser, I was trying to wrap my head around this:

> I think the problem is that later stages (cloud-init-final) do not necessarily run if earlier stages error (cloud-init-local).

I know the documentation[0] states that cloud-init-local.service and cloud-init.service both block "as much of the boot as possible", but I couldn't figure out where this was being enforced. As far as I can tell, there isn't anything in the systemd configuration that requires cloud-init-local to succeed for targets like config-config.target to be reached (for example). Is the invariant being enforced within the code? (ie cloud-init detects that these earlier stages have failed and doesn't continue)

[0] https://cloudinit.readthedocs.io/en/latest/topics/boot.html#local

Revision history for this message
Scott Moser (smoser) wrote :

@Jeff,

You may be right. I don't see anything obvious either, but in the current systemd implementation, I think you're right. All jobs should run. There was that sometimes resulted in /var/lib/cloud/instance being a directory (rather than symlink), and that could cause subsequent jobs to fail before updating status (i think). I thought that bug had been fixed by odd_bloke, but I don't see it in upstream changelog.

https://bugs.launchpad.net/cloud-init/+bug/1531880 maybe what i was remembering.

Anyway... I don't have hard evidence to prove I'm right.

James Falcon (falcojr)
Changed in cloud-init:
assignee: nobody → Chad Smith (chad.smith)
Revision history for this message
James Falcon (falcojr) wrote :
Changed in cloud-init:
status: Triaged → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.