ec2-init: Move ec2-run-user-data to startup priority S99

Bug #431255 reported by Eric Hammond on 2009-09-17
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
ec2-init (Ubuntu)
Medium
Scott Moser
Nominated for Hardy by Eric Hammond
Nominated for Intrepid by Eric Hammond
Nominated for Karmic by Eric Hammond

Bug Description

Binary package hint: ec2-init

AMI: ami-a40fefcd canonical-alphas-us/karmic-i386-alpha5.1.manifest.xml

All of the ec2-init functionality is currently driven by a single /etc/init.d/ec2-init script which is run at a single rc startup level (S15).

Some of the functionality, like regenerating ssh host keys and setting up authorized_keys, are great to be done before sshd is started.

On the other hand, running the user-data script, should be close to the last thing in the startup process, perhaps S99 (though I've been running it at S71 in my AMIs).

In any case, the user-data script must not be run before sshd is started. When the user-data script takes a long time or gets into an infinite loop, this prevents sshd from running, which prevents the user from being able to ssh in and figure out what went wrong with the startup script.

I ran into this already with a user-data script which waits until an EBS volume is attached (a common boot up function), but the volume was never attached and I couldn't ssh in to the instance to figure out what was going wrong.

Note that the same existing logic should be used to only run the user-data on the first boot of each instance (including rebundled images).

Eric Hammond (esh) on 2009-09-17
Changed in ec2-init (Ubuntu):
status: New → Confirmed
Scott Moser (smoser) wrote :

> On the other hand, running the user-data script, should be close to the last thing in
> the startup process, perhaps S99 (though I've been running it at S71 in my AMIs).

We definitely have things to work on here. But I don't believe its possible to simply say "the user-data script, should be close to the last thing in the startup process".

For example, the user-data script may be doing a upgrade, which would get a new sshd. Ideally you wouldn't have sshd start before that portion of the userdata script had taken place. It may be a contrived example, but the point is in the end, I think we need to build in a lot of flexibility on ordering the running of user-data.

tags: added: uec-images
Eric Hammond (esh) wrote :

Until we figure out the ultimate flexible solution, I'd like to be able to log in while the user-data script is running so I can monitor progress and debug.

A security bug in ssh should be considered a good motivating reason to publish updated AMIs.

Eric Hammond (esh) wrote :

Perhaps another approach might be to say that the user-data script functionality is entirely based on the current implementation in the AMIs on http://alestic.com which run the user-data late in the boot process. With that as the de-facto standard, then to enable easy migration for existing users, the new AMIs should follow suite unless there is an overwhelming reason why this standard was wrong.

> Until we figure out the ultimate flexible solution, I'd like to be able
> to log in while the user-data script is running so I can monitor
> progress and debug.

I think that is reasonable.

> A security bug in ssh should be considered a good motivating reason to
> publish updated AMIs.

I had thought about this when i posted. The basic point, though, is that
the later the user gets a hook in, the less they can fix or modify (at
least without a reboot).

I don't personally like the hassle/delay of ec2-get-console-output to
verify the ssh fingerprint. I'd much rather generate the new keys on the
system that launches the instance and pass them in the user-data. If
user-data doesn't run till after ssh starts, i have to restart sshd.
Again, not a major not ideal.

In the mean time, I think I agree to running user data after sshd per the
de-facto standard in place.

Eric Hammond (esh) wrote :

Based on IRC discussion, it may not have been clear that I consider this a reasonably important bug to fix for Karmic even though it means splitting ec2-init into two init scripts. There are existing architectures using Ubuntu on EC2 which break when attempted to be run on Canonical's latest Karmic image.

Here are a couple use cases to help demonstrate why sshd should be running before user-data is run and why not doing so breaks the code:

1) User does not want to put private keys in the non-private user-data, so user-data script waits until they show up on the file system. External process starts the instance, waits for ssh, then scp's the keys in to the instance at which point the user-data automatically continues running.

2) One of my production setups has the user-data script wait until an EBS volume is mounted, but the external process which attaches the volume first has to do some stuff on the instance through ssh. This deadlocks and both processes are waiting for each other.

Yes, it might be possible for people to restructure these applications to not use user-data or to pass information in some other way, but since there seems to be a consensus that user-data should be run after ssh is started, then I suggest it is important to do it the right way from the beginning in the Canonical images. I'd like there to be as little barrier to migration as possible. And I'd like for my existing code to work so I can continue testing them on the Canonical images :)

Eric Hammond (esh) wrote :

I think this is necessary to be completed for Canonical's images to replace the ones I've been building, but given its place in the whole of Ubuntu, I'm marking it Medium importance.

Changed in ec2-init (Ubuntu):
importance: Undecided → Medium
Scott Moser (smoser) on 2009-09-25
Changed in ec2-init (Ubuntu):
status: Confirmed → In Progress
milestone: none → ubuntu-9.10-beta
assignee: nobody → Scott Moser (smoser)
Scott Moser (smoser) wrote :

I've tested this with a private build by:

$ cat my.userdata
#!/bin/sh
SLEEPTIME=300
echo "I Am Sleeping [$SLEEPTIME]" | logger -s -t "smoser-sleeps"
sleep ${SLEEPTIME}
echo "I Am Awake [$SLEEPTIME later]" | logger -s -t "smoser-sleeps"

$ xc2-run-instances ami-fa658593 --user-data-file my.userdata

# now verify that this takes a long time before i can ssh to it, and i get connect refused
# once its up, I ssh there and :
% sudo dpkg -i ec2-init_0.4.999-0ubuntu3_all.deb
% sudo rm /var/lib/ec2/*
% sudo reboot

The rm above cause ec2-init to think it needs to run again. It regenerates keys (and ssh warns me), and then runs the long sleep. This time, though ssh comes up quickly, and i have the 300 seconds to ssh in and watch the 'sleep 300' process.

Scott Moser (smoser) wrote :

I'm requesting sponsorship of this with branch at
lp:~smoser/+junk/ec2-init.karmic

Scott Moser (smoser) wrote :

0.4.999-0ubuntu3 is in karmic archive now. This is fixed.
It will be in 20090927 uec-images. It did not make it into 20090926.

Changed in ec2-init (Ubuntu):
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers