ssh is started before cloud-init completed

Bug #1633453 reported by Vincent Bernat on 2016-10-14
24
This bug affects 5 people
Affects Status Importance Assigned to Milestone
cloud-init
Medium
Unassigned
cloud-init (Ubuntu)
Medium
Unassigned

Bug Description

Hello,

Previously, ssh was only started after cloud-init finished configuring the host. In Yakkety, it is quite easy to log into a machine with SSH while cloud-init is still running. This enables to log as root or to run apt-get update while cloud-init is still finishing to write system configuration. This is annoying with automation.

With Xenial, this never happens, but with Yakkety, this happens all the time.

Scott Moser (smoser) wrote :

I'm not sure how this would occur in xenial or yakkety.
cloud-init.service runs:
  Before=sshd-keygen.service
  Before=sshd.service

cloud-init.service is what runs the 'ssh' config module, which generates ssh host keys and disables root. And 'ssh' runs after 'user-groups', which sets the user-groups. So I'm not sure how this could happen. I'm not implying that it does not happen, but not sure how.

Scott Moser (smoser) wrote :

So, a bit more on the above. So as far as I can see, you should not be able to ssh into a system until cloud-init.service is done (preventing you from getting in with stale ssh keys), but you most certainly can get in before 'runcmd' or user scripts are all finished.

That is mostly by design.

Vincent Bernat (vbernat) wrote :

Maybe the problem is that cloud-init restarts ssh after generating host keys. Therefore, we are able to log just after that while cloud-init is still running.

Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in cloud-init (Ubuntu):
status: New → Confirmed
Vincent Bernat (vbernat) wrote :

I attach a output of journalctl when the problem arises. From a client perspective, when I connect to the server too soon, I can get this trace from ssh:

OpenSSH_7.4p1 Debian-6, OpenSSL 1.0.2j 26 Sep 2016
debug1: Reading configuration data /dev/null
debug1: Connecting to 185.19.31.14 [185.19.31.14] port 22.
debug1: Connection established.
debug1: key_load_public: No such file or directory
debug1: identity file /tmp/tmp.MlvQm06wGu/id_rsa type -1
debug1: key_load_public: No such file or directory
debug1: identity file /tmp/tmp.MlvQm06wGu/id_rsa-cert type -1
debug1: Enabling compatibility mode for protocol 2.0
debug1: Local version string SSH-2.0-OpenSSH_7.4p1 Debian-6
ssh_exchange_identification: read: Connection reset by peer

This happens at 01:25:33Z.

My interpretation is that cloud-init is running in parallel with the remaining of the system. At some point, the system starts sshd, then cloud-init restarts it once it has modified its configuration. Since the client is in the early stages of the connection, it didn't get its own process and restarting ssh will close the connection unexpectedly.

This happens about 1 time out of 10 with automatic provisioning. Automatic provisioning tools usually waits for SSH to answer at the TCP level and then expect things to work from here.

Vincent Bernat (vbernat) wrote :
Download full text (6.2 KiB)

The restricted output to SSH stuff:

Jan 21 01:25:31 stress-test-23357 cloud-init[990]: [CLOUDINIT] stages.py[DEBUG]: Running module ssh (<module 'cloudinit.config.cc_ssh' from '/usr/lib/python3/dist-packages/cloudinit/config/cc_ssh.py'>) with frequency once-per-instance
Jan 21 01:25:31 stress-test-23357 cloud-init[990]: [CLOUDINIT] handlers.py[DEBUG]: start: init-network/config-ssh: running config-ssh with frequency once-per-instance
Jan 21 01:25:31 stress-test-23357 cloud-init[990]: [CLOUDINIT] util.py[DEBUG]: Writing to /var/lib/cloud/instances/a94ea439-445e-447c-9227-64f71802ff61/sem/config_ssh - wb: [420] 24 bytes
Jan 21 01:25:31 stress-test-23357 cloud-init[990]: [CLOUDINIT] helpers.py[DEBUG]: Running config-ssh using lock (<FileLock using file '/var/lib/cloud/instances/a94ea439-445e-447c-9227-64f71802ff61/sem/config_ssh'>)
Jan 21 01:25:31 stress-test-23357 cloud-init[990]: [CLOUDINIT] util.py[DEBUG]: Running command ['ssh-keygen', '-t', 'rsa', '-N', '', '-f', '/etc/ssh/ssh_host_rsa_key'] with allowed return codes [0] (shell=False, capture=True)
Jan 21 01:25:31 stress-test-23357 cloud-init[990]: [CLOUDINIT] util.py[DEBUG]: Running command ['ssh-keygen', '-t', 'dsa', '-N', '', '-f', '/etc/ssh/ssh_host_dsa_key'] with allowed return codes [0] (shell=False, capture=True)
Jan 21 01:25:31 stress-test-23357 cloud-init[990]: [CLOUDINIT] util.py[DEBUG]: Running command ['ssh-keygen', '-t', 'ecdsa', '-N', '', '-f', '/etc/ssh/ssh_host_ecdsa_key'] with allowed return codes [0] (shell=False,capture=True)
Jan 21 01:25:31 stress-test-23357 cloud-init[990]: [CLOUDINIT] util.py[DEBUG]: Running command ['ssh-keygen', '-t', 'ed25519', '-N', '', '-f', '/etc/ssh/ssh_host_ed25519_key'] with allowed return codes [0] (shell=False, capture=True)
Jan 21 01:25:31 stress-test-23357 cloud-init[990]: [CLOUDINIT] util.py[DEBUG]: Changing the ownership of /home/ubuntu/.ssh to 1000:1000
Jan 21 01:25:31 stress-test-23357 cloud-init[990]: [CLOUDINIT] util.py[DEBUG]: Reading from /etc/ssh/sshd_config (quiet=False)
Jan 21 01:25:31 stress-test-23357 cloud-init[990]: [CLOUDINIT] util.py[DEBUG]: Read 2531 bytes from /etc/ssh/sshd_config
Jan 21 01:25:31 stress-test-23357 cloud-init[990]: [CLOUDINIT] util.py[DEBUG]: Writing to /home/ubuntu/.ssh/authorized_keys - wb: [384] 213 bytes
Jan 21 01:25:31 stress-test-23357 cloud-init[990]: [CLOUDINIT] util.py[DEBUG]: Changing the ownership of /home/ubuntu/.ssh/authorized_keys to 1000:1000
Jan 21 01:25:31 stress-test-23357 cloud-init[990]: [CLOUDINIT] util.py[DEBUG]: Changing the ownership of /root/.ssh to 0:0
Jan 21 01:25:31 stress-test-23357 cloud-init[990]: [CLOUDINIT] util.py[DEBUG]: Reading from /etc/ssh/sshd_config (quiet=False)
Jan 21 01:25:31 stress-test-23357 cloud-init[990]: [CLOUDINIT] util.py[DEBUG]: Read 2531 bytes from /etc/ssh/sshd_config
Jan 21 01:25:31 stress-test-23357 cloud-init[990]: [CLOUDINIT] util.py[DEBUG]: Writing to /root/.ssh/authorized_keys - wb: [384] 213 bytes
Jan 21 01:25:31 stress-test-23357 cloud-init[990]: [CLOUDINIT] util.py[DEBUG]: Changing the ownership of /root/.ssh/authorized_keys to 0:0
Jan 21 01:25:31 stress-test-23357 cloud-init[990]: [CLOUDINIT] handlers.py[DEBUG]: finish: init-network/config-s...

Read more...

"Automatic provisioning tools usually waits for SSH to answer at the TCP level and then expect things to work from here."

I'll let Scott handle the details, but as he mentioned by design runcmd or user scripts run afterward anyway. To make sure that cloud-init really is complete I happen to check on the existence of /var/lib/cloud/instance/boot-finished - might that be an option for your tools to use?

FYI - There also is a discussion about first initialization vs reboots around this, see bug 1258113. Not really a duplicate, but it might be worth to clarify if you care more about the initial deployment or any later reboot (or both).

Vincent Bernat (vbernat) wrote :

It's totally possible to fix things in the tools, but you have to fix in all provisioning tools. My use case is first boot and is not about user scripts, but proper configuration of SSH and SSH keys.

Looking a bit more, I see there is a Before=sshd.service for cloud-init.service, so sshd should not be started while cloud-init is still running. However, in my case, this is cloud-config.service that finish setting SSH stuff (and restart ssh). I am not familiar with the different steps. Shouldn't cloud-config.service delays SSH start too?

mathew boorman (mathew-boorman) wrote :

I think the problem is that cc_set_passwords.py is doing an immediate sshd service restart, and forcing sshd to get started immediately.

If that 'restart' is changed to a 'condrestart', it will only restart the service if it is already running. That should continue to work in any odd cases where cloud-init is being run after sshd has started, without starting it too early.

https://git.launchpad.net/cloud-init/tree/cloudinit/config/cc_set_passwords.py#n224

I think the previous post is correct.
I got the same issue, and here is my '/var/log/auth.log' at that time

Scott Moser (smoser) on 2018-02-07
Changed in cloud-init (Ubuntu):
importance: Undecided → Medium
status: Confirmed → Triaged
Changed in cloud-init:
status: New → Confirmed
importance: Undecided → Medium

I think I solved my issue by doing `cd /etc/systemd/system/multi-user.target.wants && sudo unlink ssh.service`.
Ths ssh service will anyway be restated by cloud-init later.
I can't garantee yet, and am not sure this is a good idea, will let you know.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers