sshd failing to start on undercloud

Bug #1657173 reported by Ben Nemec
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
High
Unassigned

Bug Description

This is killing quite a few of our ci jobs for some reason. Seen in the undercloud install log:

2017-01-17 00:24:54.000 | 2017-01-17 00:24:54,206 INFO: Error: Systemd start for sshd failed!
2017-01-17 00:24:54.000 | 2017-01-17 00:24:54,207 INFO: journalctl log for sshd:
2017-01-17 00:24:54.000 | 2017-01-17 00:24:54,207 INFO: -- Logs begin at Mon 2017-01-16 23:57:57 UTC, end at Tue 2017-01-17 00:24:54 UTC. --
2017-01-17 00:24:54.000 | 2017-01-17 00:24:54,207 INFO: Jan 17 00:24:54 tripleo-centos-7-tripleo-test-cloud-rh1-6704492 systemd[1]: sshd.service start operation timed out. Terminating.
2017-01-17 00:24:54.000 | 2017-01-17 00:24:54,208 INFO: Jan 17 00:24:54 tripleo-centos-7-tripleo-test-cloud-rh1-6704492 systemd[1]: Failed to start OpenSSH server daemon.
2017-01-17 00:24:54.000 | 2017-01-17 00:24:54,208 INFO: Jan 17 00:24:54 tripleo-centos-7-tripleo-test-cloud-rh1-6704492 systemd[1]: Unit sshd.service entered failed state.
2017-01-17 00:24:54.000 | 2017-01-17 00:24:54,208 INFO: Jan 17 00:24:54 tripleo-centos-7-tripleo-test-cloud-rh1-6704492 systemd[1]: sshd.service failed.

Logstash query: http://logstash.openstack.org/#dashboard/file/logstash.json?query=build_name%3A%20*tripleo-ci*%20AND%20build_status%3A%20FAILURE%20AND%20message%3A%20%5C%22Systemd%20start%20for%20sshd%20failed!%5C%22

As of this writing, we failed on this 16 times in the past 24 hours.

Tags: ci
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to instack-undercloud (master)

Reviewed: https://review.openstack.org/422146
Committed: https://git.openstack.org/cgit/openstack/instack-undercloud/commit/?id=2c93df4db920978769ecd412481f1a6c30d31922
Submitter: Jenkins
Branch: master

commit 2c93df4db920978769ecd412481f1a6c30d31922
Author: Emilien Macchi <email address hidden>
Date: Wed Jan 18 13:00:41 2017 -0500

    ssh: use tripleo profile to deploy SSH

    For consistency with overcloud, manage SSH by using the TripleO profile,
    so sshd is managed the same way as it is on the overcloud.

    Related-Bug: #1657173
    Change-Id: I8c71d6f57ce0ef3f67d162877865f0b4bb5a0d3f

Revision history for this message
Gabriele Cerami (gcerami) wrote :

Looking at the journal of any of the failing jobs, it seems that sshd is indeed started, but systemd is unable to understand it.

For example I see this:

Jan 19 16:31:53.852710 tripleo-centos-7-tripleo-test-cloud-rh1-6780446 systemd[1]: Started OpenSSH Server Key Generation.

then

Jan 19 16:31:53.883921 tripleo-centos-7-tripleo-test-cloud-rh1-6780446 systemd[1]: PID file /var/run/sshd.pid not readable (yet?) after start.

and after some time

Jan 19 16:33:25.117502 tripleo-centos-7-tripleo-test-cloud-rh1-6780446 systemd[1]: sshd.service start operation timed out. Terminating.

So it seems that the service file is unable or forgetting to write pid file, systemd intepret this as a failure, tries to restart

Jan 19 16:36:19.867524 tripleo-centos-7-tripleo-test-cloud-rh1-6780446 systemd[1]: sshd.service holdoff time over, scheduling restart.

then this happens

Jan 19 16:36:19.881772 tripleo-centos-7-tripleo-test-cloud-rh1-6780446 sshd[7345]: error: Bind to port 22 on 0.0.0.0 failed: Address already in use.

because sshd is really up and running.

Revision history for this message
Gabriele Cerami (gcerami) wrote :

Forgot to add, I know it's up and running because it's logging connections, like this

Jan 19 16:32:25.628315 tripleo-centos-7-tripleo-test-cloud-rh1-6780446 sshd[955]: Accepted publickey for jenkins from 23.253.73.160 port 46896 ssh2: RSA

Revision history for this message
Gabriele Cerami (gcerami) wrote :

This seems a legit error, I'm not sure changing what is managing it is going to solve the problem. Sshd fails to write the pid file, nothing should report that is starting correctly.
I was suspecting a selinux error, but audit log doesn't support my theory.
Now I've seen that the error about /var/run/sshd.pid not being readable can be seen also in passing job, butl looking at logstash, first occurrence of both this error and the string containing /var/run/sshd.pid not readable from systemd is 10th January in

http://logs.openstack.org/44/418344/1/check-tripleo/gate-tripleo-ci-centos-7-ovb-nonha/e4e2994/logs/undercloud/var/log/undercloud_install.txt

package diff from a job on 9th Jan doesn't show anything significant.
The openssh package has been changed recently (middle of december) from

openssh-6.6.1p1-25.el7.x86_64

to

openssh-6.6.1p1-31.el7.x86_64

it may worth a shot opening a bug against openssh-server

Revision history for this message
Gabriele Cerami (gcerami) wrote :
Changed in tripleo:
assignee: nobody → Gabriele Cerami (gcerami)
Revision history for this message
Gabriele Cerami (gcerami) wrote :

Following the reply on the bug, seems this is a known problem in the interaction with systemd, and won't be fixed soon.
We may want to pin openssh-server version to the previous release.

Revision history for this message
Michele Baldessari (michele) wrote :

Note that we have a potential workaround here:
https://bugzilla.redhat.com/show_bug.cgi?id=1415218#c4

It requires a change in the systemd service file for openssh. Namely if we revert the following:
--- a/sshd.service
+++ b/sshd.service
@@ -5,8 +5,10 @@ After=network.target sshd-keygen.service
 Wants=sshd-keygen.service

 [Service]
+Type=forking
+PIDFile=/var/run/sshd.pid
 EnvironmentFile=/etc/sysconfig/sshd
-ExecStart=/usr/sbin/sshd -D $OPTIONS
+ExecStart=/usr/sbin/sshd $OPTIONS
 ExecReload=/bin/kill -HUP $MAINPID
 KillMode=process
 Restart=on-failure

We should not get the failures any longer. Until the upstream bug gets fixed we probably want to simply override the sshd.service with the patch above reverted. So do:
1) Create /etc/systemd/system/sshd.service:
[Unit]
Description=OpenSSH server daemon
Documentation=man:sshd(8) man:sshd_config(5)
After=network.target sshd-keygen.service
Wants=sshd-keygen.service

[Service]
EnvironmentFile=/etc/sysconfig/sshd
ExecStart=/usr/sbin/sshd -D $OPTIONS
ExecReload=/bin/kill -HUP $MAINPID
KillMode=process
Restart=on-failure
RestartSec=42s

[Install]
WantedBy=multi-user.target

2) Launch systemctl daemon-reload

Revision history for this message
Michele Baldessari (michele) wrote :

Exploring the idea of the previous comment here: https://review.openstack.org/423513

Changed in tripleo:
assignee: Gabriele Cerami (gcerami) → Michele Baldessari (michele)
status: Triaged → In Progress
Revision history for this message
Gabriele Cerami (gcerami) wrote :

We merged the injection of a test package here https://review.openstack.org/430688.
But we still see the sshd errors in the logs.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on instack-undercloud (master)

Change abandoned by Michele Baldessari (<email address hidden>) on branch: master
Review: https://review.openstack.org/423513
Reason: openssh will be fixed properly

Revision history for this message
Emilien Macchi (emilienm) wrote :

There are no currently open reviews on this bug, changing the status back to the previous state and unassigning. If there are active reviews related to this bug, please include links in comments.

Changed in tripleo:
status: In Progress → Triaged
assignee: Michele Baldessari (michele) → nobody
Revision history for this message
Alex Schultz (alex-schultz) wrote :

This was fixed in sshd packaging

Changed in tripleo:
status: Triaged → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.