upgrade to lunar fails due to rescue-ssh.target or port 22 takeover

Bug #2004551 reported by Christian Ehrhardt 
18
This bug affects 1 person
Affects Status Importance Assigned to Milestone
openssh (Ubuntu)
Confirmed
High
Unassigned
systemd (Ubuntu)
Incomplete
Undecided
Unassigned

Bug Description

Hi,
I just upgraded a system from Jammy to Lunar and openssh-server refuses to upgrade well.

Setting up openssh-server (1:9.0p1-1ubuntu8) ...
Replacing config file /etc/ssh/sshd_config with new version
Replacing config file /etc/ssh/sshd_config with new version
Synchronizing state of ssh.service with SysV service script with /lib/systemd/systemd-sysv-install.
Executing: /lib/systemd/systemd-sysv-install disable ssh
rescue-ssh.target is a disabled or a static unit not running, not starting it.
Could not execute systemctl: at /usr/bin/deb-systemd-invoke line 145.
dpkg: error processing package openssh-server (--configure):
 installed openssh-server package post-installation script subprocess returned error exit status 1
Processing triggers for man-db (2.11.2-1) ...
Processing triggers for libc-bin (2.36-0ubuntu4) ...
Errors were encountered while processing:
 openssh-server
Error: Timeout was reached
needrestart is being skipped since dpkg has failed
E: Sub-process /usr/bin/dpkg returned an error code (1)

I'm not sure what exactly it is.
This output complains about rescue-ssh.target and indeed that can not be started even directly.

$ sudo systemctl start rescue-ssh.target
A dependency job for rescue-ssh.target failed. See 'journalctl -xe' for details.

And in postinst is a try to start it:
$ grep rescue /var/lib/dpkg/info/openssh-server.postinst
  deb-systemd-invoke $_dh_action 'rescue-ssh.target' >/dev/null || true

But I think the underlying issue is that ssh is already on, and I'm logged in via it.
And that makes the service restart of the ssh socket which was added break.

Feb 02 10:40:56 node-horsea systemd[104560]: ssh.socket: Failed to create listening socket ([::]:22): Address already in use
Feb 02 10:40:56 node-horsea systemd[1]: ssh.socket: Failed to receive listening socket ([::]:22): Input/output error
Feb 02 10:40:56 node-horsea systemd[1]: ssh.socket: Failed to listen on sockets: Input/output error
Feb 02 10:40:56 node-horsea systemd[1]: ssh.socket: Failed with result 'resources'.

Now, whichever it is, it is hard to resolve.
The only way to get the socket to own it would be rebooting so that sshd lets go and systemd can take over.
I could reboot, but that is not the point.
What if I'd want to get the service and upgrade completed before reboot.
Because as of now dpkg considers the system unhappy, and that would usually be a sign for "better not reboot before being resolved" to me.

One thing though, I have not upgraded with do-release-upgrade - would we / do we have magic there to make the ssh socket activation transition smoother?

CVE References

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :
Download full text (3.5 KiB)

As expected, on reboot all is fine for the service status

ubuntu@node-horsea:~$ systemctl status ssh.service
● ssh.service - OpenBSD Secure Shell server
     Loaded: loaded (/lib/systemd/system/ssh.service; disabled; preset: enabled)
    Drop-In: /etc/systemd/system/ssh.service.d
             └─00-socket.conf
     Active: active (running) since Thu 2023-02-02 10:54:40 UTC; 12min ago
TriggeredBy: ● ssh.socket
       Docs: man:sshd(8)
             man:sshd_config(5)
    Process: 2689 ExecStartPre=/usr/sbin/sshd -t (code=exited, status=0/SUCCESS)
   Main PID: 2690 (sshd)
      Tasks: 1 (limit: 38220)
     Memory: 5.3M
        CPU: 894ms
     CGroup: /system.slice/ssh.service
             └─2690 "sshd: /usr/sbin/sshd -D [listener] 0 of 10-100 startups"

Feb 02 11:06:27 node-horsea sshd[14629]: Accepted publickey for ubuntu from 10.172.196.173 port 47348 ssh2: RSA SHA256:KyONnhWWzlbscZNTHPZ25GWCXDQY5u/UD72EtQcwtqU
Feb 02 11:06:27 node-horsea sshd[14629]: pam_unix(sshd:session): session opened for user ubuntu(uid=1000) by (uid=0)
Feb 02 11:06:27 node-horsea sshd[14629]: pam_env(sshd:session): deprecated reading of user environment enabled
Feb 02 11:06:58 node-horsea sshd[14735]: Accepted publickey for ubuntu from 10.172.196.173 port 55016 ssh2: RSA SHA256:KyONnhWWzlbscZNTHPZ25GWCXDQY5u/UD72EtQcwtqU
Feb 02 11:06:58 node-horsea sshd[14735]: pam_unix(sshd:session): session opened for user ubuntu(uid=1000) by (uid=0)
Feb 02 11:06:59 node-horsea sshd[14735]: pam_env(sshd:session): deprecated reading of user environment enabled
Feb 02 11:07:03 node-horsea sshd[14796]: Accepted publickey for ubuntu from 10.172.196.173 port 57034 ssh2: RSA SHA256:KyONnhWWzlbscZNTHPZ25GWCXDQY5u/UD72EtQcwtqU
Feb 02 11:07:03 node-horsea sshd[14796]: pam_unix(sshd:session): session opened for user ubuntu(uid=1000) by (uid=0)
Feb 02 11:07:03 node-horsea sshd[14796]: pam_env(sshd:session): deprecated reading of user environment enabled
Feb 02 11:07:03 node-horsea sshd[14796]: pam_unix(sshd:session): session closed for user ubuntu
ubuntu@node-horsea:~$ systemctl status ssh.socket
● ssh.socket - OpenBSD Secure Shell server socket
     Loaded: loaded (/lib/systemd/system/ssh.socket; enabled; preset: enabled)
     Active: active (running) since Thu 2023-02-02 10:54:21 UTC; 12min ago
      Until: Thu 2023-02-02 10:54:21 UTC; 12min ago
   Triggers: ● ssh.service
     Listen: [::]:22 (Stream)
      Tasks: 0 (limit: 38220)
     Memory: 8.0K
        CPU: 894us
     CGroup: /system.slice/ssh.socket

Feb 02 10:54:21 node-horsea systemd[1]: Listening on OpenBSD Secure Shell server socket.

And out of this condition it can even complete the package configuration.

ubuntu@node-horsea:~$ sudo dpkg-reconfigure openssh-server
/usr/sbin/dpkg-reconfigure: openssh-server is broken or not fully installed
ubuntu@node-horsea:~$ sudo apt-get install --fix-broken
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
0 upgraded, 0 newly installed, 0 to remove and 0 not upgraded.
1 not fully installed or removed.
After this operation, 0 B of additional disk space will be used.
Setting up openssh-server (1:9.0p1-1ubuntu8) ...
Replacing config file /etc/ssh/ssh...

Read more...

Revision history for this message
Steve Langasek (vorlon) wrote :

> rescue-ssh.target is a disabled or a static unit not running, not starting it.

Not really a complaint, just informational.

> But I think the underlying issue is that ssh is already on, and I'm logged in via it.
> And that makes the service restart of the ssh socket which was added break.

I recall that I did test this case without problem. sshd in general doesn't fail if there are open connections and you restart it, because it properly configures its sockets with SO_REUSEADDR.

To debug this I would really want to see what process is holding that port; but as you've rebooted, that's not really possible to see now. If you ever reproduce it on another instance, please grab the output of fuser -n tcp 22 or equivalent! If we failed to stop the ssh.service, we want to see why.

> One thing though, I have not upgraded with do-release-upgrade - would we
> / do we have magic there to make the ssh socket activation transition smoother?

No, this is all supposed to be handled cleanly with the package maintainer scripts alone.

Changed in openssh (Ubuntu):
status: New → Incomplete
importance: Undecided → High
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Thank you Steve, documenting what kind of debug data you'd expect helps me or anyone else who might run into this next time.

Once I'm done with my current tasks on this system I'll try to redeploy and re-upgrade to check if it happens again.

Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for openssh (Ubuntu) because there has been no activity for 60 days.]

Changed in openssh (Ubuntu):
status: Incomplete → Expired
Revision history for this message
Miriam España Acebal (mirespace) wrote :
Download full text (6.0 KiB)

Hi,

I'm facing the same issue on the same machine (on MAAS), but I didn't reach lunar yet: first, I did a do-dist-upgrade from jammy to Kinetic (prompt normal).

I collected the info requested in comment #2:

ubuntu@node-horsea:~$ sudo fuser -n tcp 22
22/tcp: 1133 214495 214551
ubuntu@node-horsea:~$ sudo netstat -natpl | grep ":22"
tcp 0 0 0.0.0.0:22 0.0.0.0:* LISTEN 1133/sshd: /usr/sbi
tcp 0 52 10.245.171.244:22 10.172.195.194:55174 ESTABLISHED 214495/sshd: ubuntu
tcp6 0 0 :::22 :::* LISTEN 1133/sshd: /usr/sbi
ubuntu@node-horsea:~$ sudo ps -ef | grep 1133 | grep -v grep
root 1133 1 0 Jul13 ? 00:00:00 sshd: /usr/sbin/sshd -D [listener] 0 of 10-100 startups
root 214495 1133 0 Jul20 ? 00:00:00 sshd: ubuntu [priv]
ubuntu@node-horsea:~$ sudo ps -ef | grep 214551 | grep -v grep
ubuntu 214551 214495 0 Jul20 ? 00:00:00 sshd: ubuntu@pts/0
ubuntu 214552 214551 0 Jul20 pts/0 00:00:00 -bash

I experienced other messages indicating system degradation such as:

- ubuntu@node-horsea:~$ systemctl reload ssh.service
Failed to reload ssh.service: Failed to activate service 'org.freedesktop.systemd1': timed out (service_start_timeout=25000ms)
See system logs and 'systemctl status ssh.service' for details.

- ubuntu@node-horsea:~$ systemctl status ssh.service
Failed to get properties: Connection timed out

So I did:
ubuntu@node-horsea:~$ sudo telinit u
ubuntu@node-horsea:~$ sudo systemctl daemon-reexec
ubuntu@node-horsea:~$ sudo systemctl daemon-reload

And I get then:

ubuntu@node-horsea:~$ sudo systemctl status -l ssh.service
○ ssh.service - OpenBSD Secure Shell server
     Loaded: loaded (/lib/systemd/system/ssh.service; disabled; preset: enabled)
    Drop-In: /etc/systemd/system/ssh.service.d
             └─00-socket.conf
     Active: inactive (dead)
TriggeredBy: × ssh.socket
       Docs: man:sshd(8)
             man:sshd_config(5)

Jul 20 10:37:30 node-horsea sshd[225140]: error: Bind to port 22 on 0.0.0.0 failed: Address already in use.
Jul 20 10:37:30 node-horsea sshd[225140]: error: Bind to port 22 on :: failed: Address already in use.
Jul 20 10:37:30 node-horsea sshd[225140]: fatal: Cannot bind any address.
Jul 20 10:37:30 node-horsea systemd[1]: ssh.service: Main process exited, code=exited, status=255/EXCEPTION
Jul 20 10:37:30 node-horsea systemd[1]: ssh.service: Failed with result 'exit-code'.
Jul 20 10:37:30 node-horsea systemd[1]: ssh.service: Unit process 1133 (sshd) remains running after unit stopped.
Jul 20 10:37:30 node-horsea systemd[1]: Failed to start OpenBSD Secure Shell server.
Jul 20 10:54:54 node-horsea sshd[278339]: Accepted publickey for ubuntu from 10.172.195.194 port 49122 ssh2: RSA SHA256:2MSrD9nviIiiRrpcOIF/MA8eD/>
Jul 20 10:54:54 node-horsea sshd[278339]: pam_unix(sshd:session): session opened for user ubuntu(uid=1000) by (uid=0)
Jul 20 10:54:55 node-horsea sshd[278339]: pam_env(sshd:session): deprecated reading of user environment enabled

The process is inactive but alive (I'm logged via ssh, and I was able to make a second connectio...

Read more...

Changed in openssh (Ubuntu):
status: Expired → New
Revision history for this message
Robie Basak (racb) wrote :

Miriam, if you could preserve the state of the system please if you still have it for debugging, then that would be helpful.

tags: added: server-todo
Revision history for this message
Andreas Hasenack (ahasenack) wrote (last edit ):

I'm seeing something that looks just like this in a dep8 mantic test run, where openssh-server is upgraded while preparing the testbed and postinst fails[1]:

977s Preparing to unpack .../openssh-sftp-server_1%3a9.3p1-1ubuntu1_amd64.deb ...
977s Unpacking openssh-sftp-server (1:9.3p1-1ubuntu1) over (1:9.2p1-2ubuntu3) ...
977s Preparing to unpack .../openssh-server_1%3a9.3p1-1ubuntu1_amd64.deb ...
977s Warning: Stopping ssh.service, but it can still be activated by:
977s ssh.socket
977s Unpacking openssh-server (1:9.3p1-1ubuntu1) over (1:9.2p1-2ubuntu3) ...
(...)
992s Setting up openssh-server (1:9.3p1-1ubuntu1) ...
993s rescue-ssh.target is a disabled or a static unit not running, not starting it.
993s Could not execute systemctl: at /usr/bin/deb-systemd-invoke line 145.
993s dpkg: error processing package openssh-server (--configure):
993s installed openssh-server package post-installation script subprocess returned error exit status 1
(...)
1016s Errors were encountered while processing:
1016s openssh-server

1. https://autopkgtest.ubuntu.com/results/autopkgtest-mantic-ahasenack-samba-kb5028166/mantic/amd64/s/samba/20230720_195603_95cad@/log.gz

Revision history for this message
Miriam España Acebal (mirespace) wrote :

Yes, Robie, I didn't touch it for that reason... It's ready for debugging. It's Christian's machine for dpdk testing on MAAS.

Revision history for this message
Miriam España Acebal (mirespace) wrote :
Download full text (5.7 KiB)

Today I'm trying to reproduce the error in a lxc vm without success: I logged in via ssh to start the do-release-upgrade there so the system would initiate sshd on port 1022 as well. The upgrade process went fine the times I tried it.

Thanks Andreas for spotting the same error in the autopkgtest log. I'm looking for similarities between it and the horsea machine.

Checking horsea's log, I found the following

"FileNotFoundError: [Errno 2] No such file or directory: '/tmp/ubuntu-release-upgrader-x6gsdfa4/removal_denylist.cfg'"

in the screenlog.0 under /var/log/dist-upgrade directory:

Progress: [ 96%]
Processing triggers for man-db (2.10.2-1) ...
packages have been installed but needrestart is suspended
packages have been installed but needrestart is suspended
packages have been installed but needrestart is suspended

Reading package lists... 0%

Reading package lists... 100%

Reading package lists... Done

Building dependency tree... 0%

Building dependency tree... 0%

Building dependency tree... 50%

Building dependency tree... 50%

Building dependency tree... Done

Reading state information... 0%

Reading state information... 0%

Reading state information... Done

Traceback (most recent call last):
  File "/tmp/ubuntu-release-upgrader-x6gsdfa4/kinetic", line 8, in <module>
  File "/tmp/ubuntu-release-upgrader-x6gsdfa4/DistUpgrade/DistUpgradeMain.py", line 241, in main
  File "/tmp/ubuntu-release-upgrader-x6gsdfa4/DistUpgrade/DistUpgradeController.py", line 2042, in run
  File "/tmp/ubuntu-release-upgrader-x6gsdfa4/DistUpgrade/DistUpgradeController.py", line 1986, in fullUpgrade
  File "/tmp/ubuntu-release-upgrader-x6gsdfa4/DistUpgrade/DistUpgradeController.py", line 207, in openCache
  File "/tmp/ubuntu-release-upgrader-x6gsdfa4/DistUpgrade/DistUpgradeController.py", line 231, in _openCache
  File "/tmp/ubuntu-release-upgrader-x6gsdfa4/DistUpgrade/DistUpgradeCache.py", line 142, in __init__
  File "/tmp/ubuntu-release-upgrader-x6gsdfa4/DistUpgrade/DistUpgradeConfigParser.py", line 89, in getListFromFile
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/ubuntu-release-upgrader-x6gsdfa4/removal_denylist.cfg'
Error in sys.excepthook:
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/apport_python_hook.py", line 76, in apport_excepthook
    binary = os.path.realpath(os.path.join(os.getcwd(), sys.argv[0]))
FileNotFoundError: [Errno 2] No such file or directory

Original exception was:
Traceback (most recent call last):
  File "/tmp/ubuntu-release-upgrader-x6gsdfa4/kinetic", line 8, in <module>
  File "/tmp/ubuntu-release-upgrader-x6gsdfa4/DistUpgrade/DistUpgradeMain.py", line 241, in main
  File "/tmp/ubuntu-release-upgrader-x6gsdfa4/DistUpgrade/DistUpgradeController.py", line 2042, in run
  File "/tmp/ubuntu-release-upgrader-x6gsdfa4/DistUpgrade/DistUpgradeController.py", line 1986, in fullUpgrade
  File "/tmp/ubuntu-release-upgrader-x6gsdfa4/DistUpgrade/DistUpgradeController.py", line 207, in openCache
  File "/tmp/ubuntu-release-upgrader-x6gsdfa4/DistUpgrade/DistUpgradeController.py", line 231, in _openCache
  File "/tmp/ubuntu-release-upgrader-x6gsdfa4/DistUpgrade/Dist...

Read more...

Revision history for this message
Miriam España Acebal (mirespace) wrote :
Download full text (6.9 KiB)

In horsea's syslog.1, some errors happen at the same time that do-dist-upgrade is running and fails:

Jul 20 10:20:23 node-horsea fwupd[214065]: 10:20:23:0245 FuEngine failed to get releases for UEFI dbx: No releases found: Not compatible with org.freedesktop.fwupd version 1.7.9, requires >= 1.9.1
Jul 20 10:20:23 node-horsea fwupd[214065]: 10:20:23:0252 FuEngine failed to get releases for UEFI dbx: No releases found: Not compatible with org.freedesktop.fwupd version 1.7.9, requires >= 1.9.1
[...]
Jul 20 10:28:26 node-horsea kernel: [610494.600470] systemd[1]: Failed to set up automount Arbitrary Executable File Formats File System Automount Point.
[...]
Jul 20 10:28:26 node-horsea kernel: [610494.648822] systemd[1]: multipathd.socket: Failed to create listening socket (@/org/kernel/linux/storage/multipathd): Address already in use
Jul 20 10:28:26 node-horsea kernel: [610494.707090] systemd[1]: multipathd.socket: Failed to listen on sockets: Address already in use
Jul 20 10:28:26 node-horsea kernel: [610494.707135] systemd[1]: multipathd.socket: Failed with result 'resources'.
Jul 20 10:28:26 node-horsea kernel: [610494.707179] systemd[1]: Failed to listen on multipathd control socket.
[... The next appears several times:]
Jul 20 10:28:26 node-horsea kernel: [610494.788337] systemd-journald[529]: Failed to send stream file descriptor to service manager: Transport endpoint is not connected
[...]
Jul 20 10:28:27 node-horsea systemd[1]: multipathd.service: Found left-over process 576 (multipathd) in control group while starting unit. Ignoring.
Jul 20 10:28:27 node-horsea systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Jul 20 10:28:27 node-horsea systemd[1]: Finished Load/Save Random Seed.
Jul 20 10:28:27 node-horsea systemd[1]: Finished Flush Journal to Persistent Storage.
Jul 20 10:28:27 node-horsea systemd[1]: Condition check resulted in First Boot Complete being skipped.
Jul 20 10:28:27 node-horsea systemd[1]: Finished Create System Users.
Jul 20 10:28:27 node-horsea multipathd[217042]: process is already running
Jul 20 10:28:27 node-horsea multipathd[217042]: failed to create pidfile
Jul 20 10:28:27 node-horsea systemd[1]: Starting Create Static Device Nodes in /dev...
Jul 20 10:28:27 node-horsea systemd[1]: multipathd.service: Main process exited, code=exited, status=1/FAILURE
Jul 20 10:28:27 node-horsea systemd[1]: multipathd.service: Got notification message from PID 576, but reception only permitted for main PID which is currently not known
Jul 20 10:28:27 node-horsea systemd[1]: message repeated 2 times: [ multipathd.service: Got notification message from PID 576, but reception only permitted for main PID which is currently not known]
Jul 20 10:28:27 node-horsea multipathd[576]: exit (signal)
Jul 20 10:28:27 node-horsea multipathd[576]: --------shut down-------

[... A lot of the following: Processes 592, 216916 and then 216963 to 217007, 217010 to 217011]

Jul 20 10:28:27 node-horsea systemd[1]: Finished Create Static Device Nodes in /dev.
Jul 20 10:28:27 node-horsea systemd[1]: systemd-udevd.service: Found left-over process 592 (systemd-udevd) in ...

Read more...

Revision history for this message
Miriam España Acebal (mirespace) wrote :

Crash report

Changed in openssh (Ubuntu):
assignee: nobody → Miriam España Acebal (mirespace)
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

@Steve
Since the machines original use case is blocked until we know if we can go on.
Is the above enough for your to have a deeper look together with us?

If so please let Miriam know when once she can reset the machine to go on with the MRE verifications that this was supposed to do :-)

Revision history for this message
Miriam España Acebal (mirespace) wrote :

I'm sorry, but I had to reset the machine because it was blocking the dpdk's MRE process.

I hope the above info serves as a more complete starting point than before to find out something more. Anyway, after dpdk testing, I will try to reproduce the upgrade problem on that machine (if I'm lucky).

Changed in openssh (Ubuntu):
assignee: Miriam España Acebal (mirespace) → nobody
Revision history for this message
Miriam España Acebal (mirespace) wrote :
Download full text (7.4 KiB)

I'm starting to think that there is something wrong with the horsea machine indeed: still not able to upgrade to lunar. Starting from jammy or from kinetic failed.

#Jammy

Deploying again jammy and doing do-release-upgrade again failed with ssh, with differents things after rebooting and trying the do the release upgrade (I have outputs, but I don't remember exactly at what point I did the reboot of the machine, i.e: "Missing privilege separation directory: /run/sshd" could happen after reboot, but not sure).

#Kinetic
So, ok... let's think is something wrong with the jammy image that is being deployed (a remote possibility, I think, but ...).... trying the upgrading from Kinetic: failed. I checked the system status before starting the upgrade and it was not degraded.

It was stucked at 97% on:

Setting up snapd (2.58+22.10.1) ...
Installing new version of config file /etc/apt/apt.conf.d/20snapd.conf ...
Could not execute systemctl: at /usr/bin/deb-systemd-invoke line 145.
Could not execute systemctl: at /usr/bin/deb-systemd-invoke line 145.
snapd.failure.service is a disabled or a static unit not running, not starting it.
snapd.snap-repair.service is a disabled or a static unit not running, not starting it.
Failed to restart snapd.mounts-pre.target: Operation refused, unit snapd.mounts-pre.target may be requested by dependenc
y only (it is configured to refuse manual start/stop).
See system logs and 'systemctl status snapd.mounts-pre.target' for details.

system status was:

           │ └─unattended-upgrades.service
           │ ├─ 1227 /usr/bin/python3 /usr/share/unattended-upgrades/unattended-upgrade-shutdown --wait-for-signal
           │ └─11628 /usr/bin/python3 /usr/share/unattended-upgrades/unattended-upgrade-shutdown --wait-for-signal
           └─user.slice
             └─user-1000.slice
               └─session-1.scope
                 ├─ 1856 "sshd: ubuntu [priv]"
                 ├─ 1937 "sshd: ubuntu@pts/0"
                 ├─ 1941 -bash
                 ├─ 2302 sudo apt upgrade -y
                 ├─ 2303 sudo apt upgrade -y
                 ├─ 2304 apt upgrade -y
                 ├─ 5750 /usr/bin/dpkg --status-fd 39 --configure --pending
                 ├─ 5751 sh -c "(test -x /usr/lib/needrestart/dpkg-status && /usr/lib/needrestart/dpkg-status || cat > /dev/null>
                 ├─ 5752 sh -c "(test -x /usr/lib/needrestart/dpkg-status && /usr/lib/needrestart/dpkg-status || cat > /dev/null>
                 ├─ 5753 /bin/sh /usr/lib/needrestart/dpkg-status
                 ├─11336 /bin/sh /var/lib/dpkg/info/snapd.postinst configure 2.57.5+22.10ubuntu0.1
                 ├─11501 /usr/bin/perl /usr/bin/deb-systemd-invoke restart snapd.aa-prompt-listener.service snapd.apparmor.servi>
                 ├─11563 systemctl --quiet --system restart snapd.aa-prompt-listener.service snapd.apparmor.service snapd.autoim>
                 └─11564 /bin/systemd-tty-ask-password-agent --watch

After that, it failed with:

Could not execute systemctl: at /usr/bin/deb-systemd-invoke line 145.
Setting up openssh-sftp-server (1:9.0p1-1ubuntu7.3) ...
Setting up needrestart (3.6-1ubuntu0.1) ...
Setting up openssh-server (1:9.0p1-1ubuntu7...

Read more...

Revision history for this message
Miriam España Acebal (mirespace) wrote (last edit ):

Well, a simple apt upgrade makes the openssh-server fail to install, and the only thing that can take away the package from that iF situation is a:

shutdown -r now
apt install --fix-broken

Then a do-release-upgrade again broke the package with the same output plus:

User sessions running outdated binaries:
 ubuntu @ session #1: sshd[2389,2475,2887]

Again, reboot and fixbroken helped there.

Looking also on the side of snapd package, I found this bug 1958676 that might be related. So... what if systemd is the culprit???

Revision history for this message
Martin Pitt (pitti) wrote :

We see this in our CI infra as well, it's not just about jammy → lunar dist-upgrades, but also happens with regular updates within lunar as well. The cloud image from a few days ago had an older openssh-server, tried to update to 1:9.0p1-1ubuntu8.4, and failed:

Setting up openssh-server (1:9.0p1-1ubuntu8.4) ...
Creating SSH2 ECDSA key; this may take some time ...
256 SHA256:r08SEQICcnkgu0I1PQwdYTqHeuYVRkz4oPF9VJeC5xQ root@ubuntu (ECDSA)
Creating SSH2 ED25519 key; this may take some time ...
256 SHA256:Qaa3l3IsQuQpKwtESOnfyg5STi64/5pVX4WfiEGX5CI root@ubuntu (ED25519)
rescue-ssh.target is a disabled or a static unit not running, not starting it.
Could not execute systemctl: at /usr/bin/deb-systemd-invoke line 145.
dpkg: error processing package openssh-server (--configure):
 installed openssh-server package post-installation script subprocess returned error exit status 1

Today's cloud image avoids this problem as it has the current openssh-server package on it already, so it's not part of the dist-upgrade. But this will likely come back.

[1] https://cockpit-logs.us-east-1.linodeobjects.com/image-refresh-logs/ubuntu-stable-20230727-131010.log

Revision history for this message
Steve Langasek (vorlon) wrote :

I don't have availability to work on this right now but I'm tagging it rls-ll-incoming for the Foundations Team to at least evaluate picking it up no later than this Thursday's meeting.

tags: added: rls-ll-incoming
Revision history for this message
Robie Basak (racb) wrote :

General note: this seems like a serious issue but I think we're struggling with reproducing it. If anybody can help with steps to reproduce or a snapshot that can be shared in which it reproduces, then we'd very much appreciate that.

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

Another DEP8 test that failed because openssh-server's postinst failed:

https://autopkgtest.ubuntu.com/results/autopkgtest-mantic/mantic/amd64/s/samba/20230730_140035_350d6@/log.gz

2546s Errors were encountered while processing:
2546s openssh-server
2547s E: Sub-process /usr/bin/dpkg returned an error code (1)

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Sadly, or better gladly, Pitti killed most of my theories which were around the transition from pre-socket to socket-activated mode. Thanks for stopping me from doing more debug that way :-)

FYI - There recently has been bug 2020474 fixed in Mantic.
It seems to be rather similar and throws another tunable into the mix, having proposed enabled or not.
See comment #17 in that bug for more.
Sadly - the example by Andreas in comment #19 is after this fix for mantic. So it can't be all that is needed.

Anyway this made me realize I was trying to recreate on mantic (which I now know has this fix) while it was seen on lunar before.

So I removed all I had on mantic and went back to Lunar ... trying to look for weird state combinations to trigger this.

In the background I ran:
$ while /bin/true; do sleep 0.1s; for s in ssh.service ssh.socket; do printf '%15s: %15s %15s ' "$s" "$(systemctl is-enabled $s)" "$(systemctl is-active $s)"; done; printf "\n"; done

And in another session I ran server actions or loops of actions.

After install:
    ssh.service: disabled inactive ssh.socket: enabled active
logging in via ssh:
    ssh.service: disabled activating ssh.socket: enabled active
    ssh.service: disabled active ssh.socket: enabled active
upgrade or re-install pkg
    ssh.service: disabled inactive ssh.socket: enabled inactive
    ssh.service: disabled inactive ssh.socket: enabled active

So much for states and states that are to be expected.
But so far no matter how many concurrent re-installs, downgrades (to 1:9.0p1-1ubuntu8), upgrades, enable/disable, ssh logins, ... I did - I could not yet trigger the problem in an isolated reproducible environment.
As Robie said, we are looking for someone able to fill that gap to continue ...

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Yet another hit on the same

1. Maas deployed jammy GA 22.04
  This has 1:8.9p1-3

2. Tonight unattended-upgrade ran
   upgraded a lot of things
   # This is massive, but might be the first to recreate this

3. it triggers the same issue again

Aug 03 06:44:41 node-horsea sshd[8697]: error: Bind to port 22 on 0.0.0.0 failed: Address already in use.
Aug 03 06:44:41 node-horsea sshd[8697]: error: Bind to port 22 on :: failed: Address already in use.
Aug 03 06:44:41 node-horsea sshd[8697]: fatal: Cannot bind any address.

Extra info:

There are a few things on this machine which might play into this.
It has some special network adapters, not all connected well (intentionally).

ubuntu@node-horsea:~$ networkctl
IDX LINK TYPE OPERATIONAL SETUP
  1 lo loopback carrier unmanaged
  2 eno1 ether routable configured
  3 eno2 ether degraded configured
  4 eno3 ether no-carrier configuring
  5 eno4 ether no-carrier configuring
  6 eno49 ether degraded configured
  7 eno50 ether degraded configured
  8 ens1 ether no-carrier configuring

e.g. this fails:

ubuntu@node-horsea:~$ /lib/systemd/systemd-networkd-wait-online
managing: eno2
managing: eno50
managing: eno49
managing: eno1

Timeout occurred while waiting for network connectivity.

When we activate apt-daily directly it fails over that.
But if not, then it updates and causes this crash.

Revision history for this message
Sergio Durigan Junior (sergiodj) wrote :
Download full text (7.1 KiB)

I spent some time debugging this issue today. One of the MAAS nodes the Server team uses (horsea) is affected by it (what Christian reported above was actually part of a joint debugging effort between him and I earlier today), so I took the opportunity to redeploy the system and try to understand what was really going on.

I already had some good indication that the problem manifested when apt-daily-upgrade.service was executed, so that's the first thing I did after the machine booted, and voilà, the problem happened again.

I could not reproduce the issue just by reinstalling openssh-server, so my next best guess was that something else was interfering with the service restart. I tried cloud-init, to no avail. Then, I decided to try to forcefully upgrade systemd, and that caused the issue to manifest again. Good, now I had a small reproducer that did not involve redeploying the machine.

With that in mind, I decided to check what systemd (or its maintscripts) were doing that could cause such problem. I remembered that one of the first things that caught my attention when I was debugging the issue earlier with Christian was the fact that there were *a lot* of systemd-related warnings on journalctl like the following:

Aug 03 21:05:35 node-horsea systemd[1]: cron.service: Found left-over process 970 (cron) in control group while starting unit. Ignoring.
Aug 03 21:05:35 node-horsea systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.

Had these been just related to ssh, I'd have thought that there was something wrong with openssh-server's maintscripts, maybe. But as I said, these warnings were related to *many* services. They're basically saying that systemd somehow "lost" control of the process it was managing, which is concerning. It also seemed to explain why sshd is complaining about not being able to bind to port 22 because it's in use. I could verify that there is indeed a "left-over" sshd process still bound to port 22.

After some more thinking & log reading, I grew very suspicious that the command "systemctl daemon-reexec" was the culprit here. This command is performed by systemd's postinst script. And finally, after a reboot to make sure that the system was back to a sane state, I was able to confirm the suspicion:

ubuntu@node-horsea:~$ sudo systemctl daemon-reexec
ubuntu@node-horsea:~$ sudo systemctl status ssh
○ ssh.service - OpenBSD Secure Shell server
     Loaded: loaded (/lib/systemd/system/ssh.service; enabled; vendor preset: enabled)
     Active: inactive (dead) ...

Read more...

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Thanks Sergio for continuing on this!

I've added a systemd task here due to your findings (but not yet marked it a dup as we can't be sure yet) and left a comment in bug 2013543 to please let us know when there would be a test build to try our problem against.

Changed in openssh (Ubuntu):
status: New → Confirmed
Revision history for this message
Nick Rosbrook (enr0n) wrote :

Thanks for digging into this, Sergio! To help determine whether this is caused by bug 2013543, can you please share the output of the following after re-execing systemd?

$ cat /proc/1/cmdline | tr '\0' '\n'

Changed in systemd (Ubuntu):
status: New → Incomplete
Revision history for this message
Sergio Durigan Junior (sergiodj) wrote :

Hi Nick,

Sorry, I missed your comment before marking the bug as duplicate. But as I commented on bug #2013543, I was able to verify that your systemd PPA fixes the issue.

Either way, here's the output of the command:

$ cat /proc/1/cmdline | tr '\0' '\n'
/lib/systemd/systemd
---
--system
--deserialize
79

Revision history for this message
Martin Pitt (pitti) wrote :

Whoa, what a convoluted bug.. Thanks Miriam, Sergio, and Christian for tracking this down! Glad to hear that this is fixed in systemd upstream already.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.