Bug #2004551 “upgrade to lunar fails due to rescue-ssh.target or...” : Bugs : openssh package : Ubuntu

Revision history for this message

Christian Ehrhardt  (paelzer) wrote on 2023-02-02:

#1

Download full text (3.5 KiB)

As expected, on reboot all is fine for the service status

ubuntu@node-horsea:~$ systemctl status ssh.service
● ssh.service - OpenBSD Secure Shell server
     Loaded: loaded (/lib/systemd/system/ssh.service; disabled; preset: enabled)
    Drop-In: /etc/systemd/system/ssh.service.d
             └─00-socket.conf
     Active: active (running) since Thu 2023-02-02 10:54:40 UTC; 12min ago
TriggeredBy: ● ssh.socket
       Docs: man:sshd(8)
             man:sshd_config(5)
    Process: 2689 ExecStartPre=/usr/sbin/sshd -t (code=exited, status=0/SUCCESS)
   Main PID: 2690 (sshd)
      Tasks: 1 (limit: 38220)
     Memory: 5.3M
        CPU: 894ms
     CGroup: /system.slice/ssh.service
             └─2690 "sshd: /usr/sbin/sshd -D [listener] 0 of 10-100 startups"

Feb 02 11:06:27 node-horsea sshd[14629]: Accepted publickey for ubuntu from 10.172.196.173 port 47348 ssh2: RSA SHA256:KyONnhWWzlbscZNTHPZ25GWCXDQY5u/UD72EtQcwtqU
Feb 02 11:06:27 node-horsea sshd[14629]: pam_unix(sshd:session): session opened for user ubuntu(uid=1000) by (uid=0)
Feb 02 11:06:27 node-horsea sshd[14629]: pam_env(sshd:session): deprecated reading of user environment enabled
Feb 02 11:06:58 node-horsea sshd[14735]: Accepted publickey for ubuntu from 10.172.196.173 port 55016 ssh2: RSA SHA256:KyONnhWWzlbscZNTHPZ25GWCXDQY5u/UD72EtQcwtqU
Feb 02 11:06:58 node-horsea sshd[14735]: pam_unix(sshd:session): session opened for user ubuntu(uid=1000) by (uid=0)
Feb 02 11:06:59 node-horsea sshd[14735]: pam_env(sshd:session): deprecated reading of user environment enabled
Feb 02 11:07:03 node-horsea sshd[14796]: Accepted publickey for ubuntu from 10.172.196.173 port 57034 ssh2: RSA SHA256:KyONnhWWzlbscZNTHPZ25GWCXDQY5u/UD72EtQcwtqU
Feb 02 11:07:03 node-horsea sshd[14796]: pam_unix(sshd:session): session opened for user ubuntu(uid=1000) by (uid=0)
Feb 02 11:07:03 node-horsea sshd[14796]: pam_env(sshd:session): deprecated reading of user environment enabled
Feb 02 11:07:03 node-horsea sshd[14796]: pam_unix(sshd:session): session closed for user ubuntu
ubuntu@node-horsea:~$ systemctl status ssh.socket
● ssh.socket - OpenBSD Secure Shell server socket
     Loaded: loaded (/lib/systemd/system/ssh.socket; enabled; preset: enabled)
     Active: active (running) since Thu 2023-02-02 10:54:21 UTC; 12min ago
      Until: Thu 2023-02-02 10:54:21 UTC; 12min ago
   Triggers: ● ssh.service
     Listen: [::]:22 (Stream)
      Tasks: 0 (limit: 38220)
     Memory: 8.0K
        CPU: 894us
     CGroup: /system.slice/ssh.socket

Feb 02 10:54:21 node-horsea systemd[1]: Listening on OpenBSD Secure Shell server socket.

And out of this condition it can even complete the package configuration.

ubuntu@node-horsea:~$ sudo dpkg-reconfigure openssh-server
/usr/sbin/dpkg-reconfigure: openssh-server is broken or not fully installed
ubuntu@node-horsea:~$ sudo apt-get install --fix-broken
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
0 upgraded, 0 newly installed, 0 to remove and 0 not upgraded.
1 not fully installed or removed.
After this operation, 0 B of additional disk space will be used.
Setting up openssh-server (1:9.0p1-1ubuntu8) ...
Replacing config file /etc/ssh/ssh...

As expected, on reboot all is fine for the service status

ubuntu@node-horsea:~$ systemctl status ssh.service
● ssh.service - OpenBSD Secure Shell server
     Loaded: loaded (/lib/systemd/system/ssh.service; disabled; preset: enabled)
    Drop-In: /etc/systemd/system/ssh.service.d
             └─00-socket.conf
     Active: active (running) since Thu 2023-02-02 10:54:40 UTC; 12min ago
TriggeredBy: ● ssh.socket
       Docs: man:sshd(8)
             man:sshd_config(5)
    Process: 2689 ExecStartPre=/usr/sbin/sshd -t (code=exited, status=0/SUCCESS)
   Main PID: 2690 (sshd)
      Tasks: 1 (limit: 38220)
     Memory: 5.3M
        CPU: 894ms
     CGroup: /system.slice/ssh.service
             └─2690 "sshd: /usr/sbin/sshd -D [listener] 0 of 10-100 startups"

Feb 02 11:06:27 node-horsea sshd[14629]: Accepted publickey for ubuntu from 10.172.196.173 port 47348 ssh2: RSA SHA256:KyONnhWWzlbscZNTHPZ25GWCXDQY5u/UD72EtQcwtqU
Feb 02 11:06:27 node-horsea sshd[14629]: pam_unix(sshd:session): session opened for user ubuntu(uid=1000) by (uid=0)
Feb 02 11:06:27 node-horsea sshd[14629]: pam_env(sshd:session): deprecated reading of user environment enabled
Feb 02 11:06:58 node-horsea sshd[14735]: Accepted publickey for ubuntu from 10.172.196.173 port 55016 ssh2: RSA SHA256:KyONnhWWzlbscZNTHPZ25GWCXDQY5u/UD72EtQcwtqU
Feb 02 11:06:58 node-horsea sshd[14735]: pam_unix(sshd:session): session opened for user ubuntu(uid=1000) by (uid=0)
Feb 02 11:06:59 node-horsea sshd[14735]: pam_env(sshd:session): deprecated reading of user environment enabled
Feb 02 11:07:03 node-horsea sshd[14796]: Accepted publickey for ubuntu from 10.172.196.173 port 57034 ssh2: RSA SHA256:KyONnhWWzlbscZNTHPZ25GWCXDQY5u/UD72EtQcwtqU
Feb 02 11:07:03 node-horsea sshd[14796]: pam_unix(sshd:session): session opened for user ubuntu(uid=1000) by (uid=0)
Feb 02 11:07:03 node-horsea sshd[14796]: pam_env(sshd:session): deprecated reading of user environment enabled
Feb 02 11:07:03 node-horsea sshd[14796]: pam_unix(sshd:session): session closed for user ubuntu
ubuntu@node-horsea:~$ systemctl status ssh.socket
● ssh.socket - OpenBSD Secure Shell server socket
     Loaded: loaded (/lib/systemd/system/ssh.socket; enabled; preset: enabled)
     Active: active (running) since Thu 2023-02-02 10:54:21 UTC; 12min ago
      Until: Thu 2023-02-02 10:54:21 UTC; 12min ago
   Triggers: ● ssh.service
     Listen: [::]:22 (Stream)
      Tasks: 0 (limit: 38220)
     Memory: 8.0K
        CPU: 894us
     CGroup: /system.slice/ssh.socket

Feb 02 10:54:21 node-horsea systemd[1]: Listening on OpenBSD Secure Shell server socket.

And out of this condition it can even complete the package configuration.

ubuntu@node-horsea:~$ sudo dpkg-reconfigure openssh-server
/usr/sbin/dpkg-reconfigure: openssh-server is broken or not fully installed
ubuntu@node-horsea:~$ sudo apt-get install --fix-broken
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
0 upgraded, 0 newly installed, 0 to remove and 0 not upgraded.
1 not fully installed or removed.
After this operation, 0 B of additional disk space will be used.
Setting up openssh-server (1:9.0p1-1ubuntu8) ...
Replacing config file /etc/ssh/sshd_config with new version
Replacing config file /etc/ssh/sshd_config with new version
Synchronizing state of ssh.service with SysV service script with /lib/systemd/systemd-sysv-install.
Executing: /lib/systemd/systemd-sysv-install disable ssh
Warning: Stopping ssh.service, but it can still be activated by:
  ssh.socket
rescue-ssh.target is a disabled or a static unit not running, not starting it.
ubuntu@node-horsea:~$ echo $?
0

Revision history for this message

Steve Langasek (vorlon) wrote on 2023-02-02:

#2

> rescue-ssh.target is a disabled or a static unit not running, not starting it.

Not really a complaint, just informational.

> But I think the underlying issue is that ssh is already on, and I'm logged in via it.
> And that makes the service restart of the ssh socket which was added break.

I recall that I did test this case without problem. sshd in general doesn't fail if there are open connections and you restart it, because it properly configures its sockets with SO_REUSEADDR.

To debug this I would really want to see what process is holding that port; but as you've rebooted, that's not really possible to see now. If you ever reproduce it on another instance, please grab the output of fuser -n tcp 22 or equivalent! If we failed to stop the ssh.service, we want to see why.

> One thing though, I have not upgraded with do-release-upgrade - would we
> / do we have magic there to make the ssh socket activation transition smoother?

No, this is all supposed to be handled cleanly with the package maintainer scripts alone.

Changed in openssh (Ubuntu):
status:	New → Incomplete
importance:	Undecided → High

Revision history for this message

Christian Ehrhardt  (paelzer) wrote on 2023-02-03:

#3

Thank you Steve, documenting what kind of debug data you'd expect helps me or anyone else who might run into this next time.

Once I'm done with my current tasks on this system I'll try to redeploy and re-upgrade to check if it happens again.

Revision history for this message

Launchpad Janitor (janitor) wrote on 2023-04-05:

#4

[Expired for openssh (Ubuntu) because there has been no activity for 60 days.]

Changed in openssh (Ubuntu):
status:	Incomplete → Expired

Revision history for this message

Miriam España Acebal (mirespace) wrote on 2023-07-21:

#5

Download full text (6.0 KiB)

Hi,

I'm facing the same issue on the same machine (on MAAS), but I didn't reach lunar yet: first, I did a do-dist-upgrade from jammy to Kinetic (prompt normal).

I collected the info requested in comment #2:

ubuntu@node-horsea:~$ sudo fuser -n tcp 22
22/tcp: 1133 214495 214551
ubuntu@node-horsea:~$ sudo netstat -natpl | grep ":22"
tcp 0 0 0.0.0.0:22 0.0.0.0:* LISTEN 1133/sshd: /usr/sbi
tcp 0 52 10.245.171.244:22 10.172.195.194:55174 ESTABLISHED 214495/sshd: ubuntu
tcp6 0 0 :::22 :::* LISTEN 1133/sshd: /usr/sbi
ubuntu@node-horsea:~$ sudo ps -ef | grep 1133 | grep -v grep
root 1133 1 0 Jul13 ? 00:00:00 sshd: /usr/sbin/sshd -D [listener] 0 of 10-100 startups
root 214495 1133 0 Jul20 ? 00:00:00 sshd: ubuntu [priv]
ubuntu@node-horsea:~$ sudo ps -ef | grep 214551 | grep -v grep
ubuntu 214551 214495 0 Jul20 ? 00:00:00 sshd: ubuntu@pts/0
ubuntu 214552 214551 0 Jul20 pts/0 00:00:00 -bash

I experienced other messages indicating system degradation such as:

- ubuntu@node-horsea:~$ systemctl reload ssh.service
Failed to reload ssh.service: Failed to activate service 'org.freedesktop.systemd1': timed out (service_start_timeout=25000ms)
See system logs and 'systemctl status ssh.service' for details.

- ubuntu@node-horsea:~$ systemctl status ssh.service
Failed to get properties: Connection timed out

So I did:
ubuntu@node-horsea:~$ sudo telinit u
ubuntu@node-horsea:~$ sudo systemctl daemon-reexec
ubuntu@node-horsea:~$ sudo systemctl daemon-reload

And I get then:

ubuntu@node-horsea:~$ sudo systemctl status -l ssh.service
○ ssh.service - OpenBSD Secure Shell server
     Loaded: loaded (/lib/systemd/system/ssh.service; disabled; preset: enabled)
    Drop-In: /etc/systemd/system/ssh.service.d
             └─00-socket.conf
     Active: inactive (dead)
TriggeredBy: × ssh.socket
       Docs: man:sshd(8)
             man:sshd_config(5)

Jul 20 10:37:30 node-horsea sshd[225140]: error: Bind to port 22 on 0.0.0.0 failed: Address already in use.
Jul 20 10:37:30 node-horsea sshd[225140]: error: Bind to port 22 on :: failed: Address already in use.
Jul 20 10:37:30 node-horsea sshd[225140]: fatal: Cannot bind any address.
Jul 20 10:37:30 node-horsea systemd[1]: ssh.service: Main process exited, code=exited, status=255/EXCEPTION
Jul 20 10:37:30 node-horsea systemd[1]: ssh.service: Failed with result 'exit-code'.
Jul 20 10:37:30 node-horsea systemd[1]: ssh.service: Unit process 1133 (sshd) remains running after unit stopped.
Jul 20 10:37:30 node-horsea systemd[1]: Failed to start OpenBSD Secure Shell server.
Jul 20 10:54:54 node-horsea sshd[278339]: Accepted publickey for ubuntu from 10.172.195.194 port 49122 ssh2: RSA SHA256:2MSrD9nviIiiRrpcOIF/MA8eD/>
Jul 20 10:54:54 node-horsea sshd[278339]: pam_unix(sshd:session): session opened for user ubuntu(uid=1000) by (uid=0)
Jul 20 10:54:55 node-horsea sshd[278339]: pam_env(sshd:session): deprecated reading of user environment enabled

The process is inactive but alive (I'm logged via ssh, and I was able to make a second connectio...

Hi,

I'm facing the same issue on the same machine (on MAAS), but I didn't reach lunar yet: first, I did a do-dist-upgrade from jammy to Kinetic (prompt normal).

I collected the info requested in comment #2:

ubuntu@node-horsea:~$ sudo fuser -n tcp 22
22/tcp:               1133 214495 214551
ubuntu@node-horsea:~$ sudo netstat -natpl | grep ":22"
tcp        0      0 0.0.0.0:22              0.0.0.0:*               LISTEN      1133/sshd: /usr/sbi 
tcp        0     52 10.245.171.244:22       10.172.195.194:55174    ESTABLISHED 214495/sshd: ubuntu 
tcp6       0      0 :::22                   :::*                    LISTEN      1133/sshd: /usr/sbi 
ubuntu@node-horsea:~$ sudo ps -ef | grep 1133 | grep -v grep
root        1133       1  0 Jul13 ?        00:00:00 sshd: /usr/sbin/sshd -D [listener] 0 of 10-100 startups
root      214495    1133  0 Jul20 ?        00:00:00 sshd: ubuntu [priv]
ubuntu@node-horsea:~$ sudo ps -ef | grep 214551 | grep -v grep
ubuntu    214551  214495  0 Jul20 ?        00:00:00 sshd: ubuntu@pts/0
ubuntu    214552  214551  0 Jul20 pts/0    00:00:00 -bash

I experienced other messages indicating system degradation such as:

- ubuntu@node-horsea:~$ systemctl reload ssh.service
Failed to reload ssh.service: Failed to activate service 'org.freedesktop.systemd1': timed out (service_start_timeout=25000ms)
See system logs and 'systemctl status ssh.service' for details.

- ubuntu@node-horsea:~$ systemctl status ssh.service
Failed to get properties: Connection timed out

So I did:
ubuntu@node-horsea:~$ sudo telinit u
ubuntu@node-horsea:~$ sudo systemctl daemon-reexec
ubuntu@node-horsea:~$ sudo systemctl daemon-reload

And I get then:

ubuntu@node-horsea:~$ sudo systemctl status -l ssh.service
○ ssh.service - OpenBSD Secure Shell server
     Loaded: loaded (/lib/systemd/system/ssh.service; disabled; preset: enabled)
    Drop-In: /etc/systemd/system/ssh.service.d
             └─00-socket.conf
     Active: inactive (dead)
TriggeredBy: × ssh.socket
       Docs: man:sshd(8)
             man:sshd_config(5)

Jul 20 10:37:30 node-horsea sshd[225140]: error: Bind to port 22 on 0.0.0.0 failed: Address already in use.
Jul 20 10:37:30 node-horsea sshd[225140]: error: Bind to port 22 on :: failed: Address already in use.
Jul 20 10:37:30 node-horsea sshd[225140]: fatal: Cannot bind any address.
Jul 20 10:37:30 node-horsea systemd[1]: ssh.service: Main process exited, code=exited, status=255/EXCEPTION
Jul 20 10:37:30 node-horsea systemd[1]: ssh.service: Failed with result 'exit-code'.
Jul 20 10:37:30 node-horsea systemd[1]: ssh.service: Unit process 1133 (sshd) remains running after unit stopped.
Jul 20 10:37:30 node-horsea systemd[1]: Failed to start OpenBSD Secure Shell server.
Jul 20 10:54:54 node-horsea sshd[278339]: Accepted publickey for ubuntu from 10.172.195.194 port 49122 ssh2: RSA SHA256:2MSrD9nviIiiRrpcOIF/MA8eD/>
Jul 20 10:54:54 node-horsea sshd[278339]: pam_unix(sshd:session): session opened for user ubuntu(uid=1000) by (uid=0)
Jul 20 10:54:55 node-horsea sshd[278339]: pam_env(sshd:session): deprecated reading of user environment enabled

The process is inactive but alive (I'm logged via ssh, and I was able to make a second connection).

And, in a short period of time, the system behaves again the same:

ubuntu@node-horsea:~$ systemctl reload ssh.service
Failed to get properties: Failed to activate service 'org.freedesktop.systemd1': timed out (service_start_timeout=25000ms)

And trying to get the package to a proper installation state still fails:

ubuntu@node-horsea:~$ sudo dpkg-reconfigure openssh-server
/usr/sbin/dpkg-reconfigure: openssh-server is broken or not fully installed

ubuntu@node-horsea:~$ sudo apt upgrade 
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
Calculating upgrade... Done
#
# An OpenSSL vulnerability has recently been fixed with USN-6188-1 & 6119-1:
# CVE-2023-2650: possible DoS translating ASN.1 object identifiers.
# Ensure you have updated the package to its latest version.
#
0 upgraded, 0 newly installed, 0 to remove and 0 not upgraded.
1 not fully installed or removed.
After this operation, 0 B of additional disk space will be used.
Do you want to continue? [Y/n] y
Setting up openssh-server (1:9.0p1-1ubuntu7.3) ...
Replacing config file /etc/ssh/sshd_config with new version
Replacing config file /etc/ssh/sshd_config with new version
Synchronizing state of ssh.service with SysV service script with /lib/systemd/systemd-sysv-install.
Executing: /lib/systemd/systemd-sysv-install disable ssh
rescue-ssh.target is a disabled or a static unit not running, not starting it.
Could not execute systemctl:  at /usr/bin/deb-systemd-invoke line 145.
dpkg: error processing package openssh-server (--configure):
 installed openssh-server package post-installation script subprocess returned error exit status 1
Errors were encountered while processing:
 openssh-server
Error: Timeout was reached
needrestart is being skipped since dpkg has failed
E: Sub-process /usr/bin/dpkg returned an error code (1)

ubuntu@node-horsea:~$ sudo apt install --fix-broken
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
0 upgraded, 0 newly installed, 0 to remove and 0 not upgraded.
1 not fully installed or removed.
After this operation, 0 B of additional disk space will be used.
Setting up openssh-server (1:9.0p1-1ubuntu7.3) ...
Replacing config file /etc/ssh/sshd_config with new version
Replacing config file /etc/ssh/sshd_config with new version
Synchronizing state of ssh.service with SysV service script with /lib/systemd/systemd-sysv-install.
Executing: /lib/systemd/systemd-sysv-install disable ssh
rescue-ssh.target is a disabled or a static unit not running, not starting it.
Could not execute systemctl:  at /usr/bin/deb-systemd-invoke line 145.
dpkg: error processing package openssh-server (--configure):
 installed openssh-server package post-installation script subprocess returned error exit status 1
Errors were encountered while processing:
 openssh-server
needrestart is being skipped since dpkg has failed
E: Sub-process /usr/bin/dpkg returned an error code (1)

Changed in openssh (Ubuntu):
status:	Expired → New

Revision history for this message

Robie Basak (racb) wrote on 2023-07-21:

#6

Miriam, if you could preserve the state of the system please if you still have it for debugging, then that would be helpful.

tags:

added: server-todo

Revision history for this message

Andreas Hasenack (ahasenack) wrote on 2023-07-21 (last edit on 2023-07-21):

#7

I'm seeing something that looks just like this in a dep8 mantic test run, where openssh-server is upgraded while preparing the testbed and postinst fails[1]:

977s Preparing to unpack .../openssh-sftp-server_1%3a9.3p1-1ubuntu1_amd64.deb ...
977s Unpacking openssh-sftp-server (1:9.3p1-1ubuntu1) over (1:9.2p1-2ubuntu3) ...
977s Preparing to unpack .../openssh-server_1%3a9.3p1-1ubuntu1_amd64.deb ...
977s Warning: Stopping ssh.service, but it can still be activated by:
977s ssh.socket
977s Unpacking openssh-server (1:9.3p1-1ubuntu1) over (1:9.2p1-2ubuntu3) ...
(...)
992s Setting up openssh-server (1:9.3p1-1ubuntu1) ...
993s rescue-ssh.target is a disabled or a static unit not running, not starting it.
993s Could not execute systemctl: at /usr/bin/deb-systemd-invoke line 145.
993s dpkg: error processing package openssh-server (--configure):
993s installed openssh-server package post-installation script subprocess returned error exit status 1
(...)
1016s Errors were encountered while processing:
1016s openssh-server

1. https://autopkgtest.ubuntu.com/results/autopkgtest-mantic-ahasenack-samba-kb5028166/mantic/amd64/s/samba/20230720_195603_95cad@/log.gz

Revision history for this message

Miriam España Acebal (mirespace) wrote on 2023-07-22:

#8

Yes, Robie, I didn't touch it for that reason... It's ready for debugging. It's Christian's machine for dpdk testing on MAAS.

Revision history for this message

Miriam España Acebal (mirespace) wrote on 2023-07-24:

#9

Download full text (5.7 KiB)

Today I'm trying to reproduce the error in a lxc vm without success: I logged in via ssh to start the do-release-upgrade there so the system would initiate sshd on port 1022 as well. The upgrade process went fine the times I tried it.

Thanks Andreas for spotting the same error in the autopkgtest log. I'm looking for similarities between it and the horsea machine.

Checking horsea's log, I found the following

"FileNotFoundError: [Errno 2] No such file or directory: '/tmp/ubuntu-release-upgrader-x6gsdfa4/removal_denylist.cfg'"

in the screenlog.0 under /var/log/dist-upgrade directory:

Progress: [ 96%]
Processing triggers for man-db (2.10.2-1) ...
packages have been installed but needrestart is suspended
packages have been installed but needrestart is suspended
packages have been installed but needrestart is suspended

Reading package lists... 0%

Reading package lists... 100%

Reading package lists... Done

Building dependency tree... 0%

Building dependency tree... 50%

Building dependency tree... Done

Reading state information... 0%

Reading state information... Done

Traceback (most recent call last):
  File "/tmp/ubuntu-release-upgrader-x6gsdfa4/kinetic", line 8, in <module>
  File "/tmp/ubuntu-release-upgrader-x6gsdfa4/DistUpgrade/DistUpgradeMain.py", line 241, in main
  File "/tmp/ubuntu-release-upgrader-x6gsdfa4/DistUpgrade/DistUpgradeController.py", line 2042, in run
  File "/tmp/ubuntu-release-upgrader-x6gsdfa4/DistUpgrade/DistUpgradeController.py", line 1986, in fullUpgrade
  File "/tmp/ubuntu-release-upgrader-x6gsdfa4/DistUpgrade/DistUpgradeController.py", line 207, in openCache
  File "/tmp/ubuntu-release-upgrader-x6gsdfa4/DistUpgrade/DistUpgradeController.py", line 231, in _openCache
  File "/tmp/ubuntu-release-upgrader-x6gsdfa4/DistUpgrade/DistUpgradeCache.py", line 142, in __init__
  File "/tmp/ubuntu-release-upgrader-x6gsdfa4/DistUpgrade/DistUpgradeConfigParser.py", line 89, in getListFromFile
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/ubuntu-release-upgrader-x6gsdfa4/removal_denylist.cfg'
Error in sys.excepthook:
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/apport_python_hook.py", line 76, in apport_excepthook
    binary = os.path.realpath(os.path.join(os.getcwd(), sys.argv[0]))
FileNotFoundError: [Errno 2] No such file or directory

Original exception was:
Traceback (most recent call last):
  File "/tmp/ubuntu-release-upgrader-x6gsdfa4/kinetic", line 8, in <module>
  File "/tmp/ubuntu-release-upgrader-x6gsdfa4/DistUpgrade/DistUpgradeMain.py", line 241, in main
  File "/tmp/ubuntu-release-upgrader-x6gsdfa4/DistUpgrade/DistUpgradeController.py", line 2042, in run
  File "/tmp/ubuntu-release-upgrader-x6gsdfa4/DistUpgrade/DistUpgradeController.py", line 1986, in fullUpgrade
  File "/tmp/ubuntu-release-upgrader-x6gsdfa4/DistUpgrade/DistUpgradeController.py", line 207, in openCache
  File "/tmp/ubuntu-release-upgrader-x6gsdfa4/DistUpgrade/DistUpgradeController.py", line 231, in _openCache
  File "/tmp/ubuntu-release-upgrader-x6gsdfa4/DistUpgrade/Dist...

Today I'm trying to reproduce the error in a lxc vm without success: I logged in via ssh to start the do-release-upgrade there so the system would initiate sshd on port 1022 as well. The upgrade process went fine the times I tried it.

Thanks Andreas for spotting the same error in the autopkgtest log. I'm looking for similarities between it and the horsea machine.

Checking horsea's log, I found the following

"FileNotFoundError: [Errno 2] No such file or directory: '/tmp/ubuntu-release-upgrader-x6gsdfa4/removal_denylist.cfg'"

in the screenlog.0 under /var/log/dist-upgrade directory:

Progress: [ 96%]
Processing triggers for man-db (2.10.2-1) ...
packages have been installed but needrestart is suspended
packages have been installed but needrestart is suspended
packages have been installed but needrestart is suspended

Reading package lists... 0%
      
Reading package lists... 100%

Reading package lists... Done

Building dependency tree... 0%
   
Building dependency tree... 0%
   
Building dependency tree... 50%
  
Building dependency tree... 50%

Building dependency tree... Done

Reading state information... 0%
  
Reading state information... 0%

Reading state information... Done

Traceback (most recent call last):
  File "/tmp/ubuntu-release-upgrader-x6gsdfa4/kinetic", line 8, in <module>
  File "/tmp/ubuntu-release-upgrader-x6gsdfa4/DistUpgrade/DistUpgradeMain.py", line 241, in main
  File "/tmp/ubuntu-release-upgrader-x6gsdfa4/DistUpgrade/DistUpgradeController.py", line 2042, in run
  File "/tmp/ubuntu-release-upgrader-x6gsdfa4/DistUpgrade/DistUpgradeController.py", line 1986, in fullUpgrade
  File "/tmp/ubuntu-release-upgrader-x6gsdfa4/DistUpgrade/DistUpgradeController.py", line 207, in openCache
  File "/tmp/ubuntu-release-upgrader-x6gsdfa4/DistUpgrade/DistUpgradeController.py", line 231, in _openCache
  File "/tmp/ubuntu-release-upgrader-x6gsdfa4/DistUpgrade/DistUpgradeCache.py", line 142, in __init__
  File "/tmp/ubuntu-release-upgrader-x6gsdfa4/DistUpgrade/DistUpgradeConfigParser.py", line 89, in getListFromFile
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/ubuntu-release-upgrader-x6gsdfa4/removal_denylist.cfg'
Error in sys.excepthook:
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/apport_python_hook.py", line 76, in apport_excepthook
    binary = os.path.realpath(os.path.join(os.getcwd(), sys.argv[0]))
FileNotFoundError: [Errno 2] No such file or directory

Original exception was:
Traceback (most recent call last):
  File "/tmp/ubuntu-release-upgrader-x6gsdfa4/kinetic", line 8, in <module>
  File "/tmp/ubuntu-release-upgrader-x6gsdfa4/DistUpgrade/DistUpgradeMain.py", line 241, in main
  File "/tmp/ubuntu-release-upgrader-x6gsdfa4/DistUpgrade/DistUpgradeController.py", line 2042, in run
  File "/tmp/ubuntu-release-upgrader-x6gsdfa4/DistUpgrade/DistUpgradeController.py", line 1986, in fullUpgrade
  File "/tmp/ubuntu-release-upgrader-x6gsdfa4/DistUpgrade/DistUpgradeController.py", line 207, in openCache
  File "/tmp/ubuntu-release-upgrader-x6gsdfa4/DistUpgrade/DistUpgradeController.py", line 231, in _openCache
  File "/tmp/ubuntu-release-upgrader-x6gsdfa4/DistUpgrade/DistUpgradeCache.py", line 142, in __init__
  File "/tmp/ubuntu-release-upgrader-x6gsdfa4/DistUpgrade/DistUpgradeConfigParser.py", line 89, in getListFromFile
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/ubuntu-release-upgrader-x6gsdfa4/removal_denylist.cfg'

=== Command terminated with exit status 1 (Thu Jul 20 10:29:30 2023) ===-- 0:kinetic -- time-stamp -- Jul/20/23 10:31:29 --

It also appears in /var/log/dist-upgrade/main.log :

2023-07-20 10:27:36,606 DEBUG inhibit screensaver
2023-07-20 10:28:21,258 DEBUG quirks: running StartUpgrade
2023-07-20 10:28:21,258 DEBUG running Quirks.StartUpgrade
2023-07-20 10:28:21,259 DEBUG skipping 'README' (no '.')
2023-07-20 10:28:21,260 DEBUG killing update-notifier
2023-07-20 10:28:21,266 DEBUG killing kblueplugd kbluetooth4
2023-07-20 10:28:21,272 DEBUG apt btrfs snapshots supported: False
2023-07-20 10:28:21,272 INFO cache.commit()
2023-07-20 10:28:22,889 DEBUG cache.commit() returned None
2023-07-20 10:28:23,146 DEBUG quirks: running StartUpgrade
2023-07-20 10:28:23,146 DEBUG running Quirks.StartUpgrade
2023-07-20 10:28:23,147 DEBUG skipping 'README' (no '.')
2023-07-20 10:28:23,147 DEBUG killing update-notifier
2023-07-20 10:28:23,152 DEBUG killing kblueplugd kbluetooth4
2023-07-20 10:28:23,157 DEBUG apt btrfs snapshots supported: False
2023-07-20 10:28:23,157 INFO cache.commit()
2023-07-20 10:29:30,470 DEBUG cache.commit() returned None
2023-07-20 10:29:30,471 DEBUG openCache()
2023-07-20 10:29:30,847 ERROR getListFromFile: no '/tmp/ubuntu-release-upgrader-x6gsdfa4/removal_denylist.cfg' found
2023-07-20 10:29:30,848 ERROR not handled exception:
Traceback (most recent call last):

File "/tmp/ubuntu-release-upgrader-x6gsdfa4/kinetic", line 8, in <module>

File "/tmp/ubuntu-release-upgrader-x6gsdfa4/DistUpgrade/DistUpgradeMain.py", line 241, in main

File "/tmp/ubuntu-release-upgrader-x6gsdfa4/DistUpgrade/DistUpgradeController.py", line 2042, in run

File "/tmp/ubuntu-release-upgrader-x6gsdfa4/DistUpgrade/DistUpgradeController.py", line 1986, in fullUpgrade

File "/tmp/ubuntu-release-upgrader-x6gsdfa4/DistUpgrade/DistUpgradeController.py", line 207, in openCache

File "/tmp/ubuntu-release-upgrader-x6gsdfa4/DistUpgrade/DistUpgradeController.py", line 231, in _openCache

File "/tmp/ubuntu-release-upgrader-x6gsdfa4/DistUpgrade/DistUpgradeCache.py", line 142, in __init__

File "/tmp/ubuntu-release-upgrader-x6gsdfa4/DistUpgrade/DistUpgradeConfigParser.py", line 89, in getListFromFile

FileNotFoundError: [Errno 2] No such file or directory: '/tmp/ubuntu-release-upgrader-x6gsdfa4/removal_denylist.cfg'

2023-07-20 10:29:30,848 DEBUG running apport_crash()

Revision history for this message

Miriam España Acebal (mirespace) wrote on 2023-07-24:

#10

Download full text (6.9 KiB)

In horsea's syslog.1, some errors happen at the same time that do-dist-upgrade is running and fails:

Jul 20 10:20:23 node-horsea fwupd[214065]: 10:20:23:0245 FuEngine failed to get releases for UEFI dbx: No releases found: Not compatible with org.freedesktop.fwupd version 1.7.9, requires >= 1.9.1
Jul 20 10:20:23 node-horsea fwupd[214065]: 10:20:23:0252 FuEngine failed to get releases for UEFI dbx: No releases found: Not compatible with org.freedesktop.fwupd version 1.7.9, requires >= 1.9.1
[...]
Jul 20 10:28:26 node-horsea kernel: [610494.600470] systemd[1]: Failed to set up automount Arbitrary Executable File Formats File System Automount Point.
[...]
Jul 20 10:28:26 node-horsea kernel: [610494.648822] systemd[1]: multipathd.socket: Failed to create listening socket (@/org/kernel/linux/storage/multipathd): Address already in use
Jul 20 10:28:26 node-horsea kernel: [610494.707090] systemd[1]: multipathd.socket: Failed to listen on sockets: Address already in use
Jul 20 10:28:26 node-horsea kernel: [610494.707135] systemd[1]: multipathd.socket: Failed with result 'resources'.
Jul 20 10:28:26 node-horsea kernel: [610494.707179] systemd[1]: Failed to listen on multipathd control socket.
[... The next appears several times:]
Jul 20 10:28:26 node-horsea kernel: [610494.788337] systemd-journald[529]: Failed to send stream file descriptor to service manager: Transport endpoint is not connected
[...]
Jul 20 10:28:27 node-horsea systemd[1]: multipathd.service: Found left-over process 576 (multipathd) in control group while starting unit. Ignoring.
Jul 20 10:28:27 node-horsea systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Jul 20 10:28:27 node-horsea systemd[1]: Finished Load/Save Random Seed.
Jul 20 10:28:27 node-horsea systemd[1]: Finished Flush Journal to Persistent Storage.
Jul 20 10:28:27 node-horsea systemd[1]: Condition check resulted in First Boot Complete being skipped.
Jul 20 10:28:27 node-horsea systemd[1]: Finished Create System Users.
Jul 20 10:28:27 node-horsea multipathd[217042]: process is already running
Jul 20 10:28:27 node-horsea multipathd[217042]: failed to create pidfile
Jul 20 10:28:27 node-horsea systemd[1]: Starting Create Static Device Nodes in /dev...
Jul 20 10:28:27 node-horsea systemd[1]: multipathd.service: Main process exited, code=exited, status=1/FAILURE
Jul 20 10:28:27 node-horsea systemd[1]: multipathd.service: Got notification message from PID 576, but reception only permitted for main PID which is currently not known
Jul 20 10:28:27 node-horsea systemd[1]: message repeated 2 times: [ multipathd.service: Got notification message from PID 576, but reception only permitted for main PID which is currently not known]
Jul 20 10:28:27 node-horsea multipathd[576]: exit (signal)
Jul 20 10:28:27 node-horsea multipathd[576]: --------shut down-------

[... A lot of the following: Processes 592, 216916 and then 216963 to 217007, 217010 to 217011]

Jul 20 10:28:27 node-horsea systemd[1]: Finished Create Static Device Nodes in /dev.
Jul 20 10:28:27 node-horsea systemd[1]: systemd-udevd.service: Found left-over process 592 (systemd-udevd) in ...

In horsea's syslog.1, some errors happen at the same time that do-dist-upgrade is running and fails:

Jul 20 10:20:23 node-horsea fwupd[214065]: 10:20:23:0245 FuEngine             failed to get releases for UEFI dbx: No releases found: Not compatible with org.freedesktop.fwupd version 1.7.9, requires >= 1.9.1
Jul 20 10:20:23 node-horsea fwupd[214065]: 10:20:23:0252 FuEngine             failed to get releases for UEFI dbx: No releases found: Not compatible with org.freedesktop.fwupd version 1.7.9, requires >= 1.9.1
[...]
Jul 20 10:28:26 node-horsea kernel: [610494.600470] systemd[1]: Failed to set up automount Arbitrary Executable File Formats File System Automount Point.
[...]
Jul 20 10:28:26 node-horsea kernel: [610494.648822] systemd[1]: multipathd.socket: Failed to create listening socket (@/org/kernel/linux/storage/multipathd): Address already in use
Jul 20 10:28:26 node-horsea kernel: [610494.707090] systemd[1]: multipathd.socket: Failed to listen on sockets: Address already in use
Jul 20 10:28:26 node-horsea kernel: [610494.707135] systemd[1]: multipathd.socket: Failed with result 'resources'.
Jul 20 10:28:26 node-horsea kernel: [610494.707179] systemd[1]: Failed to listen on multipathd control socket.
[... The next appears several times:]
Jul 20 10:28:26 node-horsea kernel: [610494.788337] systemd-journald[529]: Failed to send stream file descriptor to service manager: Transport endpoint is not connected
[...]
Jul 20 10:28:27 node-horsea systemd[1]: multipathd.service: Found left-over process 576 (multipathd) in control group while starting unit. Ignoring.
Jul 20 10:28:27 node-horsea systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Jul 20 10:28:27 node-horsea systemd[1]: Finished Load/Save Random Seed.
Jul 20 10:28:27 node-horsea systemd[1]: Finished Flush Journal to Persistent Storage.
Jul 20 10:28:27 node-horsea systemd[1]: Condition check resulted in First Boot Complete being skipped.
Jul 20 10:28:27 node-horsea systemd[1]: Finished Create System Users.
Jul 20 10:28:27 node-horsea multipathd[217042]: process is already running
Jul 20 10:28:27 node-horsea multipathd[217042]: failed to create pidfile
Jul 20 10:28:27 node-horsea systemd[1]: Starting Create Static Device Nodes in /dev...
Jul 20 10:28:27 node-horsea systemd[1]: multipathd.service: Main process exited, code=exited, status=1/FAILURE
Jul 20 10:28:27 node-horsea systemd[1]: multipathd.service: Got notification message from PID 576, but reception only permitted for main PID which is currently not known
Jul 20 10:28:27 node-horsea systemd[1]: message repeated 2 times: [ multipathd.service: Got notification message from PID 576, but reception only permitted for main PID which is currently not known]
Jul 20 10:28:27 node-horsea multipathd[576]: exit (signal)
Jul 20 10:28:27 node-horsea multipathd[576]: --------shut down-------

[... A lot of the following: Processes 592, 216916 and then 216963 to 217007, 217010 to 217011]

Jul 20 10:28:27 node-horsea systemd[1]: Finished Create Static Device Nodes in /dev.
Jul 20 10:28:27 node-horsea systemd[1]: systemd-udevd.service: Found left-over process 592 (systemd-udevd) in control group while starting unit. Ignoring.
[...]
Jul 20 10:28:27 node-horsea systemd[1]: Starting Rule-based Manager for Device Events and Files...
Jul 20 10:28:27 node-horsea systemd[1]: multipathd.service: Failed with result 'exit-code'.
Jul 20 10:28:27 node-horsea systemd[1]: Failed to start Device-Mapper Multipath Device Controller.
[... Some of this:]
Jul 20 10:28:28 node-horsea systemd-udevd[217552]: nvme0n1p1: Process '/usr/bin/unshare -m /usr/bin/snap auto-import --mount=/dev/nvme0n1p1' failed with exit code 1.
[... A couple of: ]
Jul 20 10:28:28 node-horsea libvirtd[87522]: internal error: The keyword is not comprised only of uppercase ASCII letters or digits
Jul 20 10:28:28 node-horsea libvirtd[87522]: internal error: A field data length violates the resource length boundary.
Jul 20 10:28:28 node-horsea libvirtd[87522]: internal error: The keyword is not comprised only of uppercase ASCII letters or digits
Jul 20 10:28:28 node-horsea libvirtd[87522]: internal error: A field data length violates the resource length boundary.
[...]
Jul 20 10:28:28 node-horsea systemd-udevd[217550]: sdb: Process '/usr/bin/unshare -m /usr/bin/snap auto-import --mount=/dev/sdb' failed with exit code 1.
[...]
Jul 20 10:28:28 node-horsea cloud-init[217927]: 2023-07-20 10:28:28,415 - stages.py[WARNING]: Failed to rename devices: Failed to apply network config names: [nic not present] Cannot rename mac=8c:dc:d4:b3:6d:e8 to eno49, not available.
[... More left over processes, but this seems due to the previous dpdk tests:]
Jul 20 10:28:28 node-horsea systemd[1]: ovsdb-server.service: Found left-over process 214295 (ovsdb-server) in control group while starting unit. Ignoring.
Jul 20 10:28:28 node-horsea systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Jul 20 10:28:28 node-horsea systemd-udevd[217076]: Failed to unlink /run/udev/queue: No such file or directory
[...]
Jul 20 10:28:28 node-horsea systemd[1]: systemd-resolved.service: Found left-over process 830 (systemd-resolve) in control group while starting unit. Ignoring.
Jul 20 10:28:28 node-horsea systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
[...]
Jul 20 10:28:29 node-horsea fwupd[214065]: 10:28:29:0161 FuEngine             failed to add device /sys/devices/pci0000:00/0000:00:1f.0: failed to add device using on pci_bcr: could not read BCR: 5ca35f261be27f4d2786162a4e2dd1494697e678 [C610/X99 series chipset LPC Controller] has not been opened
Jul 20 10:28:29 node-horsea fwupd[214065]: 10:28:29:0250 FuEngine             failed to add device /sys/devices/pci0000:00/0000:00:1c.4/0000:02:00.0: EEPROM size invalid, got 0x80000, expected 0x40000
Jul 20 10:28:30 node-horsea ModemManager[1150]: <info>  [base-manager] couldn't check support for device '/sys/devices/pci0000:00/0000:00:02.2/0000:04:00.1': not supported by any plugin
Jul 20 10:28:30 node-horsea ModemManager[1150]: <info>  [base-manager] couldn't check support for device '/sys/devices/pci0000:00/0000:00:03.0/0000:08:00.0': not supported by any plugin
Jul 20 10:28:30 node-horsea ModemManager[1150]: <info>  [base-manager] couldn't check support for device '/sys/devices/pci0000:00/0000:00:1c.4/0000:02:00.2': not supported by any plugin
Jul 20 10:28:30 node-horsea ModemManager[1150]: <info>  [base-manager] couldn't check support for device '/sys/devices/pci0000:00/0000:00:1c.4/0000:02:00.3': not supported by any plugin
Jul 20 10:28:30 node-horsea ModemManager[1150]: <info>  [base-manager] couldn't check support for device '/sys/devices/pci0000:00/0000:00:1c.4/0000:02:00.0': not supported by any plugin
Jul 20 10:28:30 node-horsea ModemManager[1150]: <info>  [base-manager] couldn't check support for device '/sys/devices/pci0000:00/0000:00:1c.4/0000:02:00.1': not supported by any plugin
[...]

Revision history for this message

Miriam España Acebal (mirespace) wrote on 2023-07-24:

#11

openssh-server.0.crash Edit (1.2 MiB, text/plain)

Crash report

Miriam España Acebal (mirespace) on 2023-07-25

Changed in openssh (Ubuntu):
assignee:	nobody → Miriam España Acebal (mirespace)

Revision history for this message

Christian Ehrhardt  (paelzer) wrote on 2023-07-26:

#12

@Steve
Since the machines original use case is blocked until we know if we can go on.
Is the above enough for your to have a deeper look together with us?

If so please let Miriam know when once she can reset the machine to go on with the MRE verifications that this was supposed to do :-)

Revision history for this message

Miriam España Acebal (mirespace) wrote on 2023-07-27:

#13

I'm sorry, but I had to reset the machine because it was blocking the dpdk's MRE process.

I hope the above info serves as a more complete starting point than before to find out something more. Anyway, after dpdk testing, I will try to reproduce the upgrade problem on that machine (if I'm lucky).

Changed in openssh (Ubuntu):
assignee:	Miriam España Acebal (mirespace) → nobody

Revision history for this message

Miriam España Acebal (mirespace) wrote on 2023-07-27:

#14

Download full text (7.4 KiB)

I'm starting to think that there is something wrong with the horsea machine indeed: still not able to upgrade to lunar. Starting from jammy or from kinetic failed.

#Jammy

Deploying again jammy and doing do-release-upgrade again failed with ssh, with differents things after rebooting and trying the do the release upgrade (I have outputs, but I don't remember exactly at what point I did the reboot of the machine, i.e: "Missing privilege separation directory: /run/sshd" could happen after reboot, but not sure).

#Kinetic
So, ok... let's think is something wrong with the jammy image that is being deployed (a remote possibility, I think, but ...).... trying the upgrading from Kinetic: failed. I checked the system status before starting the upgrade and it was not degraded.

It was stucked at 97% on:

Setting up snapd (2.58+22.10.1) ...
Installing new version of config file /etc/apt/apt.conf.d/20snapd.conf ...
Could not execute systemctl: at /usr/bin/deb-systemd-invoke line 145.
Could not execute systemctl: at /usr/bin/deb-systemd-invoke line 145.
snapd.failure.service is a disabled or a static unit not running, not starting it.
snapd.snap-repair.service is a disabled or a static unit not running, not starting it.
Failed to restart snapd.mounts-pre.target: Operation refused, unit snapd.mounts-pre.target may be requested by dependenc
y only (it is configured to refuse manual start/stop).
See system logs and 'systemctl status snapd.mounts-pre.target' for details.

system status was:

           │ └─unattended-upgrades.service
           │ ├─ 1227 /usr/bin/python3 /usr/share/unattended-upgrades/unattended-upgrade-shutdown --wait-for-signal
           │ └─11628 /usr/bin/python3 /usr/share/unattended-upgrades/unattended-upgrade-shutdown --wait-for-signal
           └─user.slice
             └─user-1000.slice
               └─session-1.scope
                 ├─ 1856 "sshd: ubuntu [priv]"
                 ├─ 1937 "sshd: ubuntu@pts/0"
                 ├─ 1941 -bash
                 ├─ 2302 sudo apt upgrade -y
                 ├─ 2303 sudo apt upgrade -y
                 ├─ 2304 apt upgrade -y
                 ├─ 5750 /usr/bin/dpkg --status-fd 39 --configure --pending
                 ├─ 5751 sh -c "(test -x /usr/lib/needrestart/dpkg-status && /usr/lib/needrestart/dpkg-status || cat > /dev/null>
                 ├─ 5752 sh -c "(test -x /usr/lib/needrestart/dpkg-status && /usr/lib/needrestart/dpkg-status || cat > /dev/null>
                 ├─ 5753 /bin/sh /usr/lib/needrestart/dpkg-status
                 ├─11336 /bin/sh /var/lib/dpkg/info/snapd.postinst configure 2.57.5+22.10ubuntu0.1
                 ├─11501 /usr/bin/perl /usr/bin/deb-systemd-invoke restart snapd.aa-prompt-listener.service snapd.apparmor.servi>
                 ├─11563 systemctl --quiet --system restart snapd.aa-prompt-listener.service snapd.apparmor.service snapd.autoim>
                 └─11564 /bin/systemd-tty-ask-password-agent --watch

After that, it failed with:

Could not execute systemctl: at /usr/bin/deb-systemd-invoke line 145.
Setting up openssh-sftp-server (1:9.0p1-1ubuntu7.3) ...
Setting up needrestart (3.6-1ubuntu0.1) ...
Setting up openssh-server (1:9.0p1-1ubuntu7...

I'm starting to think that there is something wrong with the horsea machine indeed: still not able to upgrade to lunar. Starting from jammy or from kinetic failed.

#Jammy

Deploying again jammy and doing do-release-upgrade again failed with ssh, with differents things after rebooting and trying the do the release upgrade (I have outputs, but I don't remember exactly at what point I did the reboot of the machine, i.e: "Missing privilege separation directory: /run/sshd" could happen after reboot, but not sure).

#Kinetic 
So, ok... let's think is something wrong with the jammy image that is being deployed (a remote possibility, I think, but ...).... trying the upgrading from Kinetic: failed. I checked the system status before starting the upgrade and it was not degraded.

It was stucked at 97% on:

Setting up snapd (2.58+22.10.1) ...
Installing new version of config file /etc/apt/apt.conf.d/20snapd.conf ...
Could not execute systemctl:  at /usr/bin/deb-systemd-invoke line 145.
Could not execute systemctl:  at /usr/bin/deb-systemd-invoke line 145.
snapd.failure.service is a disabled or a static unit not running, not starting it.
snapd.snap-repair.service is a disabled or a static unit not running, not starting it.
Failed to restart snapd.mounts-pre.target: Operation refused, unit snapd.mounts-pre.target may be requested by dependenc
y only (it is configured to refuse manual start/stop).
See system logs and 'systemctl status snapd.mounts-pre.target' for details.

system status was:

│ └─unattended-upgrades.service
           │   ├─ 1227 /usr/bin/python3 /usr/share/unattended-upgrades/unattended-upgrade-shutdown --wait-for-signal
           │   └─11628 /usr/bin/python3 /usr/share/unattended-upgrades/unattended-upgrade-shutdown --wait-for-signal
           └─user.slice
             └─user-1000.slice
               └─session-1.scope
                 ├─ 1856 "sshd: ubuntu [priv]"
                 ├─ 1937 "sshd: ubuntu@pts/0"
                 ├─ 1941 -bash
                 ├─ 2302 sudo apt upgrade -y
                 ├─ 2303 sudo apt upgrade -y
                 ├─ 2304 apt upgrade -y
                 ├─ 5750 /usr/bin/dpkg --status-fd 39 --configure --pending
                 ├─ 5751 sh -c "(test -x /usr/lib/needrestart/dpkg-status && /usr/lib/needrestart/dpkg-status || cat > /dev/null>
                 ├─ 5752 sh -c "(test -x /usr/lib/needrestart/dpkg-status && /usr/lib/needrestart/dpkg-status || cat > /dev/null>
                 ├─ 5753 /bin/sh /usr/lib/needrestart/dpkg-status
                 ├─11336 /bin/sh /var/lib/dpkg/info/snapd.postinst configure 2.57.5+22.10ubuntu0.1
                 ├─11501 /usr/bin/perl /usr/bin/deb-systemd-invoke restart snapd.aa-prompt-listener.service snapd.apparmor.servi>
                 ├─11563 systemctl --quiet --system restart snapd.aa-prompt-listener.service snapd.apparmor.service snapd.autoim>
                 └─11564 /bin/systemd-tty-ask-password-agent --watch

After that, it failed with:

Could not execute systemctl:  at /usr/bin/deb-systemd-invoke line 145.
Setting up openssh-sftp-server (1:9.0p1-1ubuntu7.3) ...
Setting up needrestart (3.6-1ubuntu0.1) ...
Setting up openssh-server (1:9.0p1-1ubuntu7.3) ...
rescue-ssh.target is a disabled or a static unit not running, not starting it.
Could not execute systemctl:  at /usr/bin/deb-systemd-invoke line 145.
dpkg: error processing package openssh-server (--configure):
 installed openssh-server package post-installation script subprocess returned error exit status 1
Setting up bind9-dnsutils (1:9.18.12-0ubuntu0.22.10.2) ...
Processing triggers for libc-bin (2.36-0ubuntu4) ...
Processing triggers for rsyslog (8.2208.0-1ubuntu2) ...
Processing triggers for ufw (0.36.1-4build1) ...
Processing triggers for man-db (2.10.2-2) ...
Processing triggers for dbus (1.14.0-2ubuntu3) ...
Processing triggers for initramfs-tools (0.140ubuntu17) ...
update-initramfs: Generating /boot/initrd.img-5.19.0-46-generic
Processing triggers for ca-certificates (20230311ubuntu0.22.10.1) ...
Updating certificates in /etc/ssl/certs...
0 added, 0 removed; done.
Running hooks in /etc/ca-certificates/update.d...
done.
Errors were encountered while processing:
 openssh-server
needrestart is being skipped since dpkg has failed
E: Sub-process /usr/bin/dpkg returned an error code (1)

System status after that was:

ubuntu@node-horsea:~$ systemctl status -l 
Failed to read server status: Connection timed out

ubuntu@node-horsea:~$ systemctl status -l

● node-horsea
    State: degraded
    Units: 425 loaded (incl. loaded aliases)
     Jobs: 1 queued
   Failed: 7 units
    Since: Thu 2023-07-27 10:48:27 UTC; 18min ago
  systemd: 251.4-1ubuntu7.3
   CGroup: /
           ├─init.scope
           │ └─1 /lib/systemd/systemd --- --system --deserialize 22
           ├─system.slice
           │ ├─cron.service
           │ │ └─1138 /usr/sbin/cron -f -P
           │ ├─dbus.service
           │ │ ├─1139 @dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-activation --syslog-only
           │ │ └─6714 @dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-activation --syslog-only
           │ ├─irqbalance.service
           │ │ ├─1145 /usr/sbin/irqbalance --foreground
           │ │ └─6719 /usr/sbin/irqbalance --foreground
           │ ├─multipathd.service
           │ │ └─6654 /sbin/multipathd -d -s
           │ ├─packagekit.service
           │ │ └─2204 /usr/libexec/packagekitd
           │ ├─polkit.service
           │ │ └─2208 /usr/libexec/polkitd --no-debug
           │ ├─rsyslog.service
           │ │ └─12184 /usr/sbin/rsyslogd -n -iNONE
           │ ├─snapd.service
           │ │ ├─ 1150 /usr/lib/snapd/snapd
           │ │ └─23845 /usr/lib/snapd/snapd
           │ ├─ssh.service
           │ │ ├─ 1849 "sshd: /usr/sbin/sshd -D [listener] 0 of 10-100 startups"
           │ │ ├─11762 "sshd: ubuntu [priv]"
           │ │ ├─11841 "sshd: ubuntu@pts/3"
           │ │ ├─11842 -bash
           │ │ ├─23869 systemctl status -l
           │ │ └─23870 less
           │ ├─system-getty.slice
           │ │ └─getty@tty1.service
           │ │   └─6840 /sbin/agetty -o "-p -- \\u" --noclear - linux
           │ ├─system-serial\x2dgetty.slice
           │ │ └─serial-getty@ttyS1.service
           │ │   └─6834 /sbin/agetty -o "-p -- \\u" --keep-baud 115200,57600,38400,9600 - vt220
           │ ├─systemd-journald.service
           │ │ └─3834 /lib/systemd/systemd-journald
           │ ├─systemd-logind.service
           │ │ └─11626 /lib/systemd/systemd-logind
           │ ├─systemd-networkd.service
           │ │ ├─1086 /lib/systemd/systemd-networkd
           │ │ └─4522 /lib/systemd/systemd-networkd
           │ ├─systemd-resolved.service
           │ │ └─6942 /lib/systemd/systemd-resolved
           │ ├─systemd-timesyncd.service
           │ │ └─5877 /lib/systemd/systemd-timesyncd
           │ ├─systemd-udevd.service
           │ │ └─udev
           │ │   └─5921 /lib/systemd/systemd-udevd
           │ └─unattended-upgrades.service
           │   ├─ 1227 /usr/bin/python3 /usr/share/unattended-upgrades/unattended-upgrade-shutdown --wait-for-signal
           │   └─11628 /usr/bin/python3 /usr/share/unattended-upgrades/unattended-upgrade-shutdown --wait-for-signal
           └─user.slice
             └─user-1000.slice
               └─session-1.scope
                 ├─ 1856 "sshd: ubuntu [priv]"
                 ├─ 1937 "sshd: ubuntu@pts/0"
                 ├─ 1941 -bash
                 ├─23835 sudo su
                 ├─23836 sudo su
                 ├─23837 su
                 └─23838 bash

Revision history for this message

Miriam España Acebal (mirespace) wrote on 2023-07-27 (last edit on 2023-07-27):

#15

Well, a simple apt upgrade makes the openssh-server fail to install, and the only thing that can take away the package from that iF situation is a:

shutdown -r now
apt install --fix-broken

Then a do-release-upgrade again broke the package with the same output plus:

User sessions running outdated binaries:
ubuntu @ session #1: sshd[2389,2475,2887]

Again, reboot and fixbroken helped there.

Looking also on the side of snapd package, I found this bug 1958676 that might be related. So... what if systemd is the culprit???

Revision history for this message

Martin Pitt (pitti) wrote on 2023-07-31:

#16

We see this in our CI infra as well, it's not just about jammy → lunar dist-upgrades, but also happens with regular updates within lunar as well. The cloud image from a few days ago had an older openssh-server, tried to update to 1:9.0p1-1ubuntu8.4, and failed:

Setting up openssh-server (1:9.0p1-1ubuntu8.4) ...
Creating SSH2 ECDSA key; this may take some time ...
256 SHA256:r08SEQICcnkgu0I1PQwdYTqHeuYVRkz4oPF9VJeC5xQ root@ubuntu (ECDSA)
Creating SSH2 ED25519 key; this may take some time ...
256 SHA256:Qaa3l3IsQuQpKwtESOnfyg5STi64/5pVX4WfiEGX5CI root@ubuntu (ED25519)
rescue-ssh.target is a disabled or a static unit not running, not starting it.
Could not execute systemctl: at /usr/bin/deb-systemd-invoke line 145.
dpkg: error processing package openssh-server (--configure):
installed openssh-server package post-installation script subprocess returned error exit status 1

Today's cloud image avoids this problem as it has the current openssh-server package on it already, so it's not part of the dist-upgrade. But this will likely come back.

[1] https://cockpit-logs.us-east-1.linodeobjects.com/image-refresh-logs/ubuntu-stable-20230727-131010.log

Revision history for this message

Steve Langasek (vorlon) wrote on 2023-07-31:

#17

I don't have availability to work on this right now but I'm tagging it rls-ll-incoming for the Foundations Team to at least evaluate picking it up no later than this Thursday's meeting.

tags:

added: rls-ll-incoming

Revision history for this message

Robie Basak (racb) wrote on 2023-08-02:

#18

General note: this seems like a serious issue but I think we're struggling with reproducing it. If anybody can help with steps to reproduce or a snapshot that can be shared in which it reproduces, then we'd very much appreciate that.

Revision history for this message

Andreas Hasenack (ahasenack) wrote on 2023-08-02:

#19

Another DEP8 test that failed because openssh-server's postinst failed:

https://autopkgtest.ubuntu.com/results/autopkgtest-mantic/mantic/amd64/s/samba/20230730_140035_350d6@/log.gz

2546s Errors were encountered while processing:
2546s openssh-server
2547s E: Sub-process /usr/bin/dpkg returned an error code (1)

Revision history for this message

Christian Ehrhardt  (paelzer) wrote on 2023-08-03:

#20

Sadly, or better gladly, Pitti killed most of my theories which were around the transition from pre-socket to socket-activated mode. Thanks for stopping me from doing more debug that way :-)

FYI - There recently has been bug 2020474 fixed in Mantic.
It seems to be rather similar and throws another tunable into the mix, having proposed enabled or not.
See comment #17 in that bug for more.
Sadly - the example by Andreas in comment #19 is after this fix for mantic. So it can't be all that is needed.

Anyway this made me realize I was trying to recreate on mantic (which I now know has this fix) while it was seen on lunar before.

So I removed all I had on mantic and went back to Lunar ... trying to look for weird state combinations to trigger this.

In the background I ran:
$ while /bin/true; do sleep 0.1s; for s in ssh.service ssh.socket; do printf '%15s: %15s %15s ' "$s" "$(systemctl is-enabled $s)" "$(systemctl is-active $s)"; done; printf "\n"; done

And in another session I ran server actions or loops of actions.

After install:
    ssh.service: disabled inactive ssh.socket: enabled active
logging in via ssh:
    ssh.service: disabled activating ssh.socket: enabled active
    ssh.service: disabled active ssh.socket: enabled active
upgrade or re-install pkg
    ssh.service: disabled inactive ssh.socket: enabled inactive
    ssh.service: disabled inactive ssh.socket: enabled active

So much for states and states that are to be expected.
But so far no matter how many concurrent re-installs, downgrades (to 1:9.0p1-1ubuntu8), upgrades, enable/disable, ssh logins, ... I did - I could not yet trigger the problem in an isolated reproducible environment.
As Robie said, we are looking for someone able to fill that gap to continue ...

Revision history for this message

Christian Ehrhardt  (paelzer) wrote on 2023-08-03:

#21

Yet another hit on the same

1. Maas deployed jammy GA 22.04
This has 1:8.9p1-3

2. Tonight unattended-upgrade ran
upgraded a lot of things
# This is massive, but might be the first to recreate this

3. it triggers the same issue again

Aug 03 06:44:41 node-horsea sshd[8697]: error: Bind to port 22 on 0.0.0.0 failed: Address already in use.
Aug 03 06:44:41 node-horsea sshd[8697]: error: Bind to port 22 on :: failed: Address already in use.
Aug 03 06:44:41 node-horsea sshd[8697]: fatal: Cannot bind any address.

Extra info:

There are a few things on this machine which might play into this.
It has some special network adapters, not all connected well (intentionally).

ubuntu@node-horsea:~$ networkctl
IDX LINK TYPE OPERATIONAL SETUP
  1 lo loopback carrier unmanaged
  2 eno1 ether routable configured
  3 eno2 ether degraded configured
  4 eno3 ether no-carrier configuring
  5 eno4 ether no-carrier configuring
  6 eno49 ether degraded configured
  7 eno50 ether degraded configured
  8 ens1 ether no-carrier configuring

e.g. this fails:

ubuntu@node-horsea:~$ /lib/systemd/systemd-networkd-wait-online
managing: eno2
managing: eno50
managing: eno49
managing: eno1

Timeout occurred while waiting for network connectivity.

When we activate apt-daily directly it fails over that.
But if not, then it updates and causes this crash.

Revision history for this message

Sergio Durigan Junior (sergiodj) wrote on 2023-08-03:

#22

Download full text (7.1 KiB)

I spent some time debugging this issue today. One of the MAAS nodes the Server team uses (horsea) is affected by it (what Christian reported above was actually part of a joint debugging effort between him and I earlier today), so I took the opportunity to redeploy the system and try to understand what was really going on.

I already had some good indication that the problem manifested when apt-daily-upgrade.service was executed, so that's the first thing I did after the machine booted, and voilà, the problem happened again.

I could not reproduce the issue just by reinstalling openssh-server, so my next best guess was that something else was interfering with the service restart. I tried cloud-init, to no avail. Then, I decided to try to forcefully upgrade systemd, and that caused the issue to manifest again. Good, now I had a small reproducer that did not involve redeploying the machine.

With that in mind, I decided to check what systemd (or its maintscripts) were doing that could cause such problem. I remembered that one of the first things that caught my attention when I was debugging the issue earlier with Christian was the fact that there were *a lot* of systemd-related warnings on journalctl like the following:

Aug 03 21:05:35 node-horsea systemd[1]: cron.service: Found left-over process 970 (cron) in control group while starting unit. Ignoring.
Aug 03 21:05:35 node-horsea systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.

Had these been just related to ssh, I'd have thought that there was something wrong with openssh-server's maintscripts, maybe. But as I said, these warnings were related to *many* services. They're basically saying that systemd somehow "lost" control of the process it was managing, which is concerning. It also seemed to explain why sshd is complaining about not being able to bind to port 22 because it's in use. I could verify that there is indeed a "left-over" sshd process still bound to port 22.

After some more thinking & log reading, I grew very suspicious that the command "systemctl daemon-reexec" was the culprit here. This command is performed by systemd's postinst script. And finally, after a reboot to make sure that the system was back to a sane state, I was able to confirm the suspicion:

ubuntu@node-horsea:~$ sudo systemctl daemon-reexec
ubuntu@node-horsea:~$ sudo systemctl status ssh
○ ssh.service - OpenBSD Secure Shell server
Loaded: loaded (/lib/systemd/system/ssh.service; enabled; vendor preset: enabled)
Active: inactive (dead) ...

I spent some time debugging this issue today.  One of the MAAS nodes the Server team uses (horsea) is affected by it (what Christian reported above was actually part of a joint debugging effort between him and I earlier today), so I took the opportunity to redeploy the system and try to understand what was really going on.

I already had some good indication that the problem manifested when apt-daily-upgrade.service was executed, so that's the first thing I did after the machine booted, and voilà, the problem happened again.

I could not reproduce the issue just by reinstalling openssh-server, so my next best guess was that something else was interfering with the service restart.  I tried cloud-init, to no avail.  Then, I decided to try to forcefully upgrade systemd, and that caused the issue to manifest again.  Good, now I had a small reproducer that did not involve redeploying the machine.

With that in mind, I decided to check what systemd (or its maintscripts) were doing that could cause such problem.  I remembered that one of the first things that caught my attention when I was debugging the issue earlier with Christian was the fact that there were *a lot* of systemd-related warnings on journalctl like the following:

Aug 03 21:05:35 node-horsea systemd[1]: cron.service: Found left-over process 970 (cron) in control group while starting unit. Ignoring.
Aug 03 21:05:35 node-horsea systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.

Had these been just related to ssh, I'd have thought that there was something wrong with openssh-server's maintscripts, maybe.  But as I said, these warnings were related to *many* services.  They're basically saying that systemd somehow "lost" control of the process it was managing, which is concerning.  It also seemed to explain why sshd is complaining about not being able to bind to port 22 because it's in use.  I could verify that there is indeed a "left-over" sshd process still bound to port 22.

After some more thinking & log reading, I grew very suspicious that the command "systemctl daemon-reexec" was the culprit here.  This command is performed by systemd's postinst script.  And finally, after a reboot to make sure that the system was back to a sane state, I was able to confirm the suspicion:

ubuntu@node-horsea:~$ sudo systemctl daemon-reexec                                                                                                                                             
ubuntu@node-horsea:~$ sudo systemctl status ssh                                                                                                                                                
○ ssh.service - OpenBSD Secure Shell server                                                                                                                                                    
     Loaded: loaded (/lib/systemd/system/ssh.service; enabled; vendor preset: enabled)                                                                                                         
     Active: inactive (dead)                                                                                                                                                                   
       Docs: man:sshd(8)                                                                                                                                                                       
             man:sshd_config(5)                                                                                                                                                                
                                                                                                                                                                                               
Aug 03 21:26:21 node-horsea systemd[1]: Starting OpenBSD Secure Shell server...                                                                                                                
Aug 03 21:26:21 node-horsea sshd[913]: Server listening on 0.0.0.0 port 22.                                                                                                                    
Aug 03 21:26:21 node-horsea sshd[913]: Server listening on :: port 22.                                                                                                                         
Aug 03 21:26:21 node-horsea systemd[1]: Started OpenBSD Secure Shell server.
Aug 03 21:29:03 node-horsea sshd[1189]: Accepted publickey for ubuntu from 10.172.68.194 port 36222 ssh2: RSA SHA256:7JLFXGJ/lkuhcSu8/OF8mHjZ8wMD1zEfNSuNGM5u2Lk
Aug 03 21:29:03 node-horsea sshd[1189]: pam_unix(sshd:session): session opened for user ubuntu(uid=1000) by (uid=0)

You can see that the service is listed as inactive (dead) after the daemon-reexec.  Now let's try to restart it:

ubuntu@node-horsea:~$ sudo systemctl restart ssh                                                
Job for ssh.service failed because the control process exited with error code.
See "systemctl status ssh.service" and "journalctl -xeu ssh.service" for details.
ubuntu@node-horsea:~$ sudo systemctl status ssh                                                 
× ssh.service - OpenBSD Secure Shell server
     Loaded: loaded (/lib/systemd/system/ssh.service; enabled; vendor preset: enabled)
     Active: failed (Result: exit-code) since Thu 2023-08-03 21:31:25 UTC; 11s ago
       Docs: man:sshd(8)
             man:sshd_config(5)
    Process: 1914 ExecStartPre=/usr/sbin/sshd -t (code=exited, status=0/SUCCESS)
    Process: 1943 ExecStart=/usr/sbin/sshd -D $SSHD_OPTS (code=exited, status=255/EXCEPTION)
   Main PID: 1943 (code=exited, status=255/EXCEPTION)
      Tasks: 1 (limit: 38215)
     Memory: 7.6M
        CPU: 27ms
     CGroup: /system.slice/ssh.service
             └─913 "sshd: /usr/sbin/sshd -D [listener] 0 of 10-100 startups"

Aug 03 21:31:25 node-horsea systemd[1]: Starting OpenBSD Secure Shell server...
Aug 03 21:31:25 node-horsea systemd[1]: ssh.service: Found left-over process 913 (sshd) in control group while starting unit. Ignoring.
Aug 03 21:31:25 node-horsea systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Aug 03 21:31:25 node-horsea sshd[1943]: error: Bind to port 22 on 0.0.0.0 failed: Address already in use.
Aug 03 21:31:25 node-horsea sshd[1943]: error: Bind to port 22 on :: failed: Address already in use.
Aug 03 21:31:25 node-horsea sshd[1943]: fatal: Cannot bind any address.
Aug 03 21:31:25 node-horsea systemd[1]: ssh.service: Main process exited, code=exited, status=255/EXCEPTION
Aug 03 21:31:25 node-horsea systemd[1]: ssh.service: Failed with result 'exit-code'.
Aug 03 21:31:25 node-horsea systemd[1]: ssh.service: Unit process 913 (sshd) remains running after unit stopped.
Aug 03 21:31:25 node-horsea systemd[1]: Failed to start OpenBSD Secure Shell server.

And there you go: the problem now happens deterministically.

With that in mind, I also found bug #2013543 which is exactly about this issue.  Upstream has fixed the problem, but work still needs to be done to SRU the patch into Jammy/Lunar.

Revision history for this message

Christian Ehrhardt  (paelzer) wrote on 2023-08-04:

#23

Thanks Sergio for continuing on this!

I've added a systemd task here due to your findings (but not yet marked it a dup as we can't be sure yet) and left a comment in bug 2013543 to please let us know when there would be a test build to try our problem against.

Changed in openssh (Ubuntu):
status:	New → Confirmed

Revision history for this message

Nick Rosbrook (enr0n) wrote on 2023-08-04:

#24

Thanks for digging into this, Sergio! To help determine whether this is caused by bug 2013543, can you please share the output of the following after re-execing systemd?

$ cat /proc/1/cmdline | tr '\0' '\n'

Changed in systemd (Ubuntu):
status:	New → Incomplete

Revision history for this message

Sergio Durigan Junior (sergiodj) wrote on 2023-08-04:

#25

Hi Nick,

Sorry, I missed your comment before marking the bug as duplicate. But as I commented on bug #2013543, I was able to verify that your systemd PPA fixes the issue.

Either way, here's the output of the command:

$ cat /proc/1/cmdline | tr '\0' '\n'
/lib/systemd/systemd
---
--system
--deserialize
79

Revision history for this message

Martin Pitt (pitti) wrote on 2023-08-05:

#26

Whoa, what a convoluted bug.. Thanks Miriam, Sergio, and Christian for tracking this down! Glad to hear that this is fixed in systemd upstream already.

Affects		Status	Importance	Assigned to	Milestone
	openssh (Ubuntu)	Confirmed	High	Unassigned
	systemd (Ubuntu)	Incomplete	Undecided	Unassigned

Ubuntu
openssh package

upgrade to lunar fails due to rescue-ssh.target or port 22 takeover

Bug Description

CVE References

Other bug subscribers

Bug attachments

Remote bug watches

Ubuntuopenssh package

upgrade to lunar fails due to rescue-ssh.target or port 22 takeover

Bug Description

CVE References

Other bug subscribers

Bug attachments

Remote bug watches

Ubuntu
openssh package