I spent some time debugging this issue today.  One of the MAAS nodes the Server team uses (horsea) is affected by it (what Christian reported above was actually part of a joint debugging effort between him and I earlier today), so I took the opportunity to redeploy the system and try to understand what was really going on.

I already had some good indication that the problem manifested when apt-daily-upgrade.service was executed, so that's the first thing I did after the machine booted, and voilà, the problem happened again.

I could not reproduce the issue just by reinstalling openssh-server, so my next best guess was that something else was interfering with the service restart.  I tried cloud-init, to no avail.  Then, I decided to try to forcefully upgrade systemd, and that caused the issue to manifest again.  Good, now I had a small reproducer that did not involve redeploying the machine.

With that in mind, I decided to check what systemd (or its maintscripts) were doing that could cause such problem.  I remembered that one of the first things that caught my attention when I was debugging the issue earlier with Christian was the fact that there were *a lot* of systemd-related warnings on journalctl like the following:

Aug 03 21:05:35 node-horsea systemd[1]: cron.service: Found left-over process 970 (cron) in control group while starting unit. Ignoring.
Aug 03 21:05:35 node-horsea systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.

Had these been just related to ssh, I'd have thought that there was something wrong with openssh-server's maintscripts, maybe.  But as I said, these warnings were related to *many* services.  They're basically saying that systemd somehow "lost" control of the process it was managing, which is concerning.  It also seemed to explain why sshd is complaining about not being able to bind to port 22 because it's in use.  I could verify that there is indeed a "left-over" sshd process still bound to port 22.

After some more thinking & log reading, I grew very suspicious that the command "systemctl daemon-reexec" was the culprit here.  This command is performed by systemd's postinst script.  And finally, after a reboot to make sure that the system was back to a sane state, I was able to confirm the suspicion:

ubuntu@node-horsea:~$ sudo systemctl daemon-reexec                                                                                                                                             
ubuntu@node-horsea:~$ sudo systemctl status ssh                                                                                                                                                
○ ssh.service - OpenBSD Secure Shell server                                                                                                                                                    
     Loaded: loaded (/lib/systemd/system/ssh.service; enabled; vendor preset: enabled)                                                                                                         
     Active: inactive (dead)                                                                                                                                                                   
       Docs: man:sshd(8)                                                                                                                                                                       
             man:sshd_config(5)                                                                                                                                                                
                                                                                                                                                                                               
Aug 03 21:26:21 node-horsea systemd[1]: Starting OpenBSD Secure Shell server...                                                                                                                
Aug 03 21:26:21 node-horsea sshd[913]: Server listening on 0.0.0.0 port 22.                                                                                                                    
Aug 03 21:26:21 node-horsea sshd[913]: Server listening on :: port 22.                                                                                                                         
Aug 03 21:26:21 node-horsea systemd[1]: Started OpenBSD Secure Shell server.
Aug 03 21:29:03 node-horsea sshd[1189]: Accepted publickey for ubuntu from 10.172.68.194 port 36222 ssh2: RSA SHA256:7JLFXGJ/lkuhcSu8/OF8mHjZ8wMD1zEfNSuNGM5u2Lk
Aug 03 21:29:03 node-horsea sshd[1189]: pam_unix(sshd:session): session opened for user ubuntu(uid=1000) by (uid=0)

You can see that the service is listed as inactive (dead) after the daemon-reexec.  Now let's try to restart it:

ubuntu@node-horsea:~$ sudo systemctl restart ssh                                                
Job for ssh.service failed because the control process exited with error code.
See "systemctl status ssh.service" and "journalctl -xeu ssh.service" for details.
ubuntu@node-horsea:~$ sudo systemctl status ssh                                                 
× ssh.service - OpenBSD Secure Shell server
     Loaded: loaded (/lib/systemd/system/ssh.service; enabled; vendor preset: enabled)
     Active: failed (Result: exit-code) since Thu 2023-08-03 21:31:25 UTC; 11s ago
       Docs: man:sshd(8)
             man:sshd_config(5)
    Process: 1914 ExecStartPre=/usr/sbin/sshd -t (code=exited, status=0/SUCCESS)
    Process: 1943 ExecStart=/usr/sbin/sshd -D $SSHD_OPTS (code=exited, status=255/EXCEPTION)
   Main PID: 1943 (code=exited, status=255/EXCEPTION)
      Tasks: 1 (limit: 38215)
     Memory: 7.6M
        CPU: 27ms
     CGroup: /system.slice/ssh.service
             └─913 "sshd: /usr/sbin/sshd -D [listener] 0 of 10-100 startups"

Aug 03 21:31:25 node-horsea systemd[1]: Starting OpenBSD Secure Shell server...
Aug 03 21:31:25 node-horsea systemd[1]: ssh.service: Found left-over process 913 (sshd) in control group while starting unit. Ignoring.
Aug 03 21:31:25 node-horsea systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Aug 03 21:31:25 node-horsea sshd[1943]: error: Bind to port 22 on 0.0.0.0 failed: Address already in use.
Aug 03 21:31:25 node-horsea sshd[1943]: error: Bind to port 22 on :: failed: Address already in use.
Aug 03 21:31:25 node-horsea sshd[1943]: fatal: Cannot bind any address.
Aug 03 21:31:25 node-horsea systemd[1]: ssh.service: Main process exited, code=exited, status=255/EXCEPTION
Aug 03 21:31:25 node-horsea systemd[1]: ssh.service: Failed with result 'exit-code'.
Aug 03 21:31:25 node-horsea systemd[1]: ssh.service: Unit process 913 (sshd) remains running after unit stopped.
Aug 03 21:31:25 node-horsea systemd[1]: Failed to start OpenBSD Secure Shell server.

And there you go: the problem now happens deterministically.

With that in mind, I also found bug #2013543 which is exactly about this issue.  Upstream has fixed the problem, but work still needs to be done to SRU the patch into Jammy/Lunar.