sshd segmentation fault on 20.04.6 (focal)

Bug #2043114 reported by Ivaylo Markov
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
openssh (Ubuntu)
Confirmed
Undecided
Unassigned

Bug Description

We have a physical server running Ubuntu 20.04.6 LTS (amd64) and openssh-server 1:8.2p1-4ubuntu0.9. Sometimes sshd crashes with a segmentation fault on remote login with key authentication:
[193107.651745] sshd[1229630]: segfault at 5557eba6a008 ip 00007f2326a2ca53 sp 00007ffcba40c510 error 4 in libc-2.31.so[7f23269b8000+178000]

We’ve changed only the following values in the stock sshd_config file:

LogLevel DEBUG
PasswordAuthentication no
MaxStartups 100:30:100

The server is used for automated software testing, and sometimes our test suite might make a large amount of SSH connections in a short period of time, which seems to be correlated with the crashes. But at the same time, I have to note that the connection count was not near the MaxStartups limit, and we’ve had crashes before adding that setting.
Since the backtrace shows the debug logging function in the stack, we’re currently experimenting with using `LogLevel INFO` to try and isolate the issue.

I am attaching the backtrace. I could provide the full dump file, although I am hesitant due to the possibility of private keys or other sensitive information leaking.

# apt-cache policy openssh-server
openssh-server:
  Installed: 1:8.2p1-4ubuntu0.9
  Candidate: 1:8.2p1-4ubuntu0.9
  Version table:
 *** 1:8.2p1-4ubuntu0.9 500
        500 http://mirrors.storpool.com/ubuntu/archive focal-updates/main amd64 Packages
        500 http://security.ubuntu.com/ubuntu focal-security/main amd64 Packages
        100 /var/lib/dpkg/status
     1:8.2p1-4 500
        500 http://mirrors.storpool.com/ubuntu/archive focal/main amd64 Packages
---
ProblemType: Bug
ApportVersion: 2.20.11-0ubuntu27.27
Architecture: amd64
CasperMD5CheckResult: skip
DistroRelease: Ubuntu 20.04
Package: openssh-server 1:8.2p1-4ubuntu0.9
PackageArchitecture: amd64
ProcVersionSignature: Ubuntu 5.4.0-128.144-generic 5.4.210
Tags: focal
Uname: Linux 5.4.0-128-generic x86_64
UpgradeStatus: Upgraded to focal on 2021-01-13 (1030 days ago)
UserGroups: N/A
_MarkForUpload: True

Revision history for this message
Ivaylo Markov (imarkov-storpool) wrote :
Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in openssh (Ubuntu):
status: New → Confirmed
Revision history for this message
Mitchell Dzurick (mitchdz) wrote :

Thank you Ivaylo for creating this bug report and helping make Ubuntu better!

I have 2 follow up questions:
1. How often do these segmentation faults occur for you?
2. Have you tried to SSH to the Focal server from multiple other systems? Do you encounter this issue with multiple systems, or is it only observed with one?

Could you also run `apport-collect 2043114` in the server that is encountering the seg faults? This will gather debugging information that may be valuable.

I see the backtrace has the error
        nextinuse = <error reading variable nextinuse (Cannot access memory at address 0x5557eba6a008)>

I briefly looked into the upstream repository to try and find related commits to this issue and didn't see any commits pop out at me

Revision history for this message
Ivaylo Markov (imarkov-storpool) wrote :

Thanks for getting back to me!

1. The crash seems to happen in a large percentage of the runs of a particular test that does a lot of rapid SSH connections to the machine. Sadly, I don’t have an exact number, but 50% probably wouldn’t be an exaggeration. It was also happening in the setup part of the test harness when it was making a large amount of parallel connections, but we spread those out a bit.
2. The connections all happen from a single test machine. I could try to test if this is repeatable when they’re spread over multiple sources, but that’s not how our test suite works.

As for the debugging information, I will first try to reproduce the crash on a throwaway machine (VM shouldn’t make a difference, I think), as to avoid any potential leaks of host keys or other sensitive information.

Revision history for this message
Ivaylo Markov (imarkov-storpool) wrote (last edit ):

While trying to reproduce the segfault, I encountered another crash, an abort in malloc this time. I cannot be absolutely sure, but it seems that they might be related, so I am uploading the report now.

Edit: forgot to mention, it seems that the abort has also occurred on the physical machine that I initially reported the crash on.

tags: added: apport-collected focal
description: updated
Revision history for this message
Ivaylo Markov (imarkov-storpool) wrote : Dependencies.txt

apport information

Revision history for this message
Ivaylo Markov (imarkov-storpool) wrote : ProcCpuinfoMinimal.txt

apport information

Revision history for this message
Ivaylo Markov (imarkov-storpool) wrote : ProcEnviron.txt

apport information

Revision history for this message
Ivaylo Markov (imarkov-storpool) wrote :

Apologies for the multiple messages, but I was under the impression that apport-collect would upload the actual core dump file here :)

Revision history for this message
Ivaylo Markov (imarkov-storpool) wrote :

After some more "stress testing" with GNU parallel, I was able to reproduce the segmentation fault, too. See attachment.

Revision history for this message
Athos Ribeiro (athos-ribeiro) wrote :

Hi Ivaylo, thanks for all the information so far!

Can you consistently reproduce it?

Would you also mind sharing the exact steps it would take for us to reproduce it from a fresh focal install?

Revision history for this message
Ivaylo Markov (imarkov-storpool) wrote :

Yes, seems pretty consistent. I just tried it again before I write out the reproduction steps, and it took maybe a couple of minutes.

The configuration is basically just:

sed -i.bak 's/#LogLevel INFO/LogLevel DEBUG/g' /etc/ssh/sshd_config
systemctl restart sshd

As for making enough connections to cause the issue, I use something along the lines of:

parallel -j 99 -N0 "ssh root@example 'mount; sleep 1; cat /proc/cpuinfo; free -h; dd if=/dev/zero of=/dev/null bs=1 count=8192; mount -av; sleep $(($RANDOM % 5)); lshw'" ::: {1..5000}

It is essentially running random commands to simulate some sort of activity on the machine. The amount of parallel connections (`-j 99`) might need to be adjusted - too little connections won't trigger the bug, too many can cause an OOM situation.

Revision history for this message
Sergio Durigan Junior (sergiodj) wrote :

Thank you for providing more information.

Unfortunately I am still unable to reproduce the problem. I tried using a container and a VM, to no avail.

But I did open the coredump:

(gdb) bt
#0 _int_free (av=av@entry=0x7fcbaccd8b80 <main_arena>, p=p@entry=0x558afb81e0c0, have_lock=<optimized out>, have_lock@entry=1) at malloc.c:4341
#1 0x00007fcbacb84f22 in _int_realloc (av=av@entry=0x7fcbaccd8b80 <main_arena>, oldp=oldp@entry=0x558afb81e070, oldsize=oldsize@entry=8208, nb=80) at malloc.c:4644
#2 0x00007fcbacb86fb6 in __GI___libc_realloc (oldmem=0x558afb81e080, bytes=64) at malloc.c:3226
#3 0x00007fcbacb77748 in _IO_mem_finish (fp=0x558afb805e80, dummy=<optimized out>) at memstream.c:131
#4 0x00007fcbacb6de41 in _IO_new_fclose (fp=fp@entry=0x558afb805e80) at libioP.h:948
#5 0x00007fcbacc03ddb in __vsyslog_internal (pri=<optimized out>, fmt=0x558afa13ac80 "%.500s", ap=0x7ffd8170e5c0, mode_flags=2) at ../misc/syslog.c:237
#6 0x00007fcbacc04363 in __syslog_chk (pri=pri@entry=7, flag=flag@entry=1, fmt=fmt@entry=0x558afa13ac80 "%.500s") at ../misc/syslog.c:136
#7 0x0000558afa0f8b78 in syslog (__fmt=0x558afa13ac80 "%.500s", __pri=7) at /usr/include/x86_64-linux-gnu/bits/syslog.h:31
#8 do_log (level=level@entry=SYSLOG_LEVEL_DEBUG1, fmt=<optimized out>, args=args@entry=0x7ffd8170ef00) at ../../log.c:476
#9 0x0000558afa0f8ff8 in debug (fmt=<optimized out>) at ../../log.c:229
#10 0x0000558afa0ae3fe in server_accept_loop (config_s=0x7ffd8170f050, newsock=<synthetic pointer>, sock_out=<synthetic pointer>, sock_in=<synthetic pointer>) at ../../sshd.c:1338
#11 main (ac=<optimized out>, av=<optimized out>) at ../../sshd.c:2040

This stack trace is a bit intriguing. It's realloc that is crashing, way too deep into glibc. It seems to point at some weird interaction between your setup and the memory management involved in syslog.

I spent time trying to find upstream bugs to see if there was anything remotely similar, but couldn't find anything either. Can you provide more details on the setup you're using to reproduce the problem? For example, are you using a VM, a container, bare metal? How many (v)CPUs? What about memory? If it's a container/VM, what's the underlying host?

Also, since you can reproduce the issue pretty reliably, could you perhaps check if the same crash happens on Jammy?

Thank you.

Revision history for this message
Ivaylo Markov (imarkov-storpool) wrote :

We’re seeing it happen on both bare metal and virtual machines.

The VMs are libvirt/QEMU/KVM managed by OpenNebula. The hypervisor is running AlmaLinux 8 with the 4.18.0-477.15.1.el8_8.x86_64 kernel. The VMs have 2 physical/8 vCPUs allocated to them. The VM I used for the reproduction has 2GB of RAM. The VMs where we saw the problem initially have 22GB, but the setup there is a bit more complicated - there are various cgroup v1 limits, such as 2.5GB for system.slice, and 2GB for user slice. However none of that is necessary to reproduce the issue.

Since the initial report, we've done a lot of test runs with INFO-level logging on 20.04, 18.04, 22.04, as well as RedHat-family distributions, and we haven’t seen the issue again. Could be some memory corruption bug that is present only on 20.04, I guess.

So far I’ve been unable to reproduce it on a 22.04 VM with DEBUG logging on.

Thanks.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.