accept returns duplicate endpoints under load

Bug #1924298 reported by Mark Thomas
116
This bug affects 20 people
Affects Status Importance Assigned to Milestone
Linux
New
Undecided
Unassigned
linux (Ubuntu)
Confirmed
Undecided
Unassigned

Bug Description

When accepting client connections under load, duplicate endpoints may be returned. These endpoints will have different (usually sequential) file descriptors but will refer to the same connection (same server IP, same server port, same client IP, same client port). Both copies of the endpoint appear to be functional.

Reproduction requires:
- compilation of the attached server.c program
- wrk (https://github.com/wg/wrk) to generate load

The steps to reproduce are:
- run 'server' application in one console window
- run 'for i in {1..50}; do /opt/wrk/wrk -t 2 -c 1000 -d 5s --latency --timeout 1s http://localhost:5555/post; done' in a second console window
- run the same command in a third window to generate concurrent load

You may need to run additional instance of the wrk command in multiple windows to trigger the issue.

When the problem occurs the server executable will exit and print some debugging info. e.g.:
accerror = 1950892, counter = 10683, port = 59892, clientfd = 233, lastClient = 232

This indicates that the sockets with file descriptors 233 and 232 are duplicates.

The issue has been reproduced on fully patched versions of Ubuntu 20.04 and 18.04. Other versions have not been tested.

This issue was originally observed in Java and was reported against the Spring Framework:
https://github.com/spring-projects/spring-framework/issues/26434

Investigation from the Spring team and the Apache Tomcat team identified that it appeared to be a JDK issue:
https://bugs.openjdk.java.net/browse/JDK-8263243

Further research from the JDK team determined that the issue was at the OS level. Hence this report.

ProblemType: Bug
DistroRelease: Ubuntu 20.04
Package: linux-image-5.4.0-71-generic 5.4.0-71.79
ProcVersionSignature: Ubuntu 5.4.0-71.79-generic 5.4.101
Uname: Linux 5.4.0-71-generic x86_64
NonfreeKernelModules: nvidia_modeset nvidia
ApportVersion: 2.20.11-0ubuntu27.16
Architecture: amd64
CasperMD5CheckResult: skip
CurrentDesktop: ubuntu:GNOME
Date: Thu Apr 15 12:52:53 2021
HibernationDevice: RESUME=UUID=f5a46e09-d99b-4475-8ab6-2cd70da8418d
InstallationDate: Installed on 2017-02-02 (1532 days ago)
InstallationMedia: Ubuntu 16.04.1 LTS "Xenial Xerus" - Release amd64 (20160719)
IwConfig:
 lo no wireless extensions.

 docker0 no wireless extensions.

 eno1 no wireless extensions.
MachineType: Gigabyte Technology Co., Ltd. Default string
ProcFB: 0 VESA VGA
ProcKernelCmdLine: BOOT_IMAGE=/vmlinuz-5.4.0-71-generic root=/dev/mapper/ubuntu--vg-root ro text
RelatedPackageVersions:
 linux-restricted-modules-5.4.0-71-generic N/A
 linux-backports-modules-5.4.0-71-generic N/A
 linux-firmware 1.187.10
RfKill:

SourcePackage: linux
UpgradeStatus: Upgraded to focal on 2020-09-07 (219 days ago)
dmi.bios.date: 06/13/2016
dmi.bios.vendor: American Megatrends Inc.
dmi.bios.version: F22
dmi.board.asset.tag: Default string
dmi.board.name: X99-SLI-CF
dmi.board.vendor: Gigabyte Technology Co., Ltd.
dmi.board.version: x.x
dmi.chassis.asset.tag: Default string
dmi.chassis.type: 3
dmi.chassis.vendor: Default string
dmi.chassis.version: Default string
dmi.modalias: dmi:bvnAmericanMegatrendsInc.:bvrF22:bd06/13/2016:svnGigabyteTechnologyCo.,Ltd.:pnDefaultstring:pvrDefaultstring:rvnGigabyteTechnologyCo.,Ltd.:rnX99-SLI-CF:rvrx.x:cvnDefaultstring:ct3:cvrDefaultstring:
dmi.product.family: Default string
dmi.product.name: Default string
dmi.product.sku: Default string
dmi.product.version: Default string
dmi.sys.vendor: Gigabyte Technology Co., Ltd.

Revision history for this message
Mark Thomas (asfmarkt) wrote :
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote : Status changed to Confirmed

This change was made by a bot.

Changed in linux (Ubuntu):
status: New → Confirmed
Revision history for this message
Hari Krishna Sidhanthi (hsidhanthi) wrote :

The issue has been reproduced on fully patched versions of Ubuntu 21.10 as well.

Revision history for this message
Thomas Kiesl (thomaskiesl) wrote :

I get the same issue on "CentOS Linux 7.9.2009 (Core)". My Spring (embedded tomcat application) throws the following message.

  2021-12-25 08:38:31,250 ERROR | https-jsse-nio-18443-Acceptor | org.apache.tomcat.util.net.Acceptor | Socket accept failed
  java.io.IOException: Duplicate accept detected. This is a known OS bug. Please consider reporting that you are affected: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1924298
          at org.apache.tomcat.util.net.NioEndpoint.serverSocketAccept(NioEndpoint.java:545)
          at org.apache.tomcat.util.net.NioEndpoint.serverSocketAccept(NioEndpoint.java:78)
          at org.apache.tomcat.util.net.Acceptor.run(Acceptor.java:129)
          at java.base/java.lang.Thread.run(Thread.java:833)

Revision history for this message
Mark Thomas (asfmarkt) wrote :

CentOS version of this bug report:
https://bugs.centos.org/view.php?id=18383

Revision history for this message
Branimir Amidžić (ambraspace) wrote :

Arch Linux is also affected.

Operating System: Arch Linux
KDE Plasma Version: 5.23.4
KDE Frameworks Version: 5.89.0
Qt Version: 5.15.2
Kernel Version: 5.15.12-arch1-1 (64-bit)
Graphics Platform: X11
Processors: 8 × AMD Ryzen 5 2500U with Radeon Vega Mobile Gfx
Memory: 14,6 GiB of RAM
Graphics Processor: AMD Radeon Vega 8 Graphics

I'm using Spring Boot 2.6.2 and:
openjdk 17.0.1 2021-10-19
OpenJDK Runtime Environment (build 17.0.1+12)
OpenJDK 64-Bit Server VM (build 17.0.1+12, mixed mode)

information type: Public → Public Security
Revision history for this message
Chen (genghischen) wrote :

Amazon Linux is affected as well.

Operating System: Amazon Linux 2
Kernel: Linux 4.14.256-197.484.amzn2.aarch64
Architecture: arm64

openjdk version "17" 2021-09-14
OpenJDK Runtime Environment Temurin-17+35 (build 17+35)
OpenJDK 64-Bit Server VM Temurin-17+35 (build 17+35, mixed mode, sharing)

Tomcat 9.0.56

Revision history for this message
Brooke Hedrick (brooke-t-hedrick) wrote (last edit ):
Download full text (4.4 KiB)

We are seeing this error on Windows Server 2019

Jan 21, 2022 10:30:21 AM org.apache.tomcat.util.net.AprEndpoint setSocketOptions
SEVERE: Error allocating socket processor
java.io.IOException: Duplicate accept detected. This is a known OS bug. Please consider reporting that you are affected: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1924298
 at org.apache.tomcat.util.net.AprEndpoint.setSocketOptions(AprEndpoint.java:811)
 at org.apache.tomcat.util.net.AprEndpoint.setSocketOptions(AprEndpoint.java:83)
 at org.apache.tomcat.util.net.Acceptor.run(Acceptor.java:149)
 at java.base/java.lang.Thread.run(Thread.java:829)

Adoptium JDK jdk-11.0.13+8 64bit
Apache Tomcat 9.0.56
tomcat-native-1.2.31
openssl-1.1.1l
apr-1.7.0
apr-util-1.6.1

2022-01-21 09:58:59 Apache Commons Daemon procrun stdout initialized.
Jan 21, 2022 9:59:01 AM org.apache.catalina.core.AprLifecycleListener lifecycleEvent
INFO: Loaded Apache Tomcat Native library [1.2.31] using APR version [1.7.0].
Jan 21, 2022 9:59:01 AM org.apache.catalina.core.AprLifecycleListener lifecycleEvent
INFO: APR capabilities: IPv6 [true], sendfile [true], accept filters [false], random [true], UDS [true].
Jan 21, 2022 9:59:01 AM org.apache.catalina.core.AprLifecycleListener lifecycleEvent
INFO: APR/OpenSSL configuration: useAprConnector [false], useOpenSSL [true]
Jan 21, 2022 9:59:01 AM org.apache.catalina.core.AprLifecycleListener initializeSSL
INFO: OpenSSL successfully initialized [OpenSSL 1.1.1l 24 Aug 2021]
Jan 21, 2022 9:59:01 AM org.apache.coyote.AbstractProtocol init
INFO: Initializing ProtocolHandler ["http-apr-0.0.0.0-xxxxx"]
Jan 21, 2022 9:59:01 AM org.apache.coyote.AbstractProtocol init
INFO: Initializing ProtocolHandler ["https-openssl-apr-0.0.0.0-xxxxx"]
Jan 21, 2022 9:59:01 AM org.apache.tomcat.util.net.openssl.OpenSSLUtil getKeyManagers
Jan 21, 2022 9:59:01 AM org.apache.catalina.startup.Catalina load
INFO: Server initialization in [667] milliseconds
Jan 21, 2022 9:59:01 AM org.apache.catalina.core.StandardService startInternal
INFO: Starting service [Catalina]
Jan 21, 2022 9:59:01 AM org.apache.catalina.core.StandardEngine startInternal
INFO: Starting Servlet engine: [Apache Tomcat/9.0.56]

AND We are seeing the error on Ubuntu 20.04.3 LTS

Jan 21, 2022 8:55:02 AM org.apache.tomcat.util.net.AprEndpoint setSocketOptions
SEVERE: Error allocating socket processor
java.io.IOException: Duplicate accept detected. This is a known OS bug. Please consider reporting that you are affected: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1924298
        at org.apache.tomcat.util.net.AprEndpoint.setSocketOptions(AprEndpoint.java:811)
        at org.apache.tomcat.util.net.AprEndpoint.setSocketOptions(AprEndpoint.java:83)
        at org.apache.tomcat.util.net.Acceptor.run(Acceptor.java:149)
        at java.base/java.lang.Thread.run(Thread.java:829)

Jan 21, 2022 7:08:11 AM org.apache.catalina.core.AprLifecycleListener lifecycleEvent
INFO: Loaded Apache Tomcat Native library [1.2.31] using APR version [1.7.0].
Jan 21, 2022 7:08:11 AM org.apache.catalina.core.AprLifecycleListener lifecycleEvent
INFO: APR capabilities: IPv6 [true], sendfile [true], accept filters [false], random [true]...

Read more...

Revision history for this message
Seth Arnold (seth-arnold) wrote : Bug is not a security issue

Thanks for taking the time to report this bug and helping to make Ubuntu better. We appreciate the difficulties you are facing, but this appears to be a "regular" (non-security) bug. I have unmarked it as a security issue since this bug does not show evidence of allowing attackers to cross privilege boundaries nor directly cause loss of data/privacy. Please feel free to report any other bugs you may find.

information type: Public Security → Public
Revision history for this message
Brooke Hedrick (brooke-t-hedrick) wrote :

Not sure if this is useful, but running the test with WSL and Ubuntu 20.04.3 LTS

2 "console windows" running the wrk at the same time.

This is how the server failed
Listening on port 5555
pthread_create: Cannot allocate memory

I realize this is WSL and not native Windows. We have spent about 90 minutes working on porting from the server.c to run on Windows natively and haven't succeeded yet. We are not c/c++ developers, though.

Revision history for this message
Martin Huch (martin-huch) wrote :

Hi,

I am pretty sure that this bug also effects me.
Im running a Spring application on Ubuntu 20.04

It is not easy for me to execute the reproduce steps, because I only have limited access to the OS.
The reason why I am pretty sure is because I have the same situation as described here for the spring framework:
https://github.com/spring-projects/spring-framework/issues/26434

As a consequence when I update Spring Boot to 2.5 I have critical errors.
If I do not update (stay with Spring Boot 2.4) I do not get this error.
I want and need to update.

Please reconsider fixing this bug.

Revision history for this message
Martin Huch (martin-huch) wrote :

Ohhh ... this was not intelligent at all, what I just wrote.
(This reads like Spring has the issue and not the OS in my case ... although I believe the OS has the issue)

Sorry for the confusion. If I can reproduce it I will.

(Note: Editing/Deleting my prev. comment is not possible although at least the icons are there)

Revision history for this message
Matt Magoffin (msqr) wrote :

I am impacted by this bug as well, as reported in my Tomcat 9-based application logs:

[2022-04-18 21:45:44.764] ERROR [http-nio-9083-Acceptor ] org.apache.tomcat.util.net.Acceptor Socket accept failed
 java.io.IOException: Duplicate accept detected. This is a known OS bug. Please consider reporting that you are affected: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1924298

This app runs on AWS ECS Fargate v1.4 in a Ubuntu 18.04 amd64 based image using the BellSoft Liberica JRE 11. I get this log messages a couple times a day on average.

The application continues running, seemingly without any problems. The ERROR log trigger alerts which we will adapt to ignore, but I wanted to add my experience with this issue here.

Revision history for this message
Christopher Gual (cgual) wrote :

We have been hitting this bug quite often while running Tomcat 8.5 on Amazon AWS Linux 2 with a kernel of 4.14.268-205.500.amzn2.x86_64

I wanted to see if the bug could be reproduced using an updated kernel so I attempted to repro it using the server code and methodology provided by Mark Thomas on Ubuntu Server 21.10 (running on a Raspberry Pi 4 with 4GB RAM) and was NOT able to repro the bug (kernel 5.13.0-1008-raspi). I then installed Ubuntu Server 20.04 LTS on the same machine and WAS able to repro the bug (kernel 5.4.0-1052-raspi). The bug was fairly easy to repro and did not take multiple times to repro.

Since then I have been able to repro the bug using the server code on AWS Linux 2 with the 4.14.268-205.500.amzn2.x86_64 kernel, but not on AWS Linux 2 with a 5.10.109-104.500.amzn2.x86_64 kernel.

I think there is a slight problem with the server code used in the repro, as it is calling `pthread_create` with no thread attributes, which will create joinable threads instead of detached threads. The documentation for `pthread_create` says that "Only when a terminated joinable thread has been joined are the last of its resources released back to the system." Because the server code never joins the threads I think this is preventing the OS from releasing the thread resources. This results in the server eventually running out of memory and the server program returning a "pthread_create: Cannot allocate memory" as mentioned by Brooke Hedrick in their comment. I was also not able to repro the bug on WSL (kernel 4.4.0-19041-Microsoft), but perhaps their underlying network drivers are different?

I also was running into this issue when running the server code. I made a slight modification to the server code to set the pthread attribute to create the new threads in a detached state. This seemed to solve the memory issue and I was able to repro the bug with this server. I've attached the code.

Additionally, I found it useful to use `prlimit` to update the maximum number of open files for the server process, once it was running. This made the server less likely to run into an EMFILE error when calling `accept`.

Revision history for this message
Christopher Gual (cgual) wrote :

After spending some time I think I have narrowed down the bug fix to Linux kernel 5.10-rc6.

The bug reproed on Ubuntu with kernel 5.10-rc4 but not on 5.10-rc6.

Here is a diff of the kernel sources between those two versions: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/diff/?id=v5.10-rc6&id2=v5.10-rc4&dt=2

Of note are some changes to `net/ipv4/inet_hashtables.c`

Using `git blame` I think the most likely source of the bug fix was this commit by Ricardo Dias: https://github.com/torvalds/linux/commit/01770a166165738a6e05c3d911fb4609cc4eb416

The description for the commit describes a race condition which looks like it could cause the bug:

"When such event happens, the TCP stack code has a race condition that occurs between the momement a lookup is done to the established connections hashtable to check for the existence of a connection for the same client, and the moment that the child socket is added to the established connections hashtable. As a consequence, this race condition can lead to a situation where we add two child sockets to the established connections hashtable and deliver two sockets to the userspace program to the same client."

So for anyone who comes across this bug, the likely solution is to update your OS to a version which uses kernel 5.10 or greater.

Revision history for this message
Mark Thomas (asfmarkt) wrote :

Christopher,

Thank-you for the work you have put into researching this. It is really appreciated.

The system I originally used to reproduce this bug has since been updated to:
linux-image-5.13.0-40-generic 5.13.0-40.45~20.04.1

I have just attempted to reproduce the issue and I have been unable to. That is consistent with your findings.

I have also reviewed the git commit you identified and I agree that it is very likely that that commit fixed the bur reported here.

I am going to update the error message reported by Apache Tomcat to update to the OS to a version that uses kernel 5.10 or greater.

I would close this issue as fixed but I don't appear to have the necessary karma to do that.

Thanks again for your extremely helpful research.

Revision history for this message
Christopher Gual (cgual) wrote :

Hi Mark,

Thanks for the reply and confirmation.

I was talking to a co-worker today about this bug and they pointed out that even though it is fixed in the newer Linux kernel versions, Ubuntu 20.04 and 16.04 are under LTS until April 2025 and April 2023 respectively.

It might be good to leave this bug open to see if there is a way that Ubuntu can backport the fix for this issue to those versions.

I think there is a good case to argue that they should consider doing this as the bug affects the reliability of TCP networking under load, however I don't know about the process for how Ubuntu bugs are triaged.

Chris

Revision history for this message
Christopher Gual (cgual) wrote :

I wrote to Ricardo Diaz & Eric Dumazet to ask if the patch for this bug would be backported and it looks like the fix was made for the Linux stable kernels very recently: https://www.spinics.net/lists/stable-commits/msg244651.html

Eric Dumazet also mentioned that:
"the bug only happens if networking configuration is not optimal. (We never hit the bug at Google) Normally, all packets for a given 4-tuple should be handled by the same cpu. On multi-queues NIC, RSS ensures this, if only one cpu is servicing interrupts for any receive queue. Otherwise, cpus compete over the same spinlocks, and could hit the race that Ricardo fixed. I suggest you also work on networking configuration, as this could help even without this race, but once flows are established."

He suggested looking at `Documentation/networking/scaling.rst` as a starting point for tackling the network configuration.

Revision history for this message
Rushikesh (rushikesh7) wrote :

Hello,

I'm having same issue on CentOS7 host with Dockerized Ubuntu 20.04.4 LTS. The container is running a tomcat application Liferay.

20-Jun-2022 05:46:35.598 SEVERE [http-nio-8080-Acceptor] org.apache.tomcat.util.net.Acceptor.run Socket accept failed
 java.io.IOException: Duplicate accept detected. This is a known OS bug. Please consider reporting that you are affected: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1924298
  at org.apache.tomcat.util.net.NioEndpoint.serverSocketAccept(NioEndpoint.java:545)
  at org.apache.tomcat.util.net.NioEndpoint.serverSocketAccept(NioEndpoint.java:78)
  at org.apache.tomcat.util.net.Acceptor.run(Acceptor.java:129)
  at java.lang.Thread.run(Thread.java:750)

Revision history for this message
y0zg (y0zg) wrote :

The same bug is reproducible on AWS EKS 1.23

Revision history for this message
mnmtanish@gmail.com (mnmtanish) wrote :

Hi Christopher,

I came across this bug on 5.10.124 ( Amazon Linux 2 on AWS ECS ). Perhaps it was fixed on a later version?

Revision history for this message
Marcel Guzmán de Rojas (mmrcel) wrote :

I have the same bug on AWS Ubuntu fully patched. The error was reported once by a Java EE application under load, ~250 requests/min.

java -version
openjdk version "17.0.4" 2022-07-19
OpenJDK Runtime Environment (build 17.0.4+8-Ubuntu-120.04)
OpenJDK 64-Bit Server VM (build 17.0.4+8-Ubuntu-120.04, mixed mode, sharing)

uname -a
Linux g2 5.15.0-1022-aws #26~20.04.1-Ubuntu SMP Sat Oct 15 03:23:19 UTC 2022 aarch64 aarch64 aarch64 GNU/Linux

Revision history for this message
Jean-Max Reymond (jmreymond-free) wrote (last edit ):

The issue has been reproduced on fully patched versions of Ubuntu 22.04 as well.

nov. 30 06:42:04 sd-170774 tomcat9[395629]: java.io.IOException: Duplicate accept detected. This is a known OS bug. Please consider reporting that you are affected: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1924298
nov. 30 06:42:04 sd-170774 tomcat9[395629]: at org.apache.tomcat.util.net.NioEndpoint.serverSocketAccept(NioEndpoint.java:548)
nov. 30 06:42:04 sd-170774 tomcat9[395629]: at org.apache.tomcat.util.net.NioEndpoint.serverSocketAccept(NioEndpoint.java:78)
nov. 30 06:42:04 sd-170774 tomcat9[395629]: at org.apache.tomcat.util.net.Acceptor.run(Acceptor.java:129)
nov. 30 06:42:04 sd-170774 tomcat9[395629]: at java.lang.Thread.run(Thread.java:750)

% uname -a
Linux sd-170774 5.15.0-53-generic #59-Ubuntu SMP Mon Oct 17 18:53:30 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
% cat /etc/issue
Ubuntu 22.04.1 LTS \n \l

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.