Comment 14 for bug 1924298

Revision history for this message
Christopher Gual (cgual) wrote :

We have been hitting this bug quite often while running Tomcat 8.5 on Amazon AWS Linux 2 with a kernel of 4.14.268-205.500.amzn2.x86_64

I wanted to see if the bug could be reproduced using an updated kernel so I attempted to repro it using the server code and methodology provided by Mark Thomas on Ubuntu Server 21.10 (running on a Raspberry Pi 4 with 4GB RAM) and was NOT able to repro the bug (kernel 5.13.0-1008-raspi). I then installed Ubuntu Server 20.04 LTS on the same machine and WAS able to repro the bug (kernel 5.4.0-1052-raspi). The bug was fairly easy to repro and did not take multiple times to repro.

Since then I have been able to repro the bug using the server code on AWS Linux 2 with the 4.14.268-205.500.amzn2.x86_64 kernel, but not on AWS Linux 2 with a 5.10.109-104.500.amzn2.x86_64 kernel.

I think there is a slight problem with the server code used in the repro, as it is calling `pthread_create` with no thread attributes, which will create joinable threads instead of detached threads. The documentation for `pthread_create` says that "Only when a terminated joinable thread has been joined are the last of its resources released back to the system." Because the server code never joins the threads I think this is preventing the OS from releasing the thread resources. This results in the server eventually running out of memory and the server program returning a "pthread_create: Cannot allocate memory" as mentioned by Brooke Hedrick in their comment. I was also not able to repro the bug on WSL (kernel 4.4.0-19041-Microsoft), but perhaps their underlying network drivers are different?

I also was running into this issue when running the server code. I made a slight modification to the server code to set the pthread attribute to create the new threads in a detached state. This seemed to solve the memory issue and I was able to repro the bug with this server. I've attached the code.

Additionally, I found it useful to use `prlimit` to update the maximum number of open files for the server process, once it was running. This made the server less likely to run into an EMFILE error when calling `accept`.